The work of the American philologist George Zipf, from the 1930s, was concerned with such quantitative analyses as the relation between the frequency of words in text and text length, the frequency of words and their antiquity, and the relation between the rank order of an item in a word frequency list and the number of occurrences or tokens of that item in a text. Zipf (1949) sets out his famous 'law' which held that the relationship between the frequency of use of a word in a text and the rank order of that word in a frequency list is a constant (f.r=c).
As noted above, the earliest computerized corpora compiled for linguistic research from the 1960s required the use of mainframe computers, and researchers frequently had to design their own software for analysis. Initial interest was often in lexis, including word counts, but it was quickly apparent that a computer corpus facilitated the study of permissible or likely word sequences or collocations (are we more likely to write different from, different to or different than?) and grammatical and stylistic characteristics of particular authors and genres. There was a particular interest in what characterized 'scientific style', 'newspaper style' and 'literary or imaginative style'.
With a corpus stored in a computer, it is easy to find, sort and count items, either as a basis for linguistic description or for addressing language-related issues and problems. It is not surprising, therefore, that a wide range of research activities have come to be within the scope of corpus linguistics. Analyses can contribute to the making of dictionaries, word lists, descriptive grammars, diachronic and synchronic comparative studies of speech varieties, and to stylistic, pedagogical and other applications. With appropriate software it is easy to study the distribution of phonemes, letters, punctuation, inflectional and derivational morphemes, words (as variously defined), collocations, instances of particular word classes, syntactic patterns, or discourse structures. Recent work at Birmingham University described by Renouf (1993) shows how new words and new uses can be identified in corpora at the time these words enter journalistic use.
The scope and current concerns of a field of scholarship can sometimes be seen or defined through the topics which make up conference programmes and the content of specialist journals. In the 1990s the topics which appear on conference programmes and in journals which cover corpus linguistics include improved ways of annotating corpora, the tagging of parts of speech and the senses of polysemous word forms, improved automatic parsing, identification of collocations, phraseological units and discourse structure, text categorization, research methodology in the face of more and bigger corpora, and the application of this work in lexicography, syntactic description, translation, speech and handwriting recognition, and language teaching. Educational applications are increasingly on the agenda. At Lancaster University in 1994 and 1996 the pedagogical significance of electronic corpora was the subject of conferences on the teaching of linguistics and the teaching of languages.
In March 1993, a Georgetown University Round Table meeting in Washington, DC, on corpus-based linguistics identified the following topics as those in particular need of investigation and dissemination at a time when linguistics was returning to more text-based approaches to language:
• the design and development of text-speech corpora
• tools for searching and processing on-line corpora
• critical assessments of on-line corpora and corpus-processing tools
• methodological issues in corpus-based analysis
• applications and results in linguistics and related disciplines, including language teaching, computational linguistics, historical linguistics, discourse analysis and stylistic analysis
The scope of computer corpus-based scholarship can also be measured by some of its achievements. In lexicography the revision of the Oxford English Dictionary, its publication in electronic form on CD-ROM and the publication of new learners' dictionaries of English by other major publishers were all based on corpora. The completion of the 100-million-word British National Corpus in 1994 set a new standard in corpus design and compilation. Another important international standard set in corpus preparation and formatting has been in the gradual adoption of the Standard Generalized Markup Language (SGML) through the Text Encoding Initiative (TEI) (see Section 2.6.5). In the analysis of corpora there have been improvements in the accuracy of the automatic grammatical tagging and parsing of texts. There has also been a substantial and rapidly growing amount of descriptive detail on the elements and structure of languages (particularly English) arising from corpus-based research.
Widdowson H.G. Linguistics. – Oxford
University Press, 1996. – pp. 69-77.
Linguistics, like language itself, is dynamic and therefore subject to change. It would lose its validity otherwise, for like all areas of intellectual enquiry, it is continually questioning established ideas and questing after new insights. That is what enquiry means. Its very nature implies a degree of instability. So although there is, in linguistics, a reasonably secure conceptual common ground, which this book has sought to map out, there is, beyond that, a variety of different competing theories, different visions and revisions, disagreements and disputes, about what the scope and purpose of the discipline should be. There are three related issues which are particularly prominent in current debate. One has to do with the very definition of the discipline and takes us back to the question of idealization. Another issue concerns the nature of linguistic data and has come into prominence with the development of computer programs for the analysis of large corpora of language. A third issue raises the question of accountability and the extent to which linguistic enquiry should be made relevant to the practical problems of everyday life.
The scope of linguistics
[...] linguistics has traditionally been based on an idealization which abstracts the formal properties of the language code from the contextual circumstances of actual instances of use, seeking to identify some relatively stable linguistic knowledge (langue, or competence) which underlies the vast variety of linguistic behaviour (parole, or performance). It was also pointed out that there are two reasons for idealizing to such a degree of abstraction. One has to do with practical feasibility: it is convenient to idealize in this way because the actuality of language behaviour is too elusive to capture by any significant generalization. But the other reason has to do with theoretical validity, and it is this which motivates Chomsky's competence-performance distinction. The position here is that the data of actual behaviour are disregarded not because they are elusive but because they are of little real theoretical interest: they do not provide reliable evidence for the essential nature of human language. Over recent years, this formalist definition of the scope of linguistics has been challenged with respect to both feasibility and validity.
As far as feasibility is concerned, it has been demonstrated that the data of behaviour are not so resistant to systematic account as they were made out to be. There are two aspects of behaviour. One is psychological and concerns how linguistic knowledge is organized for access and what the accessing processes might be in both the acquisition and use of language. This has been a subject of enquiry in psycholinguistics. The second aspect of behaviour is sociological. This accessing of linguistic knowledge is prompted by some communicative need, some social context which calls for an appropriate use of language. These conditions for appropriateness can be specified, as indeed was demonstrated in part in the discussion of pragmatics. The account of the relationship between linguistic code and social context is the business of sociolinguistics.
Psycholinguistic work on accessing processes and socio-linguistic work on appropriateness conditions have demonstrated that there are aspects of behaviour that can be systematically studied, and that rigorous enquiry does not depend on the high degree of abstraction proposed in formalist linguistics. In other words, psycholinguistics and sociolinguistics have things to say about language which are also within the legitimate scope of the discipline. Such a point of view would be a tolerant and neighbourly one: we stake out different areas of language study, each with its own legitimacy.