The scope of corpus linguistics
Corpus linguistics is based on bodies of text as the domain of study and as the source of evidence for linguistic description and argumentation. It has also come to embody methodologies for linguistic description in which quantification of the distribution of linguistic items is part of the research activity. As Leech (1992:107) has noted, the focus of study is on performance rather than competence, and on observation of language in use leading to theory rather than vice versa.
It would be misleading, however, to suggest that corpus linguistics is a theory of language in competition with other theories of language such as transformational grammar, or even more that it is a new or separate branch of linguistics. Linguists have always needed sources of evidence for theories about the nature, elements, structure and functions of language, and as a basis for stating what is possible in a language. At various times, such evidence has come from intuition or introspection, from experimentation or elicitation, and from descriptions based on observations of occurrence in spoken or written texts. In the case of corpus-based research, the evidence is derived directly from texts. In this sense corpus linguistics differs from approaches to language which depend on introspection for evidence. In his celebrated work, Coral Gardens and their Magic, Malinowski (1935: 9) wrote about the paradigm shift which he considered was necessary in the linguistics of the day.
The neglect of the obvious has often been fatal to the development of scientific thought. The false conception of language as a means of transfusing ideas from the head of the speaker to that of the listener has, in my opinion largely vitiated the philological approach to language. The view set forth here is not merely academic: it compels us, as we shall see, to correlate other activities, to interpret the meaning - text; and this means a new departure in the handling of linguistic evidence. It will also force us to define meaning in terms of experience and situation.
Linguists may not see the necessity for such a sea change today. However, it is the case that corpus linguists often have different concerns from many other linguists. Corpus linguists are concerned typically not only with what words, structures or uses are possible in a language but also with what is probable - what is likely to occur in language use. The use of a corpus as a source of evidence however is not necessarily incompatible with any linguistic theory, and progress in the language sciences as a whole is likely to benefit from a judicious use of evidence from various sources: texts, introspection, elicitation or other types of experimentation as appropriate. Any scientific enterprise must be empirical in the sense that it has to be supported or falsified on evidence and, in the final analysis, statements made about language have to stand up to the evidence of language use. The evidence can be based on the introspective judgment of speakers of the language or on a corpus of text. The difference lies in the richness of the evidence and the confidence we can have in the generalizability of that evidence, in its validity and reliability. The boundaries, therefore, between corpus-based description and argumentation and other approaches to language description are not rigid, and linguists of varied theoretical persuasions now use corpora for evidence which is complementary to evidence obtained from other sources.
Corpus linguistics, like all linguistics, is concerned primarily with the description and explanation of the nature, structure and use of language and languages and with particular matters such as language acquisition, variation and change. Corpus linguistics has nevertheless developed something of a life of its own within linguistics, with a tendency sometimes to focus on lexis and lexical grammar rather than pure syntax. This is partly a result of using methodologies such as concordancing where the contextual evidence available in a single line of wide-carriage computer printout of 130 characters is sometimes too limited for the analysis of syntax or discourse.
Work in corpus linguistics is currently associated with several quite different activities. Scholars working in the field tend to be identified with one or more of them. The first group of researchers consists of corpus makers or compilers. These scholars are concerned with the design and compilation of corpora, the collection of texts and their preparation and storage for later analysis.
A second group of researchers has been concerned with developing tools for the analysis of corpora. Important contributions to software development especially for the syntactic analysis of corpora have been associated particularly but not exclusively with researchers in computational linguistics. These researchers have been concerned with the use of corpora to develop, among other things, algorithms for natural language processing and the modelling of linguistic theories.
A third group of researchers consists of descriptive linguists whose main concern has been to make use of computerized corpora to describe reliably the lexicon and grammar of languages, both of the linguistic systems we use and our likely use of those systems. It is the probabilistic aspect of corpus-based descriptive linguistic studies which especially distinguishes them from conventional descriptive fieldwork in linguistics or lexicography. That is, corpus-based descriptive linguistics is concerned not only with what is said or written, where, when and by whom, but how often particular forms are used. The measurement of the distribution of words and grammar has encouraged new ways of studying the linguistic basis of variation in text types, language change and regional and other varieties of language. The corpus provides contexts for the study of meaning in use and, by making available techniques for extracting linguistic information from texts on a scale previously undreamed of, it facilitates linguistic investigations where empiricism is text based.
A fourth area of activity, which has been among the most innovative outcomes of the corpus revolution, has been the exploitation of corpus-based linguistic description for use in a variety of applications such as language learning and teaching, and natural language processing by machine, including speech recognition and translation.
At the present time in corpus linguistics, some researchers tend to focus on issues in corpus design, others on methods for text analysis and processing, and still others, probably the majority, on corpus-based linguistic description and the application of such descriptions.
Although the scope of corpus linguistics may be defined in terms of what people do with corpora, it would be a mistake to assume that corpus linguistics is simply a faster way of describing how a language works, or is about the nature of linguistic evidence. Analysis of a corpus by means of standard corpus linguistic research software can and frequently does reveal facts about a language which we might never previously have thought of seeking. Altenberg's (1991a) study of amplifier collocations in English, for example, raised questions about semantic
classes of maximizers and boosters such as perfectly or awfully which probably would not have been asked without the evidence of a corpus. He found for example that frequent maximizers such as quite tend to collocate with non-scalar words (quite obviously) while absolutely has a greater tendency than other maximisers to collocate with negatives (absolutely not). The major shift in methodology associated with corpus linguistics comes not from theory but rather from what the use of corpora makes possible.
As we have seen, corpus linguistics goes beyond the use of corpora as a source of evidence in linguistic description. It also revives and carries on a concern of some linguists with the statistical distribution of linguistic items in the context of use. From the 1920s there was, especially in the United States and the United Kingdom/a tradition of word counting in texts in order to discover the most frequent, and arguably therefore the most pedagogically useful, words and grammatical structures for language teaching purposes.
From the 1930s, Prague School linguists undertook quantitative studies (mainly of Czech, English and Russian) of the frequency of certain grammatical processes, the relative frequencies of different parts of speech, the location and distribution of information in the sentence, and the statistical distribution of syllable types and structures. Some of this work was directed towards comparative stylistic analysis (e.g. Kramsky, 1972) and some towards quantitative comparisons of varieties of English (e.g. Duskova, 1977). Such Prague School quantitative studies, which were carried out manually, differ from modern computer corpus-based studies particularly in the size of the corpora and in their representativeness. Duskova, for example, studied 10,000 finite verb forms from 10 plays to draw conclusions about the functions and use of the preterite and the perfect in British and American English, but it is not clear why these 10 plays were chosen as representative of contemporary English. Nevertheless, the Prague School focus on quantitative studies was commendable at a time when orthodox linguistics eschewed them. Other quantitative studies were directed towards discovering the 'statistical laws' of text.