The main focus of this book is on four major areas of activity in corpus linguistics:
• corpus design and development
• corpus-based descriptions of aspects of English structure and use
• the particular techniques and tools used in corpus analysis
• applications of corpus-based linguistic description
Readers may choose to work through the book in the above order or to begin with the sections dealing with corpus-based descriptions of English in order first to become more familiar with some of the results of corpus analysis. In focusing on the contribution of corpus linguistics to the description of English and on some of the central issues and problems which are being addressed within corpus linguistics, the book also attempts to bring together disparate work which is often hard to get hold of. However, such is the speed of development and change in corpus linguistics at the present time that anyone writing about it must be conscious that it would be easy to produce a Ptolemaic picture of the field - with the world distorted and with Terra Australis Incognita, the Great Southern Continent, both misconceived and misplaced. Work relevant for corpus linguistics is being done in many fields, including computer science and artificial intelligence, as well as in various branches of descriptive and applied linguistics. It would not be surprising if some of the scholars contributing to corpus linguistics from these and other perspectives found that their work is inadequately represented here. However, they can be assured that such neglect is not intended.
Because corpus linguistics is a field where activity is increasing very rapidly and where there is as yet no magisterial perspective, even the very notion of what constitutes a valid corpus can still be controversial. It also needs to be understood at the outset that not every use of computers with bodies of text is part of corpus linguistics. For example, the aim of Project Gutenberg to distribute 10,000 texts to 100 million computer users by the year 2001 is not in itself part of corpus linguistics although texts included in this ambitious project may conceivably provide textual data for corpus analysis. Similarly, contemporary reviews of computing in the humanities show the enormous extent of corpus-based work in literary studies. While some of the methodology used in literary studies resembles some of the activity being undertaken in corpus linguistics, research on authorial attribution or thematic structure, for example, does not come within the scope of this book. Nor does the book attempt to cover systematically the wide range of corpus-based work being undertaken in computational linguistics in such areas of natural language processing as speech recognition and machine translation.
Although there have been spectacular advances in the development and use of electronic corpora, the essential nature of text-based linguistic studies has not necessarily changed as much as is sometimes suggested. In this book, reference is made to corpus studies which were undertaken manually before computers were available. Corpus linguistics did not begin with the development of computers but there is no doubt that computers have given corpus linguistics a huge boost by reducing much of the drudgery of text-based linguistic description and vastly increasing the size of the databases used for analysis. It should be made clear, however, that corpus linguistics is not a mindless process of automatic language description. Linguists use corpora to answer questions and solve problems. Some of the most revealing insights on language and language use have come from a blend of manual and computer analysis. It is now possible for researchers with access to a personal computer and off-the-shelf software to do linguistic analysis using a corpus, and to discover facts about a language which have never been noticed or written about previously. The most important skill is not to be able to program a computer or even to manipulate available software (which, in any case, is increasingly user-friendly). Rather, it is to be able to ask insightful questions which address real issues and problems in theoretical, descriptive and applied language studies. Many of the key problems and challenges in corpus linguistics are associated with the following questions:
• How can we best exploit the opportunities which arise from having texts stored in machine-retrievable form?
• What linguistic theories will best help structure corpus-based research?
• What linguistic phenomena should we look for?
• What applications can make use of the insights and improved descriptions of languages which come out of this research?
In answering these and other questions corpus linguistics has potential to provide solutions and new directions to some of the major issues and problems in the study of human communication.
The definition of a corpus as a collection of texts in an electronic database can beg many questions for there are many different kinds of corpora. Some dictionary definitions suggest that corpora necessarily consist of structured collections of text specifically compiled for linguistic analysis, that they are large or that they attempt to be representative of a language as a whole. This is not necessarily so. Not all corpora which can be used for linguistic research were originally compiled for that purpose. Historically it is not even the case that corpora are necessarily stored electronically so that they can be machine-readable, although this is nowadays the norm. [...] electronic corpora can consist of whole texts or collections of whole texts. They can consist of continuous text samples taken from whole texts; they can even be made up of collections of citations. At one extreme an electronic dictionary may serve as a kind of corpus for certain types of linguistic research while at the other extreme a huge unstructured archive of texts may be used for similar purposes by corpus linguists.
Corpora have been compiled for many different purposes, which in turn influence the design, size and nature of the individual corpus. Some current corpora intended for linguistic research have been designed for general descriptive purposes - that is, they have been designed so that they can be examined or trawled to answer questions at various linguistic levels on the prosody, lexis, grammar, discourse patterns or pragmatics of the language. Other corpora have been designed for specialized purposes such as discovering which words and word meanings should be included in a learners' dictionary; which words or meanings are most frequently used by workers in the oil industry or economics; or what differences there are between uses of a language in different geographical, social, historical or work-related contexts.
A distinction is sometimes made between a corpus and a text archive or text database. Whereas a corpus designed for linguistic analysis is normally a systematic, planned and structured compilation of text, an archive is a text repository, often huge and opportunistically collected, and normally not structured. It is generally the case, as Leech (1991:11) suggested, that 'the difference between an archive and a corpus must be that the latter is designed or required for a particular "representative" function'. It is nevertheless not always easy to see unequivocally what a corpus is representing, in terms of language variety.
Databases which are made up not of samples, but which constitute an entire population of data, may consist of a single book (e.g. George Eliot's Middlemarch) or of a number of works. These corpora may be the work of a single author (e.g. the complete works of Jane Austen) or of several authors (e.g. medieval lyrics), or all the editions of a particular newspaper in a given year. Some projects have assembled all the known available texts in a particular genre or from a particular historical period. Some of these databases or text archives described in Section 2.4 are very large indeed, and although they have rarely yet been used as corpora for linguistic research, there is no reason why they should not be in the future. In many respects it is thus the use to which the body of textual material is put, rather than its design features, which define what a corpus is.