A corpus constitutes an empirical basis not only for identifying the elements and structural patterns which make up the systems we use in a language, but also for mapping out our use of these systems. A corpus can be analysed and compared with other corpora or parts of corpora to study variation. Most importantly, it can be analysed distributionally to show how often particular phonological, lexical, grammatical, discoursal or pragmatic features occur, and also where they occur.
In the early 1980s it was possible to list on a few fingers the main electronic corpora which a small band of devotees had put together over the previous two decades for linguistic research. These corpora were available to researchers on a non-profit basis, and were initially available for processing only on mainframe computers. The development of more powerful microcomputers from the mid-1970s and the advent of CD-ROM in the 1980s made corpus-based research more accessible to a much wider range of participants.
By the 1990s there were many corpus-making projects in various parts of the world. Lancashire (1991) shows the huge range of corpora, archives and other electronic databases available or being compiled for a wide variety of purposes. Some of the largest corpus projects have been undertaken for commercial purposes, by dictionary publishers.' Other projects in corpus compilation or analysis are on a smaller scale, and do not necessarily become well known. Undertaken as part of graduate theses or undergraduate projects, they enabled students to gain original insights into the structure and use of language.
The role of computers in corpus linguistics
The analysis of huge bodies of text "by hand' can be prone to error and is not always exhaustive or easily replicable. Although manual analysis has made an important contribution over the centuries, especially in lexicography, it was the availability of digital computers from the middle of the 20th century which brought about a radical change in text-based scholarship. Rather than initiating corpus research, developments in information technology changed the way we work with corpora. Instead of using index cards and dictionary 'slips', lexicographers and grammarians could use computers to store huge amounts of text and retrieve particular words, phrases or whole chunks of text in context, quickly and exhaustively, on their screens. Furthermore the linguistic items could be sorted in many different ways, for example, taking account of the items they collocate with and their typical grammatical behaviour.
Corpus linguistics is thus now inextricably linked to the computer, which has introduced incredible speed, total accountability, accurate replicability, statistical reliability and the ability to handle huge amounts of data. With modern software, computer-based corpora are easily accessible, greatly reducing the drudgery and sheer bureaucracy of dealing with the increasingly large amounts of data used for compiling dictionaries and other information sources. In addition to greatly increased reliability in such basic tasks as searching, counting and sorting linguistic items, computers can show accurately the probability of occurrence of linguistic items in text. They have thus facilitated the development of mathematical bases for automatic natural language processing, and brought to linguistic studies a high degree of accuracy of measurement which is important in all science. Computers have permitted linguists to work with a large variety of texts and thus to seek generalizations about language and language use which can go beyond particular texts or the intuitions of particular linguists. The quantification of language use through corpus-based studies has led to scientifically interesting generalizations and has helped renew or strengthen links between linguistic description and various applications. Machine translation, text-to-speech synthesis, content analysis and language teaching have been among the beneficiaries.
Some idea of the changes which the computer has made possible in text studies can be gauged from a report in an early issue of the ALLC Bulletin, the forerunner of the journal Literary and Linguistic Computing. A brief report by Govindankutty (1973) on the coming of the computer to Dravidian linguistics captures the moment of transition between manual and electronic databases. The text he was working with of 300,000 words is small by today's standards, but what took the researcher and his long-suffering colleagues nearly six years of data management and analysis could, 20 years later, be carried out in minutes.
It took nearly six years' hard labour and the co-operation of colleagues and students to complete the Index of Kamparamayanam, the longest middle Tamil text, in the Kerala University under the supervision of Professor V. I. Subramoniam. The text consists of nearly 12,500 stanzas and each stanza has four lines; each line has an average of six words. All the words and some of the suffixes were listed on small cards by the late Mr. T. Velaven who is the architect of this voluminous index. Later, the cards were sorted into alphabetical order and each item was again arranged according to the ascending order of the stanza and line. Finally, each entry was checked with the text and the meaning and grammatical category were noted. The completed index consists of about 3,500 typed pages (28 x 20 cm).
While indexing, some suffixes such as case were listed separately. This posed some problems when I started to work on the grammar of the language of the text. When it was necessary to find out after what kind of words and after which phonemes and morphemes the alternants of a suffix occur, it became necessary again to go through all the entries. Though I have tried to work out the frequency of all the suffixes, for want of time it was not completely possible. However, the frequency study helped to unearth different strata in the linguistic excavation and indirectly emphasized that it is a sine qiui non, at least, for such a descriptive and historical study.
Though it took a lot of time, energy and patience, the birth of an index brought with it an unknown optimism in the grammatical description. After completing the index and the grammatical study of Kamparamayanam, three months ago I started indexing Ramacaritam, an early Malayalam text, using small cards. This project is being carried out in the Leiden University with the guidance of Professor F. B. J. Kuiper. While I was half my way through the indexing, Dr. B. J. Hoff of the Linguistics Department informed me of the work done in the Institute for Dutch Lexicology with the help of a computer. When I discussed the problems with Dr. F. de Tollenaere, who is the head of this institute, he outlined with great enthusiasm how a computer can be utilized for this purpose. Immediately, I started transcribing the text and now it is being punched on paper tape, using an AREA paper tape punch at the Institute. This paper tape punch, having an extra shift, has twice the eighty-eight standard possibilities, which results in one hundred and seventy-six different punching codes, which for the computer has the value of one hundred and seventy-six characters. Moreover, a coding system makes it possible to have up to two hundred and seven possibilities, which are also available at the output stage, as the Institute has at its disposal a print train with two hundred and seven symbols.
To a present-day corpus linguist, even the laborious data entry by punched paper seems quaintly archaic, and Govindankutty's task could now be undertaken on a personal computer accessed directly through a keyboard.
Until the mid-1980s corpus linguistics typically involved mainframe computing and was largely associated with universities having access to large machines. In the 1970s, with shared access to a standard mainframe, it could take an hour or more to make a concordance consisting of all the instances of a word such as when in a one-million-word corpus. By the late 1980s, the time taken to run such a program had been reduced to minutes. In the 1990s, the same job can be done just as quickly on the faster personal computers running at 60 or more megahertz. Hard disk drives of 500 megabytes or more on personal computers and input from a CD-ROM are now common, thus facilitating storage and rapid analysis.
In the early 1980s a captive computer scientist or friendly computer programmer was almost indispensable to assist many aspiring corpus linguists to cope with inevitable technical problems associated with data management and the programming skills necessary for corpus analysis. By the 1990s, improvements in personal computers of the kind already mentioned, and the availability of commercial software packages designed for corpus analysis, have meant that most corpus linguists can now concentrate not on how to program and use a computer but on problems and issues in linguistics which can be addressed through a corpus.