In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.
One of the largest freely-available corpus of English, and the only large and balanced corpus of American English is the Corpus of Contemporary American English (COCA).
COCA was released in 2008 and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface) and the 100 million words TIME Corpus (1920s-2000s).
The corpus contains more than 400 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2009 and the corpus is also updated once or twice a year. Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language.
The interface allows searching for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. People can search for surrounding words (collocates) within a ten-word window, which often gives a good insight into the meaning and use of a word.
The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:
- By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals
- Over time: compare different years from 1990 to the present time
How does the corpus run?
You can carry out semantically-based queries of the corpus.
For example, you can contrast and compare the collocates of two related words, as man/ woman, to determine the difference in meaning or use between these words. You can also find the frequency and distribution of synonyms and compare their frequency in different genres, and also use these word lists as part of other queries:
Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.
Using the web interface, you can search by words (mysterious), phrases (nooks and crannies or faint + noun), lemmas (all forms of words, like sing or tall), wildcards (un*ly or r?n*), and more complex searches such as un-X-ed adjectives or verb + any word + a form of ground. Notice that from the “frequency results” window you can click on the word or phrase to see it in context in this lower window.
The first option in the search form allows to either see a list of all matching strings, or a chart display that shows the frequency in the five “macro” registers (spoken, fiction, popular magazines, newspapers, and academic journals).
Look for the frequency of funky, whom, incredibly + adjective, or forms of need + to + VERB. Via the chart display, you can also see the frequency of the word or phrase in subregisters as well, such as movie scripts, children’s fiction, women’s magazines, or medical journals. With the list display, you can also see the frequency of each matching string in each of the major sections of the corpus .
You can also search for collocates (words nearby a given word), which often provides insight into the meaning of a given word.
You can also include information about genre or a specific time period directly as part of the query. This allows you to see how words and phrases vary across speech and many different types of written texts. We can easily find which words and phrases occur much more frequently in one register than another, such as good + [noun] in fiction, or verbs in the slot [we * that] in academic writing.
Compare to other Corpora:
COCA offers a balance of availability, size, genres, and currency that is not found in other corpora, including the ANC, the BNC, the BOE, or the OEC.
The chart below provides a summary of the features of the different corpora:
However, this american corpus can be compared with other ones which have a very similar characteristics , and that have follow in a way the same way structure when searching a meaning of a word or a phrase.
On of those examples, could be the CDE, Corpus Del Español, created by Mark Davies, a professor of Corpus Linguistics in the Department of Linguistics and English Language at Brigham Young University and from 1992-2003, and also professor of Spanish Linguistics at Illinois State´s University.
There can be easily found some characteristics looking only at the introductory web page of the corpus:
As in the case of the COCA, this Spanish Corpus offers on its interface the possibility of searching in different ways: exact words or phrases, labels, slogans, grammatical categories or any other possible combination. But it also offers other alternatives such as to compare diferent words of diferent categories, genre or whatever.
COCA, Corpus of Contemporary American English. Retrieved: 22 March, 2010 at 21.00 from http://www.americancorpus.org/
Wikipedia, The Free Encyclopedia. Retrieved: 18 April, 2010 at 17:53 from http://en.wikipedia.org/wiki/Text_corpus
Mark Davis: Corpus Linguistics BYU. Retrieved: 10 May, 2010 at 20:42 from http://davies-linguistics.byu.edu/personal/
Corpus del Español. Retrieved: 10 May, 2010 at 20:45 from http://www.corpusdelespanol.org/
Corpus de Referencia del Español Actual. Retrieved: 1o May, 2o10 at 20: 43 from http://corpus.rae.es/creanet.html
Corpus Diacrónico del Español CORDE. Retrieved: 10 MAy, 2010 at 20:44 from http://corpus.rae.es/cordenet.html