Deusto Reviewer on Language Resources

June 15, 2010

Corpus Review

Filed under: Reference corpus, Text corpus — Tamara Nogueira @ 12:22 pm

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

One of the largest freely-available corpus of English, and the only large and balanced corpus of American English is the Corpus of Contemporary American English (COCA).

COCA was released in 2008 and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface) and the 100 million words TIME Corpus (1920s-2000s).

The corpus contains more than 400 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2009 and the corpus is also updated once or twice a year. Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language.

The interface allows searching for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these.  People can search for surrounding words (collocates) within a ten-word window, which often gives a good insight into the meaning and use of a word.

The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:

  • By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals
  • Over time: compare different years from 1990 to the present time

How does the corpus run?

You can carry out semantically-based queries of the corpus.

For example, you can contrast and compare the collocates of two related words, as man/ woman, to determine the difference in meaning or use between these words.  You can also find the frequency and distribution of synonyms and compare their  frequency in different genres, and also use these word lists as part of other queries:

Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.

Brief Tour:

Using the web interface, you can search by words (mysterious), phrases (nooks and crannies or faint + noun), lemmas (all forms of words, like sing or tall), wildcards (un*ly or r?n*), and more complex searches such as un-X-ed adjectives or verb + any word + a form of ground. Notice that from the “frequency results” window you can click on the word or phrase to see it in context in this lower window.

The first option in the search form allows to either see a list of all matching strings, or a chart display that shows the frequency in the five “macro” registers (spoken, fiction, popular magazines, newspapers, and academic journals).

Look for the frequency of funky, whom, incredibly + adjective, or forms of need + to + VERB. Via the chart display, you can also see the frequency of the word or phrase in subregisters as well, such as movie scripts, children’s fiction, women’s magazines, or medical journals. With the list display, you can also see the frequency of each matching string in each of the major sections of the corpus .

You can also search for collocates (words nearby a given word), which often provides insight into the meaning of a given word.

You can also include information about genre or a specific time period directly as part of the query.  This allows you to see how words and phrases vary across speech and many different types of written texts.  We can easily find which words and phrases occur much more frequently in one register than another, such as good + [noun] in fiction, or verbs in the slot [we * that] in academic writing.

Compare to other Corpora:

COCA offers a balance of availability, size, genres, and currency  that is not found in other corpora, including the ANC, the BNC, the BOE, or the OEC.

The chart below provides a summary of the features of the different corpora:

However, this american corpus can be compared with other ones which have a very similar characteristics , and that have follow in a way the same way structure when searching a meaning of a word or a phrase.

On of those examples, could be the CDE, Corpus Del Español, created by Mark Davies, a professor of Corpus Linguistics in the Department of Linguistics and English Language at Brigham Young University and from 1992-2003, and also  professor of Spanish Linguistics at Illinois State´s  University.

There can be easily found some characteristics looking only at the introductory web page of the corpus:

As in the case of the COCA, this Spanish Corpus offers on its  interface the possibility of searching in different ways: exact words or phrases, labels, slogans, grammatical categories or any other possible combination. But it also offers other alternatives such as to compare diferent words of diferent categories, genre or whatever.

SOURCES

COCA, Corpus of  Contemporary American English. Retrieved: 22 March, 2010 at 21.00 from http://www.americancorpus.org/

Wikipedia, The Free Encyclopedia. Retrieved: 18 April, 2010 at 17:53 from http://en.wikipedia.org/wiki/Text_corpus

Mark Davis: Corpus Linguistics BYU. Retrieved: 10 May, 2010 at 20:42 from http://davies-linguistics.byu.edu/personal/

Corpus del Español.  Retrieved: 10 May, 2010 at 20:45 from http://www.corpusdelespanol.org/

Corpus de Referencia del Español Actual. Retrieved: 1o May, 2o10 at 20: 43 from http://corpus.rae.es/creanet.html

Corpus Diacrónico del Español CORDE. Retrieved: 10 MAy, 2010 at 20:44 from http://corpus.rae.es/cordenet.html

June 4, 2010

My Slideshare: The OED

Filed under: Dictionary — Tamara Nogueira @ 12:30 pm

The Oxford English Dictionary is an unsurpassed guide to the meaning, history, and pronunciation of over half a million words, both present and past.

It traces the usage of words through 2.5 million quotations from a wide range of international English language sources, from classic literature and specialist periodicals to film scripts and cookery books.

The OED is a historical dictionary that covers words from across the English-speaking world, from North America to South Africa, from Australia and New Zealand to the Caribbean. It also offers the best in etymological analysis and in listing of variant spellings, and it shows pronunciation using the International Phonetic Alphabet.

History

When the members of the Philological Society of London decided, in 1857, that existing English language dictionaries were incomplete and deficient, and called for a complete re-examination of the language from Anglo-Saxon times onward, they knew they were embarking on an ambitious project. However, even they didn’t realize the full extent of the work they initiated, or how long it would take to achieve the final result.

The project proceeded slowly after the Society’s first grand statement of purpose. Eventually, in 1879, the Society made an agreement with the Oxford University Press and James A. H. Murray to begin work on a New English Dictionary

The new dictionary was planned as a four-volume, 6,400-page work that would include all English language vocabulary from the Early Middle English period onward, plus some earlier words if they had continued to be used into Middle English.

Murray and his team did manage to publish the first part in 1884, but much more comprehensive work was required so over the next four decades work on the Dictionary continued and new editors joined the project. Murray now had a large team directed by himself, Henry Bradley, W.A. Craigie, and C.T. Onions. These men worked steadily, producing fascicle after fascicle until finally, in April, 1928, the last volume was published.

Structure

It is very different from that of a dictionary of current English, in which only present-day senses are covered, and in which the most common meanings or senses are described first. For each word in the OED, the various groupings of senses are dealt with in chronological order according to the quotation evidence. In a complex entry with many strands, the development over time can be seen in a structure with several ‘branches’.

Modern Era

In 1992 the Oxford English Dictionary again made history when a CD-ROM edition of the work was published. Suddenly a massive, twenty-volume work that takes up four feet of shelf space and weighs 150 pounds is reduced to a slim, shiny disk that takes up virtually no space and weighs just a few ounces.

The Oxford English Dictionary on CD-ROM has been a great success. The electronic format has revolutionized the way people use the Dictionary to search and retrieve information. Complex investigations into word origins or quotations that would have been impossible to conduct using the print edition now take only a few seconds. Because the electronic format makes the Oxford English Dictionary so easy to use, its audience now embraces all kinds of interested readers beyond the confines of the scholarly community.

Using OED:

The OED alows you to findthe word or phrase you need in the full text of the dictionary, or in selected areas sch as quotations or etymolgies.

  1. To loo up a word or phrase, simply type into the box, hit return or click the magnifying glass.You can use wildcards in all searches.
  2. Entry Version, gives the version and publication date of the entry. A button links to an earlier version of the entry when available.
  3. Full Text Search:Type a word or phrase into the box to find it in the full text of the Dictionary,or in a selected area from the drop-down list.
  4. More Options: Search for two words or phrases occurring near each other.
  5. List by Entry
  6. List by Date

Entry Versions & Fast Searching

OED Online enables you to see how an entry has changed over time.

A unique feature of OED Online is the ability to see what both of these different texts said about a word, and to compare them at the click of a button.

Every entry is labelled with date of publication and a description of which text it is from, so that its status can be clearly seen.

In the case of those entries which have already been revised and had new research incorporated, a ‘Revised draft’ version will be available. These draft entries have not been previously published, and may be altered in the future if further relevant material comes to light.

To help you look up the words you want, the site offers several different ways to search.

The most straightforward is the simple Find Word search, available at the top right-hand corner of every page. This restricts the search to the defined words and phrases.

Simply enter the word you’re looking for, click Find Word, and the entry will be displayed if there is a single match, or a results list will be displayed if there is more than one match.

Specific Search & Phrase Search

To search for references to the word ‘ghost’ in the titles of quoted works, enter ‘ghost’ into the search box, select ‘quotation work’ from the pull-down list, and click Start Search.

This produces the following results list from the Second Edition, which includes links to the entries containing the quotations, as well as direct links into the body of the quotations themselves, surrounded by a little context to show where the word ‘ghost’ has been found.

Just as with the Find Word facility, where you could enter a word, phrase, or pattern with wildcards, so you may also search anywhere in the different text areas or in the whole Dictionary text for occurrences of a phrase which interests you.

So to find variations on phrases such as mad as a hatter or mad as a March hare, type ‘mad as a’ in the first search box, using ‘full text’ as the selected area, to see how inventively this formula has been used.

As usual, the results list appears:

SOURCES:

Oxford English dictionary. Retrieved : 15 march, 2010, 21:11 from http://www.oed.com/about/

Oxford English Dictionary.  Wikipedia, the free Encyclopedia.  Retrieved: 15 march, 2010, 21:30 from http://es.wikipedia.org/wiki/Oxford_Dictionary

Oxford English dictionaries Online (OLDO). Retrieved: 1o may, 2010, 19:47 from http://www.wordreference.com/english/OLDO-es.aspx

The Concise Oxford Dictionary and Thesaurus. Retrieved: 10 may, 2010, 19:50 from http://www.babylon.com/dictionary/oxford/?id=227&tree=5&level=3

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.