CORPUS REVIEW

The COCA stands for Corpus of Contemporary American English” and it  is the largest freely-available corpus of English, and the only large and balanced one of American English. COCA was released in 2008 and it is now used by  thousands of users every month (linguists, teachers, translators, etc).  COCA is also related to other large corporations, as for example the British National Corpus and the  TIME Corpus.

Here it can bee seen a view of the main page:

The corpus contains more than 400 million words of texts  and is divided among spoken, fiction, popular magazines, newspapers, and academic texts. It has included 20 million words each year from 1990 to 2009 and the is  updated once or twice a year. It’s design, is suitable and makes easier for his users to look at the ongoing changes in the language.

The corpus is composed of more than 400 million words in more than 160,000 texts. For each year the corpus is as mentioned previously, evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals:

  • Spoken: (83 million words) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc).
  • Fiction: (79 million words) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.
  • Popular Magazines: (84 million words) Nearly 100 different magazines. A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.
  • Newspapers: (79 million words) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc.  There is a good mix between different sections of the newspaper: local news, opinion, sports, financial, etc.
  • Academic Journals: (79 million words) Nearly 100 different reviewed journals on:  B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.

Because of copyright and licensing issues, the texts themselves are not available for download, under any circumstances.

Here goes an example of how this website works.

The “searching area” (as the previous picture shows) is located int the left of the main webpage.The first step would be to type in a word.The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or combinations of these.

The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:

  • By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals
  • Over time: compare different years from 1990 to the present time.

The following pictures show the charts where searches and comparisons by frequency can be carried out:

You can also easily carry out semantically-based searches. For example, you can compare the collocates of two related words (little/small, democrats/republicans, men/women), to determine the difference in meaning or use between these words.

You can find the frequency and distribution of synonyms for nearly 60,000 words and also compare their  frequency in different genres, and also use these word lists as part of other queries.

Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.

Information sources:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s