Oxford English Corpus & British National Corpus

A text corpus is a large and structured set of texts electronically stored and processed. The aim of such corpuses is to develop statistical analysis and hypothesis testing by checking occurrences.

There are two main types of corpus: a monolingual corpus or a multilingual corpus covering text data in multiple languages.

The Oxford English Corpus is an example of a text corpus centered only in the English language (monolingual corpus) which is used by the developers of theOxford English Dictionary and by Oxford University Press’s language reasearch programme.

The Oxford English Corpus is thought to be the largest corpus of its kind, containing over two billion words. The sources of the words offered in the corpus are different kind of writings in contrast to other databases which only offer examples taken from specific kinds of writings. It is based mainly on material collected from pages on the World Wide Web, and other online sources; it also uses some printed texts such as academic journals.

The corpus is divided into 20 major subject areas or subcorpora.

This corpus has a digital version formatted in XML and analysed with Sketch Engine Software.

The important side of the Oxford English Corpus is that it keeps track of the changes in language. The spelling and usage of words suffer transformations with time that is why the aim of this corpus is to take into account those changes and introduce them into the corpus. It also keeps track of the informal usages of the language and the common mistakes that can be found on everyday writings such as emails.
Using the Corpus:
The Oxford English Corpus is aimed to be used in many different areas as a way of studying the English language.
  • When using the Corpus we appreciate that it is able to differentiate two words which have similar meanings by their occurrence. That is, words do not appear alone, they are associated to other words. Depending on to which words are connected they are differentiated from other words with similar meaning. As as example the web page of the Oxford English Corpus offers the difference between the words eccentric and quirky. Taking into account their collocational profile offered in a table consisting of three columns (the first column offers the list of adverbs modifying the word as “slightly eccentric”; the second shows nouns modified by the word such as “eccentric character”; and the third column indicates the adjectives co-occurring with the word) allow us to realize that these two words are slightly different: “Whereas eccentric is associated with being elderly, rich, or reclusive, quirky is most strongly associated with being humorous or youthful”.
  • The Corpus also shows the most used ways in which new words and expressions are coined, such as the suffixes -fest, -speak, -tastic, and -ville.
  • We will find that the Oxford English Corpus offers a list of which expressions are written together or as a two-word phrase, and it also makes difference between the usage in British English and in American English. Words such as someday, anymore and sometime are more used written as one word in American English than in British English. In fact, in American English it is more common to use fixed expressions than in British English.
  • When having a look at dictionary references, the Oxford English Corpus identifies new usages of words. As an example, the adjective edgy until 1999 had a single meaning:
edgy adj. tense, nervous, or irritable.
However, a second meaning has arisen recently and it is now offered:
edgy adj. 1 tense, nervous, or irritable. 2 informal, avant-garde and unconventional.

On the other hand, the British National Corpus (BNC) offers a wide range of samples of written and spoken English taken from different sources. This 100-million-word text corpus, is a sample of spoken and written British English as it covers a number of genres from the late twentieth century. Of course, this corpus renews its editions, the latest being the BNC XML Edition which appeared in 2007.

The project for building the corpus began in 1991 and it was finished in 1994. Since that time no new texts were added to the corpus but there were several revisions before the second edition was released in 2001: BNC World. Moreover, after the project was completed, two sub-corpora with material from the BNC were released. First, the BNC Sampler and second, the BNC Baby.

Among its written sources, the BNC takes samples from regional and national newspapers, specialist periodicals and journals which can be directed to all ages. It also includes among its sources academic books and popular fiction, letters, essays taken from schools and universities. The written part of the BNC covers the main part of the corpus, 90%.

The spoken part, being smaller (10%), includes orthographic transcriptions of unscripted informal conversations, radio recordings and a number of different sources.

What type of corpus is the BNC?

  • Monolingual: It deals with modern British English, not other languages used in Britain. However non-British English and foreign language words do occur in the corpus.
  • Synchronic: It covers British English of the late twentieth century, rather than the historical development which produced it.
  • General: It includes many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.
  • Sample: For written sources, samples of 45,000 words are taken from various parts of single-author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.

Background:

The BNC project was carried out and is managed by the BNC Consortium, an industrial/academic consortium led by Oxford University Press, of which the other members are major dictionary publishers Addison-Wesley Longman and Larousse Kingfisher Chambers; academic research centres at Oxford University Computing Services (OUCS), the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University, and the British Library’s Research and Innovation Centre. The project was funded by the commercial partners, the Science and Engineering Council (now EPSRC) and the DTI under the Joint Framework for Information Technology (JFIT) programme. Additional support was provided by the British Library and them British Academy.

How to use the Corpus:

The BNC offers a very easy to use Search option. We will find the Simple Search Box in the main page of the Corpus. This type of search allows looking up a word very quickly and easily by just typing the word on the search box and clicking OK. Automatically we will be directed to a page showing a list with up to 50 randomly selected instances. It also offers thefrequency of the search string and the possibility of checking the source of each example by just clicking the code which appears before the example. Let’s see a common search example:

The BNC also allows making more complex searchings. For instance, instead of searching for a single word, if we use the _ symbol we will be matching the word to any other word. As an example, if we search football_match, the Corpus shows five solutions for the query; five examples where the two words appear:

On the other hand, the = character allows restricting the search depending on the part of speech. If we search out=AVP we will be restricting the search to the examples where out is used as an adverb. There are several part of speech codes available in the homepage of the British National Corpus.

Moreover, we can enclose a regular expression by using braces: { }. We can search the different forms of a verb with this option. For instance, if we search the verb drink, we will find the forms drink, drank and drunk:

Furthermore, the BNC also allows making more complex queries but those have to be made by using XAIRA (XML Aware Indexing and Retrieval Architecture). XAIRA gives the chance of looking for the spelling of a word and more possibilites. This XML search engine can be installed in our computer from the main page of the British National Corpus (installing XAIRA).

Sources:
Advertisements

One thought on “Oxford English Corpus & British National Corpus

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s