The British National Corpus

I am going to write this article about the British National Corpus, but as I’m sure many people won’t know what a corpus is, I think it is important that I give an explanation. That is why I am going to start by writing a few lines on corpora in general, and then I will focus my article on the British National Corpus, trying to explain how it works.


What is a corpus?

According to the Oxford Dictionary, a corpus is “a collection of written or spoken material in machine-readable form, assembled for the purpose of linguistic research”.

The plural word to corpus is usually “corpora”.

What are they used for?

They are used to store words, whose features can be analyzed by means of tagging and use of concordancing programs, and they help studying linguistic competence. They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

Types of corpora

There are quite a lot of different types of corpora, as they can contain written or spoken language. The “General Corpora” consist of general texts, texts that do not belong to a single text type, subject field, or register. When the texts are all taken from a dialect or from a subject area, we can also call them “Sublanguage Corpora”.


I have chosen this corpus for my article because I believe it has a great importance. It was the one that most attracted me when I took a look at different lists of corpora.

What is the British National Corpus?

According to the British National Corpus’ website itself, “the British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written”.

The biggest part of the collection is written (90%), that includes extracts of different document such as national newspapers or university essays, but the spoken part is important too, which consists of orthographic transcriptions of unscripted informal conversations and spoken language collected in different contexts.

When was it created?

Although it wasn’t finished until 1994, it started being built 3 years before, in 1991.

What sort of corpus is it?

  1. Monolingual: Because it only uses one language, and it is British Language. This doesn’t mean that foreign words cannot appear.
  2. Synchronic: It covers BE of the late 20th Century, and not the historical development that produced it.
  3. General: It doesn’t include just one style, but different fields, genres and register. As mentioned before, it includes not only written language as some other corpora, but also spoken.
  4. Sample: Very long texts are sampled into 45,000 words, which allows a wider coverage of text within the 100 million limit.

A quick look at the British National Corpus website

This is a screenshot of the website (click in the image to see it bigger).

This is how the Corpus works when we are trying to look for references from a particular word. First of all, we have to write the word we are looking at in the box where it says “Search the Corpus”. In this case, I decided to write a very simple and common word: house.

Once we’ve written our word, we just have to click on “Go” and we will get up to 50 examples, as seen here:

As I have already shown, it is very easy to look for a word, so you should just try yourself and discover new things about this interesting Corpus!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s