Review: British National Corpus

The BNC, which stands for the British National Corpus, is a monolingual word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.

The project was carried out and is managed by the BNC Consortium, led byOxford University Press, and which includes academic research centres at Oxford University Computing Services (OUCS), the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University, and the British Library‘s Research and Innovation Centre. Work on building the corpus commenced in 1991 and was completed in 1994. The first general release of the Corpus for European researchers was announced in February 1995.

Later on, a phase of tagging improvement was undertaken at Lancaster University with funding from the Engineering and Physical Sciences Research Council. Correction and validation of the bibliographic and contextual information in all the BNC Headers was also carried out for this second version of the corpus, known as the BNC World Edition. BNC World was made available for world-wide distribution in 2001.

In response to user feedback, the original SGML version of the corpus was later converted into XML, been the BNC XML Edition released in 2007. All versions of the BNC are worked with XAIRA -an open-source application that can also be used with other corpora or texts in XML-format.

What sort of corpus is the BNC?

“The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.”

Monolingual: It deals with modern British English, not other languages used in Britain. However non-British English and foreign language words do occur in the corpus.

Synchronic: It covers British English of the late twentieth century, rather than the historical development which produced it.

General: It includes many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.

Sample: For written sources, samples of 45,000 words are taken from various parts of single-author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.

How does the corpus work?

You can search for a single word or a phrase, restrict searches by part of speech, search in parts of the corpus only, and much more. The user types a word or phrase in the search box and presses the Return key on his/her keyboard to see up to 50 random hits from the corpus.

In addition to just finding a word or phrase, the Simple Search service can also be used for more complex queries. Use the _ character to match any single word, for example bread _ butter finds bread and butterbread or butterbread with butter, etc. Use the = character to restrict searches by part of speech, for example house=VVBfinds only verbal uses of house. Use braces { and } to enclose a regular expression, for example {s[iau]ng}finds singsang or sung.

As for the search drive _ crazy:

Clicking the capital letters and numbers in blue would take the user to the directory of bibliography (where from the example was taken from.