Third Review: Text Corpuses

INTRODUCTION:

The Corpus diacrónico del español (CORDE) is a textual corpus of all the times and places where the Spanish language has been spoken, since the very beginning of the language until the year 1975, when the Corpus de referencia del español actual (CREA) was created. The CORDE is designed to extract information to study words and their meanings, as well as the grammar and its use over time.

The CORDE started to be used in 1994, when the Academy brought up the possibility of applying the new technologies of information in order to create a data bank which improved the quality of their working materials and made data access easier. Currently, it has about 250 millon registers. This volume of information is the biggest set of lexical registers of the history of Spanish language.

The corpus collects written texts of very different kinds. These are distributed in prose and poetry and, inside each modality, in narrative, lyrical, dramatic, scientific-technic, historical, juidical, religious, journalistic and so on. The aim is to collect all geographical, historical and generical so that the whole is representative enough.

Today, CORDE is a necessary tool for any diachronical study that is related to the Spanish language. The Academy uses the CORDE systematically to document words, to classify some of them as old-fashioned or obsolete, to know the origin of some terms, their tradition in the language, the first appearing of words…

But one of the most important objectives of the diachronic corpus is to serve as a basic material for the production of the Nuevo diccionario histórico.

TEXT ACQUISITION:

The origin or source of the texts which arrive to CORDE is diverse:

– Books which are scanned through a program of optical character recognition.
– other books obtained in electronical format.
– some are typed in digital format, beacuse there was no modern edition of some pieces which have been decided to be included for the peculiarity of their language.

SIZE AND SELECTION CRITERIA:

http://www.rae.es/rae/gestores/gespub000019.nsf/(voAnexos)/arch475E744872738671C125716500381CF8/$FILE/TamanoycriteriosCORDE.htm

ENCODING:

To all the materials processed in the CORDE, a series of textual mark-ups have been added, established according to the international standard of SGML (Standard General Markup Language) and according to the recommendations of TEI (Text Encoding Initiative), which will permit many possibilities of recuperation of information and the option to exchange texts with another corpus.

The diachronic corpus includes texts in verse; for these, a set of marks have been selected which collect the basic aspects of these texts.

Textual problems such as preliminary compositions, taxes, censorship, approvals, licenses and the intervention of different authors have been marked with several tags that will make it possible to differentiate between the main author and the rest of authors intervening.

MAINTENANCE AND CURRENT STATE:

The new version of CORDE contains 250 million forms belonging to texts of all periods of the history of Spanish language until 1974. This new version enhances the volume of texts that can be consulted. New works have been included and some others have been completed.

However, this new burden of works brings about a great amount of revision and a substitution of the editions included before for other more updated ones. Detected errors must also be corrected, which requires constant work.

The query system has three main windows. The first of them deals with the query profile construction. For that, we have a section aiming at writing the word we are looking for, and some selective criteria to make easier the dynamic selection of documentary subset of the corpus.

EXAMPLE WITH THE WORD “NACIÓN”:

The results offer statistical information about the query and offers the possibility to establish document reducing filters of documents and examples, just in case the number of documents exceeds the limits or becomes excessive for the purposes of the one who is consulting. As an example, I have looked up the word “nación”. The first thing it says is “13097 casos en 1867 documentos”.

If you click in “Ver Estadística”, some basic statistical data about the query will appear in a general view that is very useful to distinguish the appearance scope, thematic directions and the chronological distribution of the offered examples. Through the usage of charts, we are shown the number of cases and the absolute percentages of the obtained cases, classified according to subject, chronological or geographical criteria.

As we can see, the term “nación” appears most in documents of “historical prose”. Most documents containing the word “nación” are from the year 1820 (9502 cases) and most of the texts are from Spain.

This makes a lot of sense, mainly because of these reasons:

  1. The author of the book from which most of the examples come from is “Satiras y panfletos del Trienio Constitucional (1820-1823)”.
  2. The “Trienio Liberal” or “Trienio Constitucional” took place at that date, those three years.
  3. It was the kingdom of Fernando VII, “El Deseado”.
  4. The first of January 1820, the “pronunciamiento” of Colonel Rafael de Diego took place in the sevillian locality of Las Cabezas de San Juan.
  5. Although he had little success at the beginning, Riego immediately proclaimed the restoration of The Cadiz Constitution (1812, La Pepa) and the re-establishment of constitutional authorities.
  6. The support of the militar coup grew stronger and made the uprising last until March 10.
  7. That date, a manifest was published by Fernando VII respecting the Cadiz Constitution, which established a parliamentary monarchy.

 Rafael de Diego

Therefore, this was a date of great importance and no wonder why it appears that much in documents of that time.

As mentioned before, the documents can be seen as a whole (normal) or in a summarized version (resumido), depending on the objectives of the researcher. If you want the results to be more precise, you can always insert data in “Agrupación” and “Marcas”.

To obtain examples, the clasification is varied. Thus, we can search the word or expression by cases, authors, year, country, subject or title.

If we click in “Recuperar”, in the section of “Obtención de Ejemplos”, we will see the first page of results of documents containing the word “nación”. But, as indicated above the chart, this is only the first page of results out of 38. The first document is anonymous, from the year 1910, from the Spanish work “Solidaridad Obrera. Periódico sindicalista, 4 de noviembre de 1910”.

If we select some of the results above and press the option “Concordancias”, some examples of the uses of the word “nación” will show up with the reference of the work that the fragments belong to and the year:

If we take another corpus as an example, for example, the British National Corpus, we will see that it is very different from the CORDE in some respects. I find more disadvantages in the BNC than in the CORDE.

ABOUT THE BNC:

Firstly, because it shows no statistical charts, which is a very useful data to see the term we are searching as a whole. Secondly, the BNC shows the information at random and without any order, so it makes the research more complicated and less accurate.

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.

The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.

The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.

PURPOSES OF THE BNC

The purpose of a language corpus is to provide language workers with evidence of how language is really used, evidence that can then be used to inform and substantiate individual theories about what words might or should mean. nTraditional grammars and dictionaries tell us what a word ought to mean, but only experience can tell us what a word is used to mean. This is why dictionary publishers, grammar writers, language teachers, and developers of natural language processing software alike have been turning to corpus evidence as a means of extending and organizing that experience.


SELECTION CRITERIA

Domain: The domain of a text indicates the kind of writing it contains.

•75% of the written texts were to be chosen from informative writings: of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs.

•25% of the written texts were to be imaginative, that is, literary and creative works.

Medium: The medium of a text indicates the kind of publication in which it occurs. The classification used is quite broad.

•60% of written texts were to be books

•25% were to be periodicals (newspapers etc.)

•5 and 10% should come from other kinds of miscellaneous published material (brochures, advertising leaflets, etc)

•5 and 10% should come from unpublished written material such as personal letters and diaries, essays and memoranda, etc

•Small amount (less than 5%) should come from material written to be spoken (for example, political speeches, play texts, broadcast scripts, etc.)

LOOKING FOR EXAMPLES IN THE BNC

The corpus gives a random selection of 50 solutions among all the results of “nation”. Unlike the CORDE, it does not show any statistic charts and it does not give the option to specify authors or dates. You just enter a text or phrase.

Searching the corpus

CONCLUSION

I did not find any relevant information about the term “nation” in the BNC corpus, because the results are shown at random and are not organized in a chronological way. Therefore, the first result was from the book “The Tragedy of Belief”, by John Fulton, about whom I did not find any relevant information, apart from the fact that it is a text about Irish politics from the year 1991. Instead, the CORDE allowed me to do a quite complete research about the term “nación” and it let me know the reason why the results of the term were abundant in the year 1820.


Sources:

Advertisements

2 thoughts on “Third Review: Text Corpuses

  1. Joseba Abaitua June 15, 2010 / 10:42 am

    Your analysis of the word “nación” was very interesting. Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s