Deusto Reviewer on Language Resources

June 3, 2010

Third Review: Text Corpuses

Filed under: Reference corpus, Text corpus — Tags: , , , — Esti Vivanco @ 11:14 pm

INTRODUCTION:

The Corpus diacrónico del español (CORDE) is a textual corpus of all the times and places where the Spanish language has been spoken, since the very beginning of the language until the year 1975, when the Corpus de referencia del español actual (CREA) was created. The CORDE is designed to extract information to study words and their meanings, as well as the grammar and its use over time.

The CORDE started to be used in 1994, when the Academy brought up the possibility of applying the new technologies of information in order to create a data bank which improved the quality of their working materials and made data access easier. Currently, it has about 250 millon registers. This volume of information is the biggest set of lexical registers of the history of Spanish language.

The corpus collects written texts of very different kinds. These are distributed in prose and poetry and, inside each modality, in narrative, lyrical, dramatic, scientific-technic, historical, juidical, religious, journalistic and so on. The aim is to collect all geographical, historical and generical so that the whole is representative enough.

Today, CORDE is a necessary tool for any diachronical study that is related to the Spanish language. The Academy uses the CORDE systematically to document words, to classify some of them as old-fashioned or obsolete, to know the origin of some terms, their tradition in the language, the first appearing of words…

But one of the most important objectives of the diachronic corpus is to serve as a basic material for the production of the Nuevo diccionario histórico.

TEXT ACQUISITION:

The origin or source of the texts which arrive to CORDE is diverse:

- Books which are scanned through a program of optical character recognition.
- other books obtained in electronical format.
- some are typed in digital format, beacuse there was no modern edition of some pieces which have been decided to be included for the peculiarity of their language.

SIZE AND SELECTION CRITERIA:

http://www.rae.es/rae/gestores/gespub000019.nsf/(voAnexos)/arch475E744872738671C125716500381CF8/$FILE/TamanoycriteriosCORDE.htm

ENCODING:

To all the materials processed in the CORDE, a series of textual mark-ups have been added, established according to the international standard of SGML (Standard General Markup Language) and according to the recommendations of TEI (Text Encoding Initiative), which will permit many possibilities of recuperation of information and the option to exchange texts with another corpus.

The diachronic corpus includes texts in verse; for these, a set of marks have been selected which collect the basic aspects of these texts.

Textual problems such as preliminary compositions, taxes, censorship, approvals, licenses and the intervention of different authors have been marked with several tags that will make it possible to differentiate between the main author and the rest of authors intervening.

MAINTENANCE AND CURRENT STATE:

The new version of CORDE contains 250 million forms belonging to texts of all periods of the history of Spanish language until 1974. This new version enhances the volume of texts that can be consulted. New works have been included and some others have been completed.

However, this new burden of works brings about a great amount of revision and a substitution of the editions included before for other more updated ones. Detected errors must also be corrected, which requires constant work.

The query system has three main windows. The first of them deals with the query profile construction. For that, we have a section aiming at writing the word we are looking for, and some selective criteria to make easier the dynamic selection of documentary subset of the corpus.

EXAMPLE WITH THE WORD “NACIÓN”:

The results offer statistical information about the query and offers the possibility to establish document reducing filters of documents and examples, just in case the number of documents exceeds the limits or becomes excessive for the purposes of the one who is consulting. As an example, I have looked up the word “nación”. The first thing it says is “13097 casos en 1867 documentos”.

If you click in “Ver Estadística”, some basic statistical data about the query will appear in a general view that is very useful to distinguish the appearance scope, thematic directions and the chronological distribution of the offered examples. Through the usage of charts, we are shown the number of cases and the absolute percentages of the obtained cases, classified according to subject, chronological or geographical criteria.

As we can see, the term “nación” appears most in documents of “historical prose”. Most documents containing the word “nación” are from the year 1820 (9502 cases) and most of the texts are from Spain.

This makes a lot of sense, mainly because of these reasons:

  1. The author of the book from which most of the examples come from is “Satiras y panfletos del Trienio Constitucional (1820-1823)”.
  2. The “Trienio Liberal” or “Trienio Constitucional” took place at that date, those three years.
  3. It was the kingdom of Fernando VII, “El Deseado”.
  4. The first of January 1820, the ”pronunciamiento” of Colonel Rafael de Diego took place in the sevillian locality of Las Cabezas de San Juan.
  5. Although he had little success at the beginning, Riego immediately proclaimed the restoration of The Cadiz Constitution (1812, La Pepa) and the re-establishment of constitutional authorities.
  6. The support of the militar coup grew stronger and made the uprising last until March 10.
  7. That date, a manifest was published by Fernando VII respecting the Cadiz Constitution, which established a parliamentary monarchy.

 Rafael de Diego

Therefore, this was a date of great importance and no wonder why it appears that much in documents of that time.

As mentioned before, the documents can be seen as a whole (normal) or in a summarized version (resumido), depending on the objectives of the researcher. If you want the results to be more precise, you can always insert data in “Agrupación” and “Marcas”.

To obtain examples, the clasification is varied. Thus, we can search the word or expression by cases, authors, year, country, subject or title.

If we click in “Recuperar”, in the section of “Obtención de Ejemplos”, we will see the first page of results of documents containing the word “nación”. But, as indicated above the chart, this is only the first page of results out of 38. The first document is anonymous, from the year 1910, from the Spanish work “Solidaridad Obrera. Periódico sindicalista, 4 de noviembre de 1910″.

If we select some of the results above and press the option “Concordancias”, some examples of the uses of the word “nación” will show up with the reference of the work that the fragments belong to and the year:

If we take another corpus as an example, for example, the British National Corpus, we will see that it is very different from the CORDE in some respects. I find more disadvantages in the BNC than in the CORDE.

ABOUT THE BNC:

Firstly, because it shows no statistical charts, which is a very useful data to see the term we are searching as a whole. Secondly, the BNC shows the information at random and without any order, so it makes the research more complicated and less accurate.

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.

The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.

The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.

PURPOSES OF THE BNC

The purpose of a language corpus is to provide language workers with evidence of how language is really used, evidence that can then be used to inform and substantiate individual theories about what words might or should mean. nTraditional grammars and dictionaries tell us what a word ought to mean, but only experience can tell us what a word is used to mean. This is why dictionary publishers, grammar writers, language teachers, and developers of natural language processing software alike have been turning to corpus evidence as a means of extending and organizing that experience.


SELECTION CRITERIA

Domain: The domain of a text indicates the kind of writing it contains.

•75% of the written texts were to be chosen from informative writings: of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs.

•25% of the written texts were to be imaginative, that is, literary and creative works.

Medium: The medium of a text indicates the kind of publication in which it occurs. The classification used is quite broad.

•60% of written texts were to be books

•25% were to be periodicals (newspapers etc.)

•5 and 10% should come from other kinds of miscellaneous published material (brochures, advertising leaflets, etc)

•5 and 10% should come from unpublished written material such as personal letters and diaries, essays and memoranda, etc

•Small amount (less than 5%) should come from material written to be spoken (for example, political speeches, play texts, broadcast scripts, etc.)

LOOKING FOR EXAMPLES IN THE BNC

The corpus gives a random selection of 50 solutions among all the results of “nation”. Unlike the CORDE, it does not show any statistic charts and it does not give the option to specify authors or dates. You just enter a text or phrase.

Searching the corpus

CONCLUSION

I did not find any relevant information about the term “nation” in the BNC corpus, because the results are shown at random and are not organized in a chronological way. Therefore, the first result was from the book “The Tragedy of Belief”, by John Fulton, about whom I did not find any relevant information, apart from the fact that it is a text about Irish politics from the year 1991. Instead, the CORDE allowed me to do a quite complete research about the term “nación” and it let me know the reason why the results of the term were abundant in the year 1820.


Sources:

Third Review: Machine Translator

Filed under: Uncategorized Language Resources — Tags: , , , , — Lorena Sainz-Maza @ 7:54 pm

In the following review, I have translated four texts with the MT  Google Translate. The main purpose of this review is to show the large variety of mistakes that a MT can make.

Firstly, I have included the original text and, then, I have translated the original version with the MT. Thirdly, I have translated the text myself, providing an accurate translation of the original one. And, finally, I have analysed some of the main errors that the MT has produced in each translation. [the first version of this review contains colours in order to differenciate in an easier way the main mistakes of the MT. You can check it here]

FIRST TEXT 

  • Original text in English

Volcanic ash: Europe flights grounded for third day 


Virtually all of Europe’s major airports remain closed as a huge plume of volcanic ash drifts south and east across the continent from Iceland.
Millions of air travellers are stranded as thousands of flights are being cancelled for a third day.
The disruption from the spread of ash would continue into Sunday, European aviation agency Eurocontrol said.
Airlines are losing some £130m ($200m) a day in an unprecedented shutdown of commercial air travel.
“Forecasts suggest that the cloud of volcanic ash will persist and that the impact will continue for at least the next 24 hours,” a statement from Eurocontrol said at around 0830 GMT. 

  • Google translator version

La ceniza volcánica: vuelos de Europa a tierra por tercer día.
Prácticamente todos los principales aeropuertos de Europa permanecen cerrados como enorme columna de cenizas volcánicas se desplaza al sur y al este a través del continente de Islandia.
Millones de viajeros aéreos están varados en miles de vuelos han sido cancelados por tercer día consecutivo.
La interrupción de la expansión de cenizas continuará hasta el domingo, Eurocontrol Agencia Europea de la aviación, dijo.
Las aerolíneas están perdiendo unos £ 130 millones (200 millones de dólares) al día en un cierre sin precedentes del transporte aéreo comercial.
“Las previsiones indican que la nube de ceniza volcánica se mantendrá y que el impacto continuará por lo menos durante las próximas 24 horas”, dijo un comunicado de Eurocontrol en torno a las 0830 GMT. 

  • Human translation

Ceniza volcánica: se para el servicio aéreo europeo por tercer día consecutivo.
Virtualmente todos los principales aeropuertos europeos permanecen cerrados a medida que una enorme columna de ceniza volcánica se mueve empujada por el viento a través del continente desde Islandia.
Millones de viajeros aéreos se han quedado tirados ya que miles de vuelos están siendo cancelados por tercer día consecutivo.
La interrupción por la expansión de ceniza continuaría hasta el domingo, ha informado Eurocontrol, la agencia europea de aviación.
Las líneas aéreas están perdiendo unos 130 millones de libras (200 millones de dólares) al día en una paralización sin precedentes del transporte comercial aéreo.
“Las previsiones sugieren que la nube de ceniza volcánica persistirá y que el impacto continuará durante al menos las próximas 24 horas”, ha declarado un comunicado de Eurocontrol alrededor de las 08.30 GMT. 

  • MAIN MISTAKES OF THE MACHINE TRANSLATOR

Conjunctions: The conjunction “as” is repeated twice along the text and the MT has not translated it correctly in any of the cases. In the first case, it does not have a comparative meaning, but a causal one; in the second case, it has been understood as a preposition and, once again, it should be translated with a causal meaning (ya que, puesto que; in the first case, it would be also possible to translate it as a medida que).

Prepositions: The preposition “from” translated as “de la” which is rather nonsense. The sentence means something like “the disruption due to the spread of ash…” and, therefore, it could be translated with a causal meaning or using the preposition “por” (la interrupción por la expansión de la ceniza…)

Meaning of words: In some cases, the MT has misunderstood the meaning of some words, such as “grounded”, “drifts” or “suggest”, which, from my point of view, should have more accurate translations. Moreover, the translation of the verb “said” is translated literally and I think that in Spanish they do not sound as good as in English.

Verb tenses: In general, tenses are well used by the MT. However, there is a clear mistake with the passive from “are being cancelled” which has been translated as “han sido cancelados”. Although it does not sound bad in Spanish, the correct form would have been “están siendo cancelados”.

Indefinite article: The MT has forgotten to translate the indefinite article in “a huge plume of…”. It should be pointed out that, in Spanish, indefinite articles are as important as in English and, thus, they should not be forgotten.

Sentence order: In general terms, the text can be well understood. Yet, there are some sentence constructions that might sound slightly awkward.  For instance, “transporte aéreo comercial” which sounds a lot better if the order is changed “transporte comercial aéreo”. Furthermore, “European aviation agency Eurocontrol” has been translated correctly, but I think that with an apposition would sound better (Eurocontrol, la agencia europea de aviación).

 

SECOND TEXT 

  • Original text in  English

Cheryl Cole Wants Dignified Divorce 


Cheryl Cole is planning to “divorce with dignity”.
The ‘Fight for This Love’ singer - who split from her soccer star spouse Ashley Cole in February following allegations he had cheated on her with several women – is reportedly planning to take control of legal proceedings in order to end her marriage amicably.
“She is speaking to Ashley, they are texting - she even calls him babebut as much as she loves him, there is no way back. This is divorce with dignity, it is Cheryl’s style

  • Google translator version

Cheryl Cole quiere divorcio digno
Cheryl Cole es la planificación de “divorcio con dignidad”.
La “lucha por la cantante This Love” - que se separó de su esposa la estrella de fútbol Ashley Cole de febrero después de las denuncias que había engañado con varias mujeres – está planeando tomar el control de los procedimientos legales para poner fin a su matrimonio de manera amistosa.
“Ella está hablando con Ashley, son mensajes de texto - que incluso le llama bebé – pero tanto como ella lo ama, no hay vuelta atrás. Este es el divorcio con dignidad, que es el estilo de Cheryl. 

  • Human translation

Cheryl Cole quiere un divorcio digno
Cheryl Cole esta planeando “divorciarse con dignidad”.
La cantante de “Fight for this love” -que se separó de su cónyuge la estrella de fútbol Ashley Cole en febrero siguiendo las alegaciones de que le había sido infiel con varias mujeres- está, según se informa, planeando tomar el control de los procedimientos legales para poner fin a su matrimonio de manera amistosa.
“Está hablando con Ashley, se mandan mensajes -ella incluso le llama babe- pero por mucho que ella le quiere, no hay vuelta atrás. Esto es un divorcio con dignidad, es el estilo de Cheryl”. 

  • MAIN MISTAKES OF THE MACHINE TRANSLATOR

Relative clauses: With no apparent reason, the MT tends to construct relative clauses, even though there is no evidence of any relative clause in the original version. For instance, “she even calls him babe” which has been translated as “que incluso le llama bebé” or “…with dignity, it is Cheryl’s style” translated as “que es el estilo de Cheryl”.

Articles and pronouns: In this second translation, the MT has not indentified the need for an indefinite article in the Spanish version and, therefore, it has translated the title of text as “Cheryl quiere divorcio digno” which sound rather “naked” in Spanish, since a native Spanish speaker would have the intuition that the indefinite article “un” is missing. What is more, in the last sentence, “This is divorce with dignity”, the MT has added spontaneously a definite article “el” and translated the demonstrative pronoun as “este”. Personally, I cannot understand the reason for the construction of such sentence.

Prepositions: The MT has made the mistake of translating the preposition “in” as “de”, which is rather strange, because in most of the cases the English preposition for “de” is “of”.

Meaning of words: In general, the MT has made a good choice of the words when translating. Yet, in the case of “spouse”, the translator has used the word “esposa” which, on the one hand, is wrong because there is not agreement in gender with the subject and, on the other hand, the correct translation of the word would be “cónyuge”.

Verb tenses: Sometimes, the MT is in trouble when it has to translate the present continuous of some verbs. The verb “is planning” has been translated as a noun “planificación” and the same has happened with “are texting” translated as “mensajes de texto”. However, the MT has been able to recognize the present continuous in “is speaking”.

Sentence construction: The MT is not capable of understanding some structures. For example, “but as much as” or “following allegations”, maybe because it does not recognize the lexicon or simply the grammatical structure. Besides, it should be added that titles of songs do not need to be translated as it can occur that it ends up mixed with the rest of the text and, therefore, nothing can be understood (The “Fight for This Love” singer > La “lucha por la cantante This Love”).

 

THIRD TEXT 

  • Original text in English

Nickelback-How you remind me 

Never made it as a wise man
I couldn’t cut it as a poor man stealing
Tired of living like a blind man
I’m sick of seeing without a sense of feeling
and this is how you remind me
This is how you remind me of what I really am
This is how you remind me of what I really am 

(chorus)
It’s not like you to say sorry
I was waiting on a different story
this time I’m mistaken
for handing you a heart worth breaking
and I’ve been wrong
I’ve been down
into the bottom of every bottle
these five words in my head
scream are we having fun yet 

 

  • Google translator version

Nickelback-Como me recuerdas 

Nunca lo hizo como un hombre sabio
No podía cortarlo como un hombre pobre robando
Cansado de vivir como un ciego
Estoy harto de ver sin un sentimiento de
y así es como me recuerdas
Así es como me recuerdas lo que realmente soy
Así es como me recuerdas lo que realmente soy 

(Estribillo)
No es como que digas lo siento
Estaba esperando en una historia diferente
esta vez estoy equivocada
para entregar un corazón digno de romperse
y me he equivocado
He estado en
en la parte inferior
de cada botella
estas cinco palabras en mi cabeza
grito se nos divertimos aún 

  • Human translation

Nunca actué como un hombre sabio
No pude detenerme, cual mendigo que necesita robar
Cansado de vivir como un hombre ciego
Estoy harto de ver sin poder sentir
Y es así como me recuerdas
Es así como me recuerdas lo que realmente soy
Es así como me recuerdas lo que realmente soy

(Estribillo)
No es que tengas que disculparte
Soy yo quien estaba esperando una historia diferente
Esta vez estoy avergonzado
Por encargarte un corazón que merece la pena romper
Y estuve equivocado
Estuve deprimido
En el fondo de cada botella
Estas cinco palabras en mi cabeza
Gritan: ¿Nos estamos divirtiendo juntos todavía? 

  • MAIN MISTAKES OF THE MACHINE TRANSLATOR

In general, more mistakes can be appreciated in this third text, probably because it is a song and it requires more precision as, sometimes, subjects are omitted and so forth.

Agreement of number and gender: In the first sentence of the song, the MT has not been able to recognize the gender of the verb and, thus, it has used the 3rd person singular but, in a way, it is understood that 1st person singular is needed. Moreover, the sentence “and I’ve been wrong” has been translated in the feminine and it should have been translated as “equivocado”. Besides, in the last sentence, the verb “scream” has a plural subject which means that it should have been translated in the 3rd person plural (gritan).

Meaning of words & expressions: The meaning of some words has been misunderstood by the MT. For instance, “cut” translated as “cortar” (detener), “down” translated as “en” (deprimido) or “into the bottom” translated as “en la parte inferior” (en el fondo).

Prepositions: The MT has made a mistake with the preposition “for” which has been translated as “para”. It is true that “for” can mean “para” but,  in this case, it has a causal meaning and, thus, it is more appropriate to translate it as “por”.

Sentence construction: There are some sentences which have been translated incorrectly, maybe because the MT did not recognize the lexicon or the grammatical structure. For instance: “It’s not like you to say sorry” > “No es como que digas lo siento”; “…a heart worth breaking” > “un corazón digno de romperse”.

Questions in direct speech: There is no way of understanding the translation made by the MT with regard to the question. The only thing correct in the translation is the choice of the number (1st person plural), the rest of the question is nonsense. (Are we having fun yet? > Nos estamos divirtiendo todavía?).

  

FOURTH TEXT 

  • Original text in English

The Lion King 


A young lion prince is born in Africa, thus making his uncle Scar the second in line to the throne. Scar plots with the hyenas to kill King Mufasa and Prince Simba, thus making himself King. The King is killed and Simba is led to believe by Scar that it was his fault, and so flees the kingdom in shame. After years of exile he is persuaded to return home to overthrow the usurper and claim the kingdom as his own thus completing the “Circle of Life“. 

  • Google Translator version

El Rey León
Un león joven príncipe
ha nacido en África, con lo que su tío Scar el segundo en la línea de sucesión al trono. Scar parcelas con las hienas para matar al rey Mufasa y Simba Príncipe, con lo que el propio rey. El rey muere y Simba es llevado a creer por la cicatriz que era culpa suya, por lo que huye del reino de vergüenza. Después de años de exilio, está convencido de regresar a sus hogares para derrocar al usurpador y la reclamación del reino como su propia completando así el “Circle of Life“. 

  • Human Translation

El Rey León
Un joven príncipe león nace en Africa, convirtiendo de ese modo a su tío Scar el segundo en la línea de sucesión al trono. Scar planea con las hienas matar al rey Mufasa y el príncipe Simba, haciendo así que el sea el rey. Scar mata al rey y hace creer a Simba que él es el culpable así que Simba huye del reino avergonzado. Después de años de exilio, persuaden a Simba para que vuelva a casa con el fín de derrocar al usurpador y reclamar el reino como suyo, completando así “El Círculo de la Vida”. 

  • MAIN MISTAKES OF THE MACHINE TRANSLATOR

Order of sentences/Sentence construction:  The very first sentence of the text sounds a bit funny “un león joven príncipe”. Clearly, there is a slight mess with regard to the order of the adjectives and nouns. Furthermore, the expression “thus making” has been translated incorrectly twice. In both cases, the word “making” has not been translated and, therefore, the sentence has been incomplete. Likewise, the name of Simba’s uncle, Scar, has been translated into Spanish (cicatriz) which make the sentence completely nonsense. In addition, the last expression “Circle of Life”, although it is between quotation marks, could have been translated into Spanish (Círculo de la vida).

Verb tenses: It is visible that the MT has some problems when translating the passive form of some verbs; this is the case of “is born” or “is persuaded”. When this happens, the best solution may be to change the order of the sentence and the tense of the verb into the present simple; that is, instead of saying “Simba es convencido para…”, change it into “Convencen a Simba…”.

Meaning of words: The MT has made a big mistake with the word “plots” which has been translated as a noun and not as a verb. Likewise, the expression “in shame” could have been translated better, by using the adjective instead of the preposition plus the noun (en vergüenza > avergonzado).

Agreement in number: In the sentence “…persuaded to return home”, “home” has been translated in the plural even though there is no evidence of a need for the plural, since it is talking about Simba and, therefore, it should be singular.

 

SUMMARY OF THE MAIN MACHINE TRANSLATOR MISTAKES 

Above all, lexical errors can be appreciated. Just by paying attention to the title of the first text, it could be realised how the word grounded is not translated accurately. Another example is the word stranded which has been translated as están varados and, in this particular case, it means something like se han quedado tirados. In addition, the MT does not take into account that words can have more than one meaning in some cases; that is the case of in shame (third text), among many others, which has been translated as de vergüenza, but probably the translation avergonzado suits better there. Another important mistake is that the MT tends to translate everything into Spanish and, sometimes, names should remain in their original language; that is the case of Fight for this love, which is the name of a song, and Scar, the name of Simba’s uncle. Likewise, it is worth mentioning that, in the case of the first text, the author uses repeatedly the word said which might sound perfect in English in a journalistic text, yet it does not sound that good in Spanish and, therefore, the sentence happens to be slightly awkward. 

Regarding grammar mistakes, the MT is not able to recognise some verb tenses in the three texts. For example, in the first text, would continue has been translated as continuará, which is a future simple. In the second text, in order to translate is planning, the MT has chosen es la planificación, which is nonsense because it has been translated as a noun instead of a verb. And, finally, in the third text, the MT has decided to translate the passive is born as ha nacido, which is a present perfect. 

Moreover, on the syntactic level, word order has been the main protagonist, that is the case of European aviation agency Eurocontrol translated as Eurocontrol Agencia Europea de la aviación which in Spanish would be translated as an apposition Eurocontrol, la agencia europea de aviación. Furthermore, in the third text, the MT has omitted the verb making when translating the sentence into Spanish (thus making his uncle Scar the second in line to the throne > con lo que su tío Scar el segundo en la línea de sucesión al trono). Besides, there have been some errors when translating prepositions into Spanish, such as in February translated as de febrero. Likewise, some mistakes with regard to gender and number can be seen (spouse > esposa, in any case it should have been esposo; return home > a sus hogares, which should be translated as a su hogar). Also, it should be of great importance mentioning that the MT has introduced the relative pronoun que when it was not necessary, for instance, This is divorce with dignity, it is Cheryl’s style translated as Este es el divorcio con dignidad, que es el estilo de Cheryl

Finally, from a stylistic point of view, it must be said that £130m has been translated as £ 130 millones and it would have been more precise to translate it as a unit, that is, 130 millones de libras

(Summary of the main important mistakes) 

References: 

June 1, 2010

Review: MT – Google Translate

Filed under: Uncategorized Language Resources — Tags: , , , , , , — Jennifer Isasi @ 9:41 am

Google Translate is a service provided by Google Inc. to translate a section of text, or a webpage, into another language. The service limits the number of paragraphs, or range of technical terms, that will be translated. It is important to mention that Google has managed to develop and use its own translation software.,unlike other translation services such as Babel FishAOL, and Yahoowhich use SYSTRAN.

In order too acquire its huge amount of linguistic data, Google used United Nationsdocuments, and so Google now has a 6-language corpus of around 20 billion words’ worth of human translations.

How does it work?

First, we have to choose the languages in which we want to work:

Next step is to paste the text we want to be translated into the window

Automatically, the MT would do its job, and the translated version of the text pasted, would appear below:

You can see, the tool also offers the possibility to listen to the translation by clicking

However, Google Translate, like other automatic translation tools, has its limitations. While it can help the reader to understand the general content of a foreign language text, it does not always deliver accurate translations. This last  is the main goal of this review. In three short translations, from English to Spanish, I will check some of the mistakes made by the MT and that can only be corrected with human supervision.

FIRST TEXT

Original text from Vogue UK:

VICTORIA BECKHAM has landed herself yet another Vogue cover.

The fashion designer and former Spice Girl features on the front of the May issue of German Vogue which, as it’s a denim special, comes in denim packaging.

In the magazine, Victoria talks about her ideal pair of jeans: “The legs should be cut super long, with narrow knee. There is nothing more unattractive than to feel bruised on the waist.”

Inside, meanwhile, Posh strikes a pose dressed in a lace blouse and perched on a playground duck.

Beckham appeared on the cover of British Vogue back in April 2008, followed byVogue India the same year and Vogue Russia in February 2009.

Google Translator:

Victoria Beckham se ha volvió a ser portada de Vogue.

La diseñadora de moda y ex Spice Girl características en el frente de la edición de mayo de la revista Vogue alemana que, como es un especial de mezclilla, viene en envases de mezclilla.

En la revista, Victoria habla sobre su pareja ideal de jeans: “Las piernas deben ser cortados super largo, con la rodilla estrecha. No hay nada más atractivo que sentirse golpeado en la cintura.”

En el interior, por su parte, las huelgas una pose elegante vestido con una blusa de encaje y se posó en un pato patio de recreo.

Beckham apareció en la portada del Vogue británico en abril de 2008, seguida por la India Vogue el mismo año y Rusia Vogue en febrero de 2009.

Human Translation:

Victorio Beckham ha vuelto a ser portada de Vogue.

La diseñadora de moda y ex Spice Girl, aparece en la portada del número de Mayo de la Vogue alemana, la cual, como es un especial de jeans, viene en un paquete de denim.

En la revista, Victoria habla sobre sus vaqueros ideales: “Las piernas deben ser super largas, con la rodilla estrecha. No hay nada menos atractivo que sentir magulladuras en la cintura.”

En el interior, entretanto, Posh posa vestida con una blusa de encaje y sentada sobre un pato de patio de colegio.

Beckham apareción en la portada de la Vogue inglesa en Abril de 2008, seguida por la Vogue india el mismo año y la Vogue rusa en Febrero de 2009.

SECOND TEXT

Original text from BBC news:

A princess hoping to break a world record and a man dressed as the Angel of the North are among those running this year’s London Marathon.

Up to 36,000 people are taking on the world famous 26.2-mile course through the city.

Princess Beatrice, 21, hopes to become the first royal to complete the route.

She has joined a “human caterpillar” of 34 runners aiming to beat the world record for the most people to finish a marathon while tied together.

They are tied together two by two with bungee cords.

The princess, who is running for Children in Crisis, said she was “very, very excited”.

The Icelandic ash cloud had threatened to disrupt the race and many of the elite athletes only arrived after boarding a specially chartered flight from Madrid on Thursday.

Google translator:

Una princesa con la esperanza de romper un récord mundial y un hombre vestido como el Ángel del Norte se encuentran entre los que dirigen este año maratón de Londres.

Hasta 36.000 personas están tomando en el mundo famoso campo de 26,2 kilómetros a través de la ciudad.
Princesa Beatriz, de 21 años, espera convertirse en la primera real para completar la ruta.
Ella se ha unido a una oruga “humana” de los 34 corredores con el objetivo de batir el récord mundial para la mayoría de la gente para terminar una maratón mientras atadas.
Ellos están unidos de dos en dos con cuerdas elásticas.
La princesa, que se está ejecutando para Niños en Crisis, dijo que estaba “muy, muy emocionado”.
La nube de cenizas islandesa había amenazado con interrumpir la carrera y muchos de los atletas de élite sólo llegó después de abordar un vuelo especial fletado desde Madrid, el jueves.

Human Translation:

Una princesa con la esperanza de romper un récordo mundial y un hombre vestido como el Ángel del Norte están entre los que correran este año el Maraton de Londres.

Hasta 36.000 personas van a tomar parte en el mundialmente famosa carrera de 26.2 millas a través de la ciudad.

La princesa Beatrice, 21, espera convertirse en la primera persona de la realeza en completar la ruta.

Se ha unido a la “cadena humana” de 34 corredores que tratan de batir el récord mundial del mayor número de personas que terminan un maratón estando atados.

Están atados de dos en dos con cuerdas elásticas.

La princesa, que corre a favor de Children in Crisis, dijo que estaba “muy, muy emocionada”.

La nube de cenizas islandesa había amenazado con interrumpir la carrera y muchos de los atletas de élite sólo llegaron tras embarcar en un vuelo especial fletdo desde Madrid, el jueves.

THIRD TEXT

Original text from Guardian.co.uk:

Red Queen is my daughter, says Bonham Carter

When she is older, two-year-old Nell Burton may view her father’s film ofAlice in Wonderland with a particularly curious eye. For her mother,Helena Bonham Carter, last night revealed that she based her tyrannical Red Queen on her and Tim Burton‘s young daughter.

While the toddler is presumably not given to ordering the beheadings of those who surround her, Bonham Carter said Nell was the main inspiration for the bossy, spiteful monarch who maintains a reign of terror over Wonderland in Burton’s 3D CGI reimagining.

“I thought: well, she’s a toddler, because she’s got the big head,” Bonham Carter revealed at a press conference ahead of last night’s London premiere. “She’s a tyrant … toddlers are tyrants. The ‘no sympathy for any other living creature’ – that’s our toddler, in fact.

Google Translator:

Reina Roja es mi hija, dice Bonham Carter

Cuando ella es mayor, de dos años de edad, Nell Burton puede ver las películas de su padre de Alice in Wonderland, con un ojo particularmente curioso. Para su madre, Helena Bonham Carter, anoche reveló que basó su tiránica Reina Roja en ella y la joven hija de Tim Burton.
Aunque el niño probablemente no es dado a ordenar la decapitación de quienes le rodean, Bonham Carter dijo Nell fue la principal inspiración para el monarca autoritario, rencoroso que mantiene un reinado de terror sobre las Maravillas de Burton en 3D CGI reinvención.

“Yo pensé: bueno, es un niño pequeño, porque tiene la cabeza grande”, reveló Bonham Carter en una conferencia de prensa antes del estreno de la noche pasada en Londres. ”Ella es un tirano … los niños son tiranos. El” ninguna simpatía por cualquier otro ser vivo “- ese es nuestro hijo, de hecho.


Human Translation:

Es posible que cuando sea más mayor, Nell Burton, de dos años, vea la película de su padre Alicia in Wonderland con ojos particularmente curiosos. Ya  que su madre, Helena Bonham Carter, reveló anoche que basó su tiránica Reina Roja en la hija que ella y Tim Burton tienen en común.

Mientras que la pequeña probablemente no ordene la decapitación de aquellos que la rodean, Bonham Carter dijo que Nell fue la principal inspiración para mandona y maliciosa monarca que mantiene un reinado de terror sobre El País de las Maravillas en la reinvención en 3D IGC de Burton.

“Pensé: bueno, es la pequeña, porque tiene la cabeza grande,” reveló Bonham Carter  en una conferencia de prensa antes del estreno de anoche en Londres. “Ella es una tirana….los niños son tiranos. La “no simpatía por ningún otro ser viviente” – esa es nuestra pequeña, en realidad”.

MAIN MISTAKES MADE BY MT:

Working with a MT may accelerate the process of translating, but it is always necessary to supervise the work by a person, for this are one of the main mistakes that the machines usually makes when dealing with these two languages:

  1. Pronouns and articles: This might be due to the differences in rules between both languages. Spanish needs articles, but as English does not have articles, the translator would sometimes not use them. “Princess Beatrice” > “Princesa Beatriz” > “La princesa Beatrice”.
  2. Verb tenses:Many times, MTs have problems when dealing with a verb tense. It mistakes the imperfect with the past perfect; the conditionals, the subjunctive, the active or passive form, etc. ”VICTORIA BECKHAM has landed herself yet another” > “Victoria Beckham se ha volvió a ser” > “Victorio Beckham ha vuelto a ser portada de Vogue.”
  3. Abbreviation: “3D CGI” > “3D CGI” > “3D IGC” Here it would be important to change the abbreviation to the translated language for the readers to understand.
  4. Reference and Gender: It cannot translate words with the correct gender or number because in English both nouns and adjectives do not have gender nor number inflexion. “toddler” > “el niño” > “la niña” As toddler is a word for both genders, the machine cannot go back and find the reference of the word to use the correct word.
  5. Word meanings: There are many problems with the lexicon. When a word has more than one meaning, it cannot choose the correct one taking into account the context. That is the work of the translator, to decide between the many meanings the one that fits best. “features” > “características” > “aparece”. “feature” is here used as a verb, but the machine has taken it as a noun. The same happens with “running” > “dirigen” and “ejecutando” > “corriendo”
  6. More mistakes could be mentioned: word order, punctuation, cultural items transmission, expressions, etc.

For further reading,

Using Google Translation in Cross-Lingual Retrieval.

Sources:

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.