Tatoeba Project

In linguistics, corpora (plural form of “corpus”) are large sets of language samples taken from real-life texts in a given context. Usually, they attempt to capture everyday speech in ordinary situations (although they can be more specific, e.g. samples of English essays written by non-native students), and thus can then be studied to analyze the characteristics of natural language, to extract statistics on the use of certain expressions or common errors, or to guide learners of a particular language by representing how native speakers would generally express themselves.

Tatoeba (Japanese for “for example”) is an ongoing non-commercial collaborative project to collect sentences from and translated into many different languages. Started in 2006, it currently serves as a cross-lingual aligned corpus with more than 750000 sentences distributed among more than 80 languages.

Tatoeba is an online and open language resource, whose data is available to anyone and can be accessed through the website’s multilingual interface (e.g. Japanese homepage, Basque homepage, Arabic homepage). Sentences are released under the Creative Commons license and can be downloaded as a .csv file, so that they can be freely incorporated into other sites or even textbooks. Users can register to contribute either by providing written or audio-recorded sentences, new translations, comments, or correcting mistakes in spelling, grammar, etc.

Nevertheless, in order to make sure that the contents offered are of sufficient good quality to be employed educationally, Tatoeba has a few moderators that review user activity. In addition, not every person is immediately given the right to make any type of edit, and only those members who accumulate reliability through positive participation become “trusted users”.

How to use Tatoeba

Clicking on the “browse” tab, you can perform sentence searches in different ways:

1. By words:

Using Boolean logic operators, you can perform queries to look for sentences including a specific word, two or more specific words, some word among more than one alternative, only certain words but not others, exact matches of a string of words, etc.


2. By language:

You can look for a sentence in the language you want and decide whether to see all of its translations or only those appearing in a specific language.

National flags are used as symbols to indicate which language a sentence is written in.

If a single sentence can be translated into a language in many ways, users are allowed to submit all the translations they consider necessary. Tatoeba allows for near-duplicates as long as their presence is assistive.

You can also indicate whether you want to find a direct translation or an indirect one. Indirect translations are those that occur when a translation is again translated, and thus this 2nd translation is no longer directly connected with the original sentence.

Indirectly translated sentences could lose some nuances and might not be as accurate, but Tatoeba nevertheless accepts their submission.

As of March, 2011, the three top languages with most sentences on the database are (in order) English, Japanese and Esperanto.

3. By list:
Users can create public or private lists to group sentences within a category. Public lists are accessible to everyone, and include examples such as “tourist sentences (all languages)“, “sayings & idioms (all languages)“, “natural-sounding Spanish sentences“, etc.

4. By tag:
Many types of tags can be attached to sentences in order to help classify or define them. Tags are used to mark different aspects of a sentence, from semantic qualities (related to topics such as “weather“, “family“, etc.) to notes concerning usage (“female speaker“, “colloquial“, etc.). However, because there is currently no uniform tagging system, some tags are unnecessarily repetitive (e.g. “Plato“, “by Plato” and “By-Plato“), others are almost or completely unused (e.g. “quotation“, “most important“), and a few do not seem transparent enough (e.g. “651884“, “@©hange“).

5
. By user:

Sentences added or “owned” by a member can be found on their profiles.



Because not all users are as reliable or experienced, this feature helps you locate the most trusted contributions.

6. By audio:

Only those sentences for which an audio recording is available are displayed (at present totaling up to more than 9500).

Evolution

The enormous amount of sentences already submitted in less than 5 years proves Tatoeba‘s success. Although the database was at first taken from the Tanaka Corpus of parallel Japanese-English sentences, the collection has been widely modified, corrected, and expanded to include many more new sentences and other languages. While the project was started by a single person (who is still the only administrator on the site), there are now several moderators that also share the power to take care of the database through editing or deleting erroneous sentences.

Pageview statistics and the number of sentences submitted during the whole year of 2010 show that Tatoeba has seen a huge increase in popularity and activity in the span of a few months.

Tatoeba is currently a language resource checked by tens of thousands of people every day, but with its constantly growing community, we can be confident that the project will achieve much more in the course of time.

References:

  • Tanaka Corpus (February 3, 2011). In EDRDG Wiki. Retrieved March 13, 2011
Advertisements