19.11.2014 Views

Position Paper - FLaReNet

Position Paper - FLaReNet

Position Paper - FLaReNet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Proposition for a web 2.0 version of linguistic resource creation<br />

Gregory Grefenstette<br />

Exalead<br />

10 place de la Madeleine, 75008 Paris, France<br />

Gregory.Grefenstette@exalead.com<br />

<strong>Position</strong><br />

Linguistic resources should be free. It is in the interest of<br />

every language community that language resources for<br />

their language are freely available. The existence of<br />

language resources for a language can serve two purposes:<br />

defending the existence of the language thus preserving<br />

cultural heritage, and providing a means of communicating<br />

into and from the language thus easing trade. Language<br />

resources can help ease the problem of language variation<br />

which is an impediment to information access and<br />

information transmission.<br />

Dimensions of the problem<br />

Currently, Kevin Scannell of St Louis University currently<br />

monitors 446 languages on the web (Scannell, 2007). His<br />

web page http://borel.slu.edu/crubadan/stadas.html lists<br />

these languages and the number of words that he has found<br />

for each by crawling the web. Many languages (for<br />

example, Abua, Akurio, Bashkir, Bhojpuri, Chayahuita,<br />

etc.) have only one or two documents (one often being the<br />

Universal Declaration of Human Rights 1 , currently<br />

available in 370 languages) and a few thousand words.<br />

Sources of lexicons<br />

One source for lexicon extraction is online news. Google<br />

news exists in 70 national version, though there are many<br />

in the same language, for example, Spanish is used in 9 of<br />

the national versions. A European Union initiative, the<br />

European Media Monitor, EMM 2 , monitors news in 43<br />

languages (Steinberger et al., 2007). The Natural Language<br />

Processing Group of the Computer Science Department of<br />

Leipzig University has downloaded and packaged corpora 3<br />

from public sources in different sizes (100k, 300k, 1<br />

million and 3 million sentences). They provide sentence<br />

corpora for research purposes for Catalan, Danish, Dutch,<br />

English, Estonian, Finnish, French, German, Italian,<br />

Japanese, Korean, Norwegian, Sorbian, Swedish , and<br />

Turkish (Biemann at el., 2007).<br />

As for Wikipedia, there are currently 269 language<br />

versions of Wikipedia, though some entire sites are very<br />

restricted. For example, the Chichewa Wikipedia contains<br />

65 articles for this language spoken by 9.3 million people<br />

in Zambia and Malawi.<br />

Proposal<br />

All these resources and sources are insufficient to solve the<br />

problem of creating free and complete resources for the<br />

world languages, even for the 446 that have some web<br />

presence.<br />

We propose creating a Web 2.0 site for using the same<br />

community computing power than generates millions of<br />

blogs to solve the problem of creating a basic language<br />

resources for all the world’s languages, starting with the<br />

446 present on the web. Though the CLARIN project<br />

(Boves, et al. 2009) aims at providing high quality<br />

language resources for the panoply of NLP activities from<br />

tokenization to speech, there is a need for simpler and<br />

complete resources for all languages, and we believe that<br />

harnessing the power of web users can provide the<br />

realization of this dream.<br />

We think that a website can be created that will allow end<br />

users to adopt a certain number of words, in packets of ten,<br />

for example. Following the example of the construction of<br />

the Oxford English dictionary when James Murray invited<br />

volunteers from around the world to submit evidence of<br />

word usages, we can create a site where users can take a<br />

certain number of words and provide meaningful<br />

resources.<br />

For example, we could give the users a number of words<br />

such as the following French words<br />

étrangères, liberation, mauvaise, avantage, représentent<br />

1 http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx<br />

2 http://emm.newsbrief.eu/overview.html<br />

3 http://corpora.informatik.uni-leipzig.de/download.html


and ask the users to return the surface form, the lemma,the<br />

major part of speech and a few English 4 translations. For<br />

example,<br />

étrangères ; étranger ; ADJ ; foreign, stranger<br />

libération ; libération ; N ; liberation<br />

mauvaise ; mauvais ; ADJ ; bad<br />

avantage ; avantage ; N ; advantage<br />

représentent ; répresenter ; V ; represent<br />

Surowiecki, J. The Wisdom of Crowds: Why the Many Are<br />

Smarter Than the Few and How Collective Wis-dom Shapes<br />

Business, Economies, Societies and Na-tions. Doubleday Books,<br />

2004<br />

Users responses could be controlled by any of the known<br />

user rating systems available on the web. For example, the<br />

same words could be given to multiple users and the users<br />

who reply most like other users will be more highly ranked<br />

than those that given outlying answers (Surowiecki, 2004).<br />

Users could be further ranked by the number of words they<br />

have “solved” for a given language.<br />

These simple representations of words could be used as<br />

springboard for much wider resource creation, for example<br />

by adding language dependent frequencies to each word<br />

from search engine probing. Another example would be<br />

the generation of multiword expressions and their<br />

translations (Grefenstette, 1999).<br />

We will defend this approach and sketch how it can be<br />

implemented.<br />

References<br />

Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S.,<br />

Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources<br />

for Speech Research: Present and Future Infrastructure Needs. In<br />

Interspeech (pp. 1803-1806). Brighton, UK.<br />

C. Biemann, G. Heyer, U. Quasthoff, and M. Richter. The<br />

Leipzig corpora collection - mono-lingual corpora of standard<br />

size. In Proceedings of Corpus Linguistic 2007, Birmingham,<br />

UK, 2007<br />

Grefenstette, G. 1999. The World Wide Web as a resource for<br />

example-based machine translation tasks. In Proceedings of Aslib<br />

Conference on Translating and the Computer 21. London.<br />

Scannel, K.P. (2007) The Crúbadán Project: corpus building for<br />

under-resourced languages. In Fairon, C., Naets, H., Kilgarriff, A.<br />

and de Schryver, G.-M. (eds.) Building and exploring Web<br />

corpora. Proceedings of the WAC3 Conference. Louvain: Presses<br />

Universitaires de Louvain. 5-15.<br />

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C.,Erjavec, T.,<br />

Tufiş, D., Varga D. (2006). The JRC-Acquis: A multilingual<br />

aligned parallel corpus with 20+ languages. In Proceedings of the<br />

5 th International Conference on Language Resources and<br />

Evaluation (LREC'2006). Genoa, Italy, pp.2142-2147<br />

4 Some other language than English could be used, for example<br />

Chinese, but for the moment, English is the most used language<br />

on the Internet.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!