Position Paper - FLaReNet

Proposition for a web 2.0 version of linguistic resource creation 

Gregory Grefenstette 

Exalead 

10 place de la Madeleine, 75008 Paris, France 

Gregory.Grefenstette@exalead.com 

Position 

Linguistic resources should be free. It is in the interest of 

every language community that language resources for 

their language are freely available. The existence of 

language resources for a language can serve two purposes: 

defending the existence of the language thus preserving 

cultural heritage, and providing a means of communicating 

into and from the language thus easing trade. Language 

resources can help ease the problem of language variation 

which is an impediment to information access and 

information transmission. 

Dimensions of the problem 

Currently, Kevin Scannell of St Louis University currently 

monitors 446 languages on the web (Scannell, 2007). His 

web page http://borel.slu.edu/crubadan/stadas.html lists 

these languages and the number of words that he has found 

for each by crawling the web. Many languages (for 

example, Abua, Akurio, Bashkir, Bhojpuri, Chayahuita, 

etc.) have only one or two documents (one often being the 

Universal Declaration of Human Rights 1 , currently 

available in 370 languages) and a few thousand words. 

Sources of lexicons 

One source for lexicon extraction is online news. Google 

news exists in 70 national version, though there are many 

in the same language, for example, Spanish is used in 9 of 

the national versions. A European Union initiative, the 

European Media Monitor, EMM 2 , monitors news in 43 

languages (Steinberger et al., 2007). The Natural Language 

Processing Group of the Computer Science Department of 

Leipzig University has downloaded and packaged corpora 3 

from public sources in different sizes (100k, 300k, 1 

million and 3 million sentences). They provide sentence 

corpora for research purposes for Catalan, Danish, Dutch, 

English, Estonian, Finnish, French, German, Italian, 

Japanese, Korean, Norwegian, Sorbian, Swedish , and 

Turkish (Biemann at el., 2007). 

As for Wikipedia, there are currently 269 language 

versions of Wikipedia, though some entire sites are very 

restricted. For example, the Chichewa Wikipedia contains 

65 articles for this language spoken by 9.3 million people 

in Zambia and Malawi. 

Proposal 

All these resources and sources are insufficient to solve the 

problem of creating free and complete resources for the 

world languages, even for the 446 that have some web 

presence. 

We propose creating a Web 2.0 site for using the same 

community computing power than generates millions of 

blogs to solve the problem of creating a basic language 

resources for all the world’s languages, starting with the 

446 present on the web. Though the CLARIN project 

(Boves, et al. 2009) aims at providing high quality 

language resources for the panoply of NLP activities from 

tokenization to speech, there is a need for simpler and 

complete resources for all languages, and we believe that 

harnessing the power of web users can provide the 

realization of this dream. 

We think that a website can be created that will allow end 

users to adopt a certain number of words, in packets of ten, 

for example. Following the example of the construction of 

the Oxford English dictionary when James Murray invited 

volunteers from around the world to submit evidence of 

word usages, we can create a site where users can take a 

certain number of words and provide meaningful 

resources. 

For example, we could give the users a number of words 

such as the following French words 

étrangères, liberation, mauvaise, avantage, représentent 

1 http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx 

2 http://emm.newsbrief.eu/overview.html 

3 http://corpora.informatik.uni-leipzig.de/download.html

and ask the users to return the surface form, the lemma,the 

major part of speech and a few English 4 translations. For 

example, 

étrangères ; étranger ; ADJ ; foreign, stranger 

libération ; libération ; N ; liberation 

mauvaise ; mauvais ; ADJ ; bad 

avantage ; avantage ; N ; advantage 

représentent ; répresenter ; V ; represent 

Surowiecki, J. The Wisdom of Crowds: Why the Many Are 

Smarter Than the Few and How Collective Wis-dom Shapes 

Business, Economies, Societies and Na-tions. Doubleday Books, 

2004 

Users responses could be controlled by any of the known 

user rating systems available on the web. For example, the 

same words could be given to multiple users and the users 

who reply most like other users will be more highly ranked 

than those that given outlying answers (Surowiecki, 2004). 

Users could be further ranked by the number of words they 

have “solved” for a given language. 

These simple representations of words could be used as 

springboard for much wider resource creation, for example 

by adding language dependent frequencies to each word 

from search engine probing. Another example would be 

the generation of multiword expressions and their 

translations (Grefenstette, 1999). 

We will defend this approach and sketch how it can be 

implemented. 

References 

Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., 

Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources 

for Speech Research: Present and Future Infrastructure Needs. In 

Interspeech (pp. 1803-1806). Brighton, UK. 

C. Biemann, G. Heyer, U. Quasthoff, and M. Richter. The 

Leipzig corpora collection - mono-lingual corpora of standard 

size. In Proceedings of Corpus Linguistic 2007, Birmingham, 

UK, 2007 

Grefenstette, G. 1999. The World Wide Web as a resource for 

example-based machine translation tasks. In Proceedings of Aslib 

Conference on Translating and the Computer 21. London. 

Scannel, K.P. (2007) The Crúbadán Project: corpus building for 

under-resourced languages. In Fairon, C., Naets, H., Kilgarriff, A. 

and de Schryver, G.-M. (eds.) Building and exploring Web 

corpora. Proceedings of the WAC3 Conference. Louvain: Presses 

Universitaires de Louvain. 5-15. 

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C.,Erjavec, T., 

Tufiş, D., Varga D. (2006). The JRC-Acquis: A multilingual 

aligned parallel corpus with 20+ languages. In Proceedings of the 

5 th International Conference on Language Resources and 

Evaluation (LREC'2006). Genoa, Italy, pp.2142-2147 

4 Some other language than English could be used, for example 

Chinese, but for the moment, English is the most used language 

on the Internet.

Position Paper - FLaReNet

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?