Position Paper - FLaReNet
Position Paper - FLaReNet
Position Paper - FLaReNet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Proposition for a web 2.0 version of linguistic resource creation<br />
Gregory Grefenstette<br />
Exalead<br />
10 place de la Madeleine, 75008 Paris, France<br />
Gregory.Grefenstette@exalead.com<br />
<strong>Position</strong><br />
Linguistic resources should be free. It is in the interest of<br />
every language community that language resources for<br />
their language are freely available. The existence of<br />
language resources for a language can serve two purposes:<br />
defending the existence of the language thus preserving<br />
cultural heritage, and providing a means of communicating<br />
into and from the language thus easing trade. Language<br />
resources can help ease the problem of language variation<br />
which is an impediment to information access and<br />
information transmission.<br />
Dimensions of the problem<br />
Currently, Kevin Scannell of St Louis University currently<br />
monitors 446 languages on the web (Scannell, 2007). His<br />
web page http://borel.slu.edu/crubadan/stadas.html lists<br />
these languages and the number of words that he has found<br />
for each by crawling the web. Many languages (for<br />
example, Abua, Akurio, Bashkir, Bhojpuri, Chayahuita,<br />
etc.) have only one or two documents (one often being the<br />
Universal Declaration of Human Rights 1 , currently<br />
available in 370 languages) and a few thousand words.<br />
Sources of lexicons<br />
One source for lexicon extraction is online news. Google<br />
news exists in 70 national version, though there are many<br />
in the same language, for example, Spanish is used in 9 of<br />
the national versions. A European Union initiative, the<br />
European Media Monitor, EMM 2 , monitors news in 43<br />
languages (Steinberger et al., 2007). The Natural Language<br />
Processing Group of the Computer Science Department of<br />
Leipzig University has downloaded and packaged corpora 3<br />
from public sources in different sizes (100k, 300k, 1<br />
million and 3 million sentences). They provide sentence<br />
corpora for research purposes for Catalan, Danish, Dutch,<br />
English, Estonian, Finnish, French, German, Italian,<br />
Japanese, Korean, Norwegian, Sorbian, Swedish , and<br />
Turkish (Biemann at el., 2007).<br />
As for Wikipedia, there are currently 269 language<br />
versions of Wikipedia, though some entire sites are very<br />
restricted. For example, the Chichewa Wikipedia contains<br />
65 articles for this language spoken by 9.3 million people<br />
in Zambia and Malawi.<br />
Proposal<br />
All these resources and sources are insufficient to solve the<br />
problem of creating free and complete resources for the<br />
world languages, even for the 446 that have some web<br />
presence.<br />
We propose creating a Web 2.0 site for using the same<br />
community computing power than generates millions of<br />
blogs to solve the problem of creating a basic language<br />
resources for all the world’s languages, starting with the<br />
446 present on the web. Though the CLARIN project<br />
(Boves, et al. 2009) aims at providing high quality<br />
language resources for the panoply of NLP activities from<br />
tokenization to speech, there is a need for simpler and<br />
complete resources for all languages, and we believe that<br />
harnessing the power of web users can provide the<br />
realization of this dream.<br />
We think that a website can be created that will allow end<br />
users to adopt a certain number of words, in packets of ten,<br />
for example. Following the example of the construction of<br />
the Oxford English dictionary when James Murray invited<br />
volunteers from around the world to submit evidence of<br />
word usages, we can create a site where users can take a<br />
certain number of words and provide meaningful<br />
resources.<br />
For example, we could give the users a number of words<br />
such as the following French words<br />
étrangères, liberation, mauvaise, avantage, représentent<br />
1 http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx<br />
2 http://emm.newsbrief.eu/overview.html<br />
3 http://corpora.informatik.uni-leipzig.de/download.html
and ask the users to return the surface form, the lemma,the<br />
major part of speech and a few English 4 translations. For<br />
example,<br />
étrangères ; étranger ; ADJ ; foreign, stranger<br />
libération ; libération ; N ; liberation<br />
mauvaise ; mauvais ; ADJ ; bad<br />
avantage ; avantage ; N ; advantage<br />
représentent ; répresenter ; V ; represent<br />
Surowiecki, J. The Wisdom of Crowds: Why the Many Are<br />
Smarter Than the Few and How Collective Wis-dom Shapes<br />
Business, Economies, Societies and Na-tions. Doubleday Books,<br />
2004<br />
Users responses could be controlled by any of the known<br />
user rating systems available on the web. For example, the<br />
same words could be given to multiple users and the users<br />
who reply most like other users will be more highly ranked<br />
than those that given outlying answers (Surowiecki, 2004).<br />
Users could be further ranked by the number of words they<br />
have “solved” for a given language.<br />
These simple representations of words could be used as<br />
springboard for much wider resource creation, for example<br />
by adding language dependent frequencies to each word<br />
from search engine probing. Another example would be<br />
the generation of multiword expressions and their<br />
translations (Grefenstette, 1999).<br />
We will defend this approach and sketch how it can be<br />
implemented.<br />
References<br />
Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S.,<br />
Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources<br />
for Speech Research: Present and Future Infrastructure Needs. In<br />
Interspeech (pp. 1803-1806). Brighton, UK.<br />
C. Biemann, G. Heyer, U. Quasthoff, and M. Richter. The<br />
Leipzig corpora collection - mono-lingual corpora of standard<br />
size. In Proceedings of Corpus Linguistic 2007, Birmingham,<br />
UK, 2007<br />
Grefenstette, G. 1999. The World Wide Web as a resource for<br />
example-based machine translation tasks. In Proceedings of Aslib<br />
Conference on Translating and the Computer 21. London.<br />
Scannel, K.P. (2007) The Crúbadán Project: corpus building for<br />
under-resourced languages. In Fairon, C., Naets, H., Kilgarriff, A.<br />
and de Schryver, G.-M. (eds.) Building and exploring Web<br />
corpora. Proceedings of the WAC3 Conference. Louvain: Presses<br />
Universitaires de Louvain. 5-15.<br />
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C.,Erjavec, T.,<br />
Tufiş, D., Varga D. (2006). The JRC-Acquis: A multilingual<br />
aligned parallel corpus with 20+ languages. In Proceedings of the<br />
5 th International Conference on Language Resources and<br />
Evaluation (LREC'2006). Genoa, Italy, pp.2142-2147<br />
4 Some other language than English could be used, for example<br />
Chinese, but for the moment, English is the most used language<br />
on the Internet.