11.07.2015 Views

Natural Language Processing for Less Privileged Languages

Natural Language Processing for Less Privileged Languages

Natural Language Processing for Less Privileged Languages

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Natural</strong> <strong>Language</strong> <strong>Processing</strong> <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s: Where do wecome from? Where are we going?Anil Kumar Singh<strong>Language</strong> Technologies Research CentreIIIT, Hyderabad, Indiaanil@research.iiit.ac.inAbstractIn the context of the IJCNLP workshopon <strong>Natural</strong> <strong>Language</strong> <strong>Processing</strong> (NLP) <strong>for</strong><strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, we discuss theobstacles to research on such languages.We also briefly discuss the ways to makeprogress in removing these obstacles. Wemention some previous work and commenton the papers selected <strong>for</strong> the workshop.1 IntroductionWhile computing has become ubiquitous in the developedregions, its spread in other areas such asAsia is more recent. However, despite the fact thatAsia is a dense area in terms of linguistic diversity(or perhaps because of it), many Asian languagesare inadequately supported on computers. Even basicNLP tools are not available <strong>for</strong> these languages.This also has a social cost.NLP or Computational Linguistics (CL) basedtechnologies are now becoming important and futureintelligent systems will use more of these techniques.Most of NLP/CL tools and technologies aretailored <strong>for</strong> English or European languages. Recently,there has been a rapid growth of IT industryin many Asian countries. This is now the perfecttime to reduce the linguistic, computational andcomputational linguistics gap between the ‘moreprivileged’ and ‘less privileged’ languages.The IJCNLP workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong><strong>Language</strong> is aimed at bridging this gap. Onlywhen a basic infrastructure <strong>for</strong> supporting regionallanguages becomes available can we hope <strong>for</strong> a moreequitable availability of opportunities made possibleby the language technology. There have alreadybeen attempts in this direction and this workshopwill hopefully take them further.Figure-1 shows one possible view of the computationalinfrastructure needed <strong>for</strong> language processing<strong>for</strong> a particular language, or more preferably, <strong>for</strong> aset of related languages.In this paper, we will first discuss various aspectsof the problem. We will then look back at the workalready done. After that, we will present some suggestion<strong>for</strong> future work. But we will begin by addressinga minor issue: the terminology.2 TerminologyThere can be a debate about the correct term <strong>for</strong> thelanguages on which this workshop focuses. Thereare at least four candidates: less studied (LS) languages,resource scarce (RS) languages, less computerized(LC) languages, and less privileged (LP)languages. Out of these, two (LS and RS) are toonarrow <strong>for</strong> our purposes. LC is admittedly more objective,but it also is somewhat narrow in the sensethat it does not cover the lack of resources <strong>for</strong> creatingresources (finance) and the lack of linguisticstudy. We have used LP because it is more generaland covers all the aspects of the problem. However,it might be preferable to use LC in many contexts.As the common element among all these termsis the adjective ‘less’ (‘resoure scarce’ can beparaphrased as ‘with less resources’), perhaps wecan avoid the terminological debate by calling thelanguages covered by any such terms as the L-languages.7Proceedings of the IJCNLP-08 Workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, pages 7–12,Hyderabad, India, January 2008. c○2008 Asian Federation of <strong>Natural</strong> <strong>Language</strong> <strong>Processing</strong>


More <strong>Privileged</strong><strong>Less</strong> <strong>Privileged</strong>NLP / CLOther PrivilegesNLP / CLComputerizationLinguistic StudySource(Finance, Human Resources,Equipment, Socio-Political Support, etc.)ComputerizationLinguistic StudyDestinationFigure 2: The four dimensions of the problem: The Source is where we come from and Destination is wherewe are going. The problem is to go from the Source to the Destination and the solution is non-trivial.3.5 Other PrivilegesOne of the major reasons why building language resourcesand providing language processing capabilities<strong>for</strong> the L-languages is going to be a very difficulttask is the fact that these languages lack theprivileges which make it possible to build languageresources and NLP/CL tools. By ‘privileges’ wemean the availability of finance, equipment, humanresources, and even political and social support <strong>for</strong>reducing the lack of computing and language processingsupport <strong>for</strong> the L-languages. The lack ofsuch ‘privileges’ may be the single biggest reasonwhich is holding back the progress towards providingcomputing and language processing support <strong>for</strong>these languages.4 Some (Partially) Successful Ef<strong>for</strong>tsThe problem seems to be insurmountable, but therehas been some progress. More importantly, the urgencyof solving this problem (even if partially) isbeing realized by more and more people. Some recentevents or ef<strong>for</strong>ts which tried to address the problemand which have had some impact in improvingthe situation are:• The LREC conferences and workshops 2 .• Workshop on ”Shallow Parsing in South Asian<strong>Language</strong>s”, IJCAI-07, India.2 www.lrec-conf.org9• EMELD and the Digital Tools Summit in Linguistics,2006, USA.• Workshop on <strong>Language</strong> Resources <strong>for</strong> EuropeanMinority <strong>Language</strong>s, 1998, Spain.• Projects supported by ELRA on the Basic <strong>Language</strong>Resource Kit (BLARK) that targets thespecifications of a minimal kits <strong>for</strong> each languageto support NLP tools development 3 .• There is also a corresponding project at LDC(the <strong>Less</strong> Commonly Taught <strong>Language</strong>s 4 ).• The IJCNLP Workshop on Named EntityRecognition <strong>for</strong> South and South Asian <strong>Language</strong>s5 .This list is, of course, not exhaustive. There aremany papers relevant to the theme of this workshopat the IJCNLP 2008 main conference 6 , as at someprevious major conferences. There is also a very relevanttutorial (Mihalcea, 2008) at the IJCNLP 2008conference about building resources and tools <strong>for</strong>languages with scarce resources.Even the industry is realizing the importance ofproviding computing support <strong>for</strong> some of the L-languages. In the last few years there have beenmany announcements about the addition of some3 http://www.elda.org/blark4 http://projects.ldc.upenn.edu/LCTL5 http://ltrc.iiit.ac.in/ner-ssea-08/6 http://ijcnlp2008.org


such language to a product or a service and alsoof the addition of better facilities (input methods,transliteration, search) in an existing product or service<strong>for</strong> some L-language.5 Towards a SolutionSince the problem is very much like the conservationof the Earth’s environment, there is no easy solution.It is not even evident that a complete solutionis possible. However, we can still try <strong>for</strong> the bestpossible solution. Such a solution should have someprerequisites. As Figure-2 shows, the ‘other privileges’dimension of the problem has to be a majorelement of the solution, but it is not something overwhich researchers and developers have much control.This means that we will have to find ways towork even with very little of these ‘other privileges’.This is the key point that we want to make in thispaper because it implies that the methods that havebeen used <strong>for</strong> English (a language with almost unlimited‘privileges’) may not be applicable <strong>for</strong> theL-languages. Many of these methods assume theavailability of certain things which simply cannot beassumed <strong>for</strong> the L-languages. For example, there isno reasonable ground to assume that there will be(in the near future) corpus even with shallow levelsof annotation <strong>for</strong> Avadhi or Dogri or Konkani, letalone a treebank like resource. There<strong>for</strong>e, we haveto look <strong>for</strong> methods which can work with unannotatedcorpus. Moreover, these methods should alsonot require a lot of work from trained linguists becausesuch linguists may not be available to work onthese languages. There is one approach, however,that can still allow us to build resources and tools<strong>for</strong> these languages. This is the approach of adaptingthe resources of a linguistically close but moreprivileged language. It is this area which needs tobe studied and explored more thoroughly because itseems to be the only practical way to make the kindof progress that is required urgently. The processof resource adaptation will have to studied from linguistic,computational, and other practical points ofview. Since ‘other privileges’ are a major factor asdiscussed earlier, some ways of calculating the costof adaptation have also to be found.Another very general but important point is thatwe will have to build multilingual systems as far10as possible so that the cost per language is reduced.This will require innovation in terms of modeling aswell as engineering.6 Some Comments about the WorkshopThe scope of the workshop included topics such asthe following:• Archiving and creation of interoperable dataand metadata <strong>for</strong> less privileged languages• Support <strong>for</strong> less privileged language on computers.This includes input methods, display,fonts, encoding converters, spell checkers,more linguistically aware text editors etc.• Basic NLP tools such as sentence marker, tokenizer,morphological analyzer, transliterationtools, language and encoding identifiers etc.• Advanced NLP tools such as POS taggers, localword grouper, approximate string search, tools<strong>for</strong> developing language resources.There were a relatively large number of submissionsto the workshop and the overall quality wasat least above average. The most noteworthy fact isthat the variety of papers submitted (and selected)was pleasantly surprising. The workshop includespaper on topics as diverse as Machine Translation(MT) from text to sign language (an L-language onwhich very few people have worked) to MT fromspeech to speech. And from segmentation and stemmingto parser adaptation. Also, from input methods,text editor and interfaces to part of speech(POS) tagger. The variety is also remarkable interms of the languages covered and research locations.In addition, the workshop includes three invitedtalks: the first on building language resources by resourceadaptation (David and Maxwell, 2008); thesecond on cross-language resource sharing (Sornlertlamvanich,2008b); and the third on breaking theZipfian barrier in NLP (Choudhury, 2008). It canbe said that the workshop has been a moderate success.We hope it will stimulate further work in thisdirection.


of the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong><strong>Language</strong>s, Hyderabad, India.Anne David and Michael Maxwell. 2008. Building languageresources: Ways to move <strong>for</strong>ward. Invited Talkat the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong><strong>Language</strong>s, 2008. Hyderabad, India.Timo Honkela David Ellis, Mathias Creutz and MikkoKurimo. 2008. Speech to speech machine translation:Biblical chatter from finnish to english. In Proceedingsof the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong><strong>Language</strong>s, Hyderabad, India.M. B. Emeneau. 1956. India as a linguistic area. Linguistics,32:3-16.M. B. Emeneau. 1980. <strong>Language</strong> and linguistic area. Essaysby Murray B. Emeneau. Selected and introducedby Anwar S. Dil. Stan<strong>for</strong>d University Press.Sandeva Goonetilleke, Yoshihiko Hayashi, Yuichi Itoh,and Fumio Kishino. 2008. Srishell primo: A predictivesinhala text input system. In Proceedings of theIJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s,Hyderabad, India.Zin Maung Maung and Yoshiki Mikami. 2008. A rulebasedsyllable segmentation of myanmar text. In Proceedingsof the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong><strong>Privileged</strong> <strong>Language</strong>s, Hyderabad, India.Michael Maxwell and Anne David. 2008. Joint grammardevelopment by linguists and computer scientists. InProceedings of the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong><strong>Privileged</strong> <strong>Language</strong>s, Hyderabad, India.Sandipan Sarkar and Sivaji Bandyopadhyay. 2008. Designof a rule-based stemmer <strong>for</strong> natural language textin bengali. In Proceedings of the IJCNLP Workshopon NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, Hyderabad,India.Thoudam Doren Singh and Sivaji Bandyopadhyay. 2008.Morphology driven manipuri pos tagger. In Proceedingsof the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong><strong>Language</strong>s, Hyderabad, India.Virach Sornlertlamvanich, Thatsanee Charoenporn,Kergrit Robkop, and Hitoshi Isahara. 2008a. Kui:an ubiquitous tool <strong>for</strong> collective intelligence development.In Proceedings of the IJCNLP Workshopon NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, Hyderabad,India.Virach Sornlertlamvanich. 2008b. Cross language resourcesharing. Invited Talk at the IJCNLP Workshopon NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, 2008. Hyderabad,India.Krishnakumar Veeraraghavan and Indrani Roy. 2008.Acharya - a text editor and framework <strong>for</strong> workingwith indic scripts. In Proceedings of the IJCNLPWorkshop on NLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, Hyderabad,India.Daniel Zeman and Philip Resnik. 2008. Cross-languageparser adaptation between related languages. In Proceedingsof the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong><strong>Privileged</strong> <strong>Language</strong>s, Hyderabad, India.Rada Mihalcea. 2008. How to add a new language on thenlp map: Building resources and tools <strong>for</strong> languageswith scarce resources. Tutorial at the Third InternationalJoint Conference on <strong>Natural</strong> <strong>Language</strong> <strong>Processing</strong>(IJCNLP). Hyderabad, India.Jackson Muhirwe and Trond Trosterud. 2008. Finitestate solutions <strong>for</strong> reduplication in kinyarwanda language.In Proceedings of the IJCNLP Workshop onNLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, Hyderabad, India.Chirag Patel and Karthik Gali. 2008. Part of speech tagger<strong>for</strong> gujarati using conditional random fields. InProceedings of the IJCNLP Workshop on NLP <strong>for</strong> <strong>Less</strong><strong>Privileged</strong> <strong>Language</strong>s, Hyderabad, India.Hammam Riza. 2008. Indigenous languages of indonesia:Creating language resources <strong>for</strong> language preservation.In Proceedings of the IJCNLP Workshop onNLP <strong>for</strong> <strong>Less</strong> <strong>Privileged</strong> <strong>Language</strong>s, Hyderabad, India.12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!