Natural Language Processing for Less Privileged Languages

Natural Language Processing for Less Privileged Languages: Where do wecome from? Where are we going?Anil Kumar SinghLanguage Technologies Research CentreIIIT, Hyderabad, Indiaanil@research.iiit.ac.inAbstractIn the context of the IJCNLP workshopon Natural Language Processing (NLP) forLess Privileged Languages, we discuss theobstacles to research on such languages.We also briefly discuss the ways to makeprogress in removing these obstacles. Wemention some previous work and commenton the papers selected for the workshop.1 IntroductionWhile computing has become ubiquitous in the developedregions, its spread in other areas such asAsia is more recent. However, despite the fact thatAsia is a dense area in terms of linguistic diversity(or perhaps because of it), many Asian languagesare inadequately supported on computers. Even basicNLP tools are not available for these languages.This also has a social cost.NLP or Computational Linguistics (CL) basedtechnologies are now becoming important and futureintelligent systems will use more of these techniques.Most of NLP/CL tools and technologies aretailored for English or European languages. Recently,there has been a rapid growth of IT industryin many Asian countries. This is now the perfecttime to reduce the linguistic, computational andcomputational linguistics gap between the ‘moreprivileged’ and ‘less privileged’ languages.The IJCNLP workshop on NLP for Less PrivilegedLanguage is aimed at bridging this gap. Onlywhen a basic infrastructure for supporting regionallanguages becomes available can we hope for a moreequitable availability of opportunities made possibleby the language technology. There have alreadybeen attempts in this direction and this workshopwill hopefully take them further.Figure-1 shows one possible view of the computationalinfrastructure needed for language processingfor a particular language, or more preferably, for aset of related languages.In this paper, we will first discuss various aspectsof the problem. We will then look back at the workalready done. After that, we will present some suggestionfor future work. But we will begin by addressinga minor issue: the terminology.2 TerminologyThere can be a debate about the correct term for thelanguages on which this workshop focuses. Thereare at least four candidates: less studied (LS) languages,resource scarce (RS) languages, less computerized(LC) languages, and less privileged (LP)languages. Out of these, two (LS and RS) are toonarrow for our purposes. LC is admittedly more objective,but it also is somewhat narrow in the sensethat it does not cover the lack of resources for creatingresources (finance) and the lack of linguisticstudy. We have used LP because it is more generaland covers all the aspects of the problem. However,it might be preferable to use LC in many contexts.As the common element among all these termsis the adjective ‘less’ (‘resoure scarce’ can beparaphrased as ‘with less resources’), perhaps wecan avoid the terminological debate by calling thelanguages covered by any such terms as the L-languages.7Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 7–12,Hyderabad, India, January 2008. c○2008 Asian Federation of Natural Language Processing

More PrivilegedLess PrivilegedNLP / CLOther PrivilegesNLP / CLComputerizationLinguistic StudySource(Finance, Human Resources,Equipment, Socio-Political Support, etc.)ComputerizationLinguistic StudyDestinationFigure 2: The four dimensions of the problem: The Source is where we come from and Destination is wherewe are going. The problem is to go from the Source to the Destination and the solution is non-trivial.3.5 Other PrivilegesOne of the major reasons why building language resourcesand providing language processing capabilitiesfor the L-languages is going to be a very difficulttask is the fact that these languages lack theprivileges which make it possible to build languageresources and NLP/CL tools. By ‘privileges’ wemean the availability of finance, equipment, humanresources, and even political and social support forreducing the lack of computing and language processingsupport for the L-languages. The lack ofsuch ‘privileges’ may be the single biggest reasonwhich is holding back the progress towards providingcomputing and language processing support forthese languages.4 Some (Partially) Successful EffortsThe problem seems to be insurmountable, but therehas been some progress. More importantly, the urgencyof solving this problem (even if partially) isbeing realized by more and more people. Some recentevents or efforts which tried to address the problemand which have had some impact in improvingthe situation are:• The LREC conferences and workshops 2 .• Workshop on ”Shallow Parsing in South AsianLanguages”, IJCAI-07, India.2 www.lrec-conf.org9• EMELD and the Digital Tools Summit in Linguistics,2006, USA.• Workshop on Language Resources for EuropeanMinority Languages, 1998, Spain.• Projects supported by ELRA on the Basic LanguageResource Kit (BLARK) that targets thespecifications of a minimal kits for each languageto support NLP tools development 3 .• There is also a corresponding project at LDC(the Less Commonly Taught Languages 4 ).• The IJCNLP Workshop on Named EntityRecognition for South and South Asian Languages5 .This list is, of course, not exhaustive. There aremany papers relevant to the theme of this workshopat the IJCNLP 2008 main conference 6 , as at someprevious major conferences. There is also a very relevanttutorial (Mihalcea, 2008) at the IJCNLP 2008conference about building resources and tools forlanguages with scarce resources.Even the industry is realizing the importance ofproviding computing support for some of the L-languages. In the last few years there have beenmany announcements about the addition of some3 http://www.elda.org/blark4 http://projects.ldc.upenn.edu/LCTL5 http://ltrc.iiit.ac.in/ner-ssea-08/6 http://ijcnlp2008.org

such language to a product or a service and alsoof the addition of better facilities (input methods,transliteration, search) in an existing product or servicefor some L-language.5 Towards a SolutionSince the problem is very much like the conservationof the Earth’s environment, there is no easy solution.It is not even evident that a complete solutionis possible. However, we can still try for the bestpossible solution. Such a solution should have someprerequisites. As Figure-2 shows, the ‘other privileges’dimension of the problem has to be a majorelement of the solution, but it is not something overwhich researchers and developers have much control.This means that we will have to find ways towork even with very little of these ‘other privileges’.This is the key point that we want to make in thispaper because it implies that the methods that havebeen used for English (a language with almost unlimited‘privileges’) may not be applicable for theL-languages. Many of these methods assume theavailability of certain things which simply cannot beassumed for the L-languages. For example, there isno reasonable ground to assume that there will be(in the near future) corpus even with shallow levelsof annotation for Avadhi or Dogri or Konkani, letalone a treebank like resource. Therefore, we haveto look for methods which can work with unannotatedcorpus. Moreover, these methods should alsonot require a lot of work from trained linguists becausesuch linguists may not be available to work onthese languages. There is one approach, however,that can still allow us to build resources and toolsfor these languages. This is the approach of adaptingthe resources of a linguistically close but moreprivileged language. It is this area which needs tobe studied and explored more thoroughly because itseems to be the only practical way to make the kindof progress that is required urgently. The processof resource adaptation will have to studied from linguistic,computational, and other practical points ofview. Since ‘other privileges’ are a major factor asdiscussed earlier, some ways of calculating the costof adaptation have also to be found.Another very general but important point is thatwe will have to build multilingual systems as far10as possible so that the cost per language is reduced.This will require innovation in terms of modeling aswell as engineering.6 Some Comments about the WorkshopThe scope of the workshop included topics such asthe following:• Archiving and creation of interoperable dataand metadata for less privileged languages• Support for less privileged language on computers.This includes input methods, display,fonts, encoding converters, spell checkers,more linguistically aware text editors etc.• Basic NLP tools such as sentence marker, tokenizer,morphological analyzer, transliterationtools, language and encoding identifiers etc.• Advanced NLP tools such as POS taggers, localword grouper, approximate string search, toolsfor developing language resources.There were a relatively large number of submissionsto the workshop and the overall quality wasat least above average. The most noteworthy fact isthat the variety of papers submitted (and selected)was pleasantly surprising. The workshop includespaper on topics as diverse as Machine Translation(MT) from text to sign language (an L-language onwhich very few people have worked) to MT fromspeech to speech. And from segmentation and stemmingto parser adaptation. Also, from input methods,text editor and interfaces to part of speech(POS) tagger. The variety is also remarkable interms of the languages covered and research locations.In addition, the workshop includes three invitedtalks: the first on building language resources by resourceadaptation (David and Maxwell, 2008); thesecond on cross-language resource sharing (Sornlertlamvanich,2008b); and the third on breaking theZipfian barrier in NLP (Choudhury, 2008). It canbe said that the workshop has been a moderate success.We hope it will stimulate further work in thisdirection.

of the IJCNLP Workshop on NLP for Less PrivilegedLanguages, Hyderabad, India.Anne David and Michael Maxwell. 2008. Building languageresources: Ways to move forward. Invited Talkat the IJCNLP Workshop on NLP for Less PrivilegedLanguages, 2008. Hyderabad, India.Timo Honkela David Ellis, Mathias Creutz and MikkoKurimo. 2008. Speech to speech machine translation:Biblical chatter from finnish to english. In Proceedingsof the IJCNLP Workshop on NLP for Less PrivilegedLanguages, Hyderabad, India.M. B. Emeneau. 1956. India as a linguistic area. Linguistics,32:3-16.M. B. Emeneau. 1980. Language and linguistic area. Essaysby Murray B. Emeneau. Selected and introducedby Anwar S. Dil. Stanford University Press.Sandeva Goonetilleke, Yoshihiko Hayashi, Yuichi Itoh,and Fumio Kishino. 2008. Srishell primo: A predictivesinhala text input system. In Proceedings of theIJCNLP Workshop on NLP for Less Privileged Languages,Hyderabad, India.Zin Maung Maung and Yoshiki Mikami. 2008. A rulebasedsyllable segmentation of myanmar text. In Proceedingsof the IJCNLP Workshop on NLP for LessPrivileged Languages, Hyderabad, India.Michael Maxwell and Anne David. 2008. Joint grammardevelopment by linguists and computer scientists. InProceedings of the IJCNLP Workshop on NLP for LessPrivileged Languages, Hyderabad, India.Sandipan Sarkar and Sivaji Bandyopadhyay. 2008. Designof a rule-based stemmer for natural language textin bengali. In Proceedings of the IJCNLP Workshopon NLP for Less Privileged Languages, Hyderabad,India.Thoudam Doren Singh and Sivaji Bandyopadhyay. 2008.Morphology driven manipuri pos tagger. In Proceedingsof the IJCNLP Workshop on NLP for Less PrivilegedLanguages, Hyderabad, India.Virach Sornlertlamvanich, Thatsanee Charoenporn,Kergrit Robkop, and Hitoshi Isahara. 2008a. Kui:an ubiquitous tool for collective intelligence development.In Proceedings of the IJCNLP Workshopon NLP for Less Privileged Languages, Hyderabad,India.Virach Sornlertlamvanich. 2008b. Cross language resourcesharing. Invited Talk at the IJCNLP Workshopon NLP for Less Privileged Languages, 2008. Hyderabad,India.Krishnakumar Veeraraghavan and Indrani Roy. 2008.Acharya - a text editor and framework for workingwith indic scripts. In Proceedings of the IJCNLPWorkshop on NLP for Less Privileged Languages, Hyderabad,India.Daniel Zeman and Philip Resnik. 2008. Cross-languageparser adaptation between related languages. In Proceedingsof the IJCNLP Workshop on NLP for LessPrivileged Languages, Hyderabad, India.Rada Mihalcea. 2008. How to add a new language on thenlp map: Building resources and tools for languageswith scarce resources. Tutorial at the Third InternationalJoint Conference on Natural Language Processing(IJCNLP). Hyderabad, India.Jackson Muhirwe and Trond Trosterud. 2008. Finitestate solutions for reduplication in kinyarwanda language.In Proceedings of the IJCNLP Workshop onNLP for Less Privileged Languages, Hyderabad, India.Chirag Patel and Karthik Gali. 2008. Part of speech taggerfor gujarati using conditional random fields. InProceedings of the IJCNLP Workshop on NLP for LessPrivileged Languages, Hyderabad, India.Hammam Riza. 2008. Indigenous languages of indonesia:Creating language resources for language preservation.In Proceedings of the IJCNLP Workshop onNLP for Less Privileged Languages, Hyderabad, India.12

Natural Language Processing for Less Privileged Languages

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?