13.07.2015 Views

Arab Knowledge Report 2009: Towards Productive

Arab Knowledge Report 2009: Towards Productive

Arab Knowledge Report 2009: Towards Productive

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A system formachine parsing<strong>Arab</strong>ic sentencesis considered akey requirementfor <strong>Arab</strong>ic tocatch up withsecond generationapplications of naturallanguage processingrequires the adaptation of a number ofavailable technologies to make them <strong>Arab</strong>iccompliant. Technical solutions have to befound too to certain questions, which fallinto two groups, the first connected withthe <strong>Arab</strong>ic language itself, the second withthe preparation of <strong>Arab</strong>ic content forin-depth processing. An example of thefirst group of issues is optical characterrecognition technology for <strong>Arab</strong>ic lettersand for reading from the screen. Thesecond group contains spellchecking andgrammar checking systems. Developingthe software necessary to perform thesetasks is extremely difficult. Automatedgrammar checking for example musthandle the difficulty posed by the excessivelength and flexible word order of <strong>Arab</strong>icsentences when compared to the strictword order of English, for example. Somedifficulties are attributable to the lack of astandard punctuation system and tothe need–for grammar checking–for acoherent system to parse sentences as abasis for error checking. Preparing <strong>Arab</strong>ictexts for deeper processing (preparatoryto indexing or searching for example)requires the development of softwarethat permits morphological analysis,automatic vocalisation, 37 and automatedparsing. A system for machine parsing<strong>Arab</strong>ic sentences is considered a keyrequirement for <strong>Arab</strong>ic to catch up withsecond generation applications of naturallanguage processing. These include systemsfor machine comprehension and narrativestructural analysis of the languages. Some<strong>Arab</strong> and foreign businesses are makingnotable efforts in these fields, but the paceof work and the results achieved remaininsufficient (see Box 4-5).Discussion of the <strong>Arab</strong>ic language is notlimited to the generation and unificationof technical terms among groups of thoseworking in ICT but includes everythingconnected to <strong>Arab</strong>ic-language wordBOX 4-5<strong>Arab</strong>ic Language Processing Systems: machine translation, grammar checking, and searchingThe production and deployment of <strong>Arab</strong>ic digital content on thenet requires the availability of translation systems to and fromthe main languages. More effective <strong>Arab</strong>ic search engines are alsorequired. Technologies to mine, process, and retrieve content alsorequire automated indexing and summarising systems. 38 In addition,it is essential to develop advanced systems for automatic speechprocessing including automated speech analysis, generation, andrecognition in <strong>Arab</strong>ic. 39Machine translation systems: a number of software systemsfor machine translation to and from <strong>Arab</strong>ic exist. One prominentexample is the Google system. This adopts statistical methods whichmake it impossible for the quality of its translations of texts to gobeyond very modest limits, rendering it unfit for serious translation.There is also software that adopts an overly simple linguistically andlexically based analytical model. Since their launch around threedecades ago, attempts to improve the performance of such machinetranslation systems have failed. Another system developed by an<strong>Arab</strong>ic company is based on a transformational model and relies ona limited base of linguistic rules and lexical data, which limit thepossibilities of improving its performance.Grammar checking: neither of the two grammar checkingsystems in use uses an automated parser, relying instead on a storeof contextual examples. They are thus incapable of recognisinggrammatical errors that occur when the words and syntacticalelements in question are far apart and of adding the syntacticallysignificant final vowels to words, especially in the long sentencesprevalent in <strong>Arab</strong>ic texts. Of the three systems for morphologicalanalysis, two are distinguished by complete linguistic coverage of thewhole of the <strong>Arab</strong>ic lexicon and one of these enjoys a coherentlinguistic foundation which makes it capable of deriving semanticelements from morphological and lexical aspects. Among the faultsof the third system is the errors it generates when dealing with wordswith multiple and compound affixes.<strong>Arab</strong>ic search engines: there are an extremely limited numberof search engines for <strong>Arab</strong>ic texts on the internet. Many of thesites which allow the discovery of <strong>Arab</strong>ic texts are no more thandirectories comprising lists of <strong>Arab</strong>ic website addresses (the portalwww.arabsgate.com is a prime example). The Google <strong>Arab</strong>ic searchengine is reckoned to be the most used <strong>Arab</strong>ic search engine onthe net. In addition to being far from meeting most of the searchrequirements for cultural and educational applications, it also enjoysonly modest success in meeting most of the requirements of theordinary user. This search engine does not take into account thecomplex derivational and morphological formation of <strong>Arab</strong>ic wordsin comparison with the simple formation of English words forwhich the system was designed. It searches for a word as it appearsin the text without paying attention to its lexical lemma, which mayappear in as many as a thousand forms as a result of the affixing ofprefixes and suffixes to the <strong>Arab</strong>ic word. This search engine is alsoincapable of broadening the scope of a search on the basis of theusers search terms. Thus, when the user enters a word like “boy”(fata), “desert” (sahra’), or “tree” (shajara), the search engine will notreturn texts containing the plurals “boys” (fityan), “deserts” (sahara),or “trees” (ashjar). And when searching for a verb, if the user entersa third-person form “[he] condemns” (yudin), Google will not returnother related morphological forms like “[you/she] condemn/s”(tudin), “[we] condemn” (nudin), and “condemners” (mudinun).Adapted from the draft background paper for the <strong>Report</strong> by ‘Abd al-Ilah al-Diwahji, in <strong>Arab</strong>ic168 ARAB KNOWLEDGE REPORT <strong>2009</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!