Journal of Emerging Technologies in Web Intelligence Contents

More documents

Recommendations

Info

148 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 2, MAY 2010Web Based Hindi to PunjabiMachine Translation SystemVishal Goyal and Gurpreet Singh LehalDepartment of Computer Science, Punjabi University, Patiala, Punjab, India{vishal.pup,gslehal}@gmail.comAbstract - Hindi and Punjabi are closely related languageswith lots of similarities in syntax and vocabulary BothPunjabi and Hindi languages have originated from Sanskritwhich is one of the oldest language. In terms of speakers,Hindi is third most widely spoken language and Punjabi istwelfth most widely spoken language. Punjabi language ismostly used in the Northern India and in some areas ofPakistan as well as in UK, Canada and USA. Hindi is thenational language of India and is spoken and used by thepeople all over the country. In the present research, BasicHindi to Punjabi machine translation system using directtranslation approach has been developed. The results ofthis translation system are surprisingly good. The systemincludes lexicon based translation, transliteration andcontinuously improving the system through machinelearning module. It also takes care of basic word sensedisambiguation.Index Terms - Machine Translation System, Closely relatedLanguages, Hindi, Punjabi, Natural Language Processing,Computational Linguistics, Transliteration.I. INTRODUCTIONMT is a field of research that has been around sincethe birth of electronic computers. Warren Weaver, adirector of the Rockefeller Foundation received muchcredit for bringing the concept of MT to the publicwhen he published an influential paper on usingcomputers for translation in 1949. MT is the name forcomputerized methods that automate all or part of theprocess of translating from one human language toanother. Fully-automatic general purpose high qualitymachine translation system (FGH-MT) is extremelydifficult to build. In fact, there is no system in theworld of any pair of languages which qualifies to becalled FGH-MT. This paper explains the methodologyfollowed for developing the machine translationsystem between closely related languages – Hindi andPunjabi. Closely related languages have lots ofsimilarities in syntax and vocabulary. MachineTranslation between closely related languages is easierthan between language pairs that are not related witheach other. Having many parts of their grammars andvocabularies in common reduces the amount of effortneeded to develop a translation system between relatedlanguages. Closely related languages are the languages ofpeople who have similar cultures and common historicalroots.II. HISTORYThe first attempt to verify the hypothesis that relatedlanguages are easier to translate started in mid 80s atCharles University in Prague [FEMTI; Hajic et al. 2000].The project was called RUSLAN and aimed at thetranslation of documentation in the domain of operatingsystems for mainframe computers. From that date to tilldate so many examples are there in history which supportthe argument that with close languages, the quality ofMT system, with simple techniques, is better. To name afew one are CESILKO (a system for translating Czechand Slovak), MT system for translating Turkish ToCrimean Tatar etc. We are also trying to strengthen thesame concept by experimenting with a word for worddirect translation system for Hindi to Punjabi. Theselanguages are very closely related and have manyfeatures in common.III. SYSTEM DESCRIPTIONThe major task behind direct Machine translationsystem is developing an exhaustive lexicon consisting ofsource language words along with its correspondingtranslated version of the target language. There is nomachine readable dictionary available for Hindi toPunjabi language. Two traditional dictionaries – onefrom Bhasha Vibhag, Patiala and another from NationalBook Trust has been published. We got it digitized andmoulded required for machine translation purpose whichis itself a big job. Now lexicon consists of approx.1,00,000 words. This lexicon is used for word for wordtranslation. Extending the lexicon demands large hindicorpus. Sometimes it is available in Unicode format andsometimes it is available in number of other non standardfonts like Susha, Agra, Krutidev etc. which needs to beconverted into Unicode first. Conversion is being donethrough Font Converter that converts non standard fontsinto Unicode Format. The beauty of the developed Fontconverter is that it is able to convert MS-Access files, MSWord files, HTML files and Text Files. Because thecorpus can be in any file type, so it handles all thepossible file types. Besides translation it performs thetask of transliteration as well. Transliteration means toreplace character by character of word from sourcelanguage character to target language character likeेमचंद into ਪੇਮਚਂਦ. This system is basically an extension© 2010 ACADEMY PUBLISHERdoi:10.4304/jetwi.2.2.148-151
JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 2, MAY 2010 149of previous system that does not handle any word sensedisambiguation. The above said system just checks thewords to be translated in the dictionary, if found it isreplaced with the translated version stored in thedictionary, otherwise it is transliterated. But now in theextended system, the lexicon is divided into two parts –one table consists of words with no disambiguation andsecond consists of words that have multiple meaningsdepending upon the context of the word in which it hasbeen used in the sentence.IV. SYSTEM ARCHITECTUREThe architecture for the HPMTS (Hindi to PunjabiMachine Translation System) consists of number ofmodules that are listed below:a. Training the system with training corpusb. Input Text Font Conversion into UnicodeFormatc. Hindi Text Normalizationd. Finding and Replacing Collocationse. Finding and replacing named entitiesf. Word to word translation using lexiconsg. Resolving Ambiguity among wordsh. Transliteration of wordsi. Post Processingj. Improving the accuracy of the system throughmachine learning during every translation job.k. Testing the system using test corpus other thantrain corpusIn the above architecture, the most important part andstarting point is to train the system. Train the systemmeans generating the lexicon using the already existingcorpus. The second module is optional and is skipped ifthe inputted text is already in Unicode format. UnicodeFont requirement arises due to internalization of thesystem and making the system free from specific fontdependency. This font converter can be also used forconverting the non-Unicode corpus into Unicode formatcorpus. Indian language words face spellingstandardization issues, thereby resulting in multiplespelling variants for the same word. The main reason forthis phenomenon can be attributed to the phonetic natureof Indian Languages and multiple dialects. To give anidea of this data problem, these words were found –मंिजल, मिजल, मंिज़ल Third module is Hindi TextNormalization that solves this spelling variant problem.Hindi text is normalized into standard spellings before itgoes for translation. Next Module of the system find andreplaces all the collocations using the lexicon enteries. ACollocation is an expression consisting of two or morewords that correspond to some conventional way ofsaying things. Or in the other words of Firth (1957:181) :“Collocations of a given word are statements of thehabitual or customary places of that word”. This modulehelps in increasing the accuracy of the translation.Generating Lexicon for Collocations is itself achallenging task. Then comes the turn of the heart of thesystem – word for word translation uses the lexicon. Thissearch for the Hindi word in the lexicon and replaces itwith the corresponding Punjabi translated version presentin the lexicon. If this Hindi word is not found in thelexicon it searches that word in the database ofambiguous words, if found using tri-gram approach itresolves the ambiguity of word and replaces it withcorrect Punjabi meaning among multiple Punjabimeanings. For Example, the hindi word सरूप can betranslated into either of the two Punjabi words - ਸਮਾਨ,ਸੁ ੰ ਦਰ. But how will the system decide which word tochoose is basically to know the context in which theHindi word सरूप has been used in the sentence. If theword is not found in both the tables it means it is notavailable in the database and need to be transliterated.For improving the accuracy of the system, this is must toknow the system about which new words have beencome across and if they have been transliteratedaccurately or not. If they were not present in the databaseand need to be present, it is added to lexicon for futuretranslations. If it has been translated wrongly butrequired one, it is corrected first before adding to thelexicon. In this way this is the ongoing improvement ofthe system performance during every translation exercisethrough machine learning module. Post ProcessingModule takes into consideration some commongrammatical mistakes that has been done duringtranslation phases and based on the rules framed, itremoves those mistakes and increases the accuracy to thesystem. Now system has been trained a lot by number oftranslation exercises, it is time to check the accuracy ofthe system by testing the system through test data otherthan the data used for training. Testing the system is alsovery tedious task. First step in it is to prepare the testcases that covers all the possibilities.V. WEB BASED TOOLResearch must not be restricted to papers, It must bepropagated to public for use and test. Taking this aim, thewhole system has been developed as a web tool and isonline for use fre of cost. The website address ishttp://h2p.learnpunjabi.org/ . Following are the featuresof this web tool:a. Hindi Text can be written in Unicode encodingby using the most popular Hindi Font Krutidev.This concept is very useful for those who are inhabit of typing the text in Krutidev and laterthey find some source for converting them intoUnicode encoding. Thus, this feature has solvedtheir purpose in very easy manner. Now, theywill type in their style and the typed matter willalso be in Unicode.b. The text can also be input to the system fortranslation through text file. File can be readusing the Browse button provided.c. Input text can be translated into Punjabi text byjust clicking the Translate button. Withinseconds, the text is translated into Punjabi.© 2010 ACADEMY PUBLISHER
Page 1 and 2:
Journal of Emerging Technologies in
Page 3 and 4:
JOURNAL OF EMERGING TECHNOLOGIES IN
Page 6 and 7:
82 JOURNAL OF EMERGING TECHNOLOGIES
Page 8 and 9:
Page 10 and 11:
Page 12 and 13:
Page 14 and 15:
Page 16 and 17:
Page 18 and 19:
Page 20 and 21:
Page 22 and 23: 98 JOURNAL OF EMERGING TECHNOLOGIES
Page 24 and 25: 100 JOURNAL OF EMERGING TECHNOLOGIE
Page 81 and 82: Aims and ScopeCall for Papers and S
show all

Journal of Emerging Technologies in Web Intelligence Contents

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?