148 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 2, MAY 2010<strong>Web</strong> Based H<strong>in</strong>di to PunjabiMach<strong>in</strong>e Translation SystemVishal Goyal and Gurpreet S<strong>in</strong>gh LehalDepartment <strong>of</strong> Computer Science, Punjabi University, Patiala, Punjab, India{vishal.pup,gslehal}@gmail.comAbstract - H<strong>in</strong>di and Punjabi are closely related languageswith lots <strong>of</strong> similarities <strong>in</strong> syntax and vocabulary BothPunjabi and H<strong>in</strong>di languages have orig<strong>in</strong>ated from Sanskritwhich is one <strong>of</strong> the oldest language. In terms <strong>of</strong> speakers,H<strong>in</strong>di is third most widely spoken language and Punjabi istwelfth most widely spoken language. Punjabi language ismostly used <strong>in</strong> the Northern India and <strong>in</strong> some areas <strong>of</strong>Pakistan as well as <strong>in</strong> UK, Canada and USA. H<strong>in</strong>di is thenational language <strong>of</strong> India and is spoken and used by thepeople all over the country. In the present research, BasicH<strong>in</strong>di to Punjabi mach<strong>in</strong>e translation system us<strong>in</strong>g directtranslation approach has been developed. The results <strong>of</strong>this translation system are surpris<strong>in</strong>gly good. The system<strong>in</strong>cludes lexicon based translation, transliteration andcont<strong>in</strong>uously improv<strong>in</strong>g the system through mach<strong>in</strong>elearn<strong>in</strong>g module. It also takes care <strong>of</strong> basic word sensedisambiguation.Index Terms - Mach<strong>in</strong>e Translation System, Closely relatedLanguages, H<strong>in</strong>di, Punjabi, Natural Language Process<strong>in</strong>g,Computational L<strong>in</strong>guistics, Transliteration.I. INTRODUCTIONMT is a field <strong>of</strong> research that has been around s<strong>in</strong>cethe birth <strong>of</strong> electronic computers. Warren Weaver, adirector <strong>of</strong> the Rockefeller Foundation received muchcredit for br<strong>in</strong>g<strong>in</strong>g the concept <strong>of</strong> MT to the publicwhen he published an <strong>in</strong>fluential paper on us<strong>in</strong>gcomputers for translation <strong>in</strong> 1949. MT is the name forcomputerized methods that automate all or part <strong>of</strong> theprocess <strong>of</strong> translat<strong>in</strong>g from one human language toanother. Fully-automatic general purpose high qualitymach<strong>in</strong>e translation system (FGH-MT) is extremelydifficult to build. In fact, there is no system <strong>in</strong> theworld <strong>of</strong> any pair <strong>of</strong> languages which qualifies to becalled FGH-MT. This paper expla<strong>in</strong>s the methodologyfollowed for develop<strong>in</strong>g the mach<strong>in</strong>e translationsystem between closely related languages – H<strong>in</strong>di andPunjabi. Closely related languages have lots <strong>of</strong>similarities <strong>in</strong> syntax and vocabulary. Mach<strong>in</strong>eTranslation between closely related languages is easierthan between language pairs that are not related witheach other. Hav<strong>in</strong>g many parts <strong>of</strong> their grammars andvocabularies <strong>in</strong> common reduces the amount <strong>of</strong> effortneeded to develop a translation system between relatedlanguages. Closely related languages are the languages <strong>of</strong>people who have similar cultures and common historicalroots.II. HISTORYThe first attempt to verify the hypothesis that relatedlanguages are easier to translate started <strong>in</strong> mid 80s atCharles University <strong>in</strong> Prague [FEMTI; Hajic et al. 2000].The project was called RUSLAN and aimed at thetranslation <strong>of</strong> documentation <strong>in</strong> the doma<strong>in</strong> <strong>of</strong> operat<strong>in</strong>gsystems for ma<strong>in</strong>frame computers. From that date to tilldate so many examples are there <strong>in</strong> history which supportthe argument that with close languages, the quality <strong>of</strong>MT system, with simple techniques, is better. To name afew one are CESILKO (a system for translat<strong>in</strong>g Czechand Slovak), MT system for translat<strong>in</strong>g Turkish ToCrimean Tatar etc. We are also try<strong>in</strong>g to strengthen thesame concept by experiment<strong>in</strong>g with a word for worddirect translation system for H<strong>in</strong>di to Punjabi. Theselanguages are very closely related and have manyfeatures <strong>in</strong> common.III. SYSTEM DESCRIPTIONThe major task beh<strong>in</strong>d direct Mach<strong>in</strong>e translationsystem is develop<strong>in</strong>g an exhaustive lexicon consist<strong>in</strong>g <strong>of</strong>source language words along with its correspond<strong>in</strong>gtranslated version <strong>of</strong> the target language. There is nomach<strong>in</strong>e readable dictionary available for H<strong>in</strong>di toPunjabi language. Two traditional dictionaries – onefrom Bhasha Vibhag, Patiala and another from NationalBook Trust has been published. We got it digitized andmoulded required for mach<strong>in</strong>e translation purpose whichis itself a big job. Now lexicon consists <strong>of</strong> approx.1,00,000 words. This lexicon is used for word for wordtranslation. Extend<strong>in</strong>g the lexicon demands large h<strong>in</strong>dicorpus. Sometimes it is available <strong>in</strong> Unicode format andsometimes it is available <strong>in</strong> number <strong>of</strong> other non standardfonts like Susha, Agra, Krutidev etc. which needs to beconverted <strong>in</strong>to Unicode first. Conversion is be<strong>in</strong>g donethrough Font Converter that converts non standard fonts<strong>in</strong>to Unicode Format. The beauty <strong>of</strong> the developed Fontconverter is that it is able to convert MS-Access files, MSWord files, HTML files and Text Files. Because thecorpus can be <strong>in</strong> any file type, so it handles all thepossible file types. Besides translation it performs thetask <strong>of</strong> transliteration as well. Transliteration means toreplace character by character <strong>of</strong> word from sourcelanguage character to target language character likeेमचंद <strong>in</strong>to ਪੇਮਚਂਦ. This system is basically an extension© 2010 ACADEMY PUBLISHERdoi:10.4304/jetwi.2.2.148-151
JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 2, MAY 2010 149<strong>of</strong> previous system that does not handle any word sensedisambiguation. The above said system just checks thewords to be translated <strong>in</strong> the dictionary, if found it isreplaced with the translated version stored <strong>in</strong> thedictionary, otherwise it is transliterated. But now <strong>in</strong> theextended system, the lexicon is divided <strong>in</strong>to two parts –one table consists <strong>of</strong> words with no disambiguation andsecond consists <strong>of</strong> words that have multiple mean<strong>in</strong>gsdepend<strong>in</strong>g upon the context <strong>of</strong> the word <strong>in</strong> which it hasbeen used <strong>in</strong> the sentence.IV. SYSTEM ARCHITECTUREThe architecture for the HPMTS (H<strong>in</strong>di to PunjabiMach<strong>in</strong>e Translation System) consists <strong>of</strong> number <strong>of</strong>modules that are listed below:a. Tra<strong>in</strong><strong>in</strong>g the system with tra<strong>in</strong><strong>in</strong>g corpusb. Input Text Font Conversion <strong>in</strong>to UnicodeFormatc. H<strong>in</strong>di Text Normalizationd. F<strong>in</strong>d<strong>in</strong>g and Replac<strong>in</strong>g Collocationse. F<strong>in</strong>d<strong>in</strong>g and replac<strong>in</strong>g named entitiesf. Word to word translation us<strong>in</strong>g lexiconsg. Resolv<strong>in</strong>g Ambiguity among wordsh. Transliteration <strong>of</strong> wordsi. Post Process<strong>in</strong>gj. Improv<strong>in</strong>g the accuracy <strong>of</strong> the system throughmach<strong>in</strong>e learn<strong>in</strong>g dur<strong>in</strong>g every translation job.k. Test<strong>in</strong>g the system us<strong>in</strong>g test corpus other thantra<strong>in</strong> corpusIn the above architecture, the most important part andstart<strong>in</strong>g po<strong>in</strong>t is to tra<strong>in</strong> the system. Tra<strong>in</strong> the systemmeans generat<strong>in</strong>g the lexicon us<strong>in</strong>g the already exist<strong>in</strong>gcorpus. The second module is optional and is skipped ifthe <strong>in</strong>putted text is already <strong>in</strong> Unicode format. UnicodeFont requirement arises due to <strong>in</strong>ternalization <strong>of</strong> thesystem and mak<strong>in</strong>g the system free from specific fontdependency. This font converter can be also used forconvert<strong>in</strong>g the non-Unicode corpus <strong>in</strong>to Unicode formatcorpus. Indian language words face spell<strong>in</strong>gstandardization issues, thereby result<strong>in</strong>g <strong>in</strong> multiplespell<strong>in</strong>g variants for the same word. The ma<strong>in</strong> reason forthis phenomenon can be attributed to the phonetic nature<strong>of</strong> Indian Languages and multiple dialects. To give anidea <strong>of</strong> this data problem, these words were found –मंिजल, मिजल, मंिज़ल Third module is H<strong>in</strong>di TextNormalization that solves this spell<strong>in</strong>g variant problem.H<strong>in</strong>di text is normalized <strong>in</strong>to standard spell<strong>in</strong>gs before itgoes for translation. Next Module <strong>of</strong> the system f<strong>in</strong>d andreplaces all the collocations us<strong>in</strong>g the lexicon enteries. ACollocation is an expression consist<strong>in</strong>g <strong>of</strong> two or morewords that correspond to some conventional way <strong>of</strong>say<strong>in</strong>g th<strong>in</strong>gs. Or <strong>in</strong> the other words <strong>of</strong> Firth (1957:181) :“Collocations <strong>of</strong> a given word are statements <strong>of</strong> thehabitual or customary places <strong>of</strong> that word”. This modulehelps <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g the accuracy <strong>of</strong> the translation.Generat<strong>in</strong>g Lexicon for Collocations is itself achalleng<strong>in</strong>g task. Then comes the turn <strong>of</strong> the heart <strong>of</strong> thesystem – word for word translation uses the lexicon. Thissearch for the H<strong>in</strong>di word <strong>in</strong> the lexicon and replaces itwith the correspond<strong>in</strong>g Punjabi translated version present<strong>in</strong> the lexicon. If this H<strong>in</strong>di word is not found <strong>in</strong> thelexicon it searches that word <strong>in</strong> the database <strong>of</strong>ambiguous words, if found us<strong>in</strong>g tri-gram approach itresolves the ambiguity <strong>of</strong> word and replaces it withcorrect Punjabi mean<strong>in</strong>g among multiple Punjabimean<strong>in</strong>gs. For Example, the h<strong>in</strong>di word सरूप can betranslated <strong>in</strong>to either <strong>of</strong> the two Punjabi words - ਸਮਾਨ,ਸੁ ੰ ਦਰ. But how will the system decide which word tochoose is basically to know the context <strong>in</strong> which theH<strong>in</strong>di word सरूप has been used <strong>in</strong> the sentence. If theword is not found <strong>in</strong> both the tables it means it is notavailable <strong>in</strong> the database and need to be transliterated.For improv<strong>in</strong>g the accuracy <strong>of</strong> the system, this is must toknow the system about which new words have beencome across and if they have been transliteratedaccurately or not. If they were not present <strong>in</strong> the databaseand need to be present, it is added to lexicon for futuretranslations. If it has been translated wrongly butrequired one, it is corrected first before add<strong>in</strong>g to thelexicon. In this way this is the ongo<strong>in</strong>g improvement <strong>of</strong>the system performance dur<strong>in</strong>g every translation exercisethrough mach<strong>in</strong>e learn<strong>in</strong>g module. Post Process<strong>in</strong>gModule takes <strong>in</strong>to consideration some commongrammatical mistakes that has been done dur<strong>in</strong>gtranslation phases and based on the rules framed, itremoves those mistakes and <strong>in</strong>creases the accuracy to thesystem. Now system has been tra<strong>in</strong>ed a lot by number <strong>of</strong>translation exercises, it is time to check the accuracy <strong>of</strong>the system by test<strong>in</strong>g the system through test data otherthan the data used for tra<strong>in</strong><strong>in</strong>g. Test<strong>in</strong>g the system is alsovery tedious task. First step <strong>in</strong> it is to prepare the testcases that covers all the possibilities.V. WEB BASED TOOLResearch must not be restricted to papers, It must bepropagated to public for use and test. Tak<strong>in</strong>g this aim, thewhole system has been developed as a web tool and isonl<strong>in</strong>e for use fre <strong>of</strong> cost. The website address ishttp://h2p.learnpunjabi.org/ . Follow<strong>in</strong>g are the features<strong>of</strong> this web tool:a. H<strong>in</strong>di Text can be written <strong>in</strong> Unicode encod<strong>in</strong>gby us<strong>in</strong>g the most popular H<strong>in</strong>di Font Krutidev.This concept is very useful for those who are <strong>in</strong>habit <strong>of</strong> typ<strong>in</strong>g the text <strong>in</strong> Krutidev and laterthey f<strong>in</strong>d some source for convert<strong>in</strong>g them <strong>in</strong>toUnicode encod<strong>in</strong>g. Thus, this feature has solvedtheir purpose <strong>in</strong> very easy manner. Now, theywill type <strong>in</strong> their style and the typed matter willalso be <strong>in</strong> Unicode.b. The text can also be <strong>in</strong>put to the system fortranslation through text file. File can be readus<strong>in</strong>g the Browse button provided.c. Input text can be translated <strong>in</strong>to Punjabi text byjust click<strong>in</strong>g the Translate button. With<strong>in</strong>seconds, the text is translated <strong>in</strong>to Punjabi.© 2010 ACADEMY PUBLISHER