12.07.2015 Views

Journal of Emerging Technologies in Web Intelligence Contents

Journal of Emerging Technologies in Web Intelligence Contents

Journal of Emerging Technologies in Web Intelligence Contents

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 2, MAY 2010 149<strong>of</strong> previous system that does not handle any word sensedisambiguation. The above said system just checks thewords to be translated <strong>in</strong> the dictionary, if found it isreplaced with the translated version stored <strong>in</strong> thedictionary, otherwise it is transliterated. But now <strong>in</strong> theextended system, the lexicon is divided <strong>in</strong>to two parts –one table consists <strong>of</strong> words with no disambiguation andsecond consists <strong>of</strong> words that have multiple mean<strong>in</strong>gsdepend<strong>in</strong>g upon the context <strong>of</strong> the word <strong>in</strong> which it hasbeen used <strong>in</strong> the sentence.IV. SYSTEM ARCHITECTUREThe architecture for the HPMTS (H<strong>in</strong>di to PunjabiMach<strong>in</strong>e Translation System) consists <strong>of</strong> number <strong>of</strong>modules that are listed below:a. Tra<strong>in</strong><strong>in</strong>g the system with tra<strong>in</strong><strong>in</strong>g corpusb. Input Text Font Conversion <strong>in</strong>to UnicodeFormatc. H<strong>in</strong>di Text Normalizationd. F<strong>in</strong>d<strong>in</strong>g and Replac<strong>in</strong>g Collocationse. F<strong>in</strong>d<strong>in</strong>g and replac<strong>in</strong>g named entitiesf. Word to word translation us<strong>in</strong>g lexiconsg. Resolv<strong>in</strong>g Ambiguity among wordsh. Transliteration <strong>of</strong> wordsi. Post Process<strong>in</strong>gj. Improv<strong>in</strong>g the accuracy <strong>of</strong> the system throughmach<strong>in</strong>e learn<strong>in</strong>g dur<strong>in</strong>g every translation job.k. Test<strong>in</strong>g the system us<strong>in</strong>g test corpus other thantra<strong>in</strong> corpusIn the above architecture, the most important part andstart<strong>in</strong>g po<strong>in</strong>t is to tra<strong>in</strong> the system. Tra<strong>in</strong> the systemmeans generat<strong>in</strong>g the lexicon us<strong>in</strong>g the already exist<strong>in</strong>gcorpus. The second module is optional and is skipped ifthe <strong>in</strong>putted text is already <strong>in</strong> Unicode format. UnicodeFont requirement arises due to <strong>in</strong>ternalization <strong>of</strong> thesystem and mak<strong>in</strong>g the system free from specific fontdependency. This font converter can be also used forconvert<strong>in</strong>g the non-Unicode corpus <strong>in</strong>to Unicode formatcorpus. Indian language words face spell<strong>in</strong>gstandardization issues, thereby result<strong>in</strong>g <strong>in</strong> multiplespell<strong>in</strong>g variants for the same word. The ma<strong>in</strong> reason forthis phenomenon can be attributed to the phonetic nature<strong>of</strong> Indian Languages and multiple dialects. To give anidea <strong>of</strong> this data problem, these words were found –मंिजल, मिजल, मंिज़ल Third module is H<strong>in</strong>di TextNormalization that solves this spell<strong>in</strong>g variant problem.H<strong>in</strong>di text is normalized <strong>in</strong>to standard spell<strong>in</strong>gs before itgoes for translation. Next Module <strong>of</strong> the system f<strong>in</strong>d andreplaces all the collocations us<strong>in</strong>g the lexicon enteries. ACollocation is an expression consist<strong>in</strong>g <strong>of</strong> two or morewords that correspond to some conventional way <strong>of</strong>say<strong>in</strong>g th<strong>in</strong>gs. Or <strong>in</strong> the other words <strong>of</strong> Firth (1957:181) :“Collocations <strong>of</strong> a given word are statements <strong>of</strong> thehabitual or customary places <strong>of</strong> that word”. This modulehelps <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g the accuracy <strong>of</strong> the translation.Generat<strong>in</strong>g Lexicon for Collocations is itself achalleng<strong>in</strong>g task. Then comes the turn <strong>of</strong> the heart <strong>of</strong> thesystem – word for word translation uses the lexicon. Thissearch for the H<strong>in</strong>di word <strong>in</strong> the lexicon and replaces itwith the correspond<strong>in</strong>g Punjabi translated version present<strong>in</strong> the lexicon. If this H<strong>in</strong>di word is not found <strong>in</strong> thelexicon it searches that word <strong>in</strong> the database <strong>of</strong>ambiguous words, if found us<strong>in</strong>g tri-gram approach itresolves the ambiguity <strong>of</strong> word and replaces it withcorrect Punjabi mean<strong>in</strong>g among multiple Punjabimean<strong>in</strong>gs. For Example, the h<strong>in</strong>di word सरूप can betranslated <strong>in</strong>to either <strong>of</strong> the two Punjabi words - ਸਮਾਨ,ਸੁ ੰ ਦਰ. But how will the system decide which word tochoose is basically to know the context <strong>in</strong> which theH<strong>in</strong>di word सरूप has been used <strong>in</strong> the sentence. If theword is not found <strong>in</strong> both the tables it means it is notavailable <strong>in</strong> the database and need to be transliterated.For improv<strong>in</strong>g the accuracy <strong>of</strong> the system, this is must toknow the system about which new words have beencome across and if they have been transliteratedaccurately or not. If they were not present <strong>in</strong> the databaseand need to be present, it is added to lexicon for futuretranslations. If it has been translated wrongly butrequired one, it is corrected first before add<strong>in</strong>g to thelexicon. In this way this is the ongo<strong>in</strong>g improvement <strong>of</strong>the system performance dur<strong>in</strong>g every translation exercisethrough mach<strong>in</strong>e learn<strong>in</strong>g module. Post Process<strong>in</strong>gModule takes <strong>in</strong>to consideration some commongrammatical mistakes that has been done dur<strong>in</strong>gtranslation phases and based on the rules framed, itremoves those mistakes and <strong>in</strong>creases the accuracy to thesystem. Now system has been tra<strong>in</strong>ed a lot by number <strong>of</strong>translation exercises, it is time to check the accuracy <strong>of</strong>the system by test<strong>in</strong>g the system through test data otherthan the data used for tra<strong>in</strong><strong>in</strong>g. Test<strong>in</strong>g the system is alsovery tedious task. First step <strong>in</strong> it is to prepare the testcases that covers all the possibilities.V. WEB BASED TOOLResearch must not be restricted to papers, It must bepropagated to public for use and test. Tak<strong>in</strong>g this aim, thewhole system has been developed as a web tool and isonl<strong>in</strong>e for use fre <strong>of</strong> cost. The website address ishttp://h2p.learnpunjabi.org/ . Follow<strong>in</strong>g are the features<strong>of</strong> this web tool:a. H<strong>in</strong>di Text can be written <strong>in</strong> Unicode encod<strong>in</strong>gby us<strong>in</strong>g the most popular H<strong>in</strong>di Font Krutidev.This concept is very useful for those who are <strong>in</strong>habit <strong>of</strong> typ<strong>in</strong>g the text <strong>in</strong> Krutidev and laterthey f<strong>in</strong>d some source for convert<strong>in</strong>g them <strong>in</strong>toUnicode encod<strong>in</strong>g. Thus, this feature has solvedtheir purpose <strong>in</strong> very easy manner. Now, theywill type <strong>in</strong> their style and the typed matter willalso be <strong>in</strong> Unicode.b. The text can also be <strong>in</strong>put to the system fortranslation through text file. File can be readus<strong>in</strong>g the Browse button provided.c. Input text can be translated <strong>in</strong>to Punjabi text byjust click<strong>in</strong>g the Translate button. With<strong>in</strong>seconds, the text is translated <strong>in</strong>to Punjabi.© 2010 ACADEMY PUBLISHER

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!