12.07.2015 Views

Journal of Emerging Technologies in Web Intelligence Contents

Journal of Emerging Technologies in Web Intelligence Contents

Journal of Emerging Technologies in Web Intelligence Contents

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

148 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 2, MAY 2010<strong>Web</strong> Based H<strong>in</strong>di to PunjabiMach<strong>in</strong>e Translation SystemVishal Goyal and Gurpreet S<strong>in</strong>gh LehalDepartment <strong>of</strong> Computer Science, Punjabi University, Patiala, Punjab, India{vishal.pup,gslehal}@gmail.comAbstract - H<strong>in</strong>di and Punjabi are closely related languageswith lots <strong>of</strong> similarities <strong>in</strong> syntax and vocabulary BothPunjabi and H<strong>in</strong>di languages have orig<strong>in</strong>ated from Sanskritwhich is one <strong>of</strong> the oldest language. In terms <strong>of</strong> speakers,H<strong>in</strong>di is third most widely spoken language and Punjabi istwelfth most widely spoken language. Punjabi language ismostly used <strong>in</strong> the Northern India and <strong>in</strong> some areas <strong>of</strong>Pakistan as well as <strong>in</strong> UK, Canada and USA. H<strong>in</strong>di is thenational language <strong>of</strong> India and is spoken and used by thepeople all over the country. In the present research, BasicH<strong>in</strong>di to Punjabi mach<strong>in</strong>e translation system us<strong>in</strong>g directtranslation approach has been developed. The results <strong>of</strong>this translation system are surpris<strong>in</strong>gly good. The system<strong>in</strong>cludes lexicon based translation, transliteration andcont<strong>in</strong>uously improv<strong>in</strong>g the system through mach<strong>in</strong>elearn<strong>in</strong>g module. It also takes care <strong>of</strong> basic word sensedisambiguation.Index Terms - Mach<strong>in</strong>e Translation System, Closely relatedLanguages, H<strong>in</strong>di, Punjabi, Natural Language Process<strong>in</strong>g,Computational L<strong>in</strong>guistics, Transliteration.I. INTRODUCTIONMT is a field <strong>of</strong> research that has been around s<strong>in</strong>cethe birth <strong>of</strong> electronic computers. Warren Weaver, adirector <strong>of</strong> the Rockefeller Foundation received muchcredit for br<strong>in</strong>g<strong>in</strong>g the concept <strong>of</strong> MT to the publicwhen he published an <strong>in</strong>fluential paper on us<strong>in</strong>gcomputers for translation <strong>in</strong> 1949. MT is the name forcomputerized methods that automate all or part <strong>of</strong> theprocess <strong>of</strong> translat<strong>in</strong>g from one human language toanother. Fully-automatic general purpose high qualitymach<strong>in</strong>e translation system (FGH-MT) is extremelydifficult to build. In fact, there is no system <strong>in</strong> theworld <strong>of</strong> any pair <strong>of</strong> languages which qualifies to becalled FGH-MT. This paper expla<strong>in</strong>s the methodologyfollowed for develop<strong>in</strong>g the mach<strong>in</strong>e translationsystem between closely related languages – H<strong>in</strong>di andPunjabi. Closely related languages have lots <strong>of</strong>similarities <strong>in</strong> syntax and vocabulary. Mach<strong>in</strong>eTranslation between closely related languages is easierthan between language pairs that are not related witheach other. Hav<strong>in</strong>g many parts <strong>of</strong> their grammars andvocabularies <strong>in</strong> common reduces the amount <strong>of</strong> effortneeded to develop a translation system between relatedlanguages. Closely related languages are the languages <strong>of</strong>people who have similar cultures and common historicalroots.II. HISTORYThe first attempt to verify the hypothesis that relatedlanguages are easier to translate started <strong>in</strong> mid 80s atCharles University <strong>in</strong> Prague [FEMTI; Hajic et al. 2000].The project was called RUSLAN and aimed at thetranslation <strong>of</strong> documentation <strong>in</strong> the doma<strong>in</strong> <strong>of</strong> operat<strong>in</strong>gsystems for ma<strong>in</strong>frame computers. From that date to tilldate so many examples are there <strong>in</strong> history which supportthe argument that with close languages, the quality <strong>of</strong>MT system, with simple techniques, is better. To name afew one are CESILKO (a system for translat<strong>in</strong>g Czechand Slovak), MT system for translat<strong>in</strong>g Turkish ToCrimean Tatar etc. We are also try<strong>in</strong>g to strengthen thesame concept by experiment<strong>in</strong>g with a word for worddirect translation system for H<strong>in</strong>di to Punjabi. Theselanguages are very closely related and have manyfeatures <strong>in</strong> common.III. SYSTEM DESCRIPTIONThe major task beh<strong>in</strong>d direct Mach<strong>in</strong>e translationsystem is develop<strong>in</strong>g an exhaustive lexicon consist<strong>in</strong>g <strong>of</strong>source language words along with its correspond<strong>in</strong>gtranslated version <strong>of</strong> the target language. There is nomach<strong>in</strong>e readable dictionary available for H<strong>in</strong>di toPunjabi language. Two traditional dictionaries – onefrom Bhasha Vibhag, Patiala and another from NationalBook Trust has been published. We got it digitized andmoulded required for mach<strong>in</strong>e translation purpose whichis itself a big job. Now lexicon consists <strong>of</strong> approx.1,00,000 words. This lexicon is used for word for wordtranslation. Extend<strong>in</strong>g the lexicon demands large h<strong>in</strong>dicorpus. Sometimes it is available <strong>in</strong> Unicode format andsometimes it is available <strong>in</strong> number <strong>of</strong> other non standardfonts like Susha, Agra, Krutidev etc. which needs to beconverted <strong>in</strong>to Unicode first. Conversion is be<strong>in</strong>g donethrough Font Converter that converts non standard fonts<strong>in</strong>to Unicode Format. The beauty <strong>of</strong> the developed Fontconverter is that it is able to convert MS-Access files, MSWord files, HTML files and Text Files. Because thecorpus can be <strong>in</strong> any file type, so it handles all thepossible file types. Besides translation it performs thetask <strong>of</strong> transliteration as well. Transliteration means toreplace character by character <strong>of</strong> word from sourcelanguage character to target language character likeेमचंद <strong>in</strong>to ਪੇਮਚਂਦ. This system is basically an extension© 2010 ACADEMY PUBLISHERdoi:10.4304/jetwi.2.2.148-151

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!