13.07.2015 Views

Text Normalization System for Bangla - Center for Language ...

Text Normalization System for Bangla - Center for Language ...

Text Normalization System for Bangla - Center for Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Text</strong> <strong>Normalization</strong> <strong>System</strong> <strong>for</strong> <strong>Bangla</strong>Firoj Alam, S.M. Murtoza Habib, Mumit Khan<strong>Center</strong> <strong>for</strong> Research on <strong>Bangla</strong> <strong>Language</strong> Processing, BRAC University{firojalam, habibmurtoza, mumit}@bracu.ac.bdAbstractThis paper describes a process of textnormalization system <strong>for</strong> the <strong>Bangla</strong> language(exonym: Bengali) by identifying the semiotic classesfrom <strong>Bangla</strong> text corpus. After identifying the semioticclasses, a set of rules was written <strong>for</strong> tokenization andverbalization. This study is important <strong>for</strong> <strong>Text</strong>-To-Speech (TTS) system and as well as <strong>for</strong> creating alanguage model used in speech recognition.1. IntroductionThis paper describes the design and implementationof a text normalization system <strong>for</strong> <strong>Bangla</strong> <strong>Text</strong>-To-Speech (TTS) system. TTS is one of the major areas ofresearch on natural languages. The process of TTSinvolves the inputting of a text through console, file,OCR and other means, and converting the text intospoken words. Be<strong>for</strong>e a text is ready to undergo theprocess of TTS, it must first be pre-processed toremove the ambiguities and convert some NonStandard Words (NSWs) into their standard wordrepresentations, and hence pronunciations. TTSsystems work with text in everyday uses. There<strong>for</strong>e,one may expect to come across certain symbols such as$ and %, acronyms such as NATO and WHO,orthographic abbreviations such as Mr. and etc,ordinals such as 1 st and 3 rd , and many more. Whenconverting these texts to speech, one has to use thepronunciation of the words rather than theorthographic symbols so that “Mr.” will sound like“mister”, “1st” will sound like “first”, and so on.Moreover, certain numbers have to be pronounced asindividual digits or as a whole, depending on thecontext. For example, the number 88028624921 willbe pronounced as “eight eight oh two eight six twofour nine two one” if it refers to a phone number, butpronounced as “eight thousand eight hundred and twocrores eighty six lacs twenty four thousand ninehundred and twenty one” if it refers to a measurementsuch as the population. This process of converting atextual representation to another, keeping the contextin mind, is called <strong>Text</strong> <strong>Normalization</strong>. During speechrecognition in Speech-to-<strong>Text</strong> and other documentcreation systems from recognized text, inverse textnormalization is used [2].Natural language is a semiotic system, and probablyhas by far the most complicated <strong>for</strong>m to meaningrelationship. A semiotic system is a means of relatingmeaning to <strong>for</strong>m. Natural language has computerlanguages, email addresses, dates, times, telephonenumbers, postal addresses and so on. Just as withmathematics, a telephone number is not part of naturallanguage, it is a complete meaning/<strong>for</strong>m/signal systemin its own right. A semiotic class is a similar type oftoken like number, date, etc. In our system we tried toidentify all the semiotic classes and then process themaccording to their classes. To the best of ourknowledge this is the first published account of <strong>Bangla</strong>text normalization system, designed <strong>for</strong> TTS. Thisstudy develops a method to normalize <strong>Bangla</strong> textusing rules. To disambiguate tokens, we used rulesrather than using a decision tree and a decision list [1]because of unavailability a tagged corpus. The basicmodel [1] [4] [7] of a text normalization system is thesame <strong>for</strong> all languages except a language dependentrules and framework. Here our attempt is to findpossible changes and enhancements which are requiredto implement a <strong>Bangla</strong> text normalization system.A brief literature review is given in section 2,followed by a description of the methodology insection 3. The analytical results are presented anddiscussed in section 4. A summary and conclusions ofthe study are given in section 5.2. Literature reviewConsiderable work may has been done in textnormalization and disambiguation of other languagessuch as English, Hindi, Chinese, Japanese and otherwell resourced languages. Many little works has beendone on under resourced languages such as <strong>Bangla</strong>.The basic similarity among the work done is that eachinvolves repeated sequences of tokenization, tokenclassification, token sense disambiguation and standardword generation to get the normalized text. This ef<strong>for</strong>tdiffers in the ways of tokenization (delimiter, use ofFlex, etc), the methods of disambiguation (n-gramcomparison, statistical models and POS tagging) and


the methods of word representation (tagging, plainThere are various techniques used in lexicaldisambiguation in the English language. Use ofdecision lists and combining the strengths of decisiontrees, N-gram taggers and Bayesian classifiers, a textcan be processed to resolve disambiguation in TTSsynthesis [3]. Most of the work done on textnormalization uses tags after an NSW has beenidentified [1] [2] [3] [4]. After the type of the NSW isresolved, the resulted word representation is tagged.For example, in Chinese and Japanese, NUM, NDAY,NDIG, and NTIME are used <strong>for</strong> numeric NSWs [4].text).JFlex -LexicalAnalyzerInput textTokenizerSplitterClassifierTokenization3. MethodologyThis paper talks of a method to normalize <strong>Bangla</strong> text.Like other work, the basic processes are same:tokenization, token classification, token sensedisambiguation and word representation. Be<strong>for</strong>e theprocesses of Bengali text normalization are discussed,it is necessary to first discuss the different classes of<strong>Bangla</strong> NSWs and their equivalent pronunciation wordrepresentations. Where [1] uses decision tree anddecision list <strong>for</strong> disambiguation, but this work usesrule based system. The following section discusses thesemiotic class [7] (as opposed to say NSW)identification, tokenization and standard wordgeneration and disambiguation rule. The systemdiagram of text normalization procedure is shown infigure 1.According to semiotic classes a lexical analyzerwas designed to tokenize each NSW by regularexpression using the tool JFlex [8]. We assigned a tag<strong>for</strong> each token according to semiotic classes. Theoutputs of the tokenization are then used in the nextstep i.e token expander. According to the assigned tagtoken verbalization and disambiguation was per<strong>for</strong>medby the token expander.We identified a set of semiotic classes whichbelongs to the <strong>Bangla</strong> language. To do this, weselected a news corpus [9] with 18100378 tokens and384048 token types [13], <strong>for</strong>um [10] and blog [11],then we proceeded in two steps to identify the semioticclasses: (i) Python [13] script was used to identify thesemiotic class from news corpus and we manuallychecked it in the <strong>for</strong>um and blog (ii) we defined a setof rules according to context of homographs orambiguous tokens. The result is a set of semioticclasses in <strong>Bangla</strong> text as shown in table 1.DisambiguationruleToken expansionrule3.1. Figure Semiotic 1: <strong>Text</strong> class normalization identification system <strong>for</strong> <strong>Bangla</strong>Table 1: Possible token type in <strong>Bangla</strong> textSemiotic class/tokentypeEnglish text<strong>Bangla</strong> textNumbers (cardinal,ordinal, roman, floatingnumber, fraction, ratio,range)Telephone and mobilenumberYearsExampleজাভা Plat<strong>for</strong>m Independent বেলei সমেয়র সবেচেয়121,23,234; 1ম, 2য়, 3য়; I, II, III,12.23, 23,33.33; 1/2, 23/23; 12:12;12-23029567447; 0152303398 (19different <strong>for</strong>mats)2006; 1998; 98 সােলDate 022006 -06-(12 different<strong>for</strong>mats)Time4.20 িমঃ; 4.20 িমিনট;Percentage 12%Money10 ৳E-mailURLAbbreviationAcronymTokenExpanderList of word innormalized <strong>for</strong>mআমার i-মiল কানা:abc@yahoo.comসফটoয়ার http://googlegdata.googlecode.comসাiটডঃ ;মাঃ ;সাঃঢািব ;বাuিব, কিবMathematical equation (1+2=3)Look-uptable <strong>for</strong>AbbreviationAcronym,and number


In <strong>Bangla</strong> text we have English text and even Arabicand Urdu text may also be present. Other than <strong>Bangla</strong>we worked on English text within <strong>Bangla</strong> text. Thenon-natural language [7] token such as number, year,date time etc is also available in this multi-text genre(both scripts). So we are handling two types of naturallanguage [7] tokens such as <strong>Bangla</strong> and English. Ourattempt is to build such a text normalization systemthat can serve almost every domain of <strong>Bangla</strong>.Handling Arabic and Urdu text along with anyspecialist domain i.e. medicine, engineering, chemicalequations etc. is beyond the scope of this paper. Thissystem also can detect simple mathematical equations.An example of a mathematical equation is shown inthe last row of table 1.3.2. TokenizationWe defined each semiotic class to a specific tag andassigned this tag to each class of token. Thetokenization undergoes three levels such as: i.Tokenizer ii. Splitter and ii. Classifier. Like Englishand other South Asian scripts <strong>Bangla</strong> also useswhitespace to tokenize a string of characters into aseparate token. Punctuation and delimiter wereidentified and used by the splitter to classify the token.Context sensitive rules written as whitespace is not avalid delimiter <strong>for</strong> tokenizing phone numbers, year,time and floating point numbers. Finally, the classifierclassifies the token by looking at the contextual rule.Different <strong>for</strong>m of delimiters was removed in this step.For each type of token, regular expression were writtenin .jflex <strong>for</strong>mat. Then using JFlex toolkit a Lexer filewas generated. If a regular expression is matched thenwe assign a tag in list[i] and token in list [i+1]. In thisway the whole tokenization process is per<strong>for</strong>med. Allregular expressions were designed according to ourpredefined semiotic classes and the rules of the contextthat were obtained in the previous semiotic classidentification phase. This study is different than [1],where decision tree and decision list is used <strong>for</strong>disambiguation. The generated Lexer file was used inthe token expansion phase. The generated Lexer is ajava class file which was then invoked by a driver classto get the list of the token. According to the tag in thelist, each type of token expander class was theninvoked <strong>for</strong> expanding the token. For example, therules <strong>for</strong> telephone numbers are as follows:Table 2: Rule <strong>for</strong> detecting telephone numberWSP_CHAR = [ |\t]BDIGIT = [0-9]BTELNO1 = {BDIGIT}{7,7}BTELNO2= "\u09E6\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO3="\u09EE\u09EE\u09E6"({WSP_CHAR}*|"-")"\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO4 = "+"{BTELNO3}BTELNO5="("{WSP_CHAR}*"\u09EE\u09EE\u09E6"{WSP_CHAR}*")"{WSP_CHAR}*"\u09E8"{BTELNO1}BTELNO6="("{WSP_CHAR}*"\u09E6\u09E8"{WSP_CHAR}*")"{WSP_CHAR}*{BTELNO1}BTELNO7= "\u09EE\u09EE"({WSP_CHAR}*|"-"){BTELNO2}BTELNO8= {BTELNO1}({WSP_CHAR}+|"-"){BDIGIT}{1,4}These rules can detect the following phone numbers:9567447; 029567447; 02-9567447; 88029567447; 880 29567447; 880-2-9567447; +88029567447; (880) 29567447;(02)9567447;3.3. Verbalization & disambiguationThe token expander expands the token by verbalizingand disambiguating the ambiguous token.Verbalization [7] or standard word generation is theprocess of converting non-natural language text intostandard words or natural language text. A templatebased approach [7] such as the lexicon was used <strong>for</strong>number cardinal, ordinal, acronym, and abbreviations.For expanding the cardinal number, we have calculatedthe position of the digit rather than dividing by 10.Expanding the token cardinal number we have chosenthe following steps: (i). traverse from right to left. (ii).Map first two digits with lexicon to get the expanded<strong>for</strong>m (i.e. 10 → ten). (iii). After the expanded <strong>for</strong>m ofthe third digit insert the token “hundred”. (iv). Getexpanded <strong>for</strong>m of each pair of digit after third digitfrom the lexicon. (v). Insert the token “thousand” afterthe expanded <strong>for</strong>m fourth and fifth digit and “lakh”after expanded <strong>for</strong>m of sixth and seventh digit. Theseprocesses continue <strong>for</strong> each seven digits. Each sevendigit is divided as a separate block. After each of thesecond block (traversing from left to right) we insertthe token “koti”. So the expanded <strong>for</strong>m of token 10910is “ten thousand nine hundred ten”. Abbreviations areproductive and a new one may appear, so an automaticprocess may require solving unknown abbreviations.In [5] an automatic process is shown <strong>for</strong> the predictionof unknown abbreviations, our ef<strong>for</strong>t deals with thisproblem in an automatic way and looks <strong>for</strong> possiblechanges. In case of <strong>Bangla</strong> acronyms, most of the timepeople say the acronym as it is without expanding it.For example, দুদক /d̪ud̪ɔk/ expands to দুনিত দমন কিমশন


d̪urnit̪i d̪ɔmon komiʃɔn/ but people say it as দুদক/d̪ud̪ɔk/. So it is a matter of linguistic decision whetherwe will expand the acronym or not but <strong>for</strong> the timebeing we have expanded to the full <strong>for</strong>m by using thelexicon. <strong>Bangla</strong> has the same type of non-naturallanguage ambiguity like Hindi [1] in the token yearnumberand time-floating number. For example: (i).the token 1998 (1998) could be considered as a yearand at the same time it could be considered as numberand (ii). the token 12.80 (12.80) could be considered asa floating point number and it could be considered as atime. Context dependent hand written rules wereapplied <strong>for</strong> these ambiguities. In case of <strong>Bangla</strong>, aftertime pattern 12.30 (12.30) we have a token িমঃ (minute)so we look at the next token and decide whether it istime or a floating point number. In the worst casescenario where our context dependent rule will fail inyear-number ambiguity we are verbalizing the token asa pair of two digits. For example, the token 1998(1998) we expand it as uিনশ শত আটানbi (Nineteenhundred ninety eight) rather than eক হাজার নয় শতআটানbi (one thousand nine hundred ninety eight). Thisis not wrong in case of <strong>Bangla</strong> as both are acceptableby this culture. The natural language text is relativelystraight<strong>for</strong>ward, and <strong>Bangla</strong> does not have upper andlower case. The mispellings problem is not dealt within this system which however, may be handled in aseperate module be<strong>for</strong>e the text normalization system.Certain homograph problems exist in <strong>Bangla</strong> such asan abbreviation homograph token িমঃ may appear as“minute” or it may appear as “mister” and a part-ofspeech(POS) homograph such as কর would bepronounced as কর/kɔr/ and কেরা/kɔro/. The POShomograph problem exists in <strong>Bangla</strong> verb cases interms of grade, such as কর/kɔr/ is pronounced in 2 ndperson in<strong>for</strong>mal and কেরা/kɔro/ is pronounced in 2 ndperson <strong>for</strong>mal. We have not solved this problem as wedo not have any accessible tagged corpus or syntacticalparser which can solve this problem. The naturallanguage English text that exists within the <strong>Bangla</strong> textare kept as it is, we are not dealing any ambiguitiesthat may be present in the English text.in currency and 13 strings were in time <strong>for</strong>mat. Theaccuracy of these three types of token is: floating point100%, currency 100% and time 62% as shown infigure 2. Because there was no other token be<strong>for</strong>e orafter the floating point, our system could not detectthese as time. For example, কাজটা 10.15 থেক 11.15 িমিনেটরমেধ শষ করেবন (Finish the work between 10.15 and11.15). In this case, the first floating number has notoken be<strong>for</strong>e or after, but the second floating numberhas a token after it which is িমিনেটর (minute). So oursystem could not detect the first floating number astime. By adding more complex rules, this problem canbe solved. The POS homograph ambiguity in <strong>Bangla</strong>appears in different grades of the same verb and willbe resolved in the next step of TTS development aftertext normalization. Other than the ambiguous token therest of the token expansion per<strong>for</strong>ms satisfactoryresults in this system. This system will be availableonline at location as open source after completing the system as ageneralized <strong>for</strong>m so that it can be expanded <strong>for</strong> otherlanguages which is derived from Brahmi script.4. ResultsThe output of this work is the list of words in anormalized <strong>for</strong>m. The per<strong>for</strong>mance of this rule basedsystem is 99% <strong>for</strong> ambiguous tokens such as float,time and currency. These three types of token appearas a floating number. The test was per<strong>for</strong>med bycollecting 797 strings from one year news paper corpus[10] using python script <strong>for</strong> ambiguous token. Among797 strings 617 were floating point number, 167 wereFigure 2: Per<strong>for</strong>mance of the system5. Conclusions<strong>Text</strong> normalization is a very important issue in TTSsystems. TTS systems will be highly beneficial in


various fields and its use will make life easier innumerous aspects. In this paper, the preprocessing ofTTS, text normalization has been discussed. This is thefirst working text normalizer <strong>for</strong> <strong>Bangla</strong> that has beenimplemented using a rule based system. Hence thiswork will be helpful in future by integrating with TTSand Speech Recognition combined with thedisambiguation techniques of rule based systems andother classification systems. A tagged corpora is underdevelopment <strong>for</strong> <strong>Bangla</strong> that will be used to make adecision tree and decision list to increase theper<strong>for</strong>mance. We have also a plan to incorporateArabic and Urdu text normalizer in this system. Morework is needed to solve POS homograph and otherproblematic areas.6. AcknowledgementsThis work has been supported in part by the PANLocalization Project (www.panl10n.net), grant fromthe International Development Research <strong>Center</strong>(IDRC), Ottawa, Canada, administrated through <strong>Center</strong><strong>for</strong> Research in Urdu <strong>Language</strong> Processing (CRULP),National University of Computer and EmergingSciences, Pakistan.7. References[1] K. Panchapagesan, Partha Pratim Talukdar, N. SridharKrishna, Kalika Bali, A.G. Ramakrishnan, "Hindi <strong>Text</strong><strong>Normalization</strong>”, Fifth International Conference onKnowledge Based Computer <strong>System</strong>s (KBCS), Hyderabad,India, 19-22 December 2004. Retrieved (June, 1, 2008).Available:www.cis.upenn.edu/~partha/papers/KBCS04_HPL-1.pdf[2] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken<strong>Language</strong> Processing: A Guide To Theory Algorithm And<strong>System</strong> Development. Prentice Hall, pp. 706-720, 2001.[3] Yarowsky D., "Homograph Disambiguation in <strong>Text</strong>-to-Speech Synthesis", 2nd ESCA/IEEE Workshop on SpeechSynthesis, New Paltz, NY, pp. 244-247,[4] Olinsky, C. and Black, A., "Non-Standard Word andHomograph Resolution <strong>for</strong> Asian <strong>Language</strong> <strong>Text</strong> Analysis",ICSLP2000, Beijing, China, Retrieved (June, 1, 2008).Available:www.cs.cmu.edu/~awb/papers/ICSLP2000_usi.pdf[5] Sproat R., Black A., Chen S., Kumar S., Ostendorf M.,& Richards C, "<strong>Normalization</strong> of Non-Standard Words:WS'99 Final Report", CLSP Summer Workshop, JohnsHopkins University, 1999, , Retrieved (June, 1, 2008).Available: www.clsp.jhu.edu/ws99/projects/normal[6] A Raj A., Sarkar T., Pammi S. C., Yuvaraj S., BansalM., Prahallad K., and Black, A., "<strong>Text</strong> Processing <strong>for</strong> <strong>Text</strong>to-Speech<strong>System</strong>s in Indian <strong>Language</strong>s", ISCA SSW6, Bonn,Germany, pp. 188-193, Retrieved (June, 1, 2008).Available:www.cs.cmu.edu/~awb/papers/ssw6/ssw6_188.pdf[7] Paul Taylor, <strong>Text</strong> to Speech Synthesis. University ofCambridge, 2007. pp.71-111, (draft), Retrieved (June, 19,2008).Available:http://mi.eng.cam.ac.uk/~pat40/ttsbook_draft_2.pdf[8] Elliot Berk, JFlex - The Fast Scanner Generator <strong>for</strong>Java, 2004, version 1.4.1, http://jflex.de[9] Prothom-Alo, www.prothom-alo.com[10] “Forum”, <strong>for</strong>um.amaderprojukti.com[11] “Blog”, www.somewhereinblog.net[12] “Python”, www.python.org, version 2.5.2[13] Khair Md. Yeasir Arafat Majumder, Md. Zahurul Islam,Naushad UzZaman and Mumit Khan, “Analysis of andObservations from a <strong>Bangla</strong> News Corpus”, in proc. 9thInternational Conference on Computer and In<strong>for</strong>mationTechnology, Dhaka, <strong>Bangla</strong>desh, December 2006.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!