Text Normalization System for Bangla - Center for Language ...

Text Normalization System for BanglaFiroj Alam, S.M. Murtoza Habib, Mumit KhanCenter for Research on Bangla Language Processing, BRAC University{firojalam, habibmurtoza, mumit}@bracu.ac.bdAbstractThis paper describes a process of textnormalization system for the Bangla language(exonym: Bengali) by identifying the semiotic classesfrom Bangla text corpus. After identifying the semioticclasses, a set of rules was written for tokenization andverbalization. This study is important for Text-To-Speech (TTS) system and as well as for creating alanguage model used in speech recognition.1. IntroductionThis paper describes the design and implementationof a text normalization system for Bangla Text-To-Speech (TTS) system. TTS is one of the major areas ofresearch on natural languages. The process of TTSinvolves the inputting of a text through console, file,OCR and other means, and converting the text intospoken words. Before a text is ready to undergo theprocess of TTS, it must first be pre-processed toremove the ambiguities and convert some NonStandard Words (NSWs) into their standard wordrepresentations, and hence pronunciations. TTSsystems work with text in everyday uses. Therefore,one may expect to come across certain symbols such as$ and %, acronyms such as NATO and WHO,orthographic abbreviations such as Mr. and etc,ordinals such as 1 st and 3 rd , and many more. Whenconverting these texts to speech, one has to use thepronunciation of the words rather than theorthographic symbols so that “Mr.” will sound like“mister”, “1st” will sound like “first”, and so on.Moreover, certain numbers have to be pronounced asindividual digits or as a whole, depending on thecontext. For example, the number 88028624921 willbe pronounced as “eight eight oh two eight six twofour nine two one” if it refers to a phone number, butpronounced as “eight thousand eight hundred and twocrores eighty six lacs twenty four thousand ninehundred and twenty one” if it refers to a measurementsuch as the population. This process of converting atextual representation to another, keeping the contextin mind, is called Text Normalization. During speechrecognition in Speech-to-Text and other documentcreation systems from recognized text, inverse textnormalization is used [2].Natural language is a semiotic system, and probablyhas by far the most complicated form to meaningrelationship. A semiotic system is a means of relatingmeaning to form. Natural language has computerlanguages, email addresses, dates, times, telephonenumbers, postal addresses and so on. Just as withmathematics, a telephone number is not part of naturallanguage, it is a complete meaning/form/signal systemin its own right. A semiotic class is a similar type oftoken like number, date, etc. In our system we tried toidentify all the semiotic classes and then process themaccording to their classes. To the best of ourknowledge this is the first published account of Banglatext normalization system, designed for TTS. Thisstudy develops a method to normalize Bangla textusing rules. To disambiguate tokens, we used rulesrather than using a decision tree and a decision list [1]because of unavailability a tagged corpus. The basicmodel [1] [4] [7] of a text normalization system is thesame for all languages except a language dependentrules and framework. Here our attempt is to findpossible changes and enhancements which are requiredto implement a Bangla text normalization system.A brief literature review is given in section 2,followed by a description of the methodology insection 3. The analytical results are presented anddiscussed in section 4. A summary and conclusions ofthe study are given in section 5.2. Literature reviewConsiderable work may has been done in textnormalization and disambiguation of other languagessuch as English, Hindi, Chinese, Japanese and otherwell resourced languages. Many little works has beendone on under resourced languages such as Bangla.The basic similarity among the work done is that eachinvolves repeated sequences of tokenization, tokenclassification, token sense disambiguation and standardword generation to get the normalized text. This effortdiffers in the ways of tokenization (delimiter, use ofFlex, etc), the methods of disambiguation (n-gramcomparison, statistical models and POS tagging) and

the methods of word representation (tagging, plainThere are various techniques used in lexicaldisambiguation in the English language. Use ofdecision lists and combining the strengths of decisiontrees, N-gram taggers and Bayesian classifiers, a textcan be processed to resolve disambiguation in TTSsynthesis [3]. Most of the work done on textnormalization uses tags after an NSW has beenidentified [1] [2] [3] [4]. After the type of the NSW isresolved, the resulted word representation is tagged.For example, in Chinese and Japanese, NUM, NDAY,NDIG, and NTIME are used for numeric NSWs [4].text).JFlex -LexicalAnalyzerInput textTokenizerSplitterClassifierTokenization3. MethodologyThis paper talks of a method to normalize Bangla text.Like other work, the basic processes are same:tokenization, token classification, token sensedisambiguation and word representation. Before theprocesses of Bengali text normalization are discussed,it is necessary to first discuss the different classes ofBangla NSWs and their equivalent pronunciation wordrepresentations. Where [1] uses decision tree anddecision list for disambiguation, but this work usesrule based system. The following section discusses thesemiotic class [7] (as opposed to say NSW)identification, tokenization and standard wordgeneration and disambiguation rule. The systemdiagram of text normalization procedure is shown infigure 1.According to semiotic classes a lexical analyzerwas designed to tokenize each NSW by regularexpression using the tool JFlex [8]. We assigned a tagfor each token according to semiotic classes. Theoutputs of the tokenization are then used in the nextstep i.e token expander. According to the assigned tagtoken verbalization and disambiguation was performedby the token expander.We identified a set of semiotic classes whichbelongs to the Bangla language. To do this, weselected a news corpus [9] with 18100378 tokens and384048 token types [13], forum [10] and blog [11],then we proceeded in two steps to identify the semioticclasses: (i) Python [13] script was used to identify thesemiotic class from news corpus and we manuallychecked it in the forum and blog (ii) we defined a setof rules according to context of homographs orambiguous tokens. The result is a set of semioticclasses in Bangla text as shown in table 1.DisambiguationruleToken expansionrule3.1. Figure Semiotic 1: Text class normalization identification system for BanglaTable 1: Possible token type in Bangla textSemiotic class/tokentypeEnglish textBangla textNumbers (cardinal,ordinal, roman, floatingnumber, fraction, ratio,range)Telephone and mobilenumberYearsExampleজাভা Platform Independent বেলei সমেয়র সবেচেয়121,23,234; 1ম, 2য়, 3য়; I, II, III,12.23, 23,33.33; 1/2, 23/23; 12:12;12-23029567447; 0152303398 (19different formats)2006; 1998; 98 সােলDate 022006 -06-(12 differentformats)Time4.20 িমঃ; 4.20 িমিনট;Percentage 12%Money10 ৳E-mailURLAbbreviationAcronymTokenExpanderList of word innormalized formআমার i-মiল কানা:abc@yahoo.comসফটoয়ার http://googlegdata.googlecode.comসাiটডঃ ;মাঃ ;সাঃঢািব ;বাuিব, কিবMathematical equation (1+2=3)Look-uptable forAbbreviationAcronym,and number

In Bangla text we have English text and even Arabicand Urdu text may also be present. Other than Banglawe worked on English text within Bangla text. Thenon-natural language [7] token such as number, year,date time etc is also available in this multi-text genre(both scripts). So we are handling two types of naturallanguage [7] tokens such as Bangla and English. Ourattempt is to build such a text normalization systemthat can serve almost every domain of Bangla.Handling Arabic and Urdu text along with anyspecialist domain i.e. medicine, engineering, chemicalequations etc. is beyond the scope of this paper. Thissystem also can detect simple mathematical equations.An example of a mathematical equation is shown inthe last row of table 1.3.2. TokenizationWe defined each semiotic class to a specific tag andassigned this tag to each class of token. Thetokenization undergoes three levels such as: i.Tokenizer ii. Splitter and ii. Classifier. Like Englishand other South Asian scripts Bangla also useswhitespace to tokenize a string of characters into aseparate token. Punctuation and delimiter wereidentified and used by the splitter to classify the token.Context sensitive rules written as whitespace is not avalid delimiter for tokenizing phone numbers, year,time and floating point numbers. Finally, the classifierclassifies the token by looking at the contextual rule.Different form of delimiters was removed in this step.For each type of token, regular expression were writtenin .jflex format. Then using JFlex toolkit a Lexer filewas generated. If a regular expression is matched thenwe assign a tag in list[i] and token in list [i+1]. In thisway the whole tokenization process is performed. Allregular expressions were designed according to ourpredefined semiotic classes and the rules of the contextthat were obtained in the previous semiotic classidentification phase. This study is different than [1],where decision tree and decision list is used fordisambiguation. The generated Lexer file was used inthe token expansion phase. The generated Lexer is ajava class file which was then invoked by a driver classto get the list of the token. According to the tag in thelist, each type of token expander class was theninvoked for expanding the token. For example, therules for telephone numbers are as follows:Table 2: Rule for detecting telephone numberWSP_CHAR = [ |\t]BDIGIT = [0-9]BTELNO1 = {BDIGIT}{7,7}BTELNO2= "\u09E6\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO3="\u09EE\u09EE\u09E6"({WSP_CHAR}*|"-")"\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO4 = "+"{BTELNO3}BTELNO5="("{WSP_CHAR}*"\u09EE\u09EE\u09E6"{WSP_CHAR}*")"{WSP_CHAR}*"\u09E8"{BTELNO1}BTELNO6="("{WSP_CHAR}*"\u09E6\u09E8"{WSP_CHAR}*")"{WSP_CHAR}*{BTELNO1}BTELNO7= "\u09EE\u09EE"({WSP_CHAR}*|"-"){BTELNO2}BTELNO8= {BTELNO1}({WSP_CHAR}+|"-"){BDIGIT}{1,4}These rules can detect the following phone numbers:9567447; 029567447; 02-9567447; 88029567447; 880 29567447; 880-2-9567447; +88029567447; (880) 29567447;(02)9567447;3.3. Verbalization & disambiguationThe token expander expands the token by verbalizingand disambiguating the ambiguous token.Verbalization [7] or standard word generation is theprocess of converting non-natural language text intostandard words or natural language text. A templatebased approach [7] such as the lexicon was used fornumber cardinal, ordinal, acronym, and abbreviations.For expanding the cardinal number, we have calculatedthe position of the digit rather than dividing by 10.Expanding the token cardinal number we have chosenthe following steps: (i). traverse from right to left. (ii).Map first two digits with lexicon to get the expandedform (i.e. 10 → ten). (iii). After the expanded form ofthe third digit insert the token “hundred”. (iv). Getexpanded form of each pair of digit after third digitfrom the lexicon. (v). Insert the token “thousand” afterthe expanded form fourth and fifth digit and “lakh”after expanded form of sixth and seventh digit. Theseprocesses continue for each seven digits. Each sevendigit is divided as a separate block. After each of thesecond block (traversing from left to right) we insertthe token “koti”. So the expanded form of token 10910is “ten thousand nine hundred ten”. Abbreviations areproductive and a new one may appear, so an automaticprocess may require solving unknown abbreviations.In [5] an automatic process is shown for the predictionof unknown abbreviations, our effort deals with thisproblem in an automatic way and looks for possiblechanges. In case of Bangla acronyms, most of the timepeople say the acronym as it is without expanding it.For example, দুদক /d̪ud̪ɔk/ expands to দুনিত দমন কিমশন

d̪urnit̪i d̪ɔmon komiʃɔn/ but people say it as দুদক/d̪ud̪ɔk/. So it is a matter of linguistic decision whetherwe will expand the acronym or not but for the timebeing we have expanded to the full form by using thelexicon. Bangla has the same type of non-naturallanguage ambiguity like Hindi [1] in the token yearnumberand time-floating number. For example: (i).the token 1998 (1998) could be considered as a yearand at the same time it could be considered as numberand (ii). the token 12.80 (12.80) could be considered asa floating point number and it could be considered as atime. Context dependent hand written rules wereapplied for these ambiguities. In case of Bangla, aftertime pattern 12.30 (12.30) we have a token িমঃ (minute)so we look at the next token and decide whether it istime or a floating point number. In the worst casescenario where our context dependent rule will fail inyear-number ambiguity we are verbalizing the token asa pair of two digits. For example, the token 1998(1998) we expand it as uিনশ শত আটানbi (Nineteenhundred ninety eight) rather than eক হাজার নয় শতআটানbi (one thousand nine hundred ninety eight). Thisis not wrong in case of Bangla as both are acceptableby this culture. The natural language text is relativelystraightforward, and Bangla does not have upper andlower case. The mispellings problem is not dealt within this system which however, may be handled in aseperate module before the text normalization system.Certain homograph problems exist in Bangla such asan abbreviation homograph token িমঃ may appear as“minute” or it may appear as “mister” and a part-ofspeech(POS) homograph such as কর would bepronounced as কর/kɔr/ and কেরা/kɔro/. The POShomograph problem exists in Bangla verb cases interms of grade, such as কর/kɔr/ is pronounced in 2 ndperson informal and কেরা/kɔro/ is pronounced in 2 ndperson formal. We have not solved this problem as wedo not have any accessible tagged corpus or syntacticalparser which can solve this problem. The naturallanguage English text that exists within the Bangla textare kept as it is, we are not dealing any ambiguitiesthat may be present in the English text.in currency and 13 strings were in time format. Theaccuracy of these three types of token is: floating point100%, currency 100% and time 62% as shown infigure 2. Because there was no other token before orafter the floating point, our system could not detectthese as time. For example, কাজটা 10.15 থেক 11.15 িমিনেটরমেধ শষ করেবন (Finish the work between 10.15 and11.15). In this case, the first floating number has notoken before or after, but the second floating numberhas a token after it which is িমিনেটর (minute). So oursystem could not detect the first floating number astime. By adding more complex rules, this problem canbe solved. The POS homograph ambiguity in Banglaappears in different grades of the same verb and willbe resolved in the next step of TTS development aftertext normalization. Other than the ambiguous token therest of the token expansion performs satisfactoryresults in this system. This system will be availableonline at location as open source after completing the system as ageneralized form so that it can be expanded for otherlanguages which is derived from Brahmi script.4. ResultsThe output of this work is the list of words in anormalized form. The performance of this rule basedsystem is 99% for ambiguous tokens such as float,time and currency. These three types of token appearas a floating number. The test was performed bycollecting 797 strings from one year news paper corpus[10] using python script for ambiguous token. Among797 strings 617 were floating point number, 167 wereFigure 2: Performance of the system5. ConclusionsText normalization is a very important issue in TTSsystems. TTS systems will be highly beneficial in

various fields and its use will make life easier innumerous aspects. In this paper, the preprocessing ofTTS, text normalization has been discussed. This is thefirst working text normalizer for Bangla that has beenimplemented using a rule based system. Hence thiswork will be helpful in future by integrating with TTSand Speech Recognition combined with thedisambiguation techniques of rule based systems andother classification systems. A tagged corpora is underdevelopment for Bangla that will be used to make adecision tree and decision list to increase theperformance. We have also a plan to incorporateArabic and Urdu text normalizer in this system. Morework is needed to solve POS homograph and otherproblematic areas.6. AcknowledgementsThis work has been supported in part by the PANLocalization Project (www.panl10n.net), grant fromthe International Development Research Center(IDRC), Ottawa, Canada, administrated through Centerfor Research in Urdu Language Processing (CRULP),National University of Computer and EmergingSciences, Pakistan.7. References[1] K. Panchapagesan, Partha Pratim Talukdar, N. SridharKrishna, Kalika Bali, A.G. Ramakrishnan, "Hindi TextNormalization”, Fifth International Conference onKnowledge Based Computer Systems (KBCS), Hyderabad,India, 19-22 December 2004. Retrieved (June, 1, 2008).Available:www.cis.upenn.edu/~partha/papers/KBCS04_HPL-1.pdf[2] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, SpokenLanguage Processing: A Guide To Theory Algorithm AndSystem Development. Prentice Hall, pp. 706-720, 2001.[3] Yarowsky D., "Homograph Disambiguation in Text-to-Speech Synthesis", 2nd ESCA/IEEE Workshop on SpeechSynthesis, New Paltz, NY, pp. 244-247,[4] Olinsky, C. and Black, A., "Non-Standard Word andHomograph Resolution for Asian Language Text Analysis",ICSLP2000, Beijing, China, Retrieved (June, 1, 2008).Available:www.cs.cmu.edu/~awb/papers/ICSLP2000_usi.pdf[5] Sproat R., Black A., Chen S., Kumar S., Ostendorf M.,& Richards C, "Normalization of Non-Standard Words:WS'99 Final Report", CLSP Summer Workshop, JohnsHopkins University, 1999, , Retrieved (June, 1, 2008).Available: www.clsp.jhu.edu/ws99/projects/normal[6] A Raj A., Sarkar T., Pammi S. C., Yuvaraj S., BansalM., Prahallad K., and Black, A., "Text Processing for Textto-SpeechSystems in Indian Languages", ISCA SSW6, Bonn,Germany, pp. 188-193, Retrieved (June, 1, 2008).Available:www.cs.cmu.edu/~awb/papers/ssw6/ssw6_188.pdf[7] Paul Taylor, Text to Speech Synthesis. University ofCambridge, 2007. pp.71-111, (draft), Retrieved (June, 19,2008).Available:http://mi.eng.cam.ac.uk/~pat40/ttsbook_draft_2.pdf[8] Elliot Berk, JFlex - The Fast Scanner Generator forJava, 2004, version 1.4.1, http://jflex.de[9] Prothom-Alo, www.prothom-alo.com[10] “Forum”, forum.amaderprojukti.com[11] “Blog”, www.somewhereinblog.net[12] “Python”, www.python.org, version 2.5.2[13] Khair Md. Yeasir Arafat Majumder, Md. Zahurul Islam,Naushad UzZaman and Mumit Khan, “Analysis of andObservations from a Bangla News Corpus”, in proc. 9thInternational Conference on Computer and InformationTechnology, Dhaka, Bangladesh, December 2006.

Text Normalization System for Bangla - Center for Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?