the methods of word representation (tagging, plainThere are various techniques used in lexicaldisambiguation in the English language. Use ofdecision lists and combining the strengths of decisiontrees, N-gram taggers and Bayesian classifiers, a textcan be processed to resolve disambiguation in TTSsynthesis [3]. Most of the work done on textnormalization uses tags after an NSW has beenidentified [1] [2] [3] [4]. After the type of the NSW isresolved, the resulted word representation is tagged.For example, in Chinese and Japanese, NUM, NDAY,NDIG, and NTIME are used <strong>for</strong> numeric NSWs [4].text).JFlex -LexicalAnalyzerInput textTokenizerSplitterClassifierTokenization3. MethodologyThis paper talks of a method to normalize <strong>Bangla</strong> text.Like other work, the basic processes are same:tokenization, token classification, token sensedisambiguation and word representation. Be<strong>for</strong>e theprocesses of Bengali text normalization are discussed,it is necessary to first discuss the different classes of<strong>Bangla</strong> NSWs and their equivalent pronunciation wordrepresentations. Where [1] uses decision tree anddecision list <strong>for</strong> disambiguation, but this work usesrule based system. The following section discusses thesemiotic class [7] (as opposed to say NSW)identification, tokenization and standard wordgeneration and disambiguation rule. The systemdiagram of text normalization procedure is shown infigure 1.According to semiotic classes a lexical analyzerwas designed to tokenize each NSW by regularexpression using the tool JFlex [8]. We assigned a tag<strong>for</strong> each token according to semiotic classes. Theoutputs of the tokenization are then used in the nextstep i.e token expander. According to the assigned tagtoken verbalization and disambiguation was per<strong>for</strong>medby the token expander.We identified a set of semiotic classes whichbelongs to the <strong>Bangla</strong> language. To do this, weselected a news corpus [9] with 18100378 tokens and384048 token types [13], <strong>for</strong>um [10] and blog [11],then we proceeded in two steps to identify the semioticclasses: (i) Python [13] script was used to identify thesemiotic class from news corpus and we manuallychecked it in the <strong>for</strong>um and blog (ii) we defined a setof rules according to context of homographs orambiguous tokens. The result is a set of semioticclasses in <strong>Bangla</strong> text as shown in table 1.DisambiguationruleToken expansionrule3.1. Figure Semiotic 1: <strong>Text</strong> class normalization identification system <strong>for</strong> <strong>Bangla</strong>Table 1: Possible token type in <strong>Bangla</strong> textSemiotic class/tokentypeEnglish text<strong>Bangla</strong> textNumbers (cardinal,ordinal, roman, floatingnumber, fraction, ratio,range)Telephone and mobilenumberYearsExampleজাভা Plat<strong>for</strong>m Independent বেলei সমেয়র সবেচেয়121,23,234; 1ম, 2য়, 3য়; I, II, III,12.23, 23,33.33; 1/2, 23/23; 12:12;12-23029567447; 0152303398 (19different <strong>for</strong>mats)2006; 1998; 98 সােলDate 022006 -06-(12 different<strong>for</strong>mats)Time4.20 িমঃ; 4.20 িমিনট;Percentage 12%Money10 ৳E-mailURLAbbreviationAcronymTokenExpanderList of word innormalized <strong>for</strong>mআমার i-মiল কানা:abc@yahoo.comসফটoয়ার http://googlegdata.googlecode.comসাiটডঃ ;মাঃ ;সাঃঢািব ;বাuিব, কিবMathematical equation (1+2=3)Look-uptable <strong>for</strong>AbbreviationAcronym,and number
In <strong>Bangla</strong> text we have English text and even Arabicand Urdu text may also be present. Other than <strong>Bangla</strong>we worked on English text within <strong>Bangla</strong> text. Thenon-natural language [7] token such as number, year,date time etc is also available in this multi-text genre(both scripts). So we are handling two types of naturallanguage [7] tokens such as <strong>Bangla</strong> and English. Ourattempt is to build such a text normalization systemthat can serve almost every domain of <strong>Bangla</strong>.Handling Arabic and Urdu text along with anyspecialist domain i.e. medicine, engineering, chemicalequations etc. is beyond the scope of this paper. Thissystem also can detect simple mathematical equations.An example of a mathematical equation is shown inthe last row of table 1.3.2. TokenizationWe defined each semiotic class to a specific tag andassigned this tag to each class of token. Thetokenization undergoes three levels such as: i.Tokenizer ii. Splitter and ii. Classifier. Like Englishand other South Asian scripts <strong>Bangla</strong> also useswhitespace to tokenize a string of characters into aseparate token. Punctuation and delimiter wereidentified and used by the splitter to classify the token.Context sensitive rules written as whitespace is not avalid delimiter <strong>for</strong> tokenizing phone numbers, year,time and floating point numbers. Finally, the classifierclassifies the token by looking at the contextual rule.Different <strong>for</strong>m of delimiters was removed in this step.For each type of token, regular expression were writtenin .jflex <strong>for</strong>mat. Then using JFlex toolkit a Lexer filewas generated. If a regular expression is matched thenwe assign a tag in list[i] and token in list [i+1]. In thisway the whole tokenization process is per<strong>for</strong>med. Allregular expressions were designed according to ourpredefined semiotic classes and the rules of the contextthat were obtained in the previous semiotic classidentification phase. This study is different than [1],where decision tree and decision list is used <strong>for</strong>disambiguation. The generated Lexer file was used inthe token expansion phase. The generated Lexer is ajava class file which was then invoked by a driver classto get the list of the token. According to the tag in thelist, each type of token expander class was theninvoked <strong>for</strong> expanding the token. For example, therules <strong>for</strong> telephone numbers are as follows:Table 2: Rule <strong>for</strong> detecting telephone numberWSP_CHAR = [ |\t]BDIGIT = [0-9]BTELNO1 = {BDIGIT}{7,7}BTELNO2= "\u09E6\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO3="\u09EE\u09EE\u09E6"({WSP_CHAR}*|"-")"\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO4 = "+"{BTELNO3}BTELNO5="("{WSP_CHAR}*"\u09EE\u09EE\u09E6"{WSP_CHAR}*")"{WSP_CHAR}*"\u09E8"{BTELNO1}BTELNO6="("{WSP_CHAR}*"\u09E6\u09E8"{WSP_CHAR}*")"{WSP_CHAR}*{BTELNO1}BTELNO7= "\u09EE\u09EE"({WSP_CHAR}*|"-"){BTELNO2}BTELNO8= {BTELNO1}({WSP_CHAR}+|"-"){BDIGIT}{1,4}These rules can detect the following phone numbers:9567447; 029567447; 02-9567447; 88029567447; 880 29567447; 880-2-9567447; +88029567447; (880) 29567447;(02)9567447;3.3. Verbalization & disambiguationThe token expander expands the token by verbalizingand disambiguating the ambiguous token.Verbalization [7] or standard word generation is theprocess of converting non-natural language text intostandard words or natural language text. A templatebased approach [7] such as the lexicon was used <strong>for</strong>number cardinal, ordinal, acronym, and abbreviations.For expanding the cardinal number, we have calculatedthe position of the digit rather than dividing by 10.Expanding the token cardinal number we have chosenthe following steps: (i). traverse from right to left. (ii).Map first two digits with lexicon to get the expanded<strong>for</strong>m (i.e. 10 → ten). (iii). After the expanded <strong>for</strong>m ofthe third digit insert the token “hundred”. (iv). Getexpanded <strong>for</strong>m of each pair of digit after third digitfrom the lexicon. (v). Insert the token “thousand” afterthe expanded <strong>for</strong>m fourth and fifth digit and “lakh”after expanded <strong>for</strong>m of sixth and seventh digit. Theseprocesses continue <strong>for</strong> each seven digits. Each sevendigit is divided as a separate block. After each of thesecond block (traversing from left to right) we insertthe token “koti”. So the expanded <strong>for</strong>m of token 10910is “ten thousand nine hundred ten”. Abbreviations areproductive and a new one may appear, so an automaticprocess may require solving unknown abbreviations.In [5] an automatic process is shown <strong>for</strong> the predictionof unknown abbreviations, our ef<strong>for</strong>t deals with thisproblem in an automatic way and looks <strong>for</strong> possiblechanges. In case of <strong>Bangla</strong> acronyms, most of the timepeople say the acronym as it is without expanding it.For example, দুদক /d̪ud̪ɔk/ expands to দুনিত দমন কিমশন