Text Normalization System for Bangla - Center for Language ...

More documents

Recommendations

Info

the methods of word representation (tagging, plainThere are various techniques used in lexicaldisambiguation in the English language. Use ofdecision lists and combining the strengths of decisiontrees, N-gram taggers and Bayesian classifiers, a textcan be processed to resolve disambiguation in TTSsynthesis [3]. Most of the work done on textnormalization uses tags after an NSW has beenidentified [1] [2] [3] [4]. After the type of the NSW isresolved, the resulted word representation is tagged.For example, in Chinese and Japanese, NUM, NDAY,NDIG, and NTIME are used for numeric NSWs [4].text).JFlex -LexicalAnalyzerInput textTokenizerSplitterClassifierTokenization3. MethodologyThis paper talks of a method to normalize Bangla text.Like other work, the basic processes are same:tokenization, token classification, token sensedisambiguation and word representation. Before theprocesses of Bengali text normalization are discussed,it is necessary to first discuss the different classes ofBangla NSWs and their equivalent pronunciation wordrepresentations. Where [1] uses decision tree anddecision list for disambiguation, but this work usesrule based system. The following section discusses thesemiotic class [7] (as opposed to say NSW)identification, tokenization and standard wordgeneration and disambiguation rule. The systemdiagram of text normalization procedure is shown infigure 1.According to semiotic classes a lexical analyzerwas designed to tokenize each NSW by regularexpression using the tool JFlex [8]. We assigned a tagfor each token according to semiotic classes. Theoutputs of the tokenization are then used in the nextstep i.e token expander. According to the assigned tagtoken verbalization and disambiguation was performedby the token expander.We identified a set of semiotic classes whichbelongs to the Bangla language. To do this, weselected a news corpus [9] with 18100378 tokens and384048 token types [13], forum [10] and blog [11],then we proceeded in two steps to identify the semioticclasses: (i) Python [13] script was used to identify thesemiotic class from news corpus and we manuallychecked it in the forum and blog (ii) we defined a setof rules according to context of homographs orambiguous tokens. The result is a set of semioticclasses in Bangla text as shown in table 1.DisambiguationruleToken expansionrule3.1. Figure Semiotic 1: Text class normalization identification system for BanglaTable 1: Possible token type in Bangla textSemiotic class/tokentypeEnglish textBangla textNumbers (cardinal,ordinal, roman, floatingnumber, fraction, ratio,range)Telephone and mobilenumberYearsExampleজাভা Platform Independent বেলei সমেয়র সবেচেয়121,23,234; 1ম, 2য়, 3য়; I, II, III,12.23, 23,33.33; 1/2, 23/23; 12:12;12-23029567447; 0152303398 (19different formats)2006; 1998; 98 সােলDate 022006 -06-(12 differentformats)Time4.20 িমঃ; 4.20 িমিনট;Percentage 12%Money10 ৳E-mailURLAbbreviationAcronymTokenExpanderList of word innormalized formআমার i-মiল কানা:abc@yahoo.comসফটoয়ার http://googlegdata.googlecode.comসাiটডঃ ;মাঃ ;সাঃঢািব ;বাuিব, কিবMathematical equation (1+2=3)Look-uptable forAbbreviationAcronym,and number
In Bangla text we have English text and even Arabicand Urdu text may also be present. Other than Banglawe worked on English text within Bangla text. Thenon-natural language [7] token such as number, year,date time etc is also available in this multi-text genre(both scripts). So we are handling two types of naturallanguage [7] tokens such as Bangla and English. Ourattempt is to build such a text normalization systemthat can serve almost every domain of Bangla.Handling Arabic and Urdu text along with anyspecialist domain i.e. medicine, engineering, chemicalequations etc. is beyond the scope of this paper. Thissystem also can detect simple mathematical equations.An example of a mathematical equation is shown inthe last row of table 1.3.2. TokenizationWe defined each semiotic class to a specific tag andassigned this tag to each class of token. Thetokenization undergoes three levels such as: i.Tokenizer ii. Splitter and ii. Classifier. Like Englishand other South Asian scripts Bangla also useswhitespace to tokenize a string of characters into aseparate token. Punctuation and delimiter wereidentified and used by the splitter to classify the token.Context sensitive rules written as whitespace is not avalid delimiter for tokenizing phone numbers, year,time and floating point numbers. Finally, the classifierclassifies the token by looking at the contextual rule.Different form of delimiters was removed in this step.For each type of token, regular expression were writtenin .jflex format. Then using JFlex toolkit a Lexer filewas generated. If a regular expression is matched thenwe assign a tag in list[i] and token in list [i+1]. In thisway the whole tokenization process is performed. Allregular expressions were designed according to ourpredefined semiotic classes and the rules of the contextthat were obtained in the previous semiotic classidentification phase. This study is different than [1],where decision tree and decision list is used fordisambiguation. The generated Lexer file was used inthe token expansion phase. The generated Lexer is ajava class file which was then invoked by a driver classto get the list of the token. According to the tag in thelist, each type of token expander class was theninvoked for expanding the token. For example, therules for telephone numbers are as follows:Table 2: Rule for detecting telephone numberWSP_CHAR = [ |\t]BDIGIT = [0-9]BTELNO1 = {BDIGIT}{7,7}BTELNO2= "\u09E6\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO3="\u09EE\u09EE\u09E6"({WSP_CHAR}*|"-")"\u09E8"({WSP_CHAR}*|"-"){BTELNO1}BTELNO4 = "+"{BTELNO3}BTELNO5="("{WSP_CHAR}*"\u09EE\u09EE\u09E6"{WSP_CHAR}*")"{WSP_CHAR}*"\u09E8"{BTELNO1}BTELNO6="("{WSP_CHAR}*"\u09E6\u09E8"{WSP_CHAR}*")"{WSP_CHAR}*{BTELNO1}BTELNO7= "\u09EE\u09EE"({WSP_CHAR}*|"-"){BTELNO2}BTELNO8= {BTELNO1}({WSP_CHAR}+|"-"){BDIGIT}{1,4}These rules can detect the following phone numbers:9567447; 029567447; 02-9567447; 88029567447; 880 29567447; 880-2-9567447; +88029567447; (880) 29567447;(02)9567447;3.3. Verbalization & disambiguationThe token expander expands the token by verbalizingand disambiguating the ambiguous token.Verbalization [7] or standard word generation is theprocess of converting non-natural language text intostandard words or natural language text. A templatebased approach [7] such as the lexicon was used fornumber cardinal, ordinal, acronym, and abbreviations.For expanding the cardinal number, we have calculatedthe position of the digit rather than dividing by 10.Expanding the token cardinal number we have chosenthe following steps: (i). traverse from right to left. (ii).Map first two digits with lexicon to get the expandedform (i.e. 10 → ten). (iii). After the expanded form ofthe third digit insert the token “hundred”. (iv). Getexpanded form of each pair of digit after third digitfrom the lexicon. (v). Insert the token “thousand” afterthe expanded form fourth and fifth digit and “lakh”after expanded form of sixth and seventh digit. Theseprocesses continue for each seven digits. Each sevendigit is divided as a separate block. After each of thesecond block (traversing from left to right) we insertthe token “koti”. So the expanded form of token 10910is “ten thousand nine hundred ten”. Abbreviations areproductive and a new one may appear, so an automaticprocess may require solving unknown abbreviations.In [5] an automatic process is shown for the predictionof unknown abbreviations, our effort deals with thisproblem in an automatic way and looks for possiblechanges. In case of Bangla acronyms, most of the timepeople say the acronym as it is without expanding it.For example, দুদক /d̪ud̪ɔk/ expands to দুনিত দমন কিমশন
Page 1: Text Normalization System for Bangl
Page 5: various fields and its use will mak

Text Normalization System for Bangla - Center for Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?