13.07.2015 Views

Text Normalization System for Bangla - Center for Language ...

Text Normalization System for Bangla - Center for Language ...

Text Normalization System for Bangla - Center for Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

the methods of word representation (tagging, plainThere are various techniques used in lexicaldisambiguation in the English language. Use ofdecision lists and combining the strengths of decisiontrees, N-gram taggers and Bayesian classifiers, a textcan be processed to resolve disambiguation in TTSsynthesis [3]. Most of the work done on textnormalization uses tags after an NSW has beenidentified [1] [2] [3] [4]. After the type of the NSW isresolved, the resulted word representation is tagged.For example, in Chinese and Japanese, NUM, NDAY,NDIG, and NTIME are used <strong>for</strong> numeric NSWs [4].text).JFlex -LexicalAnalyzerInput textTokenizerSplitterClassifierTokenization3. MethodologyThis paper talks of a method to normalize <strong>Bangla</strong> text.Like other work, the basic processes are same:tokenization, token classification, token sensedisambiguation and word representation. Be<strong>for</strong>e theprocesses of Bengali text normalization are discussed,it is necessary to first discuss the different classes of<strong>Bangla</strong> NSWs and their equivalent pronunciation wordrepresentations. Where [1] uses decision tree anddecision list <strong>for</strong> disambiguation, but this work usesrule based system. The following section discusses thesemiotic class [7] (as opposed to say NSW)identification, tokenization and standard wordgeneration and disambiguation rule. The systemdiagram of text normalization procedure is shown infigure 1.According to semiotic classes a lexical analyzerwas designed to tokenize each NSW by regularexpression using the tool JFlex [8]. We assigned a tag<strong>for</strong> each token according to semiotic classes. Theoutputs of the tokenization are then used in the nextstep i.e token expander. According to the assigned tagtoken verbalization and disambiguation was per<strong>for</strong>medby the token expander.We identified a set of semiotic classes whichbelongs to the <strong>Bangla</strong> language. To do this, weselected a news corpus [9] with 18100378 tokens and384048 token types [13], <strong>for</strong>um [10] and blog [11],then we proceeded in two steps to identify the semioticclasses: (i) Python [13] script was used to identify thesemiotic class from news corpus and we manuallychecked it in the <strong>for</strong>um and blog (ii) we defined a setof rules according to context of homographs orambiguous tokens. The result is a set of semioticclasses in <strong>Bangla</strong> text as shown in table 1.DisambiguationruleToken expansionrule3.1. Figure Semiotic 1: <strong>Text</strong> class normalization identification system <strong>for</strong> <strong>Bangla</strong>Table 1: Possible token type in <strong>Bangla</strong> textSemiotic class/tokentypeEnglish text<strong>Bangla</strong> textNumbers (cardinal,ordinal, roman, floatingnumber, fraction, ratio,range)Telephone and mobilenumberYearsExampleজাভা Plat<strong>for</strong>m Independent বেলei সমেয়র সবেচেয়121,23,234; 1ম, 2য়, 3য়; I, II, III,12.23, 23,33.33; 1/2, 23/23; 12:12;12-23029567447; 0152303398 (19different <strong>for</strong>mats)2006; 1998; 98 সােলDate 022006 -06-(12 different<strong>for</strong>mats)Time4.20 িমঃ; 4.20 িমিনট;Percentage 12%Money10 ৳E-mailURLAbbreviationAcronymTokenExpanderList of word innormalized <strong>for</strong>mআমার i-মiল কানা:abc@yahoo.comসফটoয়ার http://googlegdata.googlecode.comসাiটডঃ ;মাঃ ;সাঃঢািব ;বাuিব, কিবMathematical equation (1+2=3)Look-uptable <strong>for</strong>AbbreviationAcronym,and number

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!