07.11.2014 Views

A Novel Soundex Algorithm for Oriya Language A Novel ... - ijcsmr

A Novel Soundex Algorithm for Oriya Language A Novel ... - ijcsmr

A Novel Soundex Algorithm for Oriya Language A Novel ... - ijcsmr

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

A <strong>Novel</strong> <strong>Soundex</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>Oriya</strong> <strong>Language</strong><br />

Sampa Chaupattnaik,Sohag Sundar Nanda,Sanghamitra Mohanty<br />

P.G.Department of Computer Science and Application, Utkal University<br />

Abstract<br />

Phonetic matching plays a vital role in in<strong>for</strong>mation retrieval in multilingual environment.<br />

In<strong>for</strong>mation retrieval needs an exact match <strong>for</strong> a given string. Phonetic matching can<br />

be defined as a process of identifying a set of strings those is most likely to be<br />

similar in sound to a given keyword .We present a phonetic encoding <strong>for</strong> Odia that<br />

can be used by spelling checkers to provide better suggestions <strong>for</strong> misspelled words.<br />

The encoding is based on the <strong>Soundex</strong> algorithm, modified to match Odia phonetics.<br />

We start by analyzing <strong>Soundex</strong> encoding scheme when applied to Odia. Next we<br />

propose a new encoding that handles the case of Odia words, including those<br />

containing conjuncts.<br />

Key Words :<br />

Odia phonetic encoding, <strong>Soundex</strong>, spelling suggestions<br />

Introduction<br />

India is a country where the millions of people speaking variety of<br />

languages (Like Hindi, Odia, Tamil, Telugu, Marathi, Punjabi, English, Bengali etc.).<br />

Indian languages<br />

use a syllable as basic linguistic unit. The syllabic writing in Indic scripts is based on<br />

the<br />

Chaupattnaik Et.Al. 20 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

phonetics of linguistic sounds. The syllabic model is common <strong>for</strong> Indic scripts.<br />

There<strong>for</strong>e, multilingual in<strong>for</strong>mation exchange and communication tasks are carried out in<br />

verbal and textual mode as part of academic, professional, and literary activities in<br />

India.<br />

Literature survey<br />

In Odia language, there are 49 alphabets which is broadly divided into 2<br />

categories.<br />

Such as : i) Vowel (13) ii) Consonants(36)<br />

List of Vowels are:<br />

(a),(A),ξ(i),(ii),(u),(uu),(r),(r),&(l),(e),(ee),(o),(oo)<br />

List of Consonants are:<br />

(k) ,(kh), (g),(gh),( ),(c) ,(ch),(j),(jh),( ),o(T),(Th),<br />

(D),(Dh),(N), (t),(th),(d),(dh),(n), (p), (ph),(b),(bh),(m),<br />

(y),(r),(l),(L),(b\),(s),(S),(sh),(h), ◌ ( anuswar),◌ Chandrabindu,<br />

◌(bisarga)<br />

But in Phonetically point of view the vowels can be divided in to 2 types with 8<br />

characters. They are :<br />

i) Short vowels : (a) , ξ(i) ,(u) , &<br />

ii) Long vowels : (A), (ii), (uu), , (e) (ee) (o) (oo)<br />

Now a days the characters like , , & are not used so far. So phonetically we<br />

are not found in distinguished between these words. Hence the short vowels are<br />

reduce to 3[ (a) , ξ(i) ,(u)] and the long vowels are[ (A), (ii), (uu), (e)<br />

,(ee) (o), (oo)]. Again we can divided these long vowels as 5 types: [(A),<br />

(ii), (uu), (e), (o)]. Thus the vowels in Odia can be categorised 8<br />

types. (Considering (e) (ee) as same and (o), (oo) as same)<br />

Chaupattnaik Et.Al. 21 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

Phonetic chart <strong>for</strong> Consonants: The consonants are divided by the following groups :<br />

1. (k), (kh), (g), (gh), ( ) : Know as Velar ( )<br />

2. (c), (ch), (j) ,(jh), ( ) : Know as Palatal ( )<br />

3. o(T), (Th), (D), (Dh), (N) : Known as Retroflex ( )<br />

4. (t),(th),(d),(dh),(n) :Known as dental ( )<br />

5. (p), (ph),(b),(bh),(m) :Know as Labial ( )<br />

The above 5 groups are known as Stops ()consonant phonemes. The<br />

rest consonants again grouped in to the following:<br />

1. (y),(r),(l),(L),(b\) : Known as Semivowels ( <br />

)<br />

2. (s),(S),(sh),(h) : known as Spirants( )<br />

3. ◌ ( anuswar), ◌ Chandrabindu, ◌(visarga)<br />

<strong>Soundex</strong> <strong>Algorithm</strong> :<br />

<strong>Soundex</strong> is a phonetic algorithm <strong>for</strong> indexing names by their sound when<br />

pronounced in English. The basic aim is <strong>for</strong> names with the same pronunciation to be<br />

encoded to the same string so that matching can occur despite minor differences in<br />

spelling. <strong>Soundex</strong> is the most widely known of all phonetic algorithms and is often<br />

used (incorrectly) as a synonym <strong>for</strong> "phonetic algorithm".<br />

<strong>Soundex</strong> <strong>Algorithm</strong> <strong>for</strong> English :<br />

The <strong>Soundex</strong> code <strong>for</strong> a name consists of a letter followed by three numbers:<br />

the letter is the first letter of the name, and the numbers encode the remaining<br />

consonants. E.g. B362 [1]<br />

The exact algorithm is as follows:<br />

Chaupattnaik Et.Al. 22 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

1. Retain the first letter of the string<br />

2. Remove all occurrences of the following letters, unless it is the first letter:<br />

a, e, h, i, o, u, w, y<br />

3. Assign numbers to the remaining letters (after the first) as follows:<br />

Number<br />

Letter<br />

1 B, F, P, V<br />

2 C, G, J, K, Q, S,<br />

X, Z<br />

3 D, T<br />

4 L<br />

5 M, N<br />

6 R<br />

4. If two or more letters with the same number were adjacent in the original<br />

name (be<strong>for</strong>e step 1), or adjacent except <strong>for</strong> any intervening h and w,<br />

then omit all but the first.<br />

5. Return the first four bytes padded with 0.<br />

<strong>Soundex</strong> <strong>Algorithm</strong> <strong>for</strong> Odia : After the literature review we classify the odia<br />

alphabet phonetically as follows :<br />

i) Vowels (8) : , ξ, (as short vowels)<br />

ii) , , , , (as Long vowels)<br />

Similarly there are 32 sounds <strong>for</strong> consonants. They are as follows:<br />

, , , , , , , ,o , , , , , , , , , , ,,<br />

, , , , , , , , , , ◌, ◌ (, , , , has no simple<br />

sounds) N.B . : (, , , , – are group known as Nasal Sound (<br />

))<br />

Chaupattnaik Et.Al. 23 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

Thus we need to grouping the odia phonetics <strong>for</strong> vowel as well as consonants. While<br />

grouping and mapping Indic letters (Odia) to phonetic codes the following facts are<br />

taken in to consideration:<br />

1. Group short vowel and long vowel to a single code.<br />

a) (a), (aa) as same group<br />

b) ξ (i), (ii) as same group<br />

c) (u), (uu) as same group<br />

d) (e), (ee) as same group<br />

e) (o), (oo) as same group<br />

2. Group consonant families ka,kh,g,gh,nga become a single family. Same is the<br />

case of c,ch,j,jh,nya become a single family. [Refer to the above list <strong>for</strong><br />

grouping]<br />

3. Group j,Y as same<br />

4. Group l, L as same<br />

5. Group r,Ra as same<br />

6. Group s,S,sh as same.<br />

Like this we found there are 5 codes <strong>for</strong> vowels and similarly <strong>for</strong> consonants<br />

14 codes.<br />

Hence the <strong>Algorithm</strong> as follows:<br />

1. Scan the word from left to write and count the number of<br />

characters(consonants/vowels) including the matras.<br />

2. Retain first letter of string.<br />

3. Remove the repeated letters<br />

4. Assign numbers to the remaining letters (after the first) as follows:<br />

Chaupattnaik Et.Al. 24 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

Code Letter Name Number<br />

A , a , aa 1<br />

B ξ , i ,ii 2<br />

C , u,uu 3<br />

D , e, ee 4<br />

E , o, oo 5<br />

F , , , k,kh,g,gh 6<br />

G nga 7<br />

H , , , , c,ch,j,jh 8<br />

I nya 9<br />

J o, , , T,Th,D,Dh 10<br />

K N 11<br />

T , , , t,th,d,dh 12<br />

M N 13<br />

P , , , p,ph,b,bh 14<br />

N m 15<br />

R r 16<br />

H y( same as ) 8<br />

L , L, l 17<br />

S , , s, S, sh 18<br />

Q h 19<br />

◌, ◌, ◌, ◌ Not Coded (0)<br />

(Table -I)<br />

5. For matra (◌ ,◌ ,◌, ◌, ◌, ◌, ◌, ◌, ◌ ,◌) which is added with<br />

consonants [consonants+ vowel] we follow some code. E.g. + ◌ = <br />

Chaupattnaik Et.Al. 25 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

Code Matra and Name Number<br />

A ◌ (aa kara) 1<br />

B ◌ (i kara), ◌ (ii kara) 2<br />

C ◌(u kara), ◌ (uu kara) 3<br />

R ◌ (r kara) 16<br />

D ◌(e kara), ◌ (ee<br />

kara)<br />

E ◌(o kara),◌ (oo<br />

kara)<br />

4<br />

5<br />

(Table –II)<br />

6. Return the first six bytes padded with 0.<br />

Example: Suppose we input a word (means things made up) , this<br />

will be interpret as ++ ◌++◌+ so the code is 1411213.Similarly we<br />

interpret the word (means pillow). Hence we found the same code.<br />

Thus these two words sounds same and phonetically similar.<br />

i.e. = 1411213 = 14112 ,<br />

=1411213 = 14112<br />

Similarly Take another E.g.: (means goddess durga)= 11720,<br />

(means ink) =11720<br />

We can also verify these same words sound same using edit distance.<br />

Eg:<br />

◌ ◌ <br />

Chaupattnaik Et.Al. 26 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

0 1 1 1 1 1<br />

1 0 1 2 2 2<br />

◌ 1 1 0 1 1 2<br />

1 2 1 1 2 2<br />

◌ 1 2 2 2 1 2<br />

1 2 3 3 3 1<br />

Hence the edit distance between these two string is 1.<br />

In Odia language we find some conjunctions or compound words(juktakhara).<br />

Like matra , (vowel with consonants ) the juktakhara means (consonants with<br />

one or more consonants). Several words(116) we find in odia which consists of<br />

juktakhara. Some are listed below:<br />

+ =<br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

Chaupattnaik Et.Al. 27 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+o = o<br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

o +o = oo<br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

Chaupattnaik Et.Al. 28 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ = <br />

+ =<br />

+=<br />

+ = <br />

+ = <br />

+= <br />

+ = <br />

+ = <br />

+ o = o<br />

+ = <br />

+ = <br />

<strong>Soundex</strong> <strong>Algorithm</strong> <strong>for</strong> Juktakhara<br />

Chaupattnaik Et.Al. 29 www.<strong>ijcsmr</strong>.org


International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012<br />

ISSN 2278 – 733X<br />

1. Scan the word from left to right and then count the number of characters (i.e.<br />

Consonants/vowels/matras)<br />

2. Retain the first character<br />

3. Remove the repeated characters<br />

4. Assign the remaining characters as follows :<br />

i) For juktakhara / compound words, we have to merge the characters,<br />

either consonants, vowels or matras by following the above Table-I &<br />

ii)<br />

Table – II<br />

Then make a new code and code number <strong>for</strong> that juktakhara<br />

5. Return the first six bytes padded with 0.<br />

Ex : Let us consider the two words which sounds the same :<br />

o (astonished) , this will interpret<br />

corresponding code is 101812= 10181, Similarly<br />

as +o++◌+ and code is 101812= 10181.<br />

as +o++◌+. Hence the<br />

o (bank of river) interpreted<br />

Conclusion : We represent<br />

the soundex algorithm <strong>for</strong> oriya language which can be<br />

further used in spelling correction <strong>for</strong> those words which sounds as same.<br />

Chaupattnaik Et.Al. 30 www.<strong>ijcsmr</strong>.org

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!