A Novel Soundex Algorithm for Oriya Language A Novel ... - ijcsmr

International Journal of Computer Science and Management Research Vol 1 Issue 1 Aug 2012 

ISSN 2278 – 733X 

A Novel Soundex Algorithm for Oriya Language 

Sampa Chaupattnaik,Sohag Sundar Nanda,Sanghamitra Mohanty 

P.G.Department of Computer Science and Application, Utkal University 

Abstract 

Phonetic matching plays a vital role in information retrieval in multilingual environment. 

Information retrieval needs an exact match for a given string. Phonetic matching can 

be defined as a process of identifying a set of strings those is most likely to be 

similar in sound to a given keyword .We present a phonetic encoding for Odia that 

can be used by spelling checkers to provide better suggestions for misspelled words. 

The encoding is based on the Soundex algorithm, modified to match Odia phonetics. 

We start by analyzing Soundex encoding scheme when applied to Odia. Next we 

propose a new encoding that handles the case of Odia words, including those 

containing conjuncts. 

Key Words : 

Odia phonetic encoding, Soundex, spelling suggestions 

Introduction 

India is a country where the millions of people speaking variety of 

languages (Like Hindi, Odia, Tamil, Telugu, Marathi, Punjabi, English, Bengali etc.). 

Indian languages 

use a syllable as basic linguistic unit. The syllabic writing in Indic scripts is based on 

the 

Chaupattnaik Et.Al. 20 www.ijcsmr.org


ISSN 2278 – 733X 

phonetics of linguistic sounds. The syllabic model is common for Indic scripts. 

Therefore, multilingual information exchange and communication tasks are carried out in 

verbal and textual mode as part of academic, professional, and literary activities in 

India. 

Literature survey 

In Odia language, there are 49 alphabets which is broadly divided into 2 

categories. 

Such as : i) Vowel (13) ii) Consonants(36) 

List of Vowels are: 

(a),(A),ξ(i),(ii),(u),(uu),(r),(r),&(l),(e),(ee),(o),(oo) 

List of Consonants are: 

(k) ,(kh), (g),(gh),( ),(c) ,(ch),(j),(jh),( ),o(T),(Th), 

(D),(Dh),(N), (t),(th),(d),(dh),(n), (p), (ph),(b),(bh),(m), 

(y),(r),(l),(L),(b\),(s),(S),(sh),(h), ◌ ( anuswar),◌ Chandrabindu, 

◌(bisarga) 

But in Phonetically point of view the vowels can be divided in to 2 types with 8 

characters. They are : 

i) Short vowels : (a) , ξ(i) ,(u) , & 

ii) Long vowels : (A), (ii), (uu), , (e) (ee) (o) (oo) 

Now a days the characters like , , & are not used so far. So phonetically we 

are not found in distinguished between these words. Hence the short vowels are 

reduce to 3[ (a) , ξ(i) ,(u)] and the long vowels are[ (A), (ii), (uu), (e) 

,(ee) (o), (oo)]. Again we can divided these long vowels as 5 types: [(A), 

(ii), (uu), (e), (o)]. Thus the vowels in Odia can be categorised 8 

types. (Considering (e) (ee) as same and (o), (oo) as same) 



ISSN 2278 – 733X 

Phonetic chart for Consonants: The consonants are divided by the following groups : 

1. (k), (kh), (g), (gh), ( ) : Know as Velar ( ) 

2. (c), (ch), (j) ,(jh), ( ) : Know as Palatal ( ) 

3. o(T), (Th), (D), (Dh), (N) : Known as Retroflex ( ) 

4. (t),(th),(d),(dh),(n) :Known as dental ( ) 

5. (p), (ph),(b),(bh),(m) :Know as Labial ( ) 

The above 5 groups are known as Stops ()consonant phonemes. The 

rest consonants again grouped in to the following: 

1. (y),(r),(l),(L),(b\) : Known as Semivowels ( 

) 

2. (s),(S),(sh),(h) : known as Spirants( ) 

3. ◌ ( anuswar), ◌ Chandrabindu, ◌(visarga) 

Soundex Algorithm : 

Soundex is a phonetic algorithm for indexing names by their sound when 

pronounced in English. The basic aim is for names with the same pronunciation to be 

encoded to the same string so that matching can occur despite minor differences in 

spelling. Soundex is the most widely known of all phonetic algorithms and is often 

used (incorrectly) as a synonym for "phonetic algorithm". 

Soundex Algorithm for English : 

The Soundex code for a name consists of a letter followed by three numbers: 

the letter is the first letter of the name, and the numbers encode the remaining 

consonants. E.g. B362 [1] 

The exact algorithm is as follows: 



ISSN 2278 – 733X 

1. Retain the first letter of the string 

2. Remove all occurrences of the following letters, unless it is the first letter: 

a, e, h, i, o, u, w, y 

3. Assign numbers to the remaining letters (after the first) as follows: 

Number 

Letter 

1 B, F, P, V 

2 C, G, J, K, Q, S, 

X, Z 

3 D, T 

4 L 

5 M, N 

6 R 

4. If two or more letters with the same number were adjacent in the original 

name (before step 1), or adjacent except for any intervening h and w, 

then omit all but the first. 

5. Return the first four bytes padded with 0. 

Soundex Algorithm for Odia : After the literature review we classify the odia 

alphabet phonetically as follows : 

i) Vowels (8) : , ξ, (as short vowels) 

ii) , , , , (as Long vowels) 

Similarly there are 32 sounds for consonants. They are as follows: 

, , , , , , , ,o , , , , , , , , , , ,, 

, , , , , , , , , , ◌, ◌ (, , , , has no simple 

sounds) N.B . : (, , , , – are group known as Nasal Sound ( 

)) 



ISSN 2278 – 733X 

Thus we need to grouping the odia phonetics for vowel as well as consonants. While 

grouping and mapping Indic letters (Odia) to phonetic codes the following facts are 

taken in to consideration: 

1. Group short vowel and long vowel to a single code. 

a) (a), (aa) as same group 

b) ξ (i), (ii) as same group 

c) (u), (uu) as same group 

d) (e), (ee) as same group 

e) (o), (oo) as same group 

2. Group consonant families ka,kh,g,gh,nga become a single family. Same is the 

case of c,ch,j,jh,nya become a single family. [Refer to the above list for 

grouping] 

3. Group j,Y as same 

4. Group l, L as same 

5. Group r,Ra as same 

6. Group s,S,sh as same. 

Like this we found there are 5 codes for vowels and similarly for consonants 

14 codes. 

Hence the Algorithm as follows: 

1. Scan the word from left to write and count the number of 

characters(consonants/vowels) including the matras. 

2. Retain first letter of string. 

3. Remove the repeated letters 

4. Assign numbers to the remaining letters (after the first) as follows: 



ISSN 2278 – 733X 

Code Letter Name Number 

A , a , aa 1 

B ξ , i ,ii 2 

C , u,uu 3 

D , e, ee 4 

E , o, oo 5 

F , , , k,kh,g,gh 6 

G nga 7 

H , , , , c,ch,j,jh 8 

I nya 9 

J o, , , T,Th,D,Dh 10 

K N 11 

T , , , t,th,d,dh 12 

M N 13 

P , , , p,ph,b,bh 14 

N m 15 

R r 16 

H y( same as ) 8 

L , L, l 17 

S , , s, S, sh 18 

Q h 19 

◌, ◌, ◌, ◌ Not Coded (0) 

(Table -I) 

5. For matra (◌ ,◌ ,◌, ◌, ◌, ◌, ◌, ◌, ◌ ,◌) which is added with 

consonants [consonants+ vowel] we follow some code. E.g. + ◌ = 



ISSN 2278 – 733X 

Code Matra and Name Number 

A ◌ (aa kara) 1 

B ◌ (i kara), ◌ (ii kara) 2 

C ◌(u kara), ◌ (uu kara) 3 

R ◌ (r kara) 16 

D ◌(e kara), ◌ (ee 

kara) 

E ◌(o kara),◌ (oo 

kara) 

4 

5 

(Table –II) 

6. Return the first six bytes padded with 0. 

Example: Suppose we input a word (means things made up) , this 

will be interpret as ++ ◌++◌+ so the code is 1411213.Similarly we 

interpret the word (means pillow). Hence we found the same code. 

Thus these two words sounds same and phonetically similar. 

i.e. = 1411213 = 14112 , 

=1411213 = 14112 

Similarly Take another E.g.: (means goddess durga)= 11720, 

(means ink) =11720 

We can also verify these same words sound same using edit distance. 

Eg: 

◌ ◌ 



ISSN 2278 – 733X 

0 1 1 1 1 1 

1 0 1 2 2 2 

◌ 1 1 0 1 1 2 

1 2 1 1 2 2 

◌ 1 2 2 2 1 2 

1 2 3 3 3 1 

Hence the edit distance between these two string is 1. 

In Odia language we find some conjunctions or compound words(juktakhara). 

Like matra , (vowel with consonants ) the juktakhara means (consonants with 

one or more consonants). Several words(116) we find in odia which consists of 

juktakhara. Some are listed below: 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 



ISSN 2278 – 733X 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+o = o 

+ = 

+ = 

+ = 

+ = 

o +o = oo 

+ = 

+ = 

+ = 

+ = 

+ = 



ISSN 2278 – 733X 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+ = 

+= 

+ = 

+ = 

+= 

+ = 

+ = 

+ o = o 

+ = 

+ = 

Soundex Algorithm for Juktakhara 



ISSN 2278 – 733X 

1. Scan the word from left to right and then count the number of characters (i.e. 

Consonants/vowels/matras) 

2. Retain the first character 

3. Remove the repeated characters 

4. Assign the remaining characters as follows : 

i) For juktakhara / compound words, we have to merge the characters, 

either consonants, vowels or matras by following the above Table-I & 

ii) 

Table – II 

Then make a new code and code number for that juktakhara 

5. Return the first six bytes padded with 0. 

Ex : Let us consider the two words which sounds the same : 

o (astonished) , this will interpret 

corresponding code is 101812= 10181, Similarly 

as +o++◌+ and code is 101812= 10181. 

as +o++◌+. Hence the 

o (bank of river) interpreted 

Conclusion : We represent 

the soundex algorithm for oriya language which can be 

further used in spelling correction for those words which sounds as same.

A Novel Soundex Algorithm for Oriya Language A Novel ... - ijcsmr

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?