13.07.2015 Views

Effectiveness of Stemming and n-grams String Similarity Matching ...

Effectiveness of Stemming and n-grams String Similarity Matching ...

Effectiveness of Stemming and n-grams String Similarity Matching ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Issue 3, Volume 5, 2011 210The theoritical maximum number <strong>of</strong> distinct bi<strong>grams</strong> that can befound is 26x26=676 (without spaces) <strong>and</strong> 27x27=729 (with spaces),less 1 bigram <strong>of</strong> the form (space,space) which gives a total <strong>of</strong> 728. Asin English these theoretical maxima can never be reached in practisebecause certain bi<strong>grams</strong> such as QQ <strong>and</strong> XY [19] simply do notoccur in Malay text too. In this Malay text the maximum number <strong>of</strong>non-zero nonstemmed-bi<strong>grams</strong> is 377, only 52%, <strong>and</strong> fromstemmed-bi<strong>grams</strong> 355, only 49%. Table-3 contains the top 100bi<strong>grams</strong> for both nonstemmed <strong>and</strong> stemmed bi<strong>grams</strong>. In order to get ageneral idea about the whole range <strong>of</strong> both bi<strong>grams</strong>, their rankfrequencydistribution <strong>and</strong> zipfian distribution were plotted (Figures1 <strong>and</strong> 2).Tri<strong>grams</strong>The maximum possible number <strong>of</strong> distinct tri<strong>grams</strong> is26x26x26=17576 (without spaces) <strong>and</strong> 27x27x27=19683 (withspaces), less 702 bi<strong>grams</strong> with a space in between, less 1 trigram <strong>of</strong>the form (space,space,space) which gave a total <strong>of</strong> 19890.Appropriate storage arrays were used to hold both nonstemmed <strong>and</strong>stemmed tri<strong>grams</strong> as indicated in Table-2 above. The total maximumnumber <strong>of</strong> non-zero nonstemmed-tri<strong>grams</strong> is 2214 which utilisedonly 12% <strong>and</strong> for stemmed-tri<strong>grams</strong> is 1892 which utilised only 10%.Thus during run time the size <strong>of</strong> array allocation can greatly bereduced.The most 50 frequently occuring tri<strong>grams</strong> are listed in Table-4 above.The rank-frequency distribution <strong>and</strong> zipfian distribution for bothtri<strong>grams</strong> are presented in Figures 3 <strong>and</strong> 4 above.Zipfian DistributionThe curves demonstrating the zipfian distributions in Figures1,2,3<strong>and</strong> 4, <strong>of</strong> the n-gram for Malay text have a similar hyperbolic curvecomplying to the Zipf's law [20] which states that the frequency <strong>of</strong>words <strong>and</strong> the rank order is approximately constant. Such analysisshould not be restricted to just words [1]. Luhn [21] used Zipf's lawto specify upper <strong>and</strong> a lower cut-<strong>of</strong>fs. The words below the upper cut<strong>of</strong>fwere considered to be common <strong>and</strong> those below the lower cut-<strong>of</strong>fto be rare, leaving the rest to be significant words. These cut-<strong>of</strong>fswere established by trial <strong>and</strong> error by estimating in both directionfrom the peak <strong>of</strong> the rank-order position. Thus by removing the mostfrequently n-<strong>grams</strong> should not change the result <strong>of</strong> retrievaleffectiveness <strong>of</strong> the documents, as carried out later.Experimental Evaluation ProcedureThe experiments performed involved the ranking <strong>and</strong> the calculation<strong>of</strong> string similarity measures <strong>of</strong> each unique terms in the ITD to aspecified query term. This procedure is the same as automatic queryexpansion approach as set by Lennon et al [4]. Following are theevaluation procedures that are carried out.Table-3. The Most Frequently Occurring Bi<strong>grams</strong> (with spaces)Nonstemmed StemmedRank Bigra Frequency Bigra Frequency1 m an 2567 m an 4012 ka 1472 a 3413 n 1391 ng 3304 a 1356 s 2425 ng 1136 g 2096 er 1111 b 2087 ya 1104 t 2078 m 1067 h 2049 ny 1062 k 20010 me 917 i 19811 en 909 ah 19512 ah 868 ra 19113 la 826 la 19014 d 731 ka 18815 di 707 n 18416 h 674 at 17517 b 632 ar 17418 k 594 t 17419 pe 581 r 17120 p 578 m 16721 ak 563 ta 16622 be 549 p 16023 em 530 er 15824 at 522 ak 15625 ta 509 ba 15226 ra 498 ma 15027 in 471 k 14428 ar 462 sa 13929 u 461 in 13530 ga 442 u 13031 ba 441 al 12932 i 430 am 12833 ke 428 en 12434 sa 419 un 12135 t 417 as 12036 s 414 pa 11937 nn 395 l 11738 ku 391 da 11539 ha 385 ha 11340 ma 382 a 10941 mu 358 ur 10642 al 349 ga 10343 un 349 s 9944 am 341 ya 9745 pa 335 ri 9546 ik 320 na 9447 se 316 d 9348 tu 310 se 8949 na 308 el 8850 as 299 h 87INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND INFORMATICS

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!