11.07.2015 Views

Context-Aware Stemming Algorithm for Semantically Related Root ...

Context-Aware Stemming Algorithm for Semantically Related Root ...

Context-Aware Stemming Algorithm for Semantically Related Root ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Vol 5. No. 4, June 2012 ISSN 2006-1781African Journal of Computing & ICT© 2012 Afr J Comp & ICT – All Rights Reservedwww.ajocict.netTable 1: Words and the measure mList ARelateProbateConflatePiratePrelateList BDerivateActivateDemonstrateNecessitateRenovateThe main drawback of this algorithm is that it does not make useof a stem dictionary (lexicon). As per the algorithm, a list ofsuffixes is considered and with each suffix, there is a criterionunder which it may be removed from a word during the <strong>for</strong>mationof a valid stem. The main strengths of this algorithm are that it issmall, fast and reasonably simple. The algorithm feature oftenproduces stems, which are unintelligible. For example, the stemsresult <strong>for</strong> the word anxious is anxiou. This is one of the cases inwhich the suffix has been identified as denoting the plural <strong>for</strong>m ofthe given word. We also get stems like Microscop, Secur and Envi<strong>for</strong> Microscope, Secure and Envy respectively, which are clearlymeaningless. As another example, if Sand and Sander getconflated, so do wand and wander.The error committed here is that the –er part of wander has beentreated as its suffix when in fact it is part of the stem. However, theabsence of a lexicon results in improper conflations and anassociated loss of precision. Converting words like general togener and iteration to iter make it difficult to relate them todictionary entries. It also complicates the process of queryenhancement. We observe that all the approaches to stemmingmentioned not only ignore word meanings, but also operate in theabsence of any lexicon at all. The design goals of theseapproaches seem to be better retrieval and compressionper<strong>for</strong>mance and not the generation of a linguistically correct rootword that is in context. Our main goal is to maximize theproportion of the meaningful stems (words) in the stemmer output<strong>for</strong> a given set of input words, without compromising the otherper<strong>for</strong>mance measures.4. CONTEXT IN INFORMATION RETRIEVALSome definition of context is greatly dependent on the field ofapplications, as shown by the analysis of 150 different definitionsby [11]. In order to use context in developing context awareapplications, the notion of context has to be well understood. Awidely adopted definition of context <strong>for</strong> ubiquitous computing –which is also considered with context-aware IR defines context asany in<strong>for</strong>mation that can be used to characterize the situation of anentity. An entity is a person, place, or object that is consideredrelevant to the interaction between a user and an application,including the user and applications themselves [12].The central aspects in this definition are identity (user), activity(interaction with an application), location and time (as the temporalconstraints of a certain situation). In the same article, Dey definescontext awareness as a system that is context-aware if it usescontext to provide relevant in<strong>for</strong>mation and/or services to the user,where relevancy depends on the user’s task [12].Dey’s definition of context points to potential sources ofcontextual in<strong>for</strong>mation on a generic level. His definition of contextawareness, however, is directly related to the view on context <strong>for</strong>IR taken in this study, which defines an IR task’s context as anyin<strong>for</strong>mation whose change modifies the task’s outcome, <strong>for</strong>instance (stemming reduces any given query term(token), to acommon base). To achieve this goal, we develop an approach <strong>for</strong>context-aware IR that enables a system to provide relevantin<strong>for</strong>mation to the user.Research activities on context-aware IR have increased remarkablyin recent years. The ubiquitous and pervasive computingcommunities have developed numerous approaches toautomatically provide users with in<strong>for</strong>mation and services based ontheir current situation [13] <strong>for</strong> an overview. Existing context-awareapplications range from smart spaces [14] over mobile tour guides[15] to generic prototyping plat<strong>for</strong>ms [16]. <strong>Context</strong> has beensuccessfully employed in a number of ways <strong>for</strong> personalization [17& 18] and to improve retrieval from the Web [19 & 20], fromemail archives [21], from Wikipedia [22], as well as <strong>for</strong> ontologybasedalert services [23], intention recognition [24]. Diverseapproaches based on vector spaces have been proposed to enablecontext-aware in<strong>for</strong>mation retrieval [25]. Maeda identifies contextas one of his laws of simplicity, stating that “what lies in theperiphery of simplicity is definitely not peripheral” [26] transferredto IR, this statement is a clear hint that the context of a retrievaltask needs to be taken into account to make completing the task assimple as possible <strong>for</strong> the user.Lawrence [27] provides an overview of the different strategies tomake Web search context-aware. Yahoo introduced context-awaretools [20] that automatically extend the user’s query by text fromthe Web site the user is currently visiting or from the file she isworking on. Dumitrescu and Santini [28] present a context-awareIR method <strong>for</strong> documents. <strong>Context</strong>s are traces of documents theuser has been working on, which they represent as self-organizingmaps [29]. The authors point out that the advantage of thisapproach is that context does not need to be represented by logicalmeans. Semantic contexts are there<strong>for</strong>e an inevitable requirementto reason about contextual in<strong>for</strong>mation.5. CONTEXT AWARE MODIFICATIONS APPROACH TOPORTER’S ALGORITHMWe examine that the addition of more rules in order to increase theper<strong>for</strong>mance in one part of the vocabulary may cause deprivationof per<strong>for</strong>mance elsewhere. To achieve the goal, after a meticulousanalysis, new suffixes and the context in which they must beremoved have been identified and suitable changes have beenmade to the original Porter’s stemming algorithm. Altogether, 31modifications comprising of 15 additions, 11 replacements and 5deletions have been carried out so that the Porter’s stemmingalgorithm now consists of 65 rules compared to 60 in the originalalgorithm. The modifications were carried out in each step withoutchanging the order suggested by Porter. The Porter’s algorithm is abasic building block <strong>for</strong> the proposed context-aware stemmer. Thesuffix stripping method is rule-based. The following is thesummary of modification steps used by the CAS algorithm.36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!