Kuflik-Boger-Shoval-Final Version.pdf

Filtering Search Results Using an Optimal Set of Terms Identified By anArtificial Neural NetworkTsvi Kuflik*Department of Management Information SystemsUniversity of Haifa, Mount Carmel, Haifa, 31905 IsraelE-mail: tsvikak@mis.hevra.haifa.ac.ilZvi BogerOPTIMAL - Industrial Neural Systems Ltd., Beer Sheva, 84243, Israel;andOptimal Neural Informatics LLC, Rockville, MD, 20852, USAE-mail: zboger@bgumail.bgu.ac.il; zvi@peeron.comAbstractPeretz ShovalDepartment of Information Systems EngineeringBen-Gurion University, Beer Sheva, 84105 IsraelE-mail: shoval@bgumail.bgu.ac.ilInformation filtering (IF) systems usually filter data items by correlating a set of termsrepresenting the user's interest (a user profile) with similar sets of terms representing thedata items. Many techniques can be employed for constructing user profiles automatically,but they usually yield large sets of term. Various dimensionality-reduction techniques canbe applied in order to reduce the number of terms in a user profile. We describe a newterms selection technique including a dimensionality-reduction mechanism which is basedon the analysis of a trained artificial neural network (ANN) model. Its novel feature is theidentification of an optimal set of terms that can classify correctly data items that arerelevant to a user. The proposed technique was compared with the classical Rocchioalgorithm. We found that when using all the distinct terms in the training set to train anANN, the Rocchio algorithm outperforms the ANN based filtering system, but afterapplying the new dimensionality-reduction technique, leaving only an optimal set of terms,the improved ANN technique outperformed both the original ANN and the Rocchioalgorithm.Keywords: Information filtering, Feature selection, User profile, Artificial neural network* Corresponding Author1

1. INTRODUCTIONWhen searching textual databases by keywords, a large number of the retrieved texts may notbe relevant to the searcher/user. An automatic filtering system, which learns the user's preferencesand filters the search results accordingly, might be useful. Information Filtering (IF) is a researcharea that provides tools for filtering out irrelevant information. It provides personalized assistancefor continuous retrieval of information in situations of information-overflow in general, and onthe Internet in particular. Information filtering combines tools from the field of artificialintelligence (AI), such as intelligent agents or software robots (“softbots”), with informationretrieval (IR) methods, geared to representing, indexing and retrieving content (Hanani et al.2001; Balabanovic’ & Shoham 1997; Etzioni & Weld 1994; Etzioni & Weld 1995). IF differsfrom traditional IR in that it deals with users who have long term interests (information needs),expressed by means of user profiles, rather than with casual users whose needs are expressedusing ad-hoc queries (Belkin & Croft 1992).Artificial neural networks (ANN) have been used for modeling complex systems where noexplicit functions correlating inputs and outputs are known (Bishop 1995), and to form predictivemodels from historical data. ANN have already been applied in IF and IR systems (Pal et al.2002; Goren-Bar and Kuflik 2004). Advanced algorithms that can train large ANN models, withhundreds or thousands inputs and outputs, are available (Guterman 1994; Boger 2002). Analysisof the trained ANN may extract useful knowledge from it (Boger & Guterman 1997). ANNmodeling has been applied to a similar task of finding relevant words in e-mail messages (Bogeret al. 2001).Keyword selection for the specification of user profiles or queries is sometimes a frustratingtask. Many text analysis techniques can be utilized for this task. In this paper we describe a2

technique based on ANN modeling. We handle this task by training a large-scale ANN-basedfilter that uses all meaningful words in the document space (i.e. the data items) as inputs, and theuser-given relevance rating as output. A novel feature of the proposed technique is the analysis ofthe trained ANN model that identifies the minimal set of words that still can classify data itemscorrectly, according to their relevance to the user. To test this technique we modeled and analyzed12 sets of research papers abstracts, each consisting of one hundred texts downloaded and ratedfor relevancy by a user on a subject of his interest.The rest of this paper is structured as follows: Section 2 reviews some essential concepts in IRand IF. Section 3 provides a brief introduction on ANN modeling techniques. Section 4 describesthe ANN input reduction algorithm and Section 5 describes possible application of ANN toinformation filtering. Section 6 explains the ANN model building technique proposed and appliedin this paper, and presents the results, and Section 7 compares them with the results of theclassical Rocchio algorithm (Rocchio 1971) using data from previous research. Section 8concludes and discusses further research issues.2. CONCEPTS IN INFORMATION RETRIEVAL AND FILTERINGInformation retrieval may be characterized as “leading the user to those documents that will bestenable him/her to satisfy his/her need for information” (Robertson 1981). This definition (amongmany others) can be described in a model of information retrieval where the user expresseshis/her information need using a query, and the system conducts a search for relevant informationin some data space (e.g. a database of documents).Interesting lessons learned from IR are in three main areas: text representation, retrievaltechniques, and acquisition of user information needs. The vector-space model (Salton & McGill3

1983), according to which a document is represented by a (possibly weighted) vector of terms, isa very popular model in information retrieval. In this model, the user's query can be representedas a vector of keywords in a similar way. The main task of the IR system is to match the twovectors of terms and provide the user with relevant data items that best match the query.Various IR models enable determination of the weights of terms in documents or in queries.The classic Boolean model considers index terms to be present or absent in a document; hencethe index term weights are all binary: 0 or 1. The Boolean model determines if a document isrelevant or not-relevant; there is no partial matching. Exact matching may lead to retrieval of toofew or too many documents. In the vector-space model non-binary weights are given to indexterms in queries and/or in documents. The weights reflect the importance of terms in the query orin a document. These weights are used to compute the degree of similarity between eachdocument and the query; thus partial matching is achieved. One well-known method to determineterm weights in documents is TF*IDF - Term Frequency * Inverse Document Frequency (Salton& McGill 1983). This method assigns a weight to a term in proportion to the number of itsoccurrences in the document and in inverse proportion to the number of documents in which itoccurs. This method is based on the statistical observation that the more times a term appears in atext the more relevant it is for representing the document, and the more documents a term appearsin, the more poorly it discriminates between documents in the collection of documents. Anothermodel for determination of term weights is a probabilistic model, which uses the difference in thedistribution behavior of words over all documents in a collection to guide the selection of indexterms (Robertson & Sparck-Jones 1976).In IF systems, the users' long-term information needs are expressed as profiles. A content-basedprofile represents the user’s interest/needs by a set of terms. The profile can be defined4

Sebastiani 2002). The computational burden of the mathematical transformations needed for theSVM classification does not depend on the term vector length, which is a great advantage. Apossible drawback of the SVM is that the computational burden increases with the third power ofthe number of examples, thus it may be too slow when the number of examples is large. For moreinformation on SVM the reader is referred to (Sebastiani 2002), who reviews the field of textcategorization, presenting the various technologies available to the researcher and developerwishing to take up these technologies for deploying real-world applications.3. A BRIEF INTRODUCTION ON ARTIFICIAL NEURAL NETWORKS MODELINGANN modeling is done by learning from examples. ANN is a network of simple (sigmoidal, forexample) mathematical “neurons” connected by adjustable weighted links. The most used ANNarchitecture is feed-forward two-layer ANN, in which neurons are placed in one hidden layerbetween the data inputs and the neurons of the output layer, and the information flows only fromthe inputs to the hidden neurons and from them to the output neurons. Training examples arepresented as inputs to the ANN, which uses a “teacher” to train the model. An error is defined asthe difference between the model outputs and the known “teacher” outputs. Error backpropagationalgorithms adjust the initial random-valued model connection weights to decrease theerror, by repeated presentations of input vectors (Werbos 1974 and 1993; Rumelhart et al. 1986;Bishop 1995). Once the ANN is trained and verified by presenting inputs not used in the training,the ANN is used to predict outputs of new inputs presented to it. Examples of many types of feedforwardANN include (Kwok 1995; Creput & Caron 1997; Kurbel et al. 1998; Yu & Liddy 1999;Yang & Liu 1999; Yang 1999; Lam & Lee 1999; Wermter 2000; Koehn 2002; Ruiz & Srinivasan1999 and 2002).6

connection weights from that input to all hidden neurons, making it almost constant. In both casesa constant value can be added to the bias input of the ANN after deleting this input as lessrelevant. (Boger & Guterman 1997; Boger 2003). The detailed derivation of the input relevancecalculation is given in the Appendix. The least relevant inputs may be discarded and the ANN canbe re-trained with the reduced input set that usually gives better prediction accuracy. Theexplanations for this possible improvement are: a) Elimination of noise or conflicting data in thenon-relevant inputs; b) Reduction of the number of connection weights in the ANN that improvesthe ratio of the number of examples to the number of connection weights, thus reducing thechance of over-fitting small number of examples to a model with many parameters (“overtraining”).The issue of over-training has troubled ANN model developers for a long time. It can be arguedthat the number of connection weights in the ANN has to be considerably smaller than thenumber of training examples. However, experience with real-world large systems indicates thatthis requirement is too conservative, as there are hidden relationships in the data that increase thegeneralization capacity of the trained ANN model (Boger 1997; Lawrence et al. 1997; Caruana etal. 2000).4. ANN INPUT DIMENSION REDUCTION ALGORITHMThis section is based on (Boger, Z. & Guterman, 1997).Basic statistics (Vandaele 1983) provide that the variance of a linear combination of randomvariables x is given by:yn∑= a x =i=1iiAX(A1)8

low V Hrel values are good candidates for elimination. Calculation of the adjustment of the bias foran output node k compensating for the removal of the hidden node j is based on the expectedvalue of o j , E(o j ):'φ= φ + ( W ) E( o )k k o kj j(A11)where W O represents the hidden-output layer weight set. A reasonable approximation of Eq.(A11) is given by:'φk = φk + ( Wo)kjf ( ( Wo)jiE( x i) + φ j)n∑i=1 (A12)Since Eq. (A12) is approximate, additional training by PCA-CG is used to re-tune the networkafter hidden nodes are removed. Input variables with low V Irelvalues do not contribute muchinformation to the network and can be eliminated. Analogous to Eq. (A12) the bias adjustmentfor hidden layer node j to compensate for removal of input i is:'φ= φ + ( W ) E( x )j j H ji i(A13)Procedurally, one begins by training the ANN with the whole data set, with a reasonable valueof the information content used for estimating of the number of the hidden nodes, for example,70%. After training the ANN with the PCA-CG algorithm, the best candidates for removalaccording to Eqs. (A6) and (A10) are identified. The group of top candidates is removed, and thenetwork is retrained using PCA-CG. On the retraining step, different information content can beused in selection of the hidden layer architecture, according to the relative variance values of thehidden nodes. Our experience shows that optimum results are obtained when there is only onehidden node with a relative variance smaller than 10%, so a higher information content is used11

when the least significant hidden node has relative variance higher than 10%, and a smallerinformation content is used when more than one hidden node have a relative variance smallerthan 10%.5. THE APPLICATION OF ANN TO INFORMATION FILTERINGThe idea to match the capabilities of ANN modeling to information retrieval is not new, andmany papers are dealing with it. Most of the papers use the unsupervised self-organized maps(SOM) technique for grouping similar examples into clusters (Kohonen 1997). Thus text clustersare formed based on the similarity of keywords in the texts. Once trained, the ANN will classifynew documents as belonging to one of these clusters. Recent reviews discuss ANN along withother “soft” tools for Web mining application (Pal et al. 2002) and text categorization (Sebastiani2002).The basic obstacle in training ANN models is the large dimension of the inputs (terms)representing the documents. As noted, large-scale ANN models tend to get stuck in local minimaduring the training, leading to restart with different initial connection weights. Various techniqueswere used to reduce the number of terms to a manageable one. Ahmed et al. (1999) chose 50terms with the highest “weirdness coefficient” from the top 5% of the most frequent terms. Theydefined this coefficient as the ratio of the relative term frequency in the specific corpus learned, tothe relative frequency of the term in natural language. Dasigi (2001) used Latent SemanticIndexing as a method for dimensionality-reduction in order to make ANN training feasible. Theytried several approaches for features extraction from a collection, all by applying LSI and evenfusion of several LSI-based feature extraction approaches, in order to generate the input fortraining an ANN for text classification. Yu and Liddy (1999) employed a genetic algorithm,12

coupled with the “Baldwin effect”, to choose candidates for the reduced term vector. Wermter(2000) reduced the number of terms according to their significance value, which is defined as theratio of the frequency of word w in the semantic class c, to the sum of the frequency of this wordin all classes. Other researchers used the well known TF*IDF index to select the reduced set ofterms (Tomsich et al. 2000; Rauber et al. 2000) while Vlajic and Card (1999) deleted from theTF*IDF terms those with very high or very low correlation. As the Guterman-Boger algorithmcan easily train ANN with thousands of inputs, no such term reduction is necessary for the initialANN model (Boger et al. 2001).The ability of the ANN to model non-linear, non-obvious relationships can be applied to thematching of the textual features (inputs to the ANN) to the user relevance rating (ANN outputs).When applying statistical methods for the required modeling, subjective selections of the numberof terms and the form of the model equations are made. No such assumptions are needed in ANNmodeling.Supervised ANN should be preferred over unsupervised approaches like SOM, as it is moreadjustable to an individual's user profile. SOM may classify texts according to their similarity, buteventually the user has to evaluate the number of clusters (too few or too many) and rank theclusters according to their degree of interest, as demonstrated by Goren-Bar and Kuflik (2004).The most important feature of ANN modeling is that the user need not specify which features(such as keywords) to extract from the text. If the ANN can use all the words in the text as inputs,the post-training ANN analysis would reveal, according to the user profile, what are the morerelevant words in the text. This would avoid the frustration of getting too many responses to ageneral query, or the suspicion of missing important results from a too narrow subjective13

selection of keywords. The trained ANN should then act as a sorter, evaluating each additionaltext according to the user profile requirements.In Section 6 we demonstrate the feasibility of ANN modeling and keyword extraction,predicting the importance ranking of unseen texts based on ANN trained on texts ranked by theuser. We employ a database used in a previous modeling of user profiles (Kuflik 2003).6. TRAINING AN ANN AS INFORMATION FINDERIn order to use a categorization mechanism such as ANN for document filtering, an appropriatedocument representation is required. In our case we used a binary vector representation of terms torepresent the documents. The dataset we used contained 12 sets of 100 documents each, whichwere returned as search query results issued at academic publications repositories (such asEconlyt that contains abstracts, indexing, and links to full-text articles in economics journals. Itabstracts books and indexes articles in books, working papers series, and dissertations (EconLit),Geobase, that is Worldwide literature on geography, geology, and ecology (Geobase), andINSPEC that provides scientific and technical literature in physics, electrical engineering,electronics, communications, control engineering, computers, computing, information technology,manufacturing, production and mechanical engineering (INSPEC).These documents areabstracts of academic publications in specific areas (such as Software Engineering, MachineLearning, etc.). Such databases are used by researchers as sources for information in their dailyresearch work. Several researchers were asked to define search queries in their areas of expertise.The search queries (presented in Table 1) were used to query the repositories and the first 100results of every set (abstracts of papers) were returned to the researchers who judged by them forrelevancy on a 1-5 scale, 1 meaning least relevant and 5 - most relevant.14

Trivial words in the text were discarded and the rest were stemmed and counted. For ANNmodeling, words in the bottom and top 5% count were discarded, and the rest were used to form abinary vector, where 1 signifies the presence of a word in the text. The average length of theresulting vector was 805 terms. Thus, an ANN model was trained with the word presence vectoras input, and with five hidden neurons and five binary outputs. The ANN target for a document isa 5-bit binary vector with 1 at the appropriate relevance ranking position and 0's at the otherpositions.The ANN was trained with the Guterman-Boger set of algorithms described in the earliersections. The trained ANN model was analyzed for identifying the more relevant inputs that wereused to train another, smaller, ANN. The training and input reduction process was repeated untilwe noticed an increase of the prediction error of the training example set.Table 1: Search QueriesSetQuery1 teleoperations or telerobotics or 'mobile robot' or 'remote control'2 'intelligent information retreival' or 'information filtering' or 'data mining' or 'topic trees'3 'hand gesture' or 'multimodal sign language gestures' or 'image motion' or 'motion from video' or'video motion'4 geographical and information and systems and ((human and geography) or (social and sciences)or (study and (urban or rural)) or (remote and sensing((5 'virtual reality' or viewpoint or vrml or 'virtual worlds' or 'active worlds' or ' 3d models' or '3dworlds' or 'distributed worlds' or ' multi user distributed systems' or 'group collaboration' or'augmented reality'6 'executive information systems'7 visualization8 'machine learning' or 'data categorization' or 'text categorization' or 'cluster analysis algorithms'9 conceptual modeling or traceability or case tools10 'software engineering' and methodology11 'data models' or 'database design' or 'entity relationship'12 'functional analysis' or 'functional design' or 'object oriented analysis' or 'object oriented design' or'analysis and design of information systems' or 'information systems development'To evaluate the effect of the number of training examples, four different ANNs were trained foreach data set, with 20, 40, 60 and 80 examples. The last 20 examples were always used as a15

validation set, i.e. they were not used in the training but presented at the end of the training as“new” examples to asses the reliability of the current ANN model's predictionsThe gathering and pre-processing of the training and testing data is the first phase (usually themost time-consuming phase) of ANN modeling. From a previous research (Kuflik 2003) weobtained twelve users' relevancy judgments of abstracts of research papers, using a 1-5 scale. Weconsidered the values 1, 2 and 3 as non-relevant, and values 4 and 5 as relevant. The words in theabstracts were already stemmed and the common words were removed by a stop-list. We had alsothe filtering results of these sets for the Rocchio-based filtering. Rocchio (1971) is a classical,well-known algorithm used for building a user-profile in the vector-space model and thus weused these results as a reference in our research. These results are discussed and presented later onin Section 7. (For more details, see Kuflik 2003).In this research, the aim was twofold: a) to predict the relevancy of abstracts, and b) to evaluatethe ability of the ANN to identify important keywords for a user profile. All the stemmed wordsin an abstracts group were combined into one “keyword” list whose length varied between 449and 1092 terms - 805 terms on average. Each abstract was transformed into a binary vector of 1'sand 0's, where 1's signify presence of a word in the abstract. The data pre-processing to the formused in the ANN training consisted of changing the 1 and 0 binary inputs to +1 and -1 values,respectively, and adding a small random noise value to them. This was done in order to avoidhaving an empty input column vector of constant -1 values. The [0,1] binary outputs weretransformed into [0.1,0.9] values, avoiding the slow asymptotic approach to the target during thetraining.The Guterman-Boger algorithms set trained four fully connected ANN models with fiveneurons in the hidden layer, as this is the typical number needed for real-life data sufficient to16

achieve good prediction rates for each user. As said, the training was done with 20, 40, 60, and 80examples, setting aside the last 20 examples as validation set. Each model was initially trainedwith the full word vectors, and then with the repeatedly reduced number of inputs, until no morereduction was possible.Figure 1: The effect of reducing the number of terms on the training and validation classificationerrorsThe “optimal” set of inputs was selected by observing the plots of the training and validationerrors as a function of the decreasing number of words as inputs (see Figure 1).17

Each panel in Figure 1 plots the change in the Root Mean Square errors of a particular (rank #)ANN output, of a data set (representing the error in predicting each level of relevancy between 1to 5). The training example errors are marked by “*”, and the validation example errors aremarked by “o”. The x-scale shows the inverse number of input terms (so the reduction processresults are presented from left to right). The numbers on the plot are the number of terms in thecurrent ANN model. As can be seen, the progressive reduction of the number of the original 692terms does not have a significant effect on the training error until 12 terms (8*10 -2 on the x-scale). Thus the ANN model with 12 terms should be selected. There is a beneficial effect onsome of the validation classification errors. For example, the first point pair in the lower rightpanel shows the prediction error of the 4 th ranking when the full 632 term set is used for trainingthe ANN. The validation error seems to progressively decrease as the number of terms used forthe ANN model training is automatically reduced, while the training error remains essentially thesame low value. When the reduced term length is less than about 30 (x-scale 3*10 -2 ), thevalidation error increases. It demonstrates our claim that the reduction of the terms used in theANN model is both feasible and useful for text categorization.7. ANN AS PREDICTOR OF TERM IMPORTANCEAs mentioned above, the trained ANN was used to filter the validation sets. The results of theANN filtering are presented in Tables 2 and 3, while Table 4 presents Rocchio results. Figure 2provides a graphical representation of the overall average performance using the "F" measure(van Reijsbergen 1979). Common measures for the performance of IF systems are "precision" and"recall", where precision is the ratio of relevant documents retrieved (as judged by the user) out ofthe total number of document retrieved by the system, and recall is the ratio of the relevant18

document retrieved out of all relevant documents available in the collection. The "F" measure thatwe used combines precision and recall as follows: F = 2 / (1/Precision + 1/ Recall).Table 2 presents the filtering results of the original (large) ANN. The Original Precision rowpresents the ratio of relevant documents in every set, giving an idea on the quality of that set ingeneral. The next four rows present the filtering results after ANN training with 20, 40, 60 and 80documents.Table 2: Filtering performance as a function of training set size – original ANNSet 1 2 3 4 5 6 7 8 9 10 11 12 Avg.SizeOriginal 0.49 0.46 0.45 0.36 0.29 0.27 0.26 0.19 0.19 0.18 0.12 0.11Precision20 0.4 0.36 0.59 0 0.32 0 0 0 0 0 0 0 0.1440 0.5 0.67 0.73 0.2 0 0 0 0 0 0.29 0 0 0.2060 0.62 0.59 0.63 0 0.48 0 0 0 0 0.25 0 0 0.2180 0.5 0.63 0.67 0 0.7 0 0 0 0 0.29 0 0 0.19As it seems from Table 2, increasing the number of training examples does not improve muchthe models’ performance. Paired T-tests were performed between every two consecutive setssizes,the results of the paired T-tests are: p=0.12 between filtering based on 20 examples andfiltering based on 40 examples; p=0.38 between filtering based on 40 examples and filteringbased on 60 examples; and p=0.5 between filtering based on 60 examples and filtering based on80 examples. These paired T-tests reveal that there is no significant difference between the resultsbased on the different sizes of training sets. However, there is a significant difference betweenfiltering based on 80 examples and filtering based on 20 examples (p=0.00016), and also betweenfiltering based on 60 examples and filtering based on 20 examples (p=0.036). The differencebetween filtering based on 80 examples and filtering based on 40 examples is not significantthough (p=0.084). This analysis shows that there is a slight improvement, as can be noticed in the19

table and in the graphical presentation, but in general the overall filtering performance of theoriginal ANN seems to be poor: low values of “F”, below an average of 0.2 for all training setssizes.Table 3 presents the results of filtering by the optimal ANN. The term reduction processdescribed in Section 5 yielded a different ANN at every step, trained with vectors containing lessand less terms for every set. The optimal result achieved by a specific ANN is the best filteringresult found for every case. (For every data set and for every size of training set, an optimal ANNwith a different set of terms was found; this will be discussed in detail later on.)Table 3: Filtering performance as a function of training set size – optimal ANNSet 1 2 3 4 5 6 7 8 9 10 11 12 Avg.SizeOriginal 0.49 0.46 0.45 0.36 0.29 0.27 0.26 0.19 0.19 0.18 0.12 0.11Precision20 0.67 0.86 0.94 0.57 0.72 0.25 0.56 0.67 0.25 0.18 0.50 0 0.5140 0.62 0.78 0.78 0.50 0.9 0.44 0.3 0.67 0.25 0.29 0.50 0 0.5060 0.62 0.84 0.84 0.35 0.93 0.36 0.46 0.55 0.29 0.50 0.57 0.4 0.5680 0.50 0.75 0.95 0.57 0.86 0.40 0.19 0.73 0.44 0.50 0.57 0.57 0.59As can be seen from comparing Tables 2 and 3, the optimal ANNs performed better than theoriginal ANNs for each and every case (with a few exceptions where the results were identical).Moreover, paired T-tests show that the differences between the optimal and the original ANNsare significant for every level of training (training sets sizes). The optimal ANN performedsignificantly better than the original ANN: paired T-test yielded p < 0.00 for every size of trainingset in average (of all sets). Comparing the filtering performance of the individual sets, paired T-tests reveal that the differences are significant for almost every set (with an exception of the firstand last sets, where for the first set there are slight differences, and for the last set there are noresults at the first steps).Table 4 presents the results of Rocchio-based filtering (Only 20, 40 and 60 examples were usedfor training, while the next 20 were used for threshold adaptation and the last 20 for validation, asin our case.).20

Table 4: Filtering performance as a function of training set size – RocchioSet 1 2 3 4 5 6 7 8 9 10 11 12 Avg.SizeOriginal 0.49 0.46 0.45 0.36 0.29 0.27 0.26 0.19 0.19 0.18 0.12 0.11Precision20 0.55 0.75 0.73 0.41 0.61 0.23 0.29 0.33 0.4 0.25 0.0 0.28 0.440 0.63 0.89 0.80 0.4 0.50 0.12 0.0 0.55 0.52 0.19 0.35 0.31 0.4460 0.71 0.86 0.80 0.53 0.75 0.23 0.40 0.38 0.43 0.15 0.41 0.0 0.44From looking at tables 2, 3 and 4, we can see that it outperforms the full ANN in most cases,but the optimal ANN outperforms it in average. Paired T-test reveals that this differences aresignificant (p

(p < 0.00). Figure 2 presents also the filtering results of the classical Rocchio algorithm that weretaken from Kuflik (2003). (Only 20, 40 and 60 examples were used for training, while the next 20were used for threshold adaptation and the last 20 for validation, as in our case.) Note thatRocchio-based filtering outperforms the full ANN (for the 20, 40 and 60 examples training setsused), but the optimal ANN outperforms both. Here too, in both cases paired T-tests revealed thatthe differences are significant (p

to the least number of terms, for all four size sets of training examples, takes about an hour on a2.4 MHz PC in an interpretive MATLAB environment, and would be faster in a compiledenvironment. Paired T-tests show that the differences in the sizes of the optimal vectors from theoriginal vectors, in every case (every training set size) and on average, are significant (p < 0.00 inall 4 cases and for the average as well).8. CONCLUSIONS AND SUGGESTIONS FOR FURTHER RESEARCHThe results presented in Sections 6 and 7 show that the initial large ANN model that can betrained from non-trivial words in a text gives an inferior prediction relative to a classicalstatistically derived model (Rocchio). However, the iterative term reduction process, resulting inreduced term-vectors and improved performance, outperforms significantly both the originalANN model and the classical statistical Rocchio model. The reduction process does not affect thetraining classification until a very low number of terms, typically between ten to thirty terms, asdiscussed in Section 6 and demonstrated in Table 4. The terms reduction process sometimesimproves the validation classification errors. The “optimal” number of terms is, on average, lessthan 10% of the original number of terms.The effect of the number of training examples on the filtering efficiency shows a differentpattern, depending on the example sets. For some of them the filtering is effective even with only20 examples and this high efficiency does not increase much with additional number of trainingexamples. This was typical for sets where more than 30% of the documents were relevant (sets 1-4, see first row of Table1). On the other hand, in sets with less than 20% of relevant documents(sets 8-12, see first row of Table 2) there is an improvement with the number of examples.23

Several important capabilities are made possible by the success of the “optimal” ANNclassification. The identified “relevant” words can be analyzed by the “causal index” algorithm(Baba et al. 1991) to identify the effect of each of these words on the user preferences, and thusbe used as “intelligent” keyword set. It may be possible to use a small set of training example togenerate a useful sorting ANN model that can be incrementally re-trained with the results ofevaluation of more texts down-loaded by the intelligent key-word set.Another interesting avenue for future research is the comparison of the efficiency of thereduced-ANN-based IR with the SVM algorithm. It is not clear yet if the SVM can efficientlyclassify more than two groups, or learn the minimal set of discriminating terms.9. REFERENCESAhmed, K., Bale, T.A. & Burford, D. (1999). Text classification and minimal-bias trainingvectors. Proceedings of the International Joint Conference on Neural Networks, 4, 2816-2819,Washington, DC.Baba, K., Enbutu, I. & Yoda, M. (1990). Explicit representation of knowledge acquired fromplant historical data using neural network. Proceedings of the International. Joint Conferenceon Neural Networks, (3), 155-160, San-Diego, California.Balabanovic’, M. & Shoham, Y. (1997). Fab: content-based, collaborative recommendation.Communications of the ACM, 40(3), 66-72.Belkin, N.J. & Croft, W.B. (1992). Information filtering and information retrieval: two sides ofthe same coin? Communications of the ACM, 35 (12), 29-38.Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Clarendon Press.Boger Z. (1992). Application of neural networks to water and wastewater treatment plantoperation. Transactions of the Instrument Society of America, 31 (1), 25-33.Boger Z. (1997). Experience in industrial plant model development using large-scale artificialneural networks. Information Sciences - Applications, 101 (3/4), 203-215.24

Boger, Z. & Guterman, H. (1997). Knowledge extraction from artificial neural networks models.Proceedings of the IEEE International Conference on Systems Man and Cybernetics, SMC'97,3030-3035, Orlando, Florida.Boger, Z., Kuflik, T., Shoval P. & Shapira, B. (2001). Automatic keyword identification byartificial neural networks compared to manual identification by users of filtering systems.Information Processing & Management, 37 (2) 187-198.Boger, Z. (2002). Who is afraid of the big bad ANN? Proceedings. of the International JointConference on Neural Networks, IJCNN’02, 2000-2005, Hawaii.Boger, Z. (2003). Selection of quasi-optimal inputs in chemometrics modeling by artificial neuralnetwork analysis. Analytica Chimica Acta, 490, (1-2), 31-40.Booker, A., Condliff, M., Greaves, M., Holt, F., Kao, A., Pierce, D.J., Poteet, S. & Wu Y.J.(1999). Visualizing text data sets. Computing in Science & Engineering, 1(4), 26-35.Caruana, R., Laurence, A. & Giles, L. (2000). Overfitting in neural nets: backpropagation,conjugate gradient, and early stopping. Neural Information Processing Systems, Denver, CO.Creput, J.C. & Caron, A. (1997). An information retrieval system using a new neural networkmodel. Cybernetica, Vol. XL 2, 127-139.Dasigi, V., Mann, R. & Protopopescu, V. (2001). Information fusion for text classification - anexperimental comparison. Pattern Recognition journal, 34, 2413-2425.Deerwater, S., Dumais, S., Furnas, G. Landaur, T. & Harshman R. (1990). Indexing by latentsemantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.Drucker, H., Shaharary B. & Gibbon, D.C. (2002). Support Vector Machines: relevance feedbackand information retrieval. Information Processing & Management, 38, 305-323.EconLit, http://www.econlit.org/, accessed, February 2005.Egecioglu, O. & Ferhatosmanoglu, H. (2000). Dimensionality reduction and similaritycomputation by inner product approximations. Proceedings of the 9 th International Conferenceon Information and Knowledge Management. CIKM 2000, McLean, Virginia, 219-226.25

Etzioni, O. & Weld, D. (1994). A softbot-based interface to the Internet. Communications of theACM, 37 (7), 72-76.Etzioni, O. & Weld, D. (1995). Intelligent agents on the Internet: fact, fiction and forecast. IEEEExpert, 10 (4), 44-49.Furnas G.W., Landauer T.K., Gomez L.M., & Dumais, S.T. (1987). The vocabulary problem inhuman-system communication. Communications of the ACM, 30(11) 964- 971.GEOBASE,http://www.oclc.org/support/documentation/firstsearch/databases/dbdetails/details/GEOBASE.htm, accessed February 2005.Goren-Bar, D & Kuflik T. (2004). Supporting users' subjective categorization with SOM andLVQ. Journal of the American Society for Information Science and Technology, 56(4), 345-356.Greenberg, S. & Guterman, H. (1996). Neural networks classifiers for automatic real-worldimage recognition. Applied Optics, 35, 4598-4609.Guterman, H. (1994). Application of principal component analysis to the design of neuralnetworks. Neural, Parallel and Scientific Computing, 2, 43-54.Hanani, U., Shapira, B. & Shoval, P. (2001). Information Filtering: Overview of Issues, Researchand Systems, User modeling and User-Adapted Interaction, 11(3), 203-259.Hull, D. (1994). Improving text retrieval for the routing problem using Latent Semantic Indexing.Proceedings of the 17th annual international ACM SIGIR conference on Research andDevelopment in Information Retrieval, 282 – 291, Dublin, Ireland.INSPEC, http://www.iee.org/Publish/INSPEC/, accessed Feb. 2005.Koehn, P. (2002). Combining multi-class maximum entropy text classifiers with neural networkvoting. In: Ranchhod & Mamede (Eds.) Advances in Natural Language Processing. Springer-Verlag, Berlin, 125-133.Kohonen, T. (1997). Exploration of very large databases by self-organizing maps. Proceedings ofthe IEEE International Conference on Neural Networks, 1, PL1-6, Houston, Texas.26

Kuflik, T. (2003). Methods for Definition of Content-Based and Rule-Based User Profiles inInformation Filtering Systems, PhD. Dissertation. Ben-Gurion University of the Negev.Kurbel, K., Singh, K. & Teutenberg, F. (1998). Search and classification of ‘interesting’ businessapplications in the World Wide Web using a neural network approach. In: Forcht, K. (Ed.),Proceedings of the 1998 IACIS Conference, 75-81, Cancun, Mexico.Kwok, K.L. (1995). A network approach to probabilistic information-retrieval. ACMTransactions on Information Systems, 13, (3), 324-353.Karypis, G. & Han, E.H. (2000). Fast dimensionality reduction algorithm with applications todocument retrieval & categorization. Proceedings of the 9th ACM International Conferenceon Information and Knowledge Management, McLean, 12-19, VA.Lam, S.L. & Lee, D.L. (1999). Feature reduction for neural network based text categorization.Proceedings of the 6th International Conference on Advanced Systems for AdvancedApplications, 195-202, Hsinchu, Taiwan.Lawrence, S., Giles, C. L. & Tsoi, A.C. (1997). Lessons in neural network training: Overfittingmay be harder than expected. Proceedings of the 14 th National Conference on ArtificialIntelligence, AAAI-97, 540-545, Menlo Park.Pal, S.K., Talwar, V. & Mitra, P. (2002). Web mining in soft computing framework: relevance,state of the art and future directions. IEEE Transactions on Neural Networks, 13 (5) 1163-1177.Rauber, A., Schweighofer, E., & Merkl D. (2000). Text classification and labeling of documentclusters with self-organising maps. Journal of the Austrian Society for Artificial Intelligence(ÖGAI), 13(3), 17-23.Robertson, S. E. (1981). The methodology of information retrieval experiment. In: Sparks JonesK. (Ed.), Information Retrieval Experiment, Butterworth. Ch. 1, 9-31.Robertson, S.E. & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of theAmerican Society for Information Science, (27)3, 129-146.27

Rocchio J.J. (1971). Performance indices for document retrieval. In: G. Salton (Ed.): TheSMART Retrieval System - Experiments in Automatic Documents Processing, Englewood,NJ, 57-67.Ruiz, M.E. & Srinivasan, P. (1999). Hierarchical neural networks for text categorization.Proceedings of the 22nd International Conference on Research and Development inInformation Retrieval, 281-282, Berkeley, California.Ruiz, M.E. & Srinivasan, P. (2002). Hierarchical text categorization using neural networks.Information Retrieval, 5 (1) 87-118.Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning representations by backpropagatingerrors. Nature, 323, 533-536.Salton, G. and McGill, W.J., (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York.Sarle, W.S. (2004). Frequently asked questions. comp.ai.neural-nets Users Group,ftp://ftp.sas.com/pub/neural, accessed March 2005.Sebastiani, F. (2002). Machine learning in automated text categorization. ACM ComputingSurveys, 34(1), 1-47.Tomsich, P., Rauber, A. & Merkl, D. (2000). ParSOM: using parallelism to overcome memorylatency in self-organizing neural networks. Proceedings of the 8th European Conference onHigh-Performance Computing and Networking (HPCN Europe 2000) , 136-146, Amsterdam,The Netherlands.Vandaele, A. (1983): Applied Time Series and Box - Jenkins Models. Academic Press.van Reijsbergen, C., J. (1979). Information Retrieval. Butterworths.Vlajic, N. & Card, H.C. (1999). Categorizing Web pages on the subject of neural networks.Journal of Network and Computer Applications, 21(2), 91-105.Werbos, P. (1974). Beyond Regression: New tools for prediction and analysis in the behavioralsciences. Ph.D. Dissertation, Committee on Applied Math., Harvard Univ.28

Werbos, P. (1993). Roots of Back-Propagation: From Ordered Derivatives to Neural Networks toPolitical Forecasting. John Wiley and Sons, Inc.Wermter, S. (2000). Neural network agents for learning semantic text classification. InformationRetrieval, 3, 87-103.Yang, Y. & Liu, X. (1999). A re-examination of text categorization methods. Proceedings of the22nd International Conference on Research and Development in Information Retrieval, 42-49,Berkeley, California.Yang, Y. (1999). An evaluation of statistical approaches to text categorization. InformationRetrieval, 1 (1-2) 69-90.Yu, E.S. & Liddy, E.D. (1999). Feature selection in text categorization using the Baldwin effect.Proceedings of the International Joint Conference on Neural Networks, 4, 2924-2927,Washington, DC.29

Kuflik-Boger-Shoval-Final Version.pdf

Create successful ePaper yourself

Delete template?

Save as template?