Using Information Gain Attribute Evaluation to Classify ... - Telfor 2009

classified as a wrapper, because in this case, the classifieralgorithm is wrapped in the loop. On the contrary, filtermethods do not rely on the classifier algorithm, but useother criteria based on correlation notions.The feature selection process may run exhaustivelybefore it stops without a suitable stopping criterion. Afeature selection process may stop under one of thefollowing reasonable criteria: (1) a predefined number offeatures are selected, (2) a predefined number of iterationsare reached, (3) whether addition (or deletion) of anyfeature does not produce a better subset, (4) an optimalsubset according to the evaluation criterion is obtained.The selected best feature subset needs to be validated bycarrying out different tests on both the selected subset andthe original set and comparing the results using artificialdata sets and real-world data sets.III. INFORMATION GAIN ATTRIBUTE EVALUATIONDiverse feature ranking and feature selection techniqueshave been proposed in the machine learning literature. Thepurpose of these techniques is to discard irrelevant orredundant features from a given feature vector. Thefollowing attribute evaluations are used: IG, gain ratio,symmetrical uncertainty, relief-F, one-R and chi-squared.In this paper, we consider evaluation of the practicalusefulness of IG attribute evaluation.Entropy is a commonly used in the information theorymeasure, which characterizes the purity of an arbitrarycollection of examples. It is in the foundation of the IGattribute ranking methods. The entropy measure isconsidered as a measure of system’s unpredictability. Theentropy of Y isH ( Y ) = − p(y)log2 ( p(y))(1)∑y∈Ywhere p(y) is the marginal probability density function forthe random variable Y. If the observed values of Y in thetraining data set S are partitioned according to the valuesof a second feature X, and the entropy of Y with respect tothe partitions induced by X is less than the entropy of Yprior to partitioning, then there is a relationship betweenfeatures Y and X. Then the entropy of Y after observing Xis:H ( Y X ) = − p(x)p(y x)log2 ( p(y x))(2)∑x∈X∑y∈Ywhere p(y |x ) is the conditional probability of y given x.Given the entropy as a criterion of impurity in a trainingset S, we can define a measure reflecting additionalinformation about Y provided by X that represents theamount by which the entropy of Y decreases. This measureis known as IG. It is given byIG = H ( Y)− H ( Y X ) = H ( X ) − H ( X Y)(3)IG is a symmetrical measure (refer to equation (3)). Theinformation gained about Y after observing X is equal tothe information gained about X after observing Y. Aweakness of the IG criterion is that it is biased in favor offeatures with more values even when they are not moreinformative.IV. C4.5 DECISION TREEDifferent methods exist to build decision trees, but all ofthem summarize given training data in a tree structure,with each branch representing an association betweenfeature values and a class label. One of the most famousand representative amongst these is the C4.5 decision tree[20]. The C4.5 decision tree works by recursivelypartitioning the training data set according to tests on thepotential of feature values in separating the classes. Thedecision tree is learned from a set of training examplesthrough an iterative process, of choosing a feature andsplitting the given example set according to the values ofthat feature. The most important question is which of thefeatures is the most influential in determining theclassification and hence should be chosen first. Entropymeasures or equivalently, information gains are used toselect the most influential, which is intuitively deemed tobe the feature of the lowest entropy (or of the highestinformation gain). This learning algorithm works by: a)computing the entropy measure for each feature, b)partitioning the set of examples according to the possiblevalues of the feature that has the lowest entropy, and c) foreach are used to estimate probabilities, in a way exactlythe same as with the Naive Bayes approach. Althoughfeature tests are chosen one at a time in a greedy manner,they are dependent on results of previous tests.V. EXPERIMENTS AND RESULTSConnectionist Bench (Sonar, Mines vs. Rocks) data setwas used for IG attribute evaluation with C4.5 decisiontree, taken from the UCI repository of machine learningdatabases [20]. This is the data set used by Gorman andSejnowski in their study of the classification of sonarsignals using a neural network [21]. The task is to train anetwork to discriminate between sonar signals bounced offa metal cylinder and those bounced off a roughlycylindrical rock.This data set contains 111 patterns obtained bybouncing sonar signals off a metal cylinder at variousangles and under various conditions, and 97 patternsobtained from rocks under similar conditions. Thetransmitted sonar signal is a frequency-modulated chirp,rising in frequency. The data set contains signals obtainedfrom a variety of different aspect angles, spanning 90degrees for the cylinder and 180 degrees for the rock.Each pattern is a set of 60 numbers in the range 0.0 to1.0, where each number represents the energy within aparticular frequency band, integrated over a certain periodof time.If the object is a rock, the label associated with eachrecord contains the letter "R" and if it is a mine (metalcylinder) "M". The numbers in the labels are in increasingorder of aspect angle, but they do not encode the angledirectly.Fig. 1 shows a sample return from the rock and thecylinder. The preprocessing of the raw signal was based onexperiments with human listeners. The temporal signalwas first filtered and spectral information was extractedand used to represent the signal on the input layer.1352

natural domains. Classification accuracy was estimatedusing ten-fold cross validation.83,081,079,077,0Accuracy75,073,071,069,067,065,06057545148454239363330272421181512963Number of featuresFig. 1. Amplitude displays of a typical return from thecylinder and the rock as a function of time [21].The preprocessing used to obtain the spectral envelopeis indicated schematically in Fig. 2 where a set ofsampling apertures (Fig. 2a) are superimposed over the 2Ddisplay of a short-term Fourier Transform spectrogram ofthe sonar return. The spectral envelope, P t0 , v0 (η), wasobtained by integrating over each aperture (Fig. 2b and c).Fig. 2. The preprocessing of the sonar signal produces asampled spectral envelope. (a) The set of samplingapertures offset temporally to correspond to the slope ofthe FM chirp, (b) sampling apertures superimposed overthe 2D display of the short-term Fourier transform, (c) thespectral envelope obtained by integrating over eachsampling aperture [21].A supervised learning algorithm, C4.5 decision tree isadopted here to build model. The purpose of theexperiments described in this section is to empirically testthe claim that IG attribute evaluation can improve theaccuracy of classification algorithm C4.5 decision tree.The performance of learning algorithms with and withoutfeature selection is taken as an indication of IG attributeevaluation success in selecting useful features, because therelevant features are often not known in advance forFig. 3. Classification accuracy of C4.5 decision tree withIG attribute evaluation.Fig. 3 shows for C4.5 decision tree, how much data setaccuracy was improved and degraded by IG attributeevaluation. IG attribute evaluation maintains or improvesthe accuracy of C4.5 decision tree if we used more than 9relevant features and degrades, maintains or improves itsaccuracy if we used less than 9 relevant features. Theaccuracy of C4.5 decision tree significantly improves morethan 10% on this data set with IG attribute evaluation.Evaluation of selecting features is fast.TABLE 1: GENERATING DECISION RULESNumber of mostrelevant featuresNumberof leaves60 - 52 18 3551 - 49 17 3348 - 36 18 3535 - 34 17 3333 - 33 19 3732 - 29 16 3128 - 22 17 3321 - 21 18 3520 -18 19 3717 - 17 20 3916 - 14 18 3513 - 13 19 3712 - 11 20 3910 - 10 23 459 - 9 21 418 - 8 19 377 - 5 14 274 - 4 8 153- 1 2 3Sizeof treeC4.5 decision tree without feature selections isgenerated 18 rules, and size of the tree is 35. Table 1shows that IG attribute evaluation changes the size of thetrees induced by C4.5 decision tree depends on number ofmost relevant features. Rules for this data set obtained byC4.5 decision tree without feature selections are:If f_11 =

If f_11 =

Using Information Gain Attribute Evaluation to Classify ... - Telfor 2009

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?