13.07.2015 Views

Using Information Gain Attribute Evaluation to Classify ... - Telfor 2009

Using Information Gain Attribute Evaluation to Classify ... - Telfor 2009

Using Information Gain Attribute Evaluation to Classify ... - Telfor 2009

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

classified as a wrapper, because in this case, the classifieralgorithm is wrapped in the loop. On the contrary, filtermethods do not rely on the classifier algorithm, but useother criteria based on correlation notions.The feature selection process may run exhaustivelybefore it s<strong>to</strong>ps without a suitable s<strong>to</strong>pping criterion. Afeature selection process may s<strong>to</strong>p under one of thefollowing reasonable criteria: (1) a predefined number offeatures are selected, (2) a predefined number of iterationsare reached, (3) whether addition (or deletion) of anyfeature does not produce a better subset, (4) an optimalsubset according <strong>to</strong> the evaluation criterion is obtained.The selected best feature subset needs <strong>to</strong> be validated bycarrying out different tests on both the selected subset andthe original set and comparing the results using artificialdata sets and real-world data sets.III. INFORMATION GAIN ATTRIBUTE EVALUATIONDiverse feature ranking and feature selection techniqueshave been proposed in the machine learning literature. Thepurpose of these techniques is <strong>to</strong> discard irrelevant orredundant features from a given feature vec<strong>to</strong>r. Thefollowing attribute evaluations are used: IG, gain ratio,symmetrical uncertainty, relief-F, one-R and chi-squared.In this paper, we consider evaluation of the practicalusefulness of IG attribute evaluation.Entropy is a commonly used in the information theorymeasure, which characterizes the purity of an arbitrarycollection of examples. It is in the foundation of the IGattribute ranking methods. The entropy measure isconsidered as a measure of system’s unpredictability. Theentropy of Y isH ( Y ) = − p(y)log2 ( p(y))(1)∑y∈Ywhere p(y) is the marginal probability density function forthe random variable Y. If the observed values of Y in thetraining data set S are partitioned according <strong>to</strong> the valuesof a second feature X, and the entropy of Y with respect <strong>to</strong>the partitions induced by X is less than the entropy of Yprior <strong>to</strong> partitioning, then there is a relationship betweenfeatures Y and X. Then the entropy of Y after observing Xis:H ( Y X ) = − p(x)p(y x)log2 ( p(y x))(2)∑x∈X∑y∈Ywhere p(y |x ) is the conditional probability of y given x.Given the entropy as a criterion of impurity in a trainingset S, we can define a measure reflecting additionalinformation about Y provided by X that represents theamount by which the entropy of Y decreases. This measureis known as IG. It is given byIG = H ( Y)− H ( Y X ) = H ( X ) − H ( X Y)(3)IG is a symmetrical measure (refer <strong>to</strong> equation (3)). Theinformation gained about Y after observing X is equal <strong>to</strong>the information gained about X after observing Y. Aweakness of the IG criterion is that it is biased in favor offeatures with more values even when they are not moreinformative.IV. C4.5 DECISION TREEDifferent methods exist <strong>to</strong> build decision trees, but all ofthem summarize given training data in a tree structure,with each branch representing an association betweenfeature values and a class label. One of the most famousand representative amongst these is the C4.5 decision tree[20]. The C4.5 decision tree works by recursivelypartitioning the training data set according <strong>to</strong> tests on thepotential of feature values in separating the classes. Thedecision tree is learned from a set of training examplesthrough an iterative process, of choosing a feature andsplitting the given example set according <strong>to</strong> the values ofthat feature. The most important question is which of thefeatures is the most influential in determining theclassification and hence should be chosen first. Entropymeasures or equivalently, information gains are used <strong>to</strong>select the most influential, which is intuitively deemed <strong>to</strong>be the feature of the lowest entropy (or of the highestinformation gain). This learning algorithm works by: a)computing the entropy measure for each feature, b)partitioning the set of examples according <strong>to</strong> the possiblevalues of the feature that has the lowest entropy, and c) foreach are used <strong>to</strong> estimate probabilities, in a way exactlythe same as with the Naive Bayes approach. Althoughfeature tests are chosen one at a time in a greedy manner,they are dependent on results of previous tests.V. EXPERIMENTS AND RESULTSConnectionist Bench (Sonar, Mines vs. Rocks) data setwas used for IG attribute evaluation with C4.5 decisiontree, taken from the UCI reposi<strong>to</strong>ry of machine learningdatabases [20]. This is the data set used by Gorman andSejnowski in their study of the classification of sonarsignals using a neural network [21]. The task is <strong>to</strong> train anetwork <strong>to</strong> discriminate between sonar signals bounced offa metal cylinder and those bounced off a roughlycylindrical rock.This data set contains 111 patterns obtained bybouncing sonar signals off a metal cylinder at variousangles and under various conditions, and 97 patternsobtained from rocks under similar conditions. Thetransmitted sonar signal is a frequency-modulated chirp,rising in frequency. The data set contains signals obtainedfrom a variety of different aspect angles, spanning 90degrees for the cylinder and 180 degrees for the rock.Each pattern is a set of 60 numbers in the range 0.0 <strong>to</strong>1.0, where each number represents the energy within aparticular frequency band, integrated over a certain periodof time.If the object is a rock, the label associated with eachrecord contains the letter "R" and if it is a mine (metalcylinder) "M". The numbers in the labels are in increasingorder of aspect angle, but they do not encode the angledirectly.Fig. 1 shows a sample return from the rock and thecylinder. The preprocessing of the raw signal was based onexperiments with human listeners. The temporal signalwas first filtered and spectral information was extractedand used <strong>to</strong> represent the signal on the input layer.1352


natural domains. Classification accuracy was estimatedusing ten-fold cross validation.83,081,079,077,0Accuracy75,073,071,069,067,065,06057545148454239363330272421181512963Number of featuresFig. 1. Amplitude displays of a typical return from thecylinder and the rock as a function of time [21].The preprocessing used <strong>to</strong> obtain the spectral envelopeis indicated schematically in Fig. 2 where a set ofsampling apertures (Fig. 2a) are superimposed over the 2Ddisplay of a short-term Fourier Transform spectrogram ofthe sonar return. The spectral envelope, P t0 , v0 (η), wasobtained by integrating over each aperture (Fig. 2b and c).Fig. 2. The preprocessing of the sonar signal produces asampled spectral envelope. (a) The set of samplingapertures offset temporally <strong>to</strong> correspond <strong>to</strong> the slope ofthe FM chirp, (b) sampling apertures superimposed overthe 2D display of the short-term Fourier transform, (c) thespectral envelope obtained by integrating over eachsampling aperture [21].A supervised learning algorithm, C4.5 decision tree isadopted here <strong>to</strong> build model. The purpose of theexperiments described in this section is <strong>to</strong> empirically testthe claim that IG attribute evaluation can improve theaccuracy of classification algorithm C4.5 decision tree.The performance of learning algorithms with and withoutfeature selection is taken as an indication of IG attributeevaluation success in selecting useful features, because therelevant features are often not known in advance forFig. 3. Classification accuracy of C4.5 decision tree withIG attribute evaluation.Fig. 3 shows for C4.5 decision tree, how much data setaccuracy was improved and degraded by IG attributeevaluation. IG attribute evaluation maintains or improvesthe accuracy of C4.5 decision tree if we used more than 9relevant features and degrades, maintains or improves itsaccuracy if we used less than 9 relevant features. Theaccuracy of C4.5 decision tree significantly improves morethan 10% on this data set with IG attribute evaluation.<strong>Evaluation</strong> of selecting features is fast.TABLE 1: GENERATING DECISION RULESNumber of mostrelevant featuresNumberof leaves60 - 52 18 3551 - 49 17 3348 - 36 18 3535 - 34 17 3333 - 33 19 3732 - 29 16 3128 - 22 17 3321 - 21 18 3520 -18 19 3717 - 17 20 3916 - 14 18 3513 - 13 19 3712 - 11 20 3910 - 10 23 459 - 9 21 418 - 8 19 377 - 5 14 274 - 4 8 153- 1 2 3Sizeof treeC4.5 decision tree without feature selections isgenerated 18 rules, and size of the tree is 35. Table 1shows that IG attribute evaluation changes the size of thetrees induced by C4.5 decision tree depends on number ofmost relevant features. Rules for this data set obtained byC4.5 decision tree without feature selections are:If f_11 =


If f_11 =

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!