12.07.2015 Views

Decision Tree Induction - A Heuristic Problem ... - IRNet Explore

Decision Tree Induction - A Heuristic Problem ... - IRNet Explore

Decision Tree Induction - A Heuristic Problem ... - IRNet Explore

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Decision</strong> <strong>Tree</strong> <strong>Induction</strong>- A <strong>Heuristic</strong> <strong>Problem</strong> Reduction ApproachD. Raghu, K.Venkata Raju & Ch.Raja JacobDepartment of Computer Science Engineering,Nova college of Engineering&Technology,Jangareddy gudem, West Godavari District, Andhra Pradesh, India-.534447E-mail : raghuau@gmail.com, venkatsagar05@gmail.com, rchidipi@gmail.comAbstract - ID3, a greedy decision tree induction technique constructs decision trees in top-down recursive divide and conquermanner. This supervised learning algorithm is used for classification. This paper introduces a new <strong>Heuristic</strong> technique of <strong>Problem</strong>Reduction that provides optimal solution based on evaluation of heuristic function applied to every node in the tree. The attributeselection can be done by considering the best attribute having least <strong>Heuristic</strong> function value by which we provide the optimal<strong>Decision</strong> <strong>Tree</strong>. This provides fast and more efficient method of constructing the <strong>Decision</strong> <strong>Tree</strong> taking predominant function ofheuristic that best selects an attribute for classification.Keywords— <strong>Decision</strong> tree, <strong>Problem</strong> Reduction, <strong>Heuristic</strong> function, Attribute Selection Criteria.I. INTRODUCTIONData Mining is a powerful technology used in thedata warehouses. Data mining is to discover therelationship and rules existing in data, to predict thefeature trends based on the existing data, finally to fullyexplore and use these wealth knowledge hiding in thedatabases [2],[3].<strong>Decision</strong> tree induction is most widelyused practical method for inductive learning; it plays animportant role in the process of data mining and dataanalysis.As the Classification by <strong>Decision</strong> tree inductionuses attribute selection measure is taken as informationgain heuristic. This paper proposes a new heuristicapproach for attribute selection, which is done byapplying the <strong>Heuristic</strong> function to each attribute thatbest selects an attribute for further split of the tree.II. ID3 ALGORITHMID3 algorithm [4] is a decision tree learningalgorithm based on information entropy proposed byQuinlan in 1986. The core of ID3 algorithm is:selecting attributes from all levels of decision treenodes; using information gain as attribute selectioncriteria; each selecting an attribute with the largestinformation gain to make decision tree nodes;establishing branches by the different values of thenode; building the tree nodes and branches recursivelyaccording to the instances of various branches; until acertain subset of the instances belonging to the samecategory.Set S is a collection of data samples. Assume thatclass label attribute has m different values, definedifferent classes C i (i=1, 2...m). Let s i represent thesample number of classes C i . For a given sample theexpected information for the classification is calculatedas follows [1]:Where p i is the probability of any sample belongingto class C i. Assumed attribute A has v distinct values :{a 1 , a 2 ...a v } divides s into v sub-sets {s 1 , s 2 ...s v } byattribute A. Making A as test attribute, this subset is thebranch getting from the nodes including set S. Assumeds ij is the sample number of class C i in subset s j . TheEntropy divided by A is given as follows [3]:And is equal to the sample number of subset dividingthe total number of S. The smaller entropy is the higherpurity of the divided subset. For the given subset S j , itsexpected information is given as [3]:International Conference on Computer Science and Engineering - April 28 th , 2012 – Vizag - ISBN: 978-93-81693-57-527


<strong>Decision</strong> <strong>Tree</strong> <strong>Induction</strong> - A <strong>Heuristic</strong> <strong>Problem</strong> Reduction ApproachWhere p ij= S ij is the probability that the sample S jBelongs to the class C i . Get information gainaccording to the expected information and entropy.Information Gain is calculated by [3]:Gain (A) = I (s 1 , s 2 ... s m )-E (A) (4)Calculate the information gain off each attributeaccording to ID3 algorithm and select the attribute withhighest information as the test attribute for a given set.Create the node of the selected test attribute arkaccording to this attribute create a branch for each valueof this attribute and accordingly divide the sample.III. PROBLEM REDUCTIONIMPLEMENTATIONIn this approach the new algorithm applies heuristicfunction on each attribute and picks the best attributewith least heuristic function value.f(n) The heuristicfunction is obtained from <strong>Problem</strong> Reduction algorithmwhich is defined by [5]:f(n) = estimated cost from this node to the goal node thealgorithm terminates when f(n)=0 or f(n)>FUTILITYvalue. In general ID3 the attribute selection criteria isdone by the process of information gain heuristic. ThisPaper proposes a new attribute selection criteria byapplying heuristic function f(n) on each attribute.Table 1 gives the data set [3] affectingBuys_computer. There are four attributes: age, income,student, credit. These four attributes are used to specifyBuys_computer and the algorithm is constructed asfollows:I. Initial Attribute:Table II: The Result Set for the Initial AttributeAttribute Selection Criteria:Step-1: Calculation of f (n)f(n)=E(n) with respect to goal attribute Buys_ComputerStep-2: Calculation of FUTILITY valuewhere m is the number of attributes.Thus step-1 through step-2 are recursively applied forbuilding the <strong>Decision</strong> <strong>Tree</strong>. The algorithm terminateswhen f(n) =0 or f(n) is greater than FUTILITY.IV. ALGORITHM VALIDATIONTable I: The Data Set used in the AlgorithmNow according to the calculated values the trainingset is classified depending on attribute “Age” whichsplits up into 3 values. Since of the two attributeswhose f(n) is less than the futility “AGE” is having lessvalues when compared to “STUDENT”.II. Next Attribute with age


<strong>Decision</strong> <strong>Tree</strong> <strong>Induction</strong> - A <strong>Heuristic</strong> <strong>Problem</strong> Reduction ApproachSo the next attribute on split age40Since all the f(n) values are 0 the node is labelled assolved and there is no further splitsThere ends on split age40.VII. Next Attribute with age>40 and credit=fairTable VIII: The Result Set for attribute Credit=fairThe attribute that was selected is “CREDIT_RATING ”on split age>40 since the f(n) value is less thanFUTILITY value.V. Next Attribute with age40 andcredit=excellentInternational Conference on Computer Science and Engineering - April 28 th , 2012 – Vizag - ISBN: 978-93-81693-57-529


<strong>Decision</strong> <strong>Tree</strong> <strong>Induction</strong> - A <strong>Heuristic</strong> <strong>Problem</strong> Reduction ApproachTable IX: The Result Set for attribute Credit=excellentSince all the f(n) values are 0 the node is labelled assolved and there is no further splitsThe algorithm Terminates and the final decision tree isas followsREFERENCES[1] Liu Yuxun, Xie Niuniu “Improved ID3Algorithm” 3 RD IEEE Conference(ICCSIT.2010.5564765):465-468.[2] G.Eason, B.Noble, and I.N.Sneddon, “On certainintegrals of W.J.Frawley, G.Piatetsky Shapiro,C.J.Matheu. Knowledge discovery in databases:An overview[M]”, USA, AAAI/MIT Press, 19911-27[3] Han Jiawei, Micheline Kamber Data MiningConcepts and Techniques[4] J.R.QUINLAN.<strong>Induction</strong> of <strong>Decision</strong><strong>Tree</strong>s[J].Machine Learning 1968:81-106.[5] E.Rich, K.Knight, Shiv Shankar B Nair ArtificialIntelligence.Fig.1 <strong>Decision</strong> <strong>Tree</strong>V. CONCLUSIONThe proposed paper created a new innovativeapproach of attribute selection criteria taking bestsplitting attribute with least heuristic functional value. Itreduces the number of calculations by generating thedecision tree with out calculating the gain factor. Thispaper can be extended to larger databases which canprovide best splitting criteria.International Conference on Computer Science and Engineering - April 28 th , 2012 – Vizag - ISBN: 978-93-81693-57-530

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!