Decision Tree Induction - A Heuristic Problem ... - IRNet Explore

Decision Tree Induction- A Heuristic Problem Reduction ApproachD. Raghu, K.Venkata Raju & Ch.Raja JacobDepartment of Computer Science Engineering,Nova college of Engineering&Technology,Jangareddy gudem, West Godavari District, Andhra Pradesh, India-.534447E-mail : raghuau@gmail.com, venkatsagar05@gmail.com, rchidipi@gmail.comAbstract - ID3, a greedy decision tree induction technique constructs decision trees in top-down recursive divide and conquermanner. This supervised learning algorithm is used for classification. This paper introduces a new Heuristic technique of ProblemReduction that provides optimal solution based on evaluation of heuristic function applied to every node in the tree. The attributeselection can be done by considering the best attribute having least Heuristic function value by which we provide the optimalDecision Tree. This provides fast and more efficient method of constructing the Decision Tree taking predominant function ofheuristic that best selects an attribute for classification.Keywords— Decision tree, Problem Reduction, Heuristic function, Attribute Selection Criteria.I. INTRODUCTIONData Mining is a powerful technology used in thedata warehouses. Data mining is to discover therelationship and rules existing in data, to predict thefeature trends based on the existing data, finally to fullyexplore and use these wealth knowledge hiding in thedatabases [2],[3].Decision tree induction is most widelyused practical method for inductive learning; it plays animportant role in the process of data mining and dataanalysis.As the Classification by Decision tree inductionuses attribute selection measure is taken as informationgain heuristic. This paper proposes a new heuristicapproach for attribute selection, which is done byapplying the Heuristic function to each attribute thatbest selects an attribute for further split of the tree.II. ID3 ALGORITHMID3 algorithm [4] is a decision tree learningalgorithm based on information entropy proposed byQuinlan in 1986. The core of ID3 algorithm is:selecting attributes from all levels of decision treenodes; using information gain as attribute selectioncriteria; each selecting an attribute with the largestinformation gain to make decision tree nodes;establishing branches by the different values of thenode; building the tree nodes and branches recursivelyaccording to the instances of various branches; until acertain subset of the instances belonging to the samecategory.Set S is a collection of data samples. Assume thatclass label attribute has m different values, definedifferent classes C i (i=1, 2...m). Let s i represent thesample number of classes C i . For a given sample theexpected information for the classification is calculatedas follows [1]:Where p i is the probability of any sample belongingto class C i. Assumed attribute A has v distinct values :{a 1 , a 2 ...a v } divides s into v sub-sets {s 1 , s 2 ...s v } byattribute A. Making A as test attribute, this subset is thebranch getting from the nodes including set S. Assumeds ij is the sample number of class C i in subset s j . TheEntropy divided by A is given as follows [3]:And is equal to the sample number of subset dividingthe total number of S. The smaller entropy is the higherpurity of the divided subset. For the given subset S j , itsexpected information is given as [3]:International Conference on Computer Science and Engineering - April 28 th , 2012 – Vizag - ISBN: 978-93-81693-57-527

Decision Tree Induction - A Heuristic Problem Reduction ApproachWhere p ij= S ij is the probability that the sample S jBelongs to the class C i . Get information gainaccording to the expected information and entropy.Information Gain is calculated by [3]:Gain (A) = I (s 1 , s 2 ... s m )-E (A) (4)Calculate the information gain off each attributeaccording to ID3 algorithm and select the attribute withhighest information as the test attribute for a given set.Create the node of the selected test attribute arkaccording to this attribute create a branch for each valueof this attribute and accordingly divide the sample.III. PROBLEM REDUCTIONIMPLEMENTATIONIn this approach the new algorithm applies heuristicfunction on each attribute and picks the best attributewith least heuristic function value.f(n) The heuristicfunction is obtained from Problem Reduction algorithmwhich is defined by [5]:f(n) = estimated cost from this node to the goal node thealgorithm terminates when f(n)=0 or f(n)>FUTILITYvalue. In general ID3 the attribute selection criteria isdone by the process of information gain heuristic. ThisPaper proposes a new attribute selection criteria byapplying heuristic function f(n) on each attribute.Table 1 gives the data set [3] affectingBuys_computer. There are four attributes: age, income,student, credit. These four attributes are used to specifyBuys_computer and the algorithm is constructed asfollows:I. Initial Attribute:Table II: The Result Set for the Initial AttributeAttribute Selection Criteria:Step-1: Calculation of f (n)f(n)=E(n) with respect to goal attribute Buys_ComputerStep-2: Calculation of FUTILITY valuewhere m is the number of attributes.Thus step-1 through step-2 are recursively applied forbuilding the Decision Tree. The algorithm terminateswhen f(n) =0 or f(n) is greater than FUTILITY.IV. ALGORITHM VALIDATIONTable I: The Data Set used in the AlgorithmNow according to the calculated values the trainingset is classified depending on attribute “Age” whichsplits up into 3 values. Since of the two attributeswhose f(n) is less than the futility “AGE” is having lessvalues when compared to “STUDENT”.II. Next Attribute with age

Decision Tree Induction - A Heuristic Problem Reduction ApproachSo the next attribute on split age40Since all the f(n) values are 0 the node is labelled assolved and there is no further splitsThere ends on split age40.VII. Next Attribute with age>40 and credit=fairTable VIII: The Result Set for attribute Credit=fairThe attribute that was selected is “CREDIT_RATING ”on split age>40 since the f(n) value is less thanFUTILITY value.V. Next Attribute with age40 andcredit=excellentInternational Conference on Computer Science and Engineering - April 28 th , 2012 – Vizag - ISBN: 978-93-81693-57-529

Decision Tree Induction - A Heuristic Problem Reduction ApproachTable IX: The Result Set for attribute Credit=excellentSince all the f(n) values are 0 the node is labelled assolved and there is no further splitsThe algorithm Terminates and the final decision tree isas followsREFERENCES[1] Liu Yuxun, Xie Niuniu “Improved ID3Algorithm” 3 RD IEEE Conference(ICCSIT.2010.5564765):465-468.[2] G.Eason, B.Noble, and I.N.Sneddon, “On certainintegrals of W.J.Frawley, G.Piatetsky Shapiro,C.J.Matheu. Knowledge discovery in databases:An overview[M]”, USA, AAAI/MIT Press, 19911-27[3] Han Jiawei, Micheline Kamber Data MiningConcepts and Techniques[4] J.R.QUINLAN.Induction of DecisionTrees[J].Machine Learning 1968:81-106.[5] E.Rich, K.Knight, Shiv Shankar B Nair ArtificialIntelligence.Fig.1 Decision TreeV. CONCLUSIONThe proposed paper created a new innovativeapproach of attribute selection criteria taking bestsplitting attribute with least heuristic functional value. Itreduces the number of calculations by generating thedecision tree with out calculating the gain factor. Thispaper can be extended to larger databases which canprovide best splitting criteria.International Conference on Computer Science and Engineering - April 28 th , 2012 – Vizag - ISBN: 978-93-81693-57-530

Decision Tree Induction - A Heuristic Problem ... - IRNet Explore

Create successful ePaper yourself

Delete template?

Save as template?