Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 3
Supervised Learning Using Python
Information gain, which is the expected reduction in entropy caused
by partitioning the examples according to this attribute, is the measure
used in this case.
Specifically, the information gain, Gain(S,A), of an attribute A relative
to a collection of examples S is defined as follows:
Sv
Gain( SA , ) º Entropy( S)- å
S Entropy ( S v )
vÎ Values( A)
So, an attribute with a higher information gain will come first in the
decision tree.
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('csv file path', index_col=0)
y = df[target class column ]
X = df[ col1, col2 ..]
clf= DecisionTreeClassifier()
clf.fit(X,y)
clf.predict(X_test)
Random Forest Classifier
A random forest classifier is an extension of a decision tree in which the
algorithm creates N number of decision trees where each tree has M
number of features selected randomly. Now a test data will be classified by
all decision trees and be categorized in a target class that is the output of
the majority of the decision trees.
60