Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3
Supervised Learning Using Python
Dealing with Categorical Data
For algorithm-like support, vector or regression input data must be
numeric. So, if you are dealing with categorical data, you need to convert
to numeric data. One strategy for conversion is to use an ordinal number
as the numerical score. A more sophisticated way to do this is to use
an expected value of the target variable for that value. This is good for
regression.
for col in X.columns:
avgs = df.groupby(col, as_index=False)['floor'].
aggregate(np.mean)
fori,row in avgs.iterrows():
k = row[col]
v = row['floor']
X.loc[X[col] == k, col] = v
For logistic regression, you can use the expected probability of the
target variable for that categorical value.
for col in X.columns:
if str(col) != 'success':
if str(col) not in index:
feature_prob = X.groupby(col).size().
div(len(df))
cond_prob = X.groupby(['success',
str(col)]).size().div(len(df)).div(feature_
prob, axis=0, level=str(col)).reset_
index(name="Probability")
cond_prob = cond_prob[cond_prob.success != '0']
cond_prob.drop("success",inplace=True, axis=1)
cond_prob['feature_value'] = cond_
prob[str(col)].apply(str).as_matrix()
73