08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classification – Detecting Poor Answers<br />

Slimming the classifier<br />

It is always worth looking at the actual contributions of the individual features.<br />

For logistic regression, we can directly take the learned coefficients (clf.coef_)<br />

to get an impression of the feature's impact. The higher the coefficient of a feature<br />

is, the more the feature plays a role in determining whether the post is good<br />

or not. Consequently, negative coefficients tell us that the higher values for the<br />

corresponding features indicate a stronger signal for the post to be classified as bad:<br />

We see that LinkCount and NumExclams have the biggest impact on the overall<br />

classification decision, while NumImages and AvgSentLen play a rather minor role.<br />

While the feature importance overall makes sense intuitively, it is surprising that<br />

NumImages is basically ignored. Normally, answers containing images are always<br />

rated high. In reality, however, answers very rarely have images. So although in<br />

principal it is a very powerful feature, it is too sparse to be of any value. We could<br />

easily drop this feature and retain the same classification performance.<br />

[ 114 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!