10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Social Media Insight Using<br />

Naive Bayes<br />

Text-based datasets contain a lot of information, whether they are books, historical<br />

documents, social media, e-mail, or any of the other ways we communicate via<br />

writing. Extracting features from text-based datasets and using them for classification<br />

is a difficult problem. There are, however, some common patterns for text mining.<br />

We look at disambiguating terms in social media using the Naive Bayes algorithm,<br />

which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few<br />

shortcuts to properly compute the probabilities for classification, hence the term<br />

naive in the name. It can also be extended to other types of datasets quite easily and<br />

doesn't rely on numerical features. The model in this chapter is a baseline for text<br />

mining studies, as the process can work reasonably well for a variety of datasets.<br />

We will cover the following topics in this chapter:<br />

• Downloading data from social network APIs<br />

• Transformers for text<br />

• Naive Bayes classifier<br />

• Using JSON for saving and loading datasets<br />

• The NLTK library for extracting features from text<br />

• The F-measure for evaluation<br />

[ 105 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!