10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Clustering News Articles<br />

If there is an error in obtaining the website, we simply skip this website and keep<br />

going. This code will work on 95 percent of websites and that is good enough for<br />

our application, as we are looking for general trends and not exactness. Note that<br />

sometimes you do care about getting 100 percent of responses, and you should<br />

adjust your code to accommodate more errors. The code to get those final 5 to 10<br />

percent of websites will be significantly more complex. We then catch any error that<br />

could occur (it is the Internet, lots of things could go wrong), increment our error<br />

count, and continue.<br />

except Exception as e:<br />

number_errors += 1<br />

print(e)<br />

If you find that too many errors occur, change the print(e) line to just type<br />

raise instead. This will cause the exception to be called, allowing you to debug<br />

the problem.<br />

Now, we have a bunch of websites in our raw subfolder. After taking a look at these<br />

pages (open the created files in a text editor), you can see that the content is there<br />

but there are HTML, JavaScript, CSS code, as well as other content. As we are only<br />

interested in the story itself, we now need a way to extract this information from<br />

these different websites.<br />

Putting it all together<br />

After we get the raw data, we need to find the story in each. There are a few online<br />

sources that use data mining to achieve this. You can find them listed in Appendix A,<br />

Next Steps…. It is rarely needed to use such complex algorithms, although you can<br />

get better accuracy using them. This is part of data mining—knowing when to use it,<br />

and when not to.<br />

First, we get a list of each of the filenames in our raw subfolder:<br />

filenames = [os.path.join(data_folder, filename)<br />

for filename in os.listdir(data_folder)]<br />

Next, we create an output folder for the text only versions that we will extract:<br />

text_output_folder = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>",<br />

"websites", "textonly")<br />

[ 220 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!