10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Authorship Attribution<br />

After taking a look at these files, you will see that many of them are quite messy—at<br />

least from a data analysis point of view. There is a large project Gutenberg disclaimer<br />

at the start of the files. This needs to be removed before we do our analysis.<br />

We could alter the individual files on disk to remove this stuff. However, what<br />

happens if we were to lose our data? We would lose our changes and potentially be<br />

unable to replicate the study. For that reason, we will perform the preprocessing as<br />

we load the files—this allows us to be sure our results will be replicable (as long as<br />

the data source stays the same). The code is as follows:<br />

def clean_book(document):<br />

We first split the document into lines, as we can identify the start and end of the<br />

disclaimer by the starting and ending lines:<br />

lines = document.split("\n")<br />

We are going to iterate through each line. We look for the line that indicates the start<br />

of the book, and the line that indicates the end of the book. We will then take the text<br />

in between as the book itself. The code is as follows:<br />

start = 0<br />

end = len(lines)<br />

for i in range(len(lines)):<br />

line = lines[i]<br />

if line.starts<strong>with</strong>("*** START OF THIS PROJECT GUTENBERG"):<br />

start = i + 1<br />

elif line.starts<strong>with</strong>("*** END OF THIS PROJECT GUTENBERG"):<br />

end = i - 1<br />

Finally, we join those lines together <strong>with</strong> a newline character to recreate the book<br />

<strong>with</strong>out the disclaimers:<br />

return "\n".join(lines[start:end])<br />

From here, we can now create a function that loads all of the books, performs<br />

the preprocessing, and returns them along <strong>with</strong> a class number for each author.<br />

The code is as follows:<br />

import numpy as np<br />

By default, our function signature takes the parent folder containing each of the<br />

subfolders that contain the actual books. The code is as follows:<br />

def load_books_data(folder=data_folder):<br />

[ 190 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!