10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9<br />

We create lists for storing the documents themselves and the author classes:<br />

documents = []<br />

authors = []<br />

We then create a list of each of the subfolders in the parent directly, as the script<br />

creates a subfolder for each author. The code is as follows:<br />

subfolders = [subfolder for subfolder in os.listdir(folder)<br />

if os.path.isdir(os.path.join(folder,<br />

subfolder))]<br />

Next we iterate over these subfolders, assigning each subfolder a number using<br />

enumerate:<br />

for author_number, subfolder in enumerate(subfolders):<br />

We then create the full subfolder path and look for all documents <strong>with</strong>in<br />

that subfolder:<br />

full_subfolder_path = os.path.join(folder, subfolder)<br />

for document_name in os.listdir(full_subfolder_path):<br />

For each of those files, we open it, read the contents, preprocess those contents,<br />

and append it to our documents list. The code is as follows:<br />

<strong>with</strong> open(os.path.join(full_subfolder_path,<br />

document_name)) as inf:<br />

documents.append(clean_book(inf.read()))<br />

We also append the number we assigned to this author to our authors list,<br />

which will form our classes:<br />

authors.append(author_number)<br />

We then return the documents and classes (which we transform into a NumPy<br />

array for each indexing later on):<br />

return documents, np.array(authors, dtype='int')<br />

We can now get our documents and classes using the following function call:<br />

documents, classes = load_books_data(data_folder)<br />

[ 191 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!