10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 12<br />

We set a test filename so we can see this in action:<br />

import os<br />

filename = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>", "blogs",<br />

"1005545.male.25.Engineering.Sagittarius.xml")<br />

First, we create a list that will let us store each of the posts:<br />

all_posts = []<br />

Then, we open the file to read:<br />

<strong>with</strong> open(filename) as inf:<br />

We then set a flag indicating whether we are currently in a post. We will set this to<br />

True when we find a tag indicating the start of a post and set it to False<br />

when we find the closing tag;<br />

post_start = False<br />

We then create a list that stores the current post's lines:<br />

post = []<br />

We then iterate over each line of the file and remove white space:<br />

for line in inf:<br />

line = line.strip()<br />

As stated before, if we find the opening tag, we indicate that we are in a new<br />

post. Likewise, <strong>with</strong> the close tag:<br />

if line == "":<br />

post_start = True<br />

elif line == "":<br />

post_start = False<br />

When we do find the closing tag, we also then record the full post that we<br />

have found so far. We also then start a new "current" post. This code is on the same<br />

indentation level as the previous line:<br />

all_posts.append("\n".join(post))<br />

post = []<br />

Finally, when the line isn't a start of end tag, but we are in a post, we add the text of<br />

the current line to our current post:<br />

elif post_start:<br />

post.append(line)<br />

[ 281 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!