10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4<br />

When loading the file, we set the delimiter parameter to the tab character, tell pandas<br />

not to read the first row as the header (<strong>with</strong> header=None), and set the column<br />

names. Let's look at the following code:<br />

all_ratings = pd.read_csv(ratings_filename, delimiter="\t",<br />

header=None, names = ["UserID", "MovieID", "Rating", "Datetime"])<br />

While we won't use it in this chapter, you can properly parse the date timestamp<br />

using the following line:<br />

all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'],<br />

unit='s')<br />

You can view the first few records by running the following in a new cell:<br />

all_ratings[:5]<br />

The result will come out looking something like this:<br />

UserID MovieID Rating Datetime<br />

0 196 242 3 1997-12-04 15:55:49<br />

1 186 302 3 1998-04-04 19:22:22<br />

2 22 377 1 1997-11-07 07:18:36<br />

3 244 51 2 1997-11-27 05:02:03<br />

4 166 346 1 1998-02-02 05:33:16<br />

Sparse data formats<br />

This dataset is in a sparse format. Each row can be thought of as a cell in a large<br />

feature matrix of the type used in previous chapters, where rows are users and<br />

columns are individual movies. The first column would be each user's review<br />

of the first movie, the second column would be each user's review of the second<br />

movie, and so on.<br />

There are 1,000 users and 1,700 movies in this dataset, which means that the full<br />

matrix would be quite large. We may run into issues storing the whole matrix in<br />

memory and computing on it would be troublesome. However, this matrix has the<br />

property that most cells are empty, that is, there is no review for most movies for<br />

most users. There is no review of movie #675 for user #213 though, and not for most<br />

other combinations of user and movie.<br />

[ 65 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!