08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Getting Started <strong>with</strong> <strong>Python</strong> <strong>Machine</strong> <strong>Learning</strong><br />

Using SciPy's genfromtxt(), we can easily read in the data.<br />

import scipy as sp<br />

data = sp.genfromtxt("web_traffic.tsv", delimiter="\t")<br />

We have to specify tab as the delimiter so that the columns are correctly determined.<br />

A quick check shows that we have correctly read in the data.<br />

>>> print(data[:10])<br />

[[ 1.00000000e+00 2.27200000e+03]<br />

[ 2.00000000e+00 nan]<br />

[ 3.00000000e+00 1.38600000e+03]<br />

[ 4.00000000e+00 1.36500000e+03]<br />

[ 5.00000000e+00 1.48800000e+03]<br />

[ 6.00000000e+00 1.33700000e+03]<br />

[ 7.00000000e+00 1.88300000e+03]<br />

[ 8.00000000e+00 2.28300000e+03]<br />

[ 9.00000000e+00 1.33500000e+03]<br />

[ 1.00000000e+01 1.02500000e+03]]<br />

>>> print(data.shape)<br />

(743, 2)<br />

We have 743 data points <strong>with</strong> two dimensions.<br />

Preprocessing and cleaning the data<br />

It is more convenient for SciPy to separate the dimensions into two vectors, each<br />

of size 743. The first vector, x, will contain the hours and the other, y, will contain<br />

the web hits in that particular hour. This splitting is done using the special index<br />

notation of SciPy, using which we can choose the columns individually.<br />

x = data[:,0]<br />

y = data[:,1]<br />

There is much more to the way data can be selected from a SciPy array.<br />

Check out http://www.scipy.org/Tentative_NumPy_Tutorial<br />

for more details on indexing, slicing, and iterating.<br />

One caveat is that we still have some values in y that contain invalid values, nan.<br />

The question is, what can we do <strong>with</strong> them? Let us check how many hours contain<br />

invalid data.<br />

>>> sp.sum(sp.isnan(y))<br />

8<br />

[ 20 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!