08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Big(ger) Data<br />

While computers keep getting faster and have more memory, the size of the data<br />

has grown as well. In fact, data has grown faster than computational speed, and this<br />

means that it has grown faster than our ability to process it.<br />

It is not easy to say what is big data and what is not, so we will adopt an operational<br />

definition: when data is so large that it becomes too cumbersome to work <strong>with</strong>, we<br />

refer to it as big data. In some areas, this might mean petabytes of data or trillions of<br />

transactions; data that will not fit into a single hard drive. In other cases, it may be<br />

one hundred times smaller, but just difficult to work <strong>with</strong>.<br />

We will first build upon some of the experience of the previous chapters and work<br />

<strong>with</strong> what we can call the medium data setting (not quite big data, but not small<br />

either). For this we will use a package called jug, which allows us to do the following:<br />

• Break up your pipeline into tasks<br />

• Cache (memoize) intermediate results<br />

• Make use of multiple cores, including multiple computers on a grid<br />

The next step is to move to true "big data", and we will see how to use the cloud<br />

(in particular, the Amazon Web Services infrastructure). We will now use another<br />

<strong>Python</strong> package, starcluster, to manage clusters.<br />

<strong>Learning</strong> about big data<br />

The expression "big data" does not mean a specific amount of data, neither in the<br />

number of examples nor in the number of gigabytes, terabytes, or petabytes taken<br />

up by the data. It means the following:<br />

• We have had data growing faster than the processing power<br />

• Some of the methods and techniques that worked well in the past now need<br />

to be redone, as they do not scale well

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!