08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Big(ger) Data<br />

• Your algorithms cannot assume that the entire data is in RAM<br />

• Managing data becomes a major task in itself<br />

• Using computer clusters or multicore machines becomes a necessity and<br />

not a luxury<br />

This chapter will focus on this last piece of the puzzle: how to use multiple cores<br />

(either on the same machine or on separate machines) to speed up and organize<br />

your computations. This will also be useful in other medium-sized data tasks.<br />

Using jug to break up your pipeline into<br />

tasks<br />

Often, we have a simple pipeline: we preprocess the initial data, compute features,<br />

and then we need to call a machine learning algorithm <strong>with</strong> the resulting features.<br />

Jug is a package developed by Luis Pedro <strong>Coelho</strong>, one of the authors of this book. It is<br />

open source (using the liberal MIT License) and can be useful in many areas but was<br />

designed specifically around data analysis problems. It simultaneously solves several<br />

problems, for example:<br />

• It can memorize results to a disk (or a database), which means that if you ask<br />

it to compute something you have computed before, the result is instead read<br />

from the disk.<br />

• It can use multiple cores or even multiple computers on a cluster. Jug was<br />

also designed to work very well in batch computing environments that use<br />

a queuing system such as Portable Batch System (PBS), the Load Sharing<br />

Facility (LSF), or the Oracle Grid Engine (OGE, earlier known as Sun Grid<br />

Engine). This will be used in the second half of the chapter as we build online<br />

clusters and dispatch jobs to them.<br />

About tasks<br />

Tasks are the basic building block of jug. A task is just a function and values for its<br />

arguments, for example:<br />

def double(x):<br />

return 2*x<br />

[ 242 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!