10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 12<br />

MapReduce originates from Google, where it was developed <strong>with</strong> distributed<br />

computing in mind. It also introduces fault tolerance and scalability improvements.<br />

The "original" research for MapReduce was published in 2004, and since then there<br />

have been thousands of projects, implementations, and applications using it.<br />

While the concept is similar to many previous concepts, MapReduce has become a<br />

staple in big data analytics.<br />

Intuition<br />

MapReduce has two main steps: the Map step and the Reduce step. These are built<br />

on the functional programming concepts of mapping a function to a list and reducing<br />

the result. To explain the concept, we will develop code that will iterate over a list of<br />

lists and produce the sum of all numbers in those lists.<br />

There are also shuffle and combine steps in the MapReduce paradigm, which we<br />

will see later.<br />

To start <strong>with</strong>, the Map step takes a function and applies it to each element in a<br />

list. The returned result is a list of the same size, <strong>with</strong> the results of the function<br />

applied to each element.<br />

To open a new I<strong>Python</strong> Notebook, start by creating a list of lists <strong>with</strong> numbers in<br />

each sublist:<br />

a = [[1,2,1], [3,2], [4,9,1,0,2]]<br />

Next, we can perform a map, using the sum function. This step will apply the sum<br />

function to each element of a:<br />

sums = map(sum, a)<br />

While sums is a generator (the actual value isn't computed until we ask for it),<br />

the above step is approximately equal to the following code:<br />

sums = []<br />

for sublist in a:<br />

results = sum(sublist)<br />

sums.append(results)<br />

The reduce step is a little more complicated. It involves applying a function to each<br />

element of the returned result, to some starting value. We start <strong>with</strong> an initial value,<br />

and then apply a given function to that initial value and the first value. We then<br />

apply the given function to the result and the next value, and so on.<br />

[ 275 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!