09.10.2023 Views

Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7

Analytics at Scale

A MapReduce program has two major components: the mapper and

the reducer. In the mapper, the input is split into small units. Generally,

each line of input file becomes an input for each map job. The mapper

processes the input and emits a key-value pair to the reducer. The reducer

receives all the values for a particular key as input and processes the data

for final output.

The following pseudocode is an example of counting the frequency of

words in a document:

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

Partitioning Function

Sometimes it is required to send a particular data set to a particular reduce

job. The partitioning function solves this purpose. For example, in the

previous MapReduce example, say the user wants the output to be stored

in sorted order. Then he mentions the number of the reduce job 32 for 32

146

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!