Introduction and MapReduce - SNAP - Stanford University
Introduction and MapReduce - SNAP - Stanford University
Introduction and MapReduce - SNAP - Stanford University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CS246: Mining Massive Datasets<br />
Jure Leskovec, <strong>Stanford</strong> <strong>University</strong><br />
http://cs246.stanford.edu
TAs:<br />
Bahman Bahmani<br />
Juthika Dabholkar<br />
Pierre Kreitmann<br />
Lu Li<br />
Aditya Ramesh<br />
Office hours:<br />
Jure: Tuesdays 9-10am, Gates 418<br />
See course website for TA office hours<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
Course website:<br />
http://cs246.stanford.edu<br />
Lecture slides (at least 6h before the lecture)<br />
Announcements, homeworks, solutions<br />
Readings!<br />
Readings: Book Mining of Massive Datasets<br />
by An<strong>and</strong> Rajaraman <strong>and</strong> Jeffrey D. Ullman<br />
Free online:<br />
http://i.stanford.edu/~ullman/mmds.html<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
4 longer homeworks: 40%<br />
Theoretical <strong>and</strong> programming questions<br />
All homeworks (even if empty) must be h<strong>and</strong>ed in<br />
Assignments take time. Start early!<br />
How to submit?<br />
Paper: Box outside the class <strong>and</strong> in the Gates east wing<br />
We will grade on paper!<br />
You should also submit electronic copy:<br />
1 PDF/ZIP file (writeups, experimental results, code)<br />
Submission website: http://cs246.stanford.edu/submit/<br />
SCPD: Only submit electronic copy & send us email<br />
7 late days for the quarter:<br />
Max 5 late days per assignment<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
Short weekly quizzes: 20%<br />
Short e-quizzes on Gradiance (see course website!)<br />
First quiz is already online<br />
You have 7 days to complete it. No late days!<br />
Final exam: 40%<br />
March 19 at 8:30am<br />
It’s going to be fun <strong>and</strong> hard work <br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
Homework schedule:<br />
Date Out In<br />
1/11 HW1<br />
1/25 HW2 HW1<br />
2/8 HW3 HW2<br />
2/22 HW4 HW3<br />
3/7 HW4<br />
No class: 1/16: Martin Luther King Jr.<br />
2/20: President’s day<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
Recitation sessions:<br />
Review of probability <strong>and</strong> statistics<br />
Installing <strong>and</strong> working with Hadoop<br />
We prepared a virtual machine with Hadoop preinstalled<br />
HW0 helps you write your first Hadoop program<br />
See course website!<br />
We will announce the dates later<br />
Sessions will be recorded<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
Algorithms (CS161)<br />
Dynamic programming, basic data structures<br />
Basic probability (CS109 or Stat116)<br />
Moments, typical distributions, MLE, …<br />
Programming (CS107 or CS145)<br />
Your choice, but C++/Java will be very useful<br />
We provide some background, but<br />
the class will be fast paced<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
CS345a: Data mining got split into 2 courses<br />
CS246: Mining massive datasets:<br />
Methods/algorithms oriented course<br />
Homeworks (theory & programming)<br />
No class project<br />
CS341: Project in mining massive datasets:<br />
Project oriented class<br />
Lectures/readings related to the project<br />
Unlimited access to Amazon EC2 cluster<br />
We intend to keep the class small<br />
Taking CS246 is basically prerequisite<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
For questions/clarifications use Piazza!<br />
If you don’t have @stanford.edu email address<br />
email us <strong>and</strong> we will register you<br />
To communicate with the course staff use<br />
cs246-win1112-staff@lists.stanford.edu<br />
We will post announcements to<br />
cs246-win1112-all@lists.stanford.edu<br />
If you are not registered or auditing send us email<br />
<strong>and</strong> we will subscribe you!<br />
You are welcome to sit-in & audit the class<br />
Send us email saying that you will be auditing<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
Much of the course will be devoted to<br />
ways to data mining on the Web:<br />
Mining to discover things about the Web<br />
E.g., PageRank, finding spam sites<br />
Mining data from the Web itself<br />
E.g., analysis of click streams, similar products at<br />
Amazon, making recommendations<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
Much of the course will be devoted to<br />
large scale computing for data mining<br />
Challenges:<br />
How to distribute computation?<br />
Distributed/parallel programming is hard<br />
Map-reduce addresses all of the above<br />
Google’s computational/data manipulation model<br />
Elegant way to work with big data<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
High-dimensional data:<br />
Locality Sensitive Hashing<br />
Dimensionality reduction<br />
Clustering<br />
The data is a graph:<br />
Link Analysis: PageRank, Hubs & Authorities<br />
Machine Learning:<br />
k-NN, Perceptron, SVM, Decision Trees<br />
Data is infinite:<br />
Mining data streams<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
Applications:<br />
Association Rules<br />
Recommender systems<br />
Advertising on the Web<br />
Web spam detection<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
Discovery of patterns <strong>and</strong> models that are:<br />
Valid: hold on new data with some certainty<br />
Useful: should be possible to act on the item<br />
Unexpected: non-obvious to the system<br />
Underst<strong>and</strong>able: humans should be able to<br />
interpret the pattern<br />
Subsidiary issues:<br />
Data cleansing: detection of bogus data<br />
Visualization: something better than MBs of output<br />
Warehousing of data (for retrieval)<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
Predictive Methods<br />
Use some variables to predict unknown<br />
or future values of other variables<br />
Descriptive Methods<br />
Find human-interpretable patterns that<br />
describe the data<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
Scalability<br />
Dimensionality<br />
Complex <strong>and</strong> Heterogeneous Data<br />
Data Quality<br />
Data Ownership <strong>and</strong> Distribution<br />
Privacy Preservation<br />
Streaming Data<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
Overlaps with:<br />
Databases: Large-scale (non-main-memory) data<br />
Machine learning: Complex methods, small data<br />
Statistics: Models<br />
Different cultures:<br />
To a DB person, data mining<br />
is an extreme form of<br />
analytic processing –<br />
queries that examine large<br />
amounts of data<br />
Result is the query answer<br />
To a statistician, data-mining is<br />
the inference of models<br />
Result is the parameters of the model<br />
Statistics/<br />
AI<br />
Data Mining<br />
Database<br />
systems<br />
Machine Learning/<br />
Pattern<br />
Recognition<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
A big data-mining risk is that you will<br />
“discover” patterns that are meaningless.<br />
Bonferroni’s principle: (roughly) if you look in<br />
more places for interesting patterns than your<br />
amount of data will support, you are bound to<br />
find crap<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
Joseph Rhine was a parapsychologist in the<br />
1950’s who hypothesized that some people<br />
had Extra-Sensory Perception<br />
He devised an experiment where subjects<br />
were asked to guess 10 hidden cards – red or<br />
blue<br />
He discovered that almost 1 in 1000 had ESP –<br />
they were able to get all 10 right!<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
He told these people they had ESP <strong>and</strong> called<br />
them in for another test of the same type<br />
Alas, he discovered that almost all of them<br />
had lost their ESP<br />
What did he conclude?<br />
He concluded that you shouldn’t tell people<br />
they have ESP; it causes them to lose it <br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
CPU<br />
Memory<br />
Disk<br />
Machine Learning, Statistics<br />
“Classical” Data Mining<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
20+ billion web pages x 20KB = 400+ TB<br />
1 computer reads 30-35 MB/sec from disk<br />
~4 months to read the web<br />
~1,000 hard drives to store the web<br />
Takes even more to do something useful<br />
with the data!<br />
St<strong>and</strong>ard architecture is emerging:<br />
Cluster of commodity Linux nodes<br />
Gigabit ethernet interconnect<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 25
1 Gbps between<br />
any pair of nodes<br />
in a rack<br />
CPU<br />
Mem<br />
Disk<br />
Switch<br />
…<br />
CPU<br />
Mem<br />
Disk<br />
Each rack contains 16-64 nodes<br />
2-10 Gbps backbone between racks<br />
Switch<br />
CPU<br />
Mem<br />
Disk<br />
Switch<br />
…<br />
CPU<br />
Mem<br />
Disk<br />
In Aug 2006 Google had ~450,000 machines<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
Large-scale computing for data mining<br />
problems on commodity hardware<br />
Challenges:<br />
How do you distribute computation?<br />
How can we make it easy to write distributed<br />
programs?<br />
Machines fail:<br />
One server may stay up 3 years (1,000 days)<br />
If you have 1,0000 servers, expect to loose 1/day<br />
In Aug 2006 Google had ~450,000 machines<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
Idea:<br />
Bring computation close to the data<br />
Store files multiple times for reliability<br />
Map-reduce addresses these problems<br />
Google’s computational/data manipulation model<br />
Elegant way to work with big data<br />
Storage Infrastructure – File system<br />
Google: GFS<br />
Hadoop: HDFS<br />
Programming model<br />
Map-Reduce<br />
1/9/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
Problem<br />
If nodes fail, how to store data persistently?<br />
Answer<br />
Distributed File System:<br />
Provides global file namespace<br />
Google GFS; Hadoop HDFS;<br />
Typical usage pattern<br />
Huge files (100s of GB to TB)<br />
Data is rarely updated in place<br />
Reads <strong>and</strong> appends are common<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 29
Chunk Servers<br />
File is split into contiguous chunks<br />
Typically each chunk is 16-64MB<br />
Each chunk replicated (usually 2x or 3x)<br />
Try to keep replicas in different racks<br />
Master node<br />
a.k.a. Name Nodes in Hadoop’s HDFS<br />
Stores metadata<br />
Might be replicated<br />
Client library for file access<br />
Talks to master to find chunk servers<br />
Connects directly to chunkservers to access data<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
Reliable distributed file system<br />
Data kept in “chunks” spread across machines<br />
Each chunk replicated on different machines<br />
Seamless recovery from disk or machine failure<br />
C 0<br />
C 5<br />
C 1<br />
C 2<br />
Chunk server 1<br />
D 0<br />
C 5<br />
C 1<br />
C 3<br />
Chunk server 2<br />
C 2<br />
D 0<br />
C 5<br />
D 1<br />
Chunk server 3<br />
Bring computation directly to the data!<br />
…<br />
C 0<br />
D 0<br />
C 5<br />
C 2<br />
Chunk server N<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
Warm-up task:<br />
We have a huge text document<br />
Count the number of times each<br />
distinct word appears in the file<br />
Sample application:<br />
Analyze web server logs to find popular URLs<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 32
Case 1:<br />
File too large for memory, but all <br />
pairs fit in memory<br />
Case 2:<br />
Count occurrences of words:<br />
words(doc.txt) | sort | uniq -c<br />
where words takes a file <strong>and</strong> outputs the words in it,<br />
one per a line<br />
Captures the essence of <strong>MapReduce</strong><br />
Great thing is it is naturally parallelizable<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
Sequentially read a lot of data<br />
Map:<br />
Extract something you care about<br />
Group by key: Sort <strong>and</strong> Shuffle<br />
Reduce:<br />
Aggregate, summarize, filter or transform<br />
Write the result<br />
Outline stays the same, map <strong>and</strong> reduce<br />
change to fit the problem<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 34
Input<br />
key-value pairs<br />
k<br />
k<br />
k<br />
…<br />
v<br />
v<br />
v<br />
map<br />
map<br />
Intermediate<br />
key-value pairs<br />
k v<br />
k v<br />
k v<br />
…<br />
k v<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
Intermediate<br />
key-value pairs<br />
k v<br />
k v<br />
k v<br />
…<br />
k v<br />
group<br />
Key-value groups<br />
k v<br />
k v v<br />
…<br />
k v<br />
v v<br />
reduce<br />
reduce<br />
Output<br />
key-value pairs<br />
k v<br />
k v<br />
…<br />
k v<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
Input: a set of key/value pairs<br />
Programmer specifies two methods:<br />
Map(k, v) *<br />
Takes a key value pair <strong>and</strong> outputs a set of key value pairs<br />
E.g., key is the filename, value is a single line in the file<br />
There is one Map call for every (k,v) pair<br />
Reduce(k’, *) *<br />
All values v’ with same key k’ are reduced<br />
together <strong>and</strong> processed in v’ order<br />
There is one Reduce function call per unique key k’<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 37
The crew of the space shuttle<br />
Endeavor recently returned to<br />
Earth as ambassadors,<br />
harbingers of a new era of<br />
space exploration. Scientists<br />
at NASA are saying that the<br />
recent assembly of the Dextre<br />
bot is the first step in a longterm<br />
space-based<br />
man/machine partnership.<br />
'"The work we're doing now --<br />
the robotics we're doing -- is<br />
what we're going to need to<br />
do to build any work station<br />
or habitat structure on the<br />
moon or Mars," said Allard<br />
Beutel.<br />
Big document<br />
Provided by the<br />
programmer<br />
MAP:<br />
reads input <strong>and</strong><br />
produces a set of<br />
key value pairs<br />
(the, 1)<br />
(crew, 1)<br />
(of, 1)<br />
(the, 1)<br />
(space, 1)<br />
(shuttle, 1)<br />
(Endeavor, 1)<br />
(recently, 1)<br />
….<br />
(key, value)<br />
Group by key:<br />
Collect all pairs<br />
with same key<br />
(crew, 1)<br />
(crew, 1)<br />
(space, 1)<br />
(the, 1)<br />
(the, 1)<br />
(the, 1)<br />
(shuttle, 1)<br />
(recently, 1)<br />
…<br />
(key, value)<br />
Provided by the<br />
programmer<br />
Reduce:<br />
Collect all values<br />
belonging to the<br />
key <strong>and</strong> output<br />
(crew, 2)<br />
(space, 1)<br />
(the, 3)<br />
(shuttle, 1)<br />
(recently, 1)<br />
…<br />
(key, value)<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 38<br />
Sequentially read the data<br />
Only sequential reads
map(key, value):<br />
// key: document name; value: text of the document<br />
for each word w in value:<br />
emit(w, 1)<br />
reduce(key, values):<br />
// key: a word; value: an iterator over counts<br />
result = 0<br />
for each count v in values:<br />
result += v<br />
emit(key, result)<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 39
Map-Reduce environment takes care of:<br />
Partitioning the input data<br />
Scheduling the program’s execution across a<br />
set of machines<br />
H<strong>and</strong>ling machine failures<br />
Managing required inter-machine<br />
communication<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 40
MAP:<br />
reads input <strong>and</strong><br />
produces a set of<br />
key value pairs<br />
Group by key:<br />
Collect all pairs<br />
with same key<br />
Reduce:<br />
Collect all values<br />
belonging to the<br />
key <strong>and</strong> output<br />
Big document<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 41
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 43
Input <strong>and</strong> final output are stored on a<br />
distributed file system:<br />
Scheduler tries to schedule map tasks “close” to<br />
physical storage location of input data<br />
Intermediate results are stored on local FS<br />
of map <strong>and</strong> reduce workers<br />
Output is often input to another map<br />
reduce task<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 44
Master data structures:<br />
Task status: (idle, in-progress, completed)<br />
Idle tasks get scheduled as workers become<br />
available<br />
When a map task completes, it sends the master<br />
the location <strong>and</strong> sizes of its R intermediate files,<br />
one for each reducer<br />
Master pushes this info to reducers<br />
Master pings workers periodically<br />
to detect failures<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
Map worker failure<br />
Map tasks completed or in-progress at worker are<br />
reset to idle<br />
Reduce workers are notified when task is<br />
rescheduled on another worker<br />
Reduce worker failure<br />
Only in-progress tasks are reset to idle<br />
Master failure<br />
<strong>MapReduce</strong> task is aborted <strong>and</strong> client is notified<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 46
M map tasks, R reduce tasks<br />
Rule of a thumb:<br />
Make M <strong>and</strong> R much larger than the number of<br />
nodes in cluster<br />
One DFS chunk per map is common<br />
Improves dynamic load balancing <strong>and</strong> speeds<br />
recovery from worker failure<br />
Usually R is smaller than M<br />
because output is spread across R files<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 47
Fine granularity tasks: map tasks >> machines<br />
Minimizes time for fault recovery<br />
Can pipeline shuffling with map execution<br />
Better dynamic load balancing<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 48
Problem<br />
Slow workers significantly lengthen the job<br />
completion time:<br />
Other jobs on the machine<br />
Bad disks<br />
Weird things<br />
Solution<br />
Near end of phase, spawn backup copies of tasks<br />
Whichever one finishes first “wins”<br />
Effect<br />
Dramatically shortens job completion time<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 49
Often a map task will produce many pairs of<br />
the form (k,v 1), (k,v 2), … for the same key k<br />
E.g., popular words in the Word Count example<br />
Can save network time by<br />
pre-aggregating values at<br />
the mapper:<br />
combine(k, list(v 1)) v 2<br />
Combiner is usually same<br />
as the reduce function<br />
Works only if reduce<br />
function is commutative <strong>and</strong> associative<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 50
Inputs to map tasks are created by contiguous<br />
splits of input file<br />
Reduce needs to ensure that records with the<br />
same intermediate key end up at the same<br />
worker<br />
System uses a default partition function:<br />
hash(key) mod R<br />
Sometimes useful to override:<br />
E.g., hash(hostname(URL)) mod R ensures URLs<br />
from a host end up in the same output file<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 51
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 52
Suppose we have a large web corpus<br />
Look at the metadata file<br />
Lines of the form (URL, size, date, …)<br />
For each host, find the total number of bytes<br />
i.e., the sum of the page sizes for all URLs from<br />
that host<br />
Other examples:<br />
Link analysis <strong>and</strong> graph processing<br />
Machine Learning algorithms<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 53
Statistical machine translation:<br />
Need to count number of times every 5-word<br />
sequence occurs in a large corpus of documents<br />
Very easy with <strong>MapReduce</strong>:<br />
Map:<br />
Extract (5-word sequence, count) from document<br />
Reduce:<br />
Combine counts<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 54
Compute the natural join R(A,B) ⋈ S(B,C)<br />
R <strong>and</strong> S each are stored in files<br />
Tuples are pairs (a,b) or (b,c)<br />
A B<br />
a 1<br />
a 2<br />
a 3<br />
a 4<br />
b 1<br />
b 1<br />
b 2<br />
b 3<br />
⋈<br />
B C<br />
b 2<br />
b 2<br />
b 3<br />
1/9/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 55<br />
c 1<br />
c 2<br />
c 3<br />
=<br />
A C<br />
a 3<br />
a 3<br />
a 4<br />
c 1<br />
c 2<br />
c 3
Use a hash function h from B-values to 1...k<br />
A Map process turns:<br />
Each input tuple R(a,b) into key-value pair (b,(a,R))<br />
Each input tuple S(b,c) into (b,(c,S))<br />
Map processes send each key-value pair with<br />
key b to Reduce process h(b).<br />
Hadoop does this automatically; just tell it what k is.<br />
Each Reduce process matches all the pairs<br />
(b,(a,R)) with all (b,(c,S)) <strong>and</strong> outputs (a,b,c).<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 56
1. Communication cost = total I/O of all<br />
processes.<br />
2. Elapsed communication cost = max of I/O<br />
along any path.<br />
3. (Elapsed ) computation costs analogous, but<br />
count only running time of processes.<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 57
For a map-reduce algorithm:<br />
Communication cost = input file size + 2 × (sum of<br />
the sizes of all files passed from Map processes to<br />
Reduce processes) + the sum of the output sizes of<br />
the Reduce processes.<br />
Elapsed communication cost is the sum of the<br />
largest input + output for any map process, plus<br />
the same for any reduce process<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 58
Either the I/O (communication) or processing<br />
(computation) cost dominates<br />
Ignore one or the other<br />
Total costs tell what you pay in rent from your<br />
friendly neighborhood cloud<br />
Elapsed costs are wall-clock time using<br />
parallelism<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 59
Total communication<br />
cost = O(|R|+|S|+|R ⋈ S|)<br />
Elapsed communication cost = O(s)<br />
We’re going to pick k <strong>and</strong> the number of Map<br />
processes so I/O limit s is respected<br />
We put a limit s on the amount of input or output that<br />
any one process can have. s could be:<br />
What fits in main memory<br />
What fits on local disk<br />
With proper indexes, computation cost is linear<br />
in the input + output size<br />
So computation costs are like comm. costs<br />
1/9/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 60
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 61
Google<br />
Not available outside Google<br />
Hadoop<br />
An open-source implementation in Java<br />
Uses HDFS for stable storage<br />
Download: http://lucene.apache.org/hadoop/<br />
Aster Data<br />
Cluster-optimized SQL Database that also<br />
implements <strong>MapReduce</strong><br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 62
Ability to rent computing by the hour<br />
Additional services e.g., persistent storage<br />
Amazon’s “Elastic Compute Cloud” (EC2)<br />
Aster Data <strong>and</strong> Hadoop can both be run on<br />
EC2<br />
For CS341 (offered next quarter) Amazon will<br />
provide free access for the class<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 63
Jeffrey Dean <strong>and</strong> Sanjay Ghemawat:<br />
<strong>MapReduce</strong>: Simplified Data Processing on<br />
Large Clusters<br />
http://labs.google.com/papers/mapreduce.html<br />
Sanjay Ghemawat, Howard Gobioff, <strong>and</strong> Shun-<br />
Tak Leung: The Google File System<br />
http://labs.google.com/papers/gfs.html<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 64
Hadoop Wiki<br />
<strong>Introduction</strong><br />
http://wiki.apache.org/lucene-hadoop/<br />
Getting Started<br />
http://wiki.apache.org/lucenehadoop/GettingStartedWithHadoop<br />
Map/Reduce Overview<br />
http://wiki.apache.org/lucene-hadoop/Hadoop<strong>MapReduce</strong><br />
http://wiki.apache.org/lucenehadoop/HadoopMapRedClasses<br />
Eclipse Environment<br />
http://wiki.apache.org/lucene-hadoop/EclipseEnvironment<br />
Javadoc<br />
http://lucene.apache.org/hadoop/docs/api/<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 65
Releases from Apache download mirrors<br />
http://www.apache.org/dyn/closer.cgi/lucene/had<br />
oop/<br />
Nightly builds of source<br />
http://people.apache.org/dist/lucene/hadoop/nig<br />
htly/<br />
Source code from subversion<br />
http://lucene.apache.org/hadoop/version_control<br />
.html<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 66
Programming model inspired by functional language<br />
primitives<br />
Partitioning/shuffling similar to many large-scale sorting<br />
systems<br />
NOW-Sort ['97]<br />
Re-execution for fault tolerance<br />
BAD-FS ['04] <strong>and</strong> TACC ['97]<br />
Locality optimization has parallels with Active<br />
Disks/Diamond work<br />
Active Disks ['01], Diamond ['04]<br />
Backup tasks similar to Eager Scheduling in Charlotte<br />
system<br />
Charlotte ['96]<br />
Dynamic load balancing solves similar problem as River's<br />
distributed queues<br />
River ['99]<br />
1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 67