Introduction and MapReduce - SNAP - Stanford University

More documents

Recommendations

Info

Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in Hadoop’s HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunkservers to access data 1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
Reliable distributed file system Data kept in “chunks” spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C 0 C 5 C 1 C 2 Chunk server 1 D 0 C 5 C 1 C 3 Chunk server 2 C 2 D 0 C 5 D 1 Chunk server 3 Bring computation directly to the data! … C 0 D 0 C 5 C 2 Chunk server N 1/8/2012 Jure Leskovec, <strong>Stanford</strong> CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
Page 1 and 2: CS246: Mining Massive Datasets Jure
Page 3 and 4: Course website: http://cs246.stanfo
Page 5 and 6: Short weekly quizzes: 20% Short e-
Page 7 and 8: Recitation sessions: Review of pro
Page 9 and 10: CS345a: Data mining got split into
Page 11 and 12: Much of the course will be devoted
Page 13 and 14: High-dimensional data: Locality Se
Page 16 and 17: Discovery of patterns and models th
Page 18 and 19: Scalability Dimensionality Comple
Page 20 and 21: A big data-mining risk is that you
Page 22: He told these people they had ESP a
Page 25 and 26: 20+ billion web pages x 20KB = 400+
Page 27 and 28: Large-scale computing for data mini
Page 29: Problem If nodes fail, how to stor
Page 33 and 34: Case 1: File too large for memory,
Page 35 and 36: Input key-value pairs k k k … v v
Page 37 and 38: Input: a set of key/value pairs Pr
Page 39 and 40: map(key, value): // key: document n
Page 41 and 42: MAP: reads input and produces a set
Page 43 and 44: Input and final output are stored o
Page 45 and 46: Map worker failure Map tasks compl
Page 47 and 48: Fine granularity tasks: map tasks >
Page 49 and 50: Often a map task will produce many
Page 51 and 52: 1/8/2012 Jure Leskovec, Stanford CS
Page 53 and 54: Statistical machine translation: N
Page 55 and 56: Use a hash function h from B-values
Page 57 and 58: For a map-reduce algorithm: Commun
Page 59 and 60: Total communication cost = O(|R|+|S
Page 61 and 62: Google Not available outside Googl
Page 63 and 64: Jeffrey Dean and Sanjay Ghemawat: M
Page 65 and 66: Releases from Apache download mirro

Introduction and MapReduce - SNAP - Stanford University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?