29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 10<br />

MapReduce Programming Model<br />

The MapReduce programming [13] was developed by Google to address its large scale data processing<br />

requirements. Just reading terabytes <strong>of</strong> data can be overwhelming. Thus, the MapReduce programming<br />

model is based on the idea <strong>of</strong> achieving linear scalability by using clusters <strong>of</strong> standard commodity computers.<br />

Nutch, an open source search project, that was facing similar scalability challenges implemented the ideas<br />

described in the MapReduce [13] and the Google File System papers [33]. The larger applicability <strong>of</strong> using<br />

MapReduce and the distributed file systems for other projects was recognized and Hadoop was split <strong>of</strong>f as a<br />

separate project from Nutch in 2006.<br />

MapReduce and Hadoop have gained significant traction in the last few years as a means <strong>of</strong> processing<br />

large volumes <strong>of</strong> data. The MapReduce programming model consists <strong>of</strong> two functions familiar to functional<br />

programming, map and reduce, that are each applied in parallel to a data block. The inherent support<br />

for parallelism built into the programming model enables horizontal scaling and fault-tolerance. The opensource<br />

implementation Hadoop is now widely used by global internet companies such as Facebook, LinkedIn,<br />

Netflix, etc. Teams have developed components on top <strong>of</strong> Hadoop resulting in a rich ecosystem for web<br />

applications and data-intensive computing.<br />

A number <strong>of</strong> scientific applications have characteristics in common with MapReduce/Hadoop jobs. A<br />

class <strong>of</strong> scientific applications which employ a high degree <strong>of</strong> parallelism or need to operate on large volumes<br />

<strong>of</strong> data might benefit from MapReduce and Hadoop. However, it is not apparent that the MapReduce and<br />

Hadoop ecosystem is sufficiently flexible to be used for scientific applications without significant development<br />

overheads.<br />

Recently, there have been a number <strong>of</strong> implementations <strong>of</strong> MapReduce and similar data processing<br />

tools [21, 25, 36, 45, 59, 73, 87]. Apache Hadoop was the most popular implementation <strong>of</strong> MapReduce at<br />

the start <strong>of</strong> the <strong>Magellan</strong> project and it continuous to gain traction in various communities. It has evolved<br />

rapidly as the leading platform and has spawned an entire ecosystem <strong>of</strong> supporting products. Thus we<br />

use Hadoop, as a representative MapReduce implementation for core <strong>Magellan</strong> efforts in this area. We<br />

have also collaborated with other groups to compare different MapReduce implementations (Section 10.6.2).<br />

Additionally, <strong>Magellan</strong> staff were also involved with an alternative implementation <strong>of</strong> MapReduce (described<br />

in Section 10.6.3).<br />

In Sections 10.1, 10.2 and 10.3 we provide an overview <strong>of</strong> MapReduce, Hadoop and Hadoop Ecosystem.<br />

In Section 10.4 we describe our experiences with porting existing applications to Hadoop using the Streaming<br />

model. In Section 10.5 we describe our benchmarking effort with Hadoop and Section 10.6 describes other<br />

collaborative efforts. We present a discussion <strong>of</strong> the use <strong>of</strong> Hadoop for scientific applications in Section 10.7.<br />

Additional case studies using Hadoop are described in Chapter 11.<br />

83

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!