Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 10<br />
MapReduce Programming Model<br />
The MapReduce programming [13] was developed by Google to address its large scale data processing<br />
requirements. Just reading terabytes <strong>of</strong> data can be overwhelming. Thus, the MapReduce programming<br />
model is based on the idea <strong>of</strong> achieving linear scalability by using clusters <strong>of</strong> standard commodity computers.<br />
Nutch, an open source search project, that was facing similar scalability challenges implemented the ideas<br />
described in the MapReduce [13] and the Google File System papers [33]. The larger applicability <strong>of</strong> using<br />
MapReduce and the distributed file systems for other projects was recognized and Hadoop was split <strong>of</strong>f as a<br />
separate project from Nutch in 2006.<br />
MapReduce and Hadoop have gained significant traction in the last few years as a means <strong>of</strong> processing<br />
large volumes <strong>of</strong> data. The MapReduce programming model consists <strong>of</strong> two functions familiar to functional<br />
programming, map and reduce, that are each applied in parallel to a data block. The inherent support<br />
for parallelism built into the programming model enables horizontal scaling and fault-tolerance. The opensource<br />
implementation Hadoop is now widely used by global internet companies such as Facebook, LinkedIn,<br />
Netflix, etc. Teams have developed components on top <strong>of</strong> Hadoop resulting in a rich ecosystem for web<br />
applications and data-intensive computing.<br />
A number <strong>of</strong> scientific applications have characteristics in common with MapReduce/Hadoop jobs. A<br />
class <strong>of</strong> scientific applications which employ a high degree <strong>of</strong> parallelism or need to operate on large volumes<br />
<strong>of</strong> data might benefit from MapReduce and Hadoop. However, it is not apparent that the MapReduce and<br />
Hadoop ecosystem is sufficiently flexible to be used for scientific applications without significant development<br />
overheads.<br />
Recently, there have been a number <strong>of</strong> implementations <strong>of</strong> MapReduce and similar data processing<br />
tools [21, 25, 36, 45, 59, 73, 87]. Apache Hadoop was the most popular implementation <strong>of</strong> MapReduce at<br />
the start <strong>of</strong> the <strong>Magellan</strong> project and it continuous to gain traction in various communities. It has evolved<br />
rapidly as the leading platform and has spawned an entire ecosystem <strong>of</strong> supporting products. Thus we<br />
use Hadoop, as a representative MapReduce implementation for core <strong>Magellan</strong> efforts in this area. We<br />
have also collaborated with other groups to compare different MapReduce implementations (Section 10.6.2).<br />
Additionally, <strong>Magellan</strong> staff were also involved with an alternative implementation <strong>of</strong> MapReduce (described<br />
in Section 10.6.3).<br />
In Sections 10.1, 10.2 and 10.3 we provide an overview <strong>of</strong> MapReduce, Hadoop and Hadoop Ecosystem.<br />
In Section 10.4 we describe our experiences with porting existing applications to Hadoop using the Streaming<br />
model. In Section 10.5 we describe our benchmarking effort with Hadoop and Section 10.6 describes other<br />
collaborative efforts. We present a discussion <strong>of</strong> the use <strong>of</strong> Hadoop for scientific applications in Section 10.7.<br />
Additional case studies using Hadoop are described in Chapter 11.<br />
83