Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report Table 10.1: Comparison of HDFS with a parallel filesystems (GPFS and Lustre) . HDFS GPFS and Lustre Storage Location Compute Node Servers Access Model Custom (except with Fuse) POSIX Typical Replication 3 1 Typical Stripe Size 64 MB 1 MB Concurrent Writes No Yes Performance Scales with No. of Compute Nodes No. of Servers Scale of Largest Systems O(10k) Nodes O(100) Servers User/Kernel Space User Kernel 10.1 MapReduce MapReduce is a programming model that enables users to achieve large-scale parallelism when processing and generating large data sets. The MapReduce model has been used to process large amounts of index, log and user behavior data in global internet companies including Google, Yahoo!, Facebook and Twitter. The MapReduce model was designed to be implemented and deployed on very large clusters of commodity machines. It was originally developed, and is still heavily used for massive data analysis tasks. MapReduce jobs were proposed to run on the proprietary Google File System (GFS) distributed file system [33]. GFS supports several features that are specifically suited to MapReduce workloads, such as data locality and data replication. The basic steps in a MapReduce programming model consists of a map function that transforms the input records into intermediate output data, grouped by a user-specified key. The reduce function processes intermediate data to generate a final result. Both the map and reduce functions consume and generate key/value pairs. The reduce phase gets a sorted list of key, value pairs from the MapReduce framework. In the MapReduce model all parallel instances of mappers or reducers are the same program; any differentiation in their behavior must be based solely on the inputs they operate on. The canonical basic example of MapReduce is a word-count algorithm where the map emits the word itself as the key and the number 1 as the value; and the reduce sums the values for each key, thus counting the total number of occurrences for each word. 10.2 Hadoop Hadoop is an open-source distributed computing platform that implements the MapReduce model. Hadoop consists of two core components: the job management framework that handles the map and reduce tasks and the Hadoop Distributed File System (HDFS) [76]. Hadoop’s job management framework is highly reliable and available, using techniques such as replication and automated restart of failed tasks. The framework has optimization for heterogeneous environments and workloads, e.g., speculative (redundant) execution that reduces delays due to stragglers. HDFS is a highly scalable, fault-tolerant file system modeled after the Google File System. The data locality features of HDFS are used by the Hadoop scheduler to schedule the I/O intensive map computations closer to the data. The scalability and reliability characteristics of Hadoop suggest that it could be used as an engine for running scientific ensembles. However, the target application for Hadoop is very loosely coupled analysis of huge data sets. Table 10.1 shows a comparison of HDFS with parallel file systems such as GPFS and Lustre. HDFS fundamentally differs from parallel file systems due to the underlying storage model. HDFS relies on local storage on each node while parallel file systems are typically served from a set of dedicated I/O servers. Most parallel file systems use the POSIX interface that enables applications to be portable across different file systems on various HPC systems. HDFS on the other hand has its own interface requiring applications to be rewritten to leverage HDFS features. FUSE [31] provides a POSIX interface on top of HDFS but with a performance penalty. 84
Magellan Final Report 10.3 Hadoop Ecosystem Hadoop is used by a number of global Internet companies and an entire open-source ecosystem of supporting software components has evolved around the core MapReduce and HDFS framework. The native Hadoop program has a Java API allowing map and reduce functions to be written as Java programs. Hadoop Pipes and Hadoop Streaming, add-on-components to Hadoop, enable applications written in C++ and other languages to be plugged in as maps and reduces. Dumbo is a python library for streaming. Additionally, higher-level languages such as Pig, a data-flow language and Jaql, a JSON (Javascript Object Notation) based semi-structured query processing language have evolved. The Hadoop ecosystem is rapidly growing and we mention some popular components here. HBase and Cassandra provides various database implementations on top of Hadoop. Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout is a machine learning library and Giraph provides large-scale graph processing on Hadoop. Oozie is a workflow tool that allows chaining of multiple MapReduce stages. There are a number of other buildingblocks such as ZooKeeper, a high-performance coordination service that provides semantics for concurrent access. Chukwa and Scribe are data collection and general log aggregation frameworks. Hadoop is also available as a service on Amazon’s EC2 and is called Elastic MapReduce. The evaluation of the MapReduce programming model as part of the Magellan project focuses on the suitability of the core components of Hadoop, i.e., MapReduce framework and HDFS. However, in the context of collaboration with various scientific groups, we used a number of other components from the Hadoop Ecosystem including Pig, HBase and Elastic MapReduce. 10.4 Hadoop Streaming Experiences The Hadoop streaming model allows one to create map-and-reduce jobs with any executable or script as the mapper and/or the reducer. This is the most suitable model for scientific applications that have years of code in place capturing complex scientific processes and hence was the focus of our work with scientific applications. 10.4.1 Hadoop Templates The Hadoop streaming model enables applications to easily plug-in existing applications. However Hadoop has a number of parameters that need to be configured which can be tedious when starting with new applications. Magellan staff developed a simple template where an application can specify a set of parameters and the application can easily be run through the streaming model. This greatly simplifies things and enables users to plug-in applications into Hadoop with minimal work. The user can specify the paths to the binaries/scripts for the mappers and reducers and the input and output directories. The number of mappers in Hadoop is typically controlled by the size of the input blocks. For scientific applications, we designed the template to have one map per input file matching with the model of many current high-throughput and data-intensive applications. The user can control how many reduce tasks they would desire for their application. In addition, we allow scientists to use Hadoop as a job management framework while using high performance file systems such as GPFS that are already available at HPC centers and may already store legacy data sets. The parameters are shown below. my_mapper=[path to binary/script for the mapper] my_reducer=[path to binary/script for the reducer] input_dir=[input directory] output_dir=[output directory] num_reduce_tasks=[specify number of reduce tasks] file_system=[Options right now are hdfs and gpfs] 85
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48: Magellan Final Report greater than
Page 49 and 50: Magellan Final Report specific QoS
Page 51 and 52: Magellan Final Report configuration
Page 53 and 54: Magellan Final Report 7.4 Summary U
Page 55 and 56: Magellan Final Report Firewalls are
Page 57 and 58: Magellan Final Report Aside from le
Page 59 and 60: Magellan Final Report 9.1 Understan
Page 61 and 62: Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97: Chapter 10 MapReduce Programming Mo
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112: Magellan Final Report This benchmar
Page 113 and 114: Magellan Final Report Task Tracker
Page 115 and 116: Magellan Final Report processing ti
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123 and 124: Magellan Final Report comparison to
Page 125 and 126: Magellan Final Report 11.2.5 Integr
Page 127 and 128: Magellan Final Report very large (4
Page 129 and 130: Magellan Final Report for optimizat
Page 131 and 132: Magellan Final Report One of the ad
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139 and 140: Magellan Final Report Productivity.
Page 141 and 142: Magellan Final Report compute insta
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150:
Magellan Final Report [30] I. Foste
Page 151 and 152:
Magellan Final Report [67] M. Palan
Page 153 and 154:
Appendix A Publications Selected Pr
Page 155 and 156:
Magellan Final Report Magellan Rese
Page 157 and 158:
Magellan Final Report Selected Mage
Page 159 and 160:
Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?