Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report In this study, we compare and study the performance of Hadoop[4], Twister[22] and LEMO-MR [26] in various real-world application usage scenarios, including data-intensive, iterative, CPU-intensive, and memory-intensive loads. We not only compare the chosen frameworks in several real-world application scenarios, but also in real-world cluster scenarios, including fault-prone clusters/applications, physically heterogeneous clusters and load-imbalanced clusters. This performance study provides insight into the relative strengths and weaknesses of different implementations under different usage scenarios. We link the observed behavior of the frameworks to the design and implementation choices of the particular MapReduce platform being tested. Our test framework will be made available to the community for researchers to compare frameworks for their custom usage scenarios and data sizes of interest. The study identified key applications and cluster categories pertinent to MapReduce performance, i.e., data-intensive, CPU intensive, memory intensive, cluster heterogeneity, load-balancing, iterative applications, fault-tolerance tests. The paper [24] discusses the results in greater detail, here we summarize the key points. • Twister is efficient for jobs requiring significant map side data processing and very limited reduce side data processing, such as CPU and memory-intensive jobs. For our test infrastructure, Twister was better than Apache Hadoop when data transfer between the Mapper and Reducer was less than 93 MB, because it uses memory based transfers as opposed to the file based transfers used by Hadoop. The 93 MB threshold can be increased slightly by allocating more memory to the worker processes, but is still limited by the maximum memory footprint of a Java process on a given node. • Hadoop is designed to make use of the maximum disk space available to each node and hence is known to scale seamlessly for very large data sizes. LEMO-MR uses a hybrid approach starting with a memory-based transfer scheme between Map and Reduce, then progressively changing to the Hadoop model for larger data sizes. • Twister and LEMO-MR are optimized for CPU-intensive Map and Reduce jobs compared to Hadoop, which spends extra CPU cycles on high overhead prone fault tolerance related tasks. For the matrix multiplication experiment, the CPU utilization for Twister and LEMO-MR was 93% while it was 89% for Hadoop. • As the processing power is increased for a given application, Twister’s performance improvement is best, closely followed by LEMO-MR. Hadoop has the highest overhead per node, and as such, when nodes were increased from 8 to 64 nodes, Twister’s speedup improvement was by a factor of 7.5, LEMO-MR’s was 7, and Hadoop’s was by a factor of 4. • LEMO-MR has the least memory footprint allowing for 520MB of data to be processed per node, closely followed by Twister which allowed 500 MB of data to be loaded into memory for processing, per node. Hadoop has the highest memory footprint and it allowed only 100 MB of data to be processed per node, before throwing an out of memory error. • The uniformity of input chunk to each node renders Hadoop unable to make efficient use of a heterogeneous cluster. It is unable to determine what ”slow” means for tasks mapped to nodes with different memory and processor configurations, and as a result load transfer operations occur excessively and inefficiently. • In a homogeneous cluster, Hadoop’s load management significantly outperforms both LEMO-MR and Twister. When random nodes experience stress, induced by our tests, Hadoop benefits from its speculative execution model which duplicates tasks and is not affected by the slow nodes. • For applications requiring multiple iterations, Twister proved to be best. Apart from its ease in running iterative applications, compared to Hadoop and LEMO-MR, Twister also offers significant performance advantages to such applications. This is illustrated with Twister being from 2 up to 5 times faster than Hadoop and LEMO-MR. 96
Magellan Final Report This benchmarking work highlights the strengths and gaps in each of the frameworks. Additional application benchmarking will be needed to further understand and improve these frameworks for scientific use. Additionally maintaining a benchmark suite aimed at scientific application usage and tracking the performance data for new MapReduce implementations as they become available is important. 10.6.3 MARIANE MARIANE, an alternative MapReduce implementation was developed at SUNY Binghampton and used Magellan resources for experiences. In addition Magellan staff participated in the study. The paper was published in Grid 2011 [25] The MapReduce model in its most popular form, Hadoop [4], uses the Hadoop Distributed File System (HDFS) [76] to serve as the Input/Output manager and fault-tolerance support system for the framework. The use of Hadoop, and of the HDFS is however not directly compatible with HPC environments. This is because Hadoop implicitly assumes dedicated resources with non-shared disks attached. Hadoop has successfully worked at large scale for a myriad of applications , however it is not suitable for scientific (legacy) applications that rely on a POSIX compliant file systems in the grid/cloud/HPC setting. HDFS is not POSIX compliant. In this study, we investigate the use of Global Parallel File System (GPFS), Network File System (NFS) for large scale data support in a MapReduce context through our implementation, MARIANE. For this purpose we picked three application groups, two of them, UrlRank and “Distributed Grep”, from the Hadoop repository, and a third of our own design: an XML parsing of arrays of double, tested here under induced node failures, for fault-tolerance testing purposes. The first two are provided with the Hadoop application package. Even as we limit our evaluation to NFS and GPFS, our proposed design is compatible with a wide set of parallel and shared-block file systems. We present the diverse design implications for a successful MapReduce framework in HPC contexts, and collected performance data from the evaluation of this approach to MapReduce along side Apache Hadoop at the NERSC Magellan cluster. Design considerations in MARIANE The design of MARIANE rests upon of three principal modules, representing the tenets of the MapReduce model. Input/Output management and distribution rests within the Splitter. Concurrency management in the role of TaskController, while fault-tolerance dwells with the FaultTracker. Figure 10.12 shows the design used for MARIANE. Input Management While Hadoop and most MapReduce applications apportion the input amongst participating nodes, then transfer each chunks to their destinations, MARIANE relies on the inherent shared file system it sits on top for this feat. The framework leverages the data visibility offered by shared-disk file systems to cluster nodes. This feature exonerates MARIANE from operating data transfers, as such a task is built-in and optimized within the underlying distributed file system. Input management and split distribution are thus not performed on top of an existing file system (FS), but rather with the complicity of the latter. This absolves the application from the responsibility of low-level file management and with it, from the overhead of efficiently communicating with the FS through additional system layers. Furthermore, MARIANE in doing so, benefits not only from file system and data transfer optimizations provided by evolving shared-disk file system technology, but can solely focus on mapping and reducing, rather than data management at a lower level. Input Distribution Input distribution is directly operated through the shared file system. As the input is deposited by the user, the FS is optimized to perform caching and pre-fetching to make the data visible to all nodes on- 97
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48:
Magellan Final Report greater than
Page 49 and 50:
Magellan Final Report specific QoS
Page 51 and 52:
Magellan Final Report configuration
Page 53 and 54:
Magellan Final Report 7.4 Summary U
Page 55 and 56:
Magellan Final Report Firewalls are
Page 57 and 58:
Magellan Final Report Aside from le
Page 59 and 60: Magellan Final Report 9.1 Understan
Page 61 and 62: Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109: Magellan Final Report Workload Patt
Page 113 and 114: Magellan Final Report Task Tracker
Page 115 and 116: Magellan Final Report processing ti
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123 and 124: Magellan Final Report comparison to
Page 125 and 126: Magellan Final Report 11.2.5 Integr
Page 127 and 128: Magellan Final Report very large (4
Page 129 and 130: Magellan Final Report for optimizat
Page 131 and 132: Magellan Final Report One of the ad
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139 and 140: Magellan Final Report Productivity.
Page 141 and 142: Magellan Final Report compute insta
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150: Magellan Final Report [30] I. Foste
Page 151 and 152: Magellan Final Report [67] M. Palan
Page 153 and 154: Appendix A Publications Selected Pr
Page 155 and 156: Magellan Final Report Magellan Rese
Page 157 and 158: Magellan Final Report Selected Mage
Page 159 and 160: Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?