Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report 10.4.2 Application Examples Magellan staff worked with some key scientific groups to port their applications to the streaming model. We describe some of these in this section. Additional case studies are presented in Chapter 11. BLAST. JGI’s IMG pipeline has been tested in the Hadoop framework to manage a set of parallel BLAST computations. The performance across an HPC machine, virtual machines, and a Hadoop cluster was found to be comparable (within 10%), making this a feasible application for cloud environments. The experience with running the IMG pipeline on virtual machines is described in greater detail in Chapter 11. Here we describe the challenges with running this in the Hadoop framework. BLAST takes a multi-line sequence input for each run. By default, MapReduce assumes that inputs are line-oriented and each line can be processed independently. One could define and write Java classes to handle other input formats. However there are difficulties with managing API versions and jar files that can make the process cumbersome. We experimented with a wrapper script that formatted the input to be line oriented when passed to the Hadoop framework and the input was reformatted in the streaming mapper script to be compatible with application needs. This highlights the need for more custom plugins in Hadoop for science applications. Tropical storm detection code. Another application that we plugged into Hadoop was TSTORMS. It is a single-threaded analysis program to find tropical storms in climate data. TSTORMS input is a NetCDF file that has a binary format. While Hadoop and HDFS can handle binary files, current versions of Hadoop streaming can’t handle binary input and output. NetCDF libraries provide utilities to manage an ASCII version of NetCDF files and to convert between the ASCII and binary versions. While not the most efficient way, we explored pre-converting to an ASCII format and loading the converted files into HDFS. The streaming mapper function then converts it back to a binary format before running the application. In addition, TSTORMS does not take inputs on standard input and thus the file needs to be streamed to disk and the input file path is then passed to the binary. The application enjoyed the benefits of scalability and faulttolerance in the Hadoop framework but it highlights the difficulties of working with scientific data formats in this environment. Atmospheric River Detection. The atmospheric river detection code, similar to TSTORMS works on NetCDF input files. The challenges presented for TSTORMS also holds for this application. In addition, this application takes a parameter that defines the day to look at in the NetCDF file. Thus, each map should be differentiated based on a combination of file and a parameter. This can’t be handled easily in the current Hadoop implementation where a mapper is differentiated by the file or data it operates on. Additionally, we experimented with a wrapper script, that sequentially process all days in a file. Additional parallelism could be achieved for such applications if Hadoop would allow maps to be differentiated by not just a file but by additional parameters. 10.5 Benchmarking In addition to understanding the programming effort required for applications to use the Hadoop framework, we conducted some benchmarking to understand the effects of various Hadoop parameters, underlying file systems, network interface etc. 10.5.1 Standard Hadoop Benchmarks First, we used the standard Hadoop benchmark tests on the NERSC Magellan Hadoop cluster (described in Chapter 5). The benchmarking data enables us to understand the effect of various parameters and system configurations. 86
Magellan Final Report 35000  35000  30000  30000  25000  25000  Time (ms)  20000  15000  10000  5000  0  20  40  60  80  100  Maps  20  100  60  Reduces  Time (ms)  20000  15000  10000  5000  0  20  40  60  80  100  Maps  20  100  60  Reduces  (a) 100 lines (b) 1000 lines Figure 10.1: Hadoop MRBench. MRBench evaluates the performance of MapReduce systems while varying key parameters such as data size and the number of Map/Reduce tasks. We varied the number of lines of data written from 100 to 1000 and varied the number of maps and reduces. Figure 10.1 shows the time with varying maps and reduces for a) 100 and b) 1000 lines. As the number of maps and reduces increases the time increases; however the difference is less than 10 seconds. The number of lines written at the orders of magnitude we measure shows no perceptible effect. MRBench can be provided with custom mapper and reducer implementations to measure specific system or application behavior. This could be used to develop benchmarks that emphasize the nature of scientific workloads. TestDFSIO measures the I/O performance of HDFS. Figure 10.2 shows the throughput for small and large file sizes with varying concurrent writers/files. For small file sizes, the throughput remains fairly constant with varying number of concurrent writers. However, the throughput decreases rapidly as the number of concurrent files/writers increases. This seems to be dependent on the HDFS block size and will require tuning and additional benchmarking to understand the configuration necessary for scientific workloads at a specific site. 70 60 10MB 10GB Throughput MB/s 50 40 30 20 10 0 0 10 20 30 40 50 60 Number of concurrent writerss Figure 10.2: HDFS throughput for different file sizes with varying number of concurrent writers. 87
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48:
Magellan Final Report greater than
Page 49 and 50: Magellan Final Report specific QoS
Page 51 and 52: Magellan Final Report configuration
Page 53 and 54: Magellan Final Report 7.4 Summary U
Page 55 and 56: Magellan Final Report Firewalls are
Page 57 and 58: Magellan Final Report Aside from le
Page 59 and 60: Magellan Final Report 9.1 Understan
Page 61 and 62: Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99: Magellan Final Report 10.3 Hadoop E
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112: Magellan Final Report This benchmar
Page 113 and 114: Magellan Final Report Task Tracker
Page 115 and 116: Magellan Final Report processing ti
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123 and 124: Magellan Final Report comparison to
Page 125 and 126: Magellan Final Report 11.2.5 Integr
Page 127 and 128: Magellan Final Report very large (4
Page 129 and 130: Magellan Final Report for optimizat
Page 131 and 132: Magellan Final Report One of the ad
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139 and 140: Magellan Final Report Productivity.
Page 141 and 142: Magellan Final Report compute insta
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150: Magellan Final Report [30] I. Foste
Page 151 and 152:
Magellan Final Report [67] M. Palan
Page 153 and 154:
Appendix A Publications Selected Pr
Page 155 and 156:
Magellan Final Report Magellan Rese
Page 157 and 158:
Magellan Final Report Selected Mage
Page 159 and 160:
Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?