Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report 10.7 Discussion We have evaluated MapReduce, specifically Hadoop for different set of applications as well as under different system configurations. In this section we summarize our findings. 10.7.1 Deployment Challenges Hadoop 0.20 runs all jobs as user hadoop. This results in a situation where users may not be able to access non-HDFS files generated by the job. Thus, the permissions need to be fixed after the application completes, and the world-readable files make it hard to ensure data privacy for the end users. A recently released version fixes this model using Kerberos but this version was not tested in this project. Additionally, the Hadoop configuration has a number of site-specific and job-specific parameters that are hard to tune to achieve optimal performance. 10.7.2 Programming in Hadoop Apache Hadoop and a number of the other open source MapReduce implementations are written in Java. Scientific codes are often written in Fortran, C, C++ or use languages such as python for analysis. The Hadoop streaming model allows one to create map-and-reduce jobs with any executable or script as the mapper and/or the reducer. This is the most suitable model for scientific applications that have years of code in place capturing complex scientific processes. For an application to be able to use this model, it needs to read input through stdin and pass output through stdout. Thus, legacy applications are limited to using the streaming model that may not harness the full benefits of the MapReduce framework. 10.7.3 File System. The Hadoop Distributed File System (HDFS) does not have a POSIX compliant interface severely restricting the adoptability for legacy applications. Some projects have implemented native Hadoop based applications that take advantage of the full capabilities of the MapReduce model by using HDFS. For example, the authors of CloudBurst have implemented short read mapping for genomic sequence data in Hadoop [75]. However, rewriting applications to use a new file system interface is unlikely to be practical for most application groups, especially those with large legacy code bases that have undergone decades of validation and verification. HDFS’s data locality features can be useful to applications that need to process large volumes of data. However, Hadoop considers only the data locality for a single file and does not handle applications that might have multiple input sets. In addition, in legacy Hadoop applications, each map task operates on a single independent data piece, and thus only data locality of the single file is considered. 10.7.4 Data Formats. Apache Hadoop considers inputs as blocks of data where each map task gets a block of data. Scientific applications often work with files where the logical division of work is per file. Apache Hadoop has internal support to handle text data. New file formats require additional java programming to define the format, appropriate split for a single map task, reader that loads the data and converts it to a key value pair. 10.7.5 Diverse Tasks. Traditionally, all mapper and reducer tasks are considered identical in function roughly working on equal sized workloads. Implementing different mapper and reducer requires logic in the tasks that differentiate the functionality since there is no easy way to specify it in the higher-level programming model. In addition, differences in inputs or algorithms can cause worker processing times to vary widely. This could result in timeouts and restarted tasks due to the speculative execution in Hadoop. If there is a large difference in 100
Magellan Final Report processing time between tasks, this will cause load imbalance. This may dramatically impact the horizontal scalability. 10.8 Summary The explosion of sensor data in the last few years has resulted in a class of scientific applications that are loosely coupled and data intensive. These applications are scaling up from desktops and departmental clusters and now require access to larger resources. These are typically high-throughput, serial applications that do not fit into the scheduling policies of many HPC centers. They also could benefit greatly from the features of the MapReduce programming model. In our early studies, Hadoop, the open source implementation of MapReduce and HDFS, the associated distributed file system, have proven to be promising for some scientific applications. The built-in replication and fault tolerance in Hadoop is advantageous for managing this class of workloads. In Magellan, we have also experimented with running Hadoop through a batch queue system. This approach can enable users to reap the benefits of Hadoop while running within the scheduling policies geared towards large parallel jobs. However programming scientific applications in Hadoop presents a number of gaps and challenges. Thus, there is a need for MapReduce implementations that specifically consider characteristics of scientific applications. For example, there is a need for MapReduce frameworks that are not closely tied to HDFS and are available to use with other POSIX file systems. In addition, MapReduce implementations that account for scientific data access patterns (such as considering data locality of multiple input files) are desired. Our early evaluation of alternate MapReduce implementations (through collaborations) shows promise for addressing the needs of high-throughput and data-intensive scientific applications. 101
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48:
Magellan Final Report greater than
Page 49 and 50:
Magellan Final Report specific QoS
Page 51 and 52:
Magellan Final Report configuration
Page 53 and 54:
Magellan Final Report 7.4 Summary U
Page 55 and 56:
Magellan Final Report Firewalls are
Page 57 and 58:
Magellan Final Report Aside from le
Page 59 and 60:
Magellan Final Report 9.1 Understan
Page 61 and 62:
Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112: Magellan Final Report This benchmar
Page 113: Magellan Final Report Task Tracker
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123 and 124: Magellan Final Report comparison to
Page 125 and 126: Magellan Final Report 11.2.5 Integr
Page 127 and 128: Magellan Final Report very large (4
Page 129 and 130: Magellan Final Report for optimizat
Page 131 and 132: Magellan Final Report One of the ad
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139 and 140: Magellan Final Report Productivity.
Page 141 and 142: Magellan Final Report compute insta
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150: Magellan Final Report [30] I. Foste
Page 151 and 152: Magellan Final Report [67] M. Palan
Page 153 and 154: Appendix A Publications Selected Pr
Page 155 and 156: Magellan Final Report Magellan Rese
Page 157 and 158: Magellan Final Report Selected Mage
Page 159 and 160: Appendix B Surveys B1
Page 161 and 162: • Nuclear Physics - Accelarator P
Page 163 and 164: Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?