Thèse de doctorat: Algorithmes de classification répartis sur le cloud

More documents

Recommendations

Info

tel-00744768, version 1 - 23 Oct 2012 36 CHAPTER 2. PRESENTATION OF CLOUD COMPUTING considerations. More specifically, the users need not dive into general parallel applications design, neither do they need to have any knowledge about the hardware actually running the computations: MapReduce indeed provides a scale-agnostic ([64]) interface. Secondly, the framework helps the users to figure out what the parallel algorithm is actually running. As stated in [81], “concurrent programs are notoriously difficult to reason about”. Because of its elementary design, a MapReduce execution has Leslie Lamport’s sequential consistency property 2 , which means the MapReduce result is the same as the one that would have been produced by some reordered sequential execution of the program. This property helps to understand the MapReduce execution as it can therefore be rethought as a sequential run speeded-up. As previously stated, the MapReduce framework also provides a strong resilience to failures. A reliable commodity hardware usually has a Mean-Time Between Failures (MTBF) of 3 years. In a typical 10,000 server cluster, this implies that ten servers are failing every day. Thus, failures in large-scale data centers are a frequent event and a framework like MapReduce is designed to be resilient to single point of failures 3 . Such a resilience is hard to achieve since the computing units are dying silently, i.e. without notifying of their failure. MapReduce holds a complex monitoring system to ping all the processing units and to ensure all the tasks will be successfully completed, as detailed in [47]. The initial problems MapReduce has been devised to deal with are related to text mining. These problems involve gigantic amounts of data that do not fit in the memory of the processing units dedicated to the computation, as explained in [48]. In such situations, providing the processing units with a continuous flow of data without starving these units out would require an aggregated bandwidth very hard to achieve. MapReduce has provided a new solution to this problem by co-locating the data storage and the processing system. Instead of pushing the data chunks to the processing units in charge of them, MapReduce pushes the tasks to the processing units whose storage holds the data chunks to be processed. Such a design requires a distributed file system that can locate where the data are stored, but removes a lot of stress on communication devices. 2. Provided that the map and reduce operators are deterministic functions of their input values. 3. With the noticeable exception of the single orchestration machine referred to as master.
tel-00744768, version 1 - 23 Oct 2012 2.5. CLOUD EXECUTION LEVEL 37 MapReduce performance Since its first release, MapReduce has proved to be very well-suited to multiple large computation problems on machine clusters. An embarrassingly parallel problem is one for which little effort is required to separate the problem into a number of parallel (and often independent) tasks. These problems require few or zero inter-machines communications and therefore allow to achieve speedup close to the optimal. On such problems, MapReduce is performing very well. A MapReduce mechanism that can significantly impact MapReduce performance is the synchronization process: the Reduce stage cannot begin before all the Map tasks have been completed. This design is very sensitive to stragglers. As defined in [47], a straggler is a “machine that takes an unusually long time to complete one of the last few map or reduce tasks”. Because of the synchronization process in the end of the Map stage, the overall computation is significantly slowed down by the slowest machine in the pool. In the case of a very large execution involving thousands of processing units, the worst straggler will behave much worse than the median behaviour, and significant delays can be noted. MapReduce provides a backup execution logic that duplicates the remaining in-progress tasks when almost all the map tasks have already been completed. Such a duplication mechanism is reported to significantly reduce the time to complete large MapReduce operations in [47]. MapReduce has proved to be very efficient in the case of embarrassingly parallel algorithms. For example, the Hadoop implementation has been successfully used by Yahoo! to win the terasort contest (see [92]). The Google implementation is also reported to be widely used by Google’s internal teams (see e.g. [48]). Recent criticisms of the MapReduce framework for machine-learning In addition to the quality of the implementations of Google’s and Hapache’s MapReduce, a key factor in MapReduce success and adoption by the community is how easy it is to use. Because MapReduce only supports tasks that can be expressed in a Map/Reduce logic, it is rather straightforward to parallelize a task using this framework: either the task is intrinsically in a Map/Reduce form and the expression of the task within the framework is obvious, or the task cannot be expressed within MapReduce. The simplicity to use it and to figure out how to adapt a given sequential algorithm into the framework has driven the interest of statistical and machine-learning
Page 1 and 2: tel-00744768, version 1 - 23 Oct 20
Page 47: tel-00744768, version 1 - 23 Oct 20
Page 99 and 100:
tel-00744768, version 1 - 23 Oct 20
Page 101 and 102:
tel-00744768, version 1 - 23 Oct 20
Page 103 and 104:
tel-00744768, version 1 - 23 Oct 20
Page 105 and 106:
tel-00744768, version 1 - 23 Oct 20
Page 107 and 108:
tel-00744768, version 1 - 23 Oct 20
Page 109 and 110:
tel-00744768, version 1 - 23 Oct 20
Page 111 and 112:
tel-00744768, version 1 - 23 Oct 20
Page 113 and 114:
tel-00744768, version 1 - 23 Oct 20
Page 115 and 116:
tel-00744768, version 1 - 23 Oct 20
Page 117 and 118:
tel-00744768, version 1 - 23 Oct 20
Page 119 and 120:
tel-00744768, version 1 - 23 Oct 20
Page 121 and 122:
tel-00744768, version 1 - 23 Oct 20
Page 123 and 124:
tel-00744768, version 1 - 23 Oct 20
Page 125 and 126:
tel-00744768, version 1 - 23 Oct 20
Page 127 and 128:
tel-00744768, version 1 - 23 Oct 20
Page 129 and 130:
tel-00744768, version 1 - 23 Oct 20
Page 131 and 132:
tel-00744768, version 1 - 23 Oct 20
Page 133 and 134:
tel-00744768, version 1 - 23 Oct 20
Page 135 and 136:
tel-00744768, version 1 - 23 Oct 20
Page 137 and 138:
tel-00744768, version 1 - 23 Oct 20
Page 139 and 140:
tel-00744768, version 1 - 23 Oct 20
Page 141 and 142:
tel-00744768, version 1 - 23 Oct 20
Page 143 and 144:
tel-00744768, version 1 - 23 Oct 20
Page 145 and 146:
tel-00744768, version 1 - 23 Oct 20
Page 147 and 148:
tel-00744768, version 1 - 23 Oct 20
Page 149 and 150:
tel-00744768, version 1 - 23 Oct 20
Page 151 and 152:
tel-00744768, version 1 - 23 Oct 20
Page 153 and 154:
tel-00744768, version 1 - 23 Oct 20
Page 155 and 156:
tel-00744768, version 1 - 23 Oct 20
Page 157 and 158:
tel-00744768, version 1 - 23 Oct 20
Page 159 and 160:
tel-00744768, version 1 - 23 Oct 20
Page 161 and 162:
tel-00744768, version 1 - 23 Oct 20
Page 163 and 164:
tel-00744768, version 1 - 23 Oct 20
Page 165 and 166:
tel-00744768, version 1 - 23 Oct 20
Page 167 and 168:
tel-00744768, version 1 - 23 Oct 20
Page 169 and 170:
tel-00744768, version 1 - 23 Oct 20
Page 171 and 172:
tel-00744768, version 1 - 23 Oct 20
Page 173 and 174:
tel-00744768, version 1 - 23 Oct 20
Page 175 and 176:
tel-00744768, version 1 - 23 Oct 20
Page 177 and 178:
tel-00744768, version 1 - 23 Oct 20
Page 179 and 180:
tel-00744768, version 1 - 23 Oct 20
Page 181 and 182:
tel-00744768, version 1 - 23 Oct 20
Page 183 and 184:
tel-00744768, version 1 - 23 Oct 20
Page 185 and 186:
tel-00744768, version 1 - 23 Oct 20
Page 187 and 188:
tel-00744768, version 1 - 23 Oct 20
Page 189 and 190:
tel-00744768, version 1 - 23 Oct 20
Page 191 and 192:
tel-00744768, version 1 - 23 Oct 20
Page 193 and 194:
tel-00744768, version 1 - 23 Oct 20
show all

Thèse de doctorat: Algorithmes de classification répartis sur le cloud

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?