A Comparison of Hadoop and BitDew-MapReduce

Assessing MapReduce for Internet Computing: A Comparison of Hadoop andBitDew-MapReduceLu Lu, Hai Jin, Xuanhua ShiCluster and Grid Computing LabServices Computing Technology and System LabHuazhong University of Science and TechnologyWuhan, 430074, China{llu, hjin, xhshi}@hust.edu.cnGilles FedakINRIA/Uni e i of yonLyon, 69364, Francegilles.fedak@inria.frAbstract—MapReduce is emerging as an importantprogramming model for data-intensive application. Adaptingthis model to desktop grid would allow taking advantage of thevast amount of computing power and distributed storage toexecute new range of application able to process enormousamount of data. In 2010, we have presented the firstimplementation of MapReduce dedicated to Internet DesktopGrid based on the BitDew middleware. In this paper, wepresent new optimizations to BitDew-MapReduce (BitDew-MR): aggressive task backup, intermediate result backup, taskre-execution mitigation and network failure hiding. Wepropose a new experimental framework which emulates keyfundamental aspects of Internet Desktop Grid. Using theframework, we compare BitDew-MR and the open-sourceHadoop middleware on Grid5000. Our experimental resultsshow that 1) BitDew-MR successfully passes all the stress-testsof the framework while Hadoop is unable to work in typicalwide-area network topology which includes PC hidden behindfirewall and NAT; 2) BitDew-MR outperforms Hadoopperformances on several aspects: scalability, fairness, resilienceto node failures, and network disconnections.Keywords-desktop grid computing, MapReduce, dataintensiveapplication, cloud computingI. INTRODUCTIONResearchers in various fields have the willingness to uselarge numbers of computing resources to attack theirproblems of enormous scale. Desktop grid has shown itscapability to address this problem by using computing,network and storage resources of idle PCs distributed overmultiple LANs or the Internet, especially for CPU-intensiveapplications. We believe that applications could benefit notonly from the vast CPU processing power but also from thehuge data storage potential offered by desktop grid [3].Distributed data processing has been widely used and studied,especially after Google shows the feasibility and simplicityof MapReduce for handling massive amount of web searchdata on their internal commodity clusters [11]. Recently,Hadoop [1] has emerged as the industrial standard of paralleldata processing on enterprise data centers. Many projects areexploring ways to support MapReduce on different types ofenvironment (e.g. Mars [16] for GPU, Phoenix [26] for largeSMPs), and for wider range of applications [9].Implementing the MapReduce programing model fordesktop grid raises many challenges due to the low reliabilityof these infrastructures. In 2010, we have proposed [5] thefirst implementation of MapReduce for desktop grid basedon the BitDew middleware [14]. Typical system, such asBOINC [2] or XtremWeb [7] are oriented towards Bag-of-Tasks application and are built following simple master/slavearchitecture, where the workers pull tasks from a centralserver when they are idle. The architecture we proposeradically differs from traditional desktop grid system inmany aspects. Following a data-centric approach files andtasks are scheduled now independently by two differentschedulers. Communication patterns are now more complexthan the regular workers to master one because collectivecommunications happen at several steps of the computation:initial distribution of file chunks, shuffle and the finalreduction. Because of the tremendous amount of data toprocess, some components such as the results checker arenow decentralized. As a result, a strict methodology isneeded to assess the viability of this complex architecturewhen running on realistic Internet conditions.We summarize the main contributions of this paper asfollows: We implement a novel sophistical MapReducescheduling strategy for dynamic and volatileenvironment, which tolerates data transfer andcommunication faults, avoids unnecessary tasks reexecutionand aggressively backup slow tasks. We propose a new experimental framework whichemulates key fundamental aspects of InternetDesktop Grid (faults, hosts churn, firewall,heterogeneity, network disconnection, etc.) based onthe analysis of real desktop grid traces. In the paper,we present a large variety of execution scenarioswhich emulate up to 100,000 nodes.

Input DataSpilt 0 Mapper 0Spilt 1 Mapper 1Spilt 2 Mapper 2Spilt 3 Mapper 3Intermidiate DataPartition 0Partition 1Partition 0Partition 1Partition 0Partition 1Partition 0Partition 1Reducer 0Reducer 1Figure 1. Execution overview of MapReduce.Final ResultsOutput 0Output 1 Using the emulation framework on Grid5000, weevaluate experimentally BitDew-MapReduce againstHadoop, the reference implementation which alsohave fault-tolerant capabilities. Results show thatBitDew-MR successfully passes all the tests of theframework while Hadoop would be unable to run onInternet Desktop Grid. More, thanks to the newscheduling optimizations, BitDew-MR outperformsHadoop performances on several aspects: scalability,fairness, resilience to node failures, and networkdisconnection.The rest of the paper is organized as follows. In section IIwe give the background of our research. In section III, wedetail the runtime system design and implementation. Insection IV and V we report the performance evaluation. Wegive related work in section VI and finally we conclude insection VII.II.BACKGROUNDA. MapReduceThe MapReduce programing model [11] borrowsconcepts of tow list-processing combinators, map and reduce,know from Lisp and many other functional languages. Thisabstraction isolates the computation expression of users’applications from the details of massively parallel dataprocessing on distributed systems, which will be handled bythe MapReduce runtime system. The execution processconsists of three phases: The runtime system reads input (typically from adistributed file system) and parses it into key/valuepairrecords. The map function iterates the recordsand maps each of them into a set of intermediatekey/value pairs. The intermediate pairs are partitioned by thepartition function, then grouped and sorted accordingto their keys. An optional combine function can beinvoked to reduce the size of intermediate data. The reduce function reduces the results of theprevious phase, once for a unique key in the sortedlist, to arrive at a final result.B. HadoopHadoop is the reference MapReduce [11] implementationtargeting to commodity clusters and enterprise data centers[1]. It consists of two fundamental subprojects: the HadoopDistributed File System (HDFS) and the HadoopMapReduce framework.HDFS is a master/slave architectural distributed filesystem inspired by GFS, which provides high throughputaccess to application data. A NameNode daemon, running onthe master server, manages the system metadata, logicallysplits files into equal-sized blocks and controls the blockdistribution across the cluster from user clients taking intoaccount the replication factor of the file for fault tolerance.Several DataNode daemons, running on slave nodes that areactually storing the data blocks, execute management taskswhich are assigned by the NameNode and serve read/writerequests from user clients.The Hadoop MapReduce framework runs on top ofHDFS and is also based on the traditional master/slavearchitecture. The master node runs a single JobTrackerdaemon to manage job status and task assignment. On eachslave node, a TaskTracker daemon is responsible forlaunching new JVM processes for executing a task whileperiodically reporting tasks’ progresses and idle task slots(slots number refereed to the maximum map/reduce taskswhich can concurrently run in the slave) to the JobTrackerthrough heartbeat signals. The JobTracker then updates thestatus of the TaskTracker and assigns new tasks to itconsidering the slot availability and data locality.C. BitDew-MapReduceBitDew [14] is an open source data managementmiddleware which can be easily integrated within fullfledgeddesktop grid systems such as XtremWeb [7] andBOINC [2] as a subsystem. It provides simple APIs with ahigh level abstraction of data named Attributes to control thelife cycle, distribution, placement, replication and faulttoleranceof their data in highly dynamic and volatileenvironments. The BitDew runtime environment adopts aflexible style of distributed service architecture: 1) it uses anopen source objects persistence module and does not rely ona specific relational database for data catalog; 2) it integratesvarious asynchronous and synchronous data transferprotocols including FTP, http and BitTorrent; giving theusers the freedom to select the most suitable one accordingto their applications.Our BitDew-MapReduce (BitDew-MR) prototype [5]contains three main components: the API of the MapReduceprograming model, the MapReduce library that includesmaster and worker daemon programs, and a benchmarkMapReduce wordcount application. Figure 2 illustrates thearchitecture of BitDew-MR runtime system. It separates thenodes in two groups: stable nodes run various independentservices which compose the runtime environment, andvolatile nodes provide the storage and computing resourcesto run the map and reduce tasks. Normally, programmerswill not use the various services directly; instead they call theAPI which encapsulates the complexity of internal protocols.They can use the BitDew API (or the command tool) toupload input data to workers and the MapReduce API tobuild their applications. The master and worker daemons ofMapReduce library will handle the interactive with BitDewservices for data management.

BitDew ApplicationBitDew, Transfer Manager, ActivateDataIII.DC DR DT DS DMhttp ftp SQLMapReduce ApplicationMapReduceMasterFigure 2. General overview of the system.MapReduceSYSTEM DESIGN AND IMPLEMENTATIONMapReduceWorkerIn this section we describe the runtime techniques ofBitDew-MR. We focus in our discussion on the new BitDewattribute processing algorithms and the implementation ofthe main software components and their features.A. Event-driven Task SchedulingThe key feature of BitDew is to leverage on dataattributes that not only used to index and search data files,but also to dynamically control repartitioning anddistribution of data onto the storage nodes. Programmers canalso use data transfer events by manipulating data attributesto trigger task assignment actions; therefore avoid buildingtheir own scheduling component from scratch. For moredetails about these six abstractions defined by BitDew, see[14].Unfortunately, it is not trivial to implement MapReducetask scheduling by just manipulating the BitDew dataattributes. We summarize four core functionalities of theMapReduce scheduling design: a) data location aware taskselection; b) idle-worker-pull-based dynamic taskassignment; c) fault tolerance scheduling by re-executingfailed tasks; d) speculative execution by backing up slowtasks. Hadoop implements all these functionalities andimproves them in two ways: task slot abstraction to specifythe number of concurrent tasks for efficient multi-co e node ’resources utilization, and conservative task backup withaggle de ec ion by compa ing each a k’ p og e co ewith the average value.The traditional data processing approach on desktop girdsdistributes input files just at the beginning of the jobexecution which makes data-local scheduling meaningless.We define a new data attribute MUTAFF - stands for mutualaffinity - to support separating input data distribution fromtheir execution process; thus allowing users to cache theirdata into the worker nodes before launching their jobs.According to its literal meaning, MUTAFF is thebidirectional version of the original AFFINITY attribute.An intuitive approach is using DISTRIB to simulate thetask slot abstraction; FT is used to implement task reexecutionand REPLICAT is used to backup tasks. Butconsidering the user sets the REPLICA value to n for his jobinput data: at the beginning of the job execution, each workergets n file chunks. Then, whenever they finish process thesechunks, they should un-register data of the chunks to triggerDS scheduling new data chunks to them, which makes reexecutionsof the corresponding tasks impossible. TheREPLICAT is mainly designed for result checking by majorvoting, whereas running redundant backups for all tasks is anunacceptable resource waste (we do not take result checkinginto account in this paper because it is a challenging problemaddressed in our other work [24]). We use the combinedeffect of REPLICA and MUTAFF to implement tasksbackup. The actual control logic we implement within thedata scheduler is a little subtle because the mutual affinity isnot symmetrical relationship. For convenience of discussion,suppose there are data a, b and set a.mutaff = b, we refer tothe data a as strong MUTAFF data, data b as weakMUTAFF data. The DS will schedule strong MUTAFF dataaccording to MUTAFF attribute firstly then schedules theremaining replicas according to the REPLICA attribute andvice-versa.Algorithm 1 SCHEDU ING A GORITHMRequire:Require:ho kRequire:collec ionRequire:Ensure:ho khe e of da a managed by he chedulehe da a cache managed by he e e oihe e of da a belong o he ame da ahe e of e e oi ho owning da ahe new da a e managed by he e e oi1: ∅2: ched_coun 03: {Re ol e mu ual affini y }4: for all do5: { }6: if (k Ω( )) then7: { }8: end if9: end for10: main:11: for all (Θ ) do12: {Re ol e mu ual affini y dependence}13: for all do14: if (( .mutaff == ) ( )) then15: { }16: continue main17: end if18: end for19: {Schedule eplica}20: if (( .mutaff.replica < Ω( ) ) ( .replica < Ω( ) )) then21: di _coun 022: for all do23: if (( ) then24: di _coun di _coun + 125: end if26: end for27: if (di _coun .distrib) then28: { }29: ched_coun ched_coun + 130: end if31: end if32: if ( ched_coun MaxDataSchedule) then33: b eak34: end if35: end for36: return

Algorithm 1 presents the pseudo-code of the modifieddata scheduling algorithm. Whenever a worker programsreports the set of data held locally through heartbeat message,the Data Scheduler iterates the worker’s local data list andthe global data list in order to make the scheduling decisionaccording to their attributes, and uses a MaxDataSchedulethreshold to limit the size of the set of the new data to beassigned per heartbeat, and thereby balancing the datadistribution among all the workers. We omit the detailswhich are less relative to our event-driven task scheduling.The scheduler firstly adds all the data should be kept in thewo ke ’ local data list to the new assigned data set bytheir life-cycle attributes, while checks whether these keptdata have mutual affinitive data in the global data list(strong MUTAFF data). If it is, the scheduler just adds theaffinitive data (weak MUTAFF data) to . The schedulerthen finds all the weak MUTAFF data from the wo ke ’local data list, and adds their affinitive strong MUTAFF datato . The remaining strong MUTAFF data in the globaldata list will be assigned according to their DISTRIB andREPLICA attributes regardless of MUTAFF. As theoriginally algorithm, the DISTRIB attribute is alwaysstronger than MUTAFF and REPLICA. The old DISTRIBcan only limit the number of data simultaneously hold by aworker that have the same DISTRIB attribute (with the sameattr id), we also extend it to restrict the number of data thatbelongs to the same DataCollection [5].B. The BitDew-MapReduce RuntimeOur previous work [8] mainly aimed at showing thefeasibility of building a BitDew based MapReduce runtimefor large scale and loosely connected Internet Desktop Grid.We rewrite the upper layer MapReduce API to allow users toisolate their application- specified map and reduce functionswith the data management code. Users can use the BitDewcommand-line tool to submit input data and launch jobsseparately. The pre-uploaded input data can be distributed toand cached in worker machines before execution of thecorresponding data processing job. It significantly improvesthe data locality of post-submitted jobs if the running maptasks already have their input data chunks on local machines.We also re-implement the master and worker daemons usingthe MUTAFF attribute, and adopt a new event-handle-threaddesign to cope with worker-side network failures.MasterIf a user uses the command-line tool to upload his inputfile, it will automatically split the file into equal-sized datachunks and returns an id of the correspondingDataCollection. Because we do not need cache multiplereplicas on worker machines to guarantee the accessibility ofthe input file, all data in this DataCollection should be setattr = {replicat = 1, ft = true}. When a job launches, themaster daemon initializes job configuration, fetches all dataof input chunks from DC service by the input DataCollectionid, and creates task token data that used to trigger workerdaemons launch corresponding map/reduce tasks. All maptokens will be set MUTAFF to their corresponding inputchunk: map_token_i.attr = {replicat = 2, distrib = 1, affmut= input_data_i}. The combination of REPILCA, DISTRIBFigure 3. Worker handles.and MUTAFF makes the scheduler to dynamically assignmap tokens to workers and balance the load. When the job isclosed to the end of the map phase, the scheduler replicatesthe remaining map tasks indicated by REPLICA values ofthe map tokens on idle workers, thereby shortening the jobmake-span. The mutual affinity triggers workers receive newmap token to download the required task input files.At the end of the computation, several result files will begenerated by reduce tasks which have to be retrieved by themaster. Master creates an empty data, a Collector and everyWorker sets an AFFINITY attribute from Result data toCollector data. By this way, results are automaticallytransferred to the master node.WorkerThe worker daemon periodically gets data from DSservice using ActiveData API, and then determines theactions to be performed according to the type of datareceived and their attributes. If a user submits an input file,all the workers will download spitted chunks that assigned tothem after get the corresponding data from DS. Weimplement a multi-threaded transfer component which canprocess several concurrent files transfers especially for thesynchronous protocols. After a user launches the job, themap tokens will be sent to the workers according to theirMUTAFF attributes, while the reduce tokens will bescheduled in a round-robin way. We borrow the task slotabstraction from Hadoop to efficiently utilize the computingcapacity of the modern multi-core hardware, each slot will beassigned to a separated map/reduce executing thread in theworker daemon, the number of maximum concurrent threads(slots number) of map and reduce tasks can be configured.Once a map task is finished, the worker daemon invokes theunpin method to un-associate the token data with its host tomake scheduler assigning new map tokens limited by theDISTRIB attribute to it. After the output file of this task hasbeen sent to the stable storage, the worker daemon unregistersthe task token with DC and DS services to avoidany re-execution of this task in the future.We use two different threads to invoke the ActiveDatacallbacks and process file transfers: one for transfermanagement and the other for data control. The mainprinciple is to avoid putting time-consuming works andblocking I/O procedures in the bodies of ActiveData callbackmethods. Otherwise they may punctually prevent the

ActiveData main loop from sending heartbeats to DS, whichin turn could make the DS service mistakenly marking theworker as “dead”. We do not put any actual process logicinto the active callbacks – they only generate event that areadded to the proper thread-safe event queues. To makeworker programs resilient to temporary network failures,whenever the threads catch a remote communicationexception (or any other kind of exceptions), they just skip theprocessing event and add it to the tail of its queue. TheTransferManager API has also been modified to supportautomatic file retransmission. The TransferManager mainloop just re-initializes a transfer if a failure occurs.IV.EXPERIMENTAL METHODOLOGYA. Platform and ApplicationWe perform all our experiments in the GdX and NetGdXclusters which are part of the Grid5000 infrastructure. Thetwo clusters are composed of 356 IBM eServer nodesfeatured with one 2-core 2.0GHz AMD Opteron CPU and2GB RAM. All the nodes are running Debian with kernel2.6.18, and interconnected by gigabit Ethernet network. Allresults described in this paper are obtained using Hadoopversion 0.21.0, while the data is stored with 2 replicas perblock in Hadoop Distributed File System. We perform ourexperiments by repeatedly executing the word countbenchmark, with 50GB dataset generated by HadoopRandomTextWriter application. The block size is set to64MB and, tasks slots for the map and reduce tasks are set to2 and 1 respectively. We fix the number of reducers per jobto 10 reducers. The job make-span baselines of Hadoop andBitDew-MR in normal case are 399 seconds and 246 secondsrespectively.B. Emulation ScenariosIt is difficult to conduct experiments on large-scale anddistributed systems such as desktop grids and re-produce theoriginal results due to: 1) the implementation of systemruntime plays an important role in the overall performance; 2)the resources of different machines of desktop grids can beheterogeneous, hierarchical, or dynamic; 3) machine failuresand uses’ usage behaviors make the system performancevery hard to predict. We design seven experiment scenariosto emulate Internet-scale desktop grid environment on theGrid5000 platform. This environment emulation is based onthe analysis of both: desktop grid system implementationsand traces represent node availability of real desktop grids.- Scalability. Volunteer computing projects such asSETI@home may have millions of participants, but thenumbers of the server machines which offer the core systemmanagement services are relatively small. If large amount ofthe participant clients simultaneously connect to the centralservers, a disastrous overload could occur. The first scenarioevaluates the scalability of the core master services ofHadoop and BitDew-MR. We run central service daemonson one master node, and multi-threads clients thatperiodically perform remote meta-data creating operationsin the master on 100 worker nodes. We tune the concurrentthreads number of each clients and the operation interval.- Fault Tolerance. Google MapReduce usesstraightforward task re-execution strategy to handle frequentbut small fractional of machine failures based on theobservation of their commodity clusters [11]. However, themajor contributor to resource volatility in desktop grid is notthe machine failures but the users’ personal behaviors suchas shutting down their machines. Moreover, typical desktopgrid systems including BOINC [2] and Condor [27] justsuspend the running tasks when detecting other active jobsthrough keyboard or mouse events, which in turn aggravatethe problem. CPU availability traces of participating nodesgathered from a real enterprise desktop grid [21] show that:a) the independent single node unavailability rate is about40% on average; b) up to 90% of the resources can beunavailable simultaneously, which may create catastrophiceffect on the running jobs. We emulate this kind of machineunavailability by killing worker and task processes on 25worker nodes at different progress points of the map phaseof job execution.Hadoop MapReduce runtime system can handle twodifferent kinds of failures: child task processes failure andTaskTracker daemons failure. We conduct threeexperimental scenarios for different failure modes: 1) kill allchild map task processes; 2) kill TaskTracker processes; 3)kill all Java processes including DataNode daemons - toemulate the whole machine crash. At the meantime,considering that the common case of desktop gridenvironment is not process fail but the crash or leaving ofwhole machines, we make the BitDew-MR worker daemonmulti-threaded within a single-process, which simplifies thedata sharing of different system modules. We just kill thesingle worker process on each of the 25 chosen nodes toemulate the machine.- Host Churn. The independent arrival and departure ofthousands or even millions of peer machines leads to hostchurn. We periodically kill the MapReduce worker processon one node and launch it on a new node to emulate the hostchurn effect. To increase the survival probability of Hadoopjob completion, we increase the HDFS chunk replica factorto 3, and set the DataNode heartbeat timeout value to 20seconds.- Network Connectivity. We set firewalls and NAT ruleson all the worker nodes to disable all server-initiated andinter-worker network connections.- CPU heterogeneity. We emulate CPU heterogeneity byadjusting half worker nodes CPU frequency to 50% withWrekavoc [10].- Straggler. Straggler is a machine that takes anunusually long time to complete one of the last few tasks inthe computation [11]. We emulate stragglers by adjustingCPU frequency of target nodes to 10% with Wrekavoc.- Network Failure Tolerance. The runtime system mustbe resilient to the temporary network isolation of a portion ofthe machines, which is very common in the Internetenvironment for the sake of users' behaviors and networkhardware failures. We inject temporary off-line 25-secondwindow periods in 25 worker nodes at different job progresspoints. To make sure the system master will mark the offlineas dead, we set the worker heartbeat timeout to 10 seconds.

Figure 4. Scalability evaluation of core system service: (a) Hadoop creates empty files and (b) BitDew creates data.TABLE I.PERFORMANCE EVALUATION OF FAULT TOLERANCE SCENARIOJob progress of the crash points 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%HadoopBitDew-MRKilltasksKillTTsKillallRe-executed map tasks 50 50 50 50 50 50 50 50Job make-span (sec.) 425 425 423 427 426 429 431 453Re-executed map tasks 50 100 150 200 250 300 350 400Job make-span (sec.) 816 823 809 815 820 819 812 814Kill allFailedRe-executed map tasks 50 0 50 0 50 0 50 0Job make-span (sec.) 450 411 389 351 331 299 279 247V. EXPERIMENT RESULTSA. ScalabilityFigure 4 presents the operation throughput when varyingthe number of concurrent threads and the time intervalbetween them in Hadoop and BitDew-MR. As shown inFigure 2, increasing the number of concurrent clients resultsin dramatic decrease in the number of meta-data operationper second for both Hadoop and BitDew-MR. However,BitDew-MR shows better scalability in contrast to Hadoop asit can achieve acceptable throughputs under typical DesktopGrid configuration, 1,000,000 PCs create meta-data everyfew minutes.At the start, we think that the significant decrease inthroughput under 1,000,000 concurrent emulated clients iscontributed by the bottleneck of disk IO operations. Butconsidering that the Hadoop NameNode persists its memorystructures image into a write-ahead log file to group smallrandom disk IO operations into big sequential IO, and thatwe are using a pure in-memory database as the BitDewbackend data store during the experiments; the actualperformance bottleneck of both systems is due to thesynchronization overhead of highly concurrent RPCinvocations. The Hadoop team also reported a similarscalability issue that occurs when the cluster size reaches upto tens or hundreds of thousands of nodes. A feasiblesolution is replacing the threads based RPC model with theevent-driven asynchronous IO model.B. Fault ToleranceTable I shows the jobs’ make-span time and the numberof re-executed map tasks in fault tolerance scenario. In theHadoop case: for the first test, whenever we kill the runningchild tasks on 25 nodes, the JobTracker just re-schedules the50 killed map tasks and prolongs the job make span time forabout 6.5% in contrast to the normal case. For the second test,the JobTracker blindly re-executes all successfullycompleted and progressing map tasks on the failedTaskTrackers. Which indicates that all 25 chosen workernodes just contribute zero to the whole job executionprogress, resulting the job make-span time is almost doubledin contrast to the baseline. Finally, when killing all the Javaprocesses on half of the worker nodes, jobs just fail due tothe permanently loss of input chunks.On the other hand, BitDew-MR avoids substantialunnecessary fault tolerant works. Because, in BitDew-MR,the intermediate outputs of completed map tasks are safelystored on the stable central storage server, thus, the masterdoes not re-execute the successfully completed map tasks offailed workers. However, the main reason of the additionaljobs’ make-span time in BitDew-MR when worker nodesfailure is the loss of half of the total amount of computingresources and the waiting time needed for re-downloadingthe input chunks to the survived worker nodes.C. Host ChurnAs Table II shows, for the tests of 5, 10, and 25 secondsof host churn intervals, Hadoop jobs could only progress upto the 80% of the map phase before they fail. The reason isthat when the job enters the last stage, a great mass of inputfile chunks concentrate to the few rest old survival workernodes. When new nodes join, they can only afford a smallfraction of chunks. Eventually, HDFS cannot maintain thereplica level resulting in permanently loss of data. For thetests of 30 and 50 seconds of interval, once an old workerleaves; the JobTracker will re-assign all the completed map

tasks of it to other nodes, which significantly delays the totaljob execution time.TABLE II.PERFORMANCE EVALUATION OF HOST CHURN SCENARIOChurn Interval (sec.) 5 10 25 30 50Hadoop job makespan(sec.)BitDew-MR jobmake-span (sec.)failed failed failed 2357 1752457 398 366 361 357Similar to fault tolerance scenario, the BitDew-MRruntime does not waste the completed works done by theeventually failed worker nodes, therefore the host churnexerts very little effects on the jobs’ execu ion performance.D. Network ConnectivityIn this test, Hadoop could not even launch a job becausethe HDFS needs inter-communication between twoDataNodes. On the other hand, BitDew-MR works properlyand the performance is almost the same as the baseline in thenormal case of network conditions.E. CPU Heterogeneity and StragglerAs Figure 5 shows, Hadoop works very well withdynamic task scheduling approach when worker nodes havedifferent classes of CPU. Nodes from the fast group process20 tasks on average and the ones from the slow group getabout 11 tasks. Although, BitDew-MR has same schedulingheuristic but it does not perform well according to Figure 6.The nodes from the two groups get almost the same numberof tasks. The reason is that we only maintain one chunk copyon the worker nodes for the sake of the assumption that thereare no inter-worker communication and data transfer. Thus,although fast nodes spend half of time to process their localchunks compare to the slow nodes, they still need to takemuch time to download new chunks before launchingadditional map tasks.TABLE III.PERFORMANCE EVALUATION OF STRAGGLERS SCENARIOStraggler Number 1 2 5 10Hadoop job makespan(sec.)BitDew-MR jobmake-span (sec.)477 481 487 490211 245 267 298Figure 5. Hadoop map task distribution over 50 workers.Figure 6. BitDew-MapReduce map task distribution over 50 workers.F. Network Failure ToleranceIn case of network failures, as shown in Table IV,Hadoop JobTracker just marks all the temporarydisconnected nodes as “dead” - although they are stillrunning tasks, and blindly removes all the tasks done bythese nodes from the successful task list. Re-executing thesetasks significantly prolongs the job make-span. Meanwhile,the BitDew-MR clearly allows workers to go temporarilyoffline without any performance penalty.The key idea behind map tasks re-execution avoidancewhich makes BitDew-MR outperforms Hadoop undermachine and network failure scenarios is allowing reducetasks to re-download map outputs that have already beenuploaded to the central stable storage rather than re-generatethem. Another benefit of this method is that it does notintroduce any extra overhead since all the intermediate datashould be always transferred through the central storageregardless of whether or not use the fault tolerant strategy.However, the overhead of data transfer to the central stablestorage makes desktop grid more suitable for the applicationswhich generate a few intermediate data, such as Word Countand Distributed Grep. To mitigate the data transfer overhead,we can use a storage cluster with large aggregated bandwidth.The emerging public cloud storage services also provide analternative solution, which can be included in our futurework.VI.RELATED WORKThere have been many studies on improving MapReduceperformance [17, 18, 19, 28] and exploring ways to supportMapReduce on different architectures [16, 22, 23, 26]. Aclosely related work is MOON [23] which stands forMapReduce On Opportunistic eNvironments. Unlike ourwork, MOON limits the system scale within a campus areaand assumes the underlying resources are hybrid andorganized by provisioning a fixed fraction of dedicated stablecomputers to supplement other volatile personal computers,which is much more difficult to implement in the Internetscale desktop grids. The main idea of MOON is prioritizingnew tasks and important data blocks and assigning them tothe dedicated stable machines to guarantee smoothlyprogressing of jobs even many volatile PCs join and leavethe system dynamically. MOON also makes some tricky

TABLE IV.PERFORMANCE EVALUATION OF NETWORK FAULT TOLERANCE SCENARIOJob progress of the crash points 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%HadoopBitDew-MRRe-executed map tasks 50 100 150 200 250 300 350 400Job make-span (sec.) 425 468 479 512 536 572 589 601Re-executed map tasks 0 0 0 0 0 0 0 0Job make-span (sec.) 246 249 243 239 254 257 274 256modifications to Hadoop in order to solve the problem thatthe heartbeat reporting and data serving of native Hadoopworker daemons can be blocked by the PC users’ actions,this is not an issue for the systems that originally designedfor volunteer computing. Ko et al. [22] replicates interandinner- job intermediate data among workers throughlow priority TCP transfers to utilize idle networkbandwidth. We focus on inner-job intermediate dataavailability in this paper and only replicate them on centralstorage for the sake of prohibition of inter-workercommunication of desktop grids.There are existing works on simulation and emulationof distributed systems. Well known general-purpose gridsimulators include GridSim [4] and SimGrid [25].OptorSim [6] focuses on studying and validating dynamicreplication techniques. Simulation is commonly strongenough for designing and validating algorithms but not forevaluating real large-scale distributed systems. EmBOINC[13] uses a hybrid approach that simulates the populationof volunteered BOINC clients. We use the samemethodology to evaluate the scalability of BitDew servicesusing one hundred nodes to simulate a huge number ofconcurrent clients. Wrekavoc [8] is a heterogeneityemulator that controls environment by degrading the nodes’performance that is similar to our approach, which is usedin the heterogeneity scenarios.Kondo et al. [21] measures a real enterprise desktopgrid to analyze how the temporal characteristics of PCsaffect the utility of desktop grids. Javadi et al. [20] usesclustering methods to identify hosts whose availability isindependent and identically distributed according to theavailability traces from real systems. Nurmi et al. [12]develops an automatic method for modeling theavailability of Internet resources. The traces and modelsextracted from the real environment can be used as theworkload input for more strict and accurate evaluation ofthe availability aware task-scheduling algorithm, which isalso one of our future directions on BitDew-MR.VII. CONCLUSIONSDesktop grid computing offers vast amount ofcomputing resources, which can be efficiently used forrunning scientific applications. However, as data generatedfrom scientific instruments are continuously increasing,many efforts are done on utilizing Desktop Grid for dataintensiveapplications. Accordingly, in this paper, weextend our BitDew-MR framework and added newfeatures including: aggressive task backup, intermediateresults replication, task re-execution avoidance, andnetwork latency hiding optimization with the aim atfacilitating the usage of large-scale desktop grid. We thendesign a new experimental framework which emulates keyfundamental aspects of Internet desktop grid to validateand evaluate BitDew-MR against Hadoop.Our evaluation results demonstrates that: 1) BitDew-MR successfully passes all the stress-tests of theframework while Hadoop is unable to work in typicalwide-area network topology which includes PC hiddenbehind firewall and NAT; 2) BitDew-MR outperformsHadoop performances on several aspects: scalability,fairness, resilience to node failures, and networkdisconnections.ACKNOWLEDGMENTExperiments presented in this paper were carried outusing the Grid5000 experimental test bed, being developedunder the INRIA ALADDIN development action withsupport from CNRS, RENATER and several universitiesas well as other funding bodies (seehttps://www.grid5000.fr).This work is supported by the NSFC under grant Nos.61133008, 60973037, National Science and TechnologyPillar Program under grant 2012BAH14F02, WuhanChenguang Program under grant No. 201050231075,MOE-Intel Special Research Fund of InformationTechnology under grant MOE-INTEL-2012-01, and theAgence National de la Recherche under contract ANR-10-SEGI-001.REFERENCES[1] Apache Hadoop. Available: http://hadoop.apache.org/[2] D. P. Anderson, "BOINC: a system for public-resource computingand storage," in Proceedings of 5th IEEE/ACM InternationalWorkshop on Grid Computing (GRID’04), 2004.[3] D. P. Anderson and G. Fedak, "The Computational and StoragePotential of Volunteer Computing," in Proceedings of 6th IEEEInternational Symposium on Cluster Computing and the Grid(CCGrid’06), 2006[4] R. Buyya and M. M. Murshed, “GridSim: A Toolkit for theModeling and Simulation of Distributed Resource Managementand Scheduling for Grid Computing.,” in CoRR, 2002.[5] T. Bing, M. Moca, S. Chevalier, H. Haiwu, and G. Fedak,"Towards MapReduce for Desktop Grid Computing," inProceedings of Fifth International Conference on P2p, Parallel,Grid, Cloud and Internet Computing (3PGCIC’10), 2010.[6] W. H. Bell, D. G. Cameron, L. Capozza, A. P. Millar, K.Stockinger, and F. Zini, “OptorSim - A Grid Simulator forStudying Dynamic Data Replication Strategies,” InternationalJournal of High Performance Computing Applications, 17(4),2003.

[7] F. Cappello, S. Djilali, G. Fedak, T. Herault, F. Magniette, V.N´eri, and O. Lodygensky, “Computing on large-scale distributedsystems: Xtremweb architecture, programming models, security,tests and convergence with grid,” Future Generation ComputerSystems, vol. 21, pp. 417–437, 2005.[8] L. C. Canon and E. Jeannot, "Wrekavoc: a tool for emulatingheterogeneity," in Proceedings of 20th International Parallel andDistributed Processing Symposium (IPDPS’06), 2006.[9] S. Chen, and S. W. Schlosser, “Map-Reduce Meets WiderVarieties of Applications,” IRP-TR-08-05, Technical Report, IntelResearch Pittsburgh, May, 2008[10] T. Condie, N.Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy,and R. Sears, “MapReduce Online,” in Proceedings of 7th USENIXSymposium on Networked Systems Design and Implementation(NSDI’10), 2010.[11] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processingon large clusters," Communications of the ACM, vol. 51, pp. 107-113, Jan 2008.[12] D. Nurmi, J. Brevik, and R. Wolski, "Modeling MachineAvailability in Enterprise and Widearea Distributed ComputingEnvironments," in Proceedings of 11th International Euro-ParConference (EuroPar’05), 2005.[13] T. Estrada, M. Taufer, K. Reed, and D. P. Anderson, "EmBOINC:An emulator for performance analysis of BOINC projects," inProceedings of 23rd IEEE International Symposium on Paralleland Distributed Processing (IPDPS’09), 2009.[14] G. Fedak, H. Haiwu, and F. Cappello, "BitDew: A programmableenvironment for large-scale data management and distribution," inProceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’08), 2008.[15] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google filesystem," in Proceedings of 19th ACM Symposium on OperatingSystems Principles (SOSP’03), 2003.[16] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars:a MapReduce framework on graphics processors," in Proceedingsof The Seventeenth International Conference on ParallelArchitectures and Compilation Techniques (PACT’08), 2008.[17] X. Huaxia, H. Dail, H. Casanova, and A. A. Chien, "Theperformance of MapReduce: an in-depth study," in Proceedings of36th International Conference on Very Large Data Bases(VLDB’10), 2010.[18] S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu and S. Wu, “Maestro:Replica-Aware Map Scheduling for MapReduce,” in Proceedingsof The 12th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing (CCGrid’12), 2012.[19] S. Ibrahim, H. Jin, L. Lu, B. He, and S. Wu, “Adaptive disk i/oscheduling for mapreduce in virtualized environment,” inProceedings of 2011 International Conference on ParallelProcessing (ICPP’11), 2011[20] B. Javadi, D. Kondo, J. Vincent, and D. P. Anderson, “Mining forStatistical Models of Availability in LargeScale DistributedSystems: An Empirical Study of SETI@home”, In Proceedings of17th IEEE/ACM International Symposium on Modelling, Analysisand Simulation of Computer and Telecommunication Systems(MASCOTS’09), 2009.[21] D. Kondo, M. Taufer, C. L. Brooks, H. Casanova, and A. A.Chien, "Characterizing and evaluating desktop grids: an empiricalstudy," in Proceedings of 18th IEEE International Symposium onParallel and Distributed Processing (IPDPS’04), 2004.[22] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta, “Making CloudIntermediate Data Fault-Tolerant”, in Proceedings of ACMSymposium on Cloud Computing (SOCC’10), 2010[23] H. Lin, X. Ma, J. Archuleta, W. C. Feng, M. Gardner, and Z.Zhang, "MOON: MapReduce On Opportunistic eNvironments," inProceedings of 19th International Symposium on HighPerformance Distributed Computing (HPDC’10), 2010.[24] M. Moca, G. C. Silaghi, and G. Fedak, "Distributed ResultsChecking for MapReduce in Volunteer Computing," inProceedings of 2011 IEEE International Symposium on Paralleland Distributed Processing Workshops and Phd Forum(IPDPSW’11), 2011.[25] M. Quinson, "SimGrid: a generic framework for large-scaledistributed experiments," in Proceedings of Ninth InternationalConference on Peer-to-Peer Computing (P2P’09), 2009.[26] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C.Kozyrakis, "Evaluating MapReduce for Multi-core andMultiprocessor Systems," in Proceedings of 13st InternationalConference on High-Performance Computer Architecture(HPCA’07), 2007.[27] D. Thain, T. Tannenbaum, and M. Livny, "Distributed Computingin Practice: The Condor Experience," Concurrency andComputation: Practice and Experience, Vol. 17, No. 2-4, pp. 323-356, February-April, 2005.[28] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica,"Improving MapReduce Performance in HeterogeneousEnvironments," in Proceedings of 8th USENIX Symposium onOperating Systems Design and Implementation (OSDI’08), 2008.

A Comparison of Hadoop and BitDew-MapReduce

Create successful ePaper yourself

Delete template?

Save as template?