10.07.2015 Views

Distributed Duplicate Detection in Post-Process Data De ... - HiPC

Distributed Duplicate Detection in Post-Process Data De ... - HiPC

Distributed Duplicate Detection in Post-Process Data De ... - HiPC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

or variable sized blocks) [8] rather than block level, forfaster results, but they aga<strong>in</strong> suffer from scalabilityissues <strong>in</strong> presence of billions of objects. For postprocessde-duplication, it is assumed that the datacenter would have “sleep” times when the applicationload would be significantly lesser, which is when thede-duplication tasks are best scheduled, but with datacenter’s serv<strong>in</strong>g data worldwide, such assumptions areno longer valid. It’s <strong>in</strong>evitable for the de-duplicationprocess to compete with IOPs for resources at thecontroller.We present a scalable distributed deduplicationdesign, scales out the de-duplication tasks.2. <strong>De</strong>signIn our design, we propose a scalableenhancement to current de-duplication systems. Thedesign works best <strong>in</strong> a clustered storage environment.<strong>De</strong>-duplication can be sub-divided <strong>in</strong>to two basic tasks,that of detect<strong>in</strong>g duplicates and shar<strong>in</strong>g the duplicates(<strong>in</strong>stead of stor<strong>in</strong>g multiple <strong>in</strong>stances of the sameobject). Our design advocates to fan-out the computeand memory <strong>in</strong>tensive, detection of duplicates phase ofde-duplication process, to a cluster of compute nodeswhich carry out process<strong>in</strong>g of block f<strong>in</strong>gerpr<strong>in</strong>ts(typically a hash of data <strong>in</strong> the block) and detection ofcandidate duplicate blocks <strong>in</strong> a parallel fashion. Thecluster of compute nodes is constructed us<strong>in</strong>gcommodity hardware and is part of the storage-tier.Leverag<strong>in</strong>g a cluster of commodity nodes allows us toscale out the said task of de-duplication system. Theduplicate detection phase of post process de-duplicationis entirely based upon identification of duplicatesamong the list of collected f<strong>in</strong>gerpr<strong>in</strong>ts (which arenoth<strong>in</strong>g but computed hash <strong>in</strong>dexes of data chunks <strong>in</strong>the storage system). The f<strong>in</strong>gerpr<strong>in</strong>ts are typically muchsmaller <strong>in</strong> size as compared to the data chunk theyrepresent. Ow<strong>in</strong>g to the much smaller size of f<strong>in</strong>gerpr<strong>in</strong>tdatabase (<strong>in</strong> comparison to the size of orig<strong>in</strong>al dataset),one does not have to move chunks of orig<strong>in</strong>al datasetfor the purpose of scal<strong>in</strong>g out of duplicate detectionphase. The shar<strong>in</strong>g phase, however, is understandablybest done where the actual data is present as it <strong>in</strong>volvesbyte-wise comparison of data chunks, ensur<strong>in</strong>g they areidentical before de-duplicat<strong>in</strong>g them. We fan-out thef<strong>in</strong>gerpr<strong>in</strong>t database for a given volume to our computecluster for further process<strong>in</strong>g i.e. duplicate detection, asshown <strong>in</strong> Fig 1.2.1. BenefitsAs was mentioned <strong>in</strong> the <strong>in</strong>troduction, it is acommon practice to limit the number of concurrent deduplicationthreads on a controller <strong>in</strong> pursuit of notcompromis<strong>in</strong>g on the core goal of serv<strong>in</strong>g I/Ooperations. With the aid of proposed design, we canprevent the critical storage controller resources frombe<strong>in</strong>g spent <strong>in</strong> a workload which can be easily scaledout.The other important aspect of this design is that itallows for leverag<strong>in</strong>g cheaper resources which are notat the controller, but still are part of the storage tier (byus<strong>in</strong>g commodity hardware). This has an effect oflower<strong>in</strong>g the overall cost of the solution, while allow<strong>in</strong>gfor higher IOPs to be served through saved resources atthe controller. Another <strong>in</strong>terest<strong>in</strong>g aspect of this designis that it <strong>in</strong>volves non-shared coarse-gra<strong>in</strong>ed parallelismwhich enables better scalability. Thus with our design,the number of concurrent de-duplication processescould be <strong>in</strong>creased by an order of magnitude withouttak<strong>in</strong>g additional resources at the controller, by add<strong>in</strong>gmore commodity hardware to the storage cluster.Storage ControllerCollection ofhashes/f<strong>in</strong>gerpr<strong>in</strong>tsof dataShar<strong>in</strong>g ofduplicate dataobjectsFig.1: <strong>Distributed</strong> de-duplicationCluster ofcommoditycompute nodes<strong><strong>De</strong>tection</strong> ofduplicatef<strong>in</strong>gerpr<strong>in</strong>ts3. ImplementationAs expla<strong>in</strong>ed <strong>in</strong> the design, our <strong>in</strong>tention is totake the duplicate-detection phase of offl<strong>in</strong>e deduplicationto an external cluster of compute nodesus<strong>in</strong>g commodity hardware. Hadoop, an Apache opensource distributed comput<strong>in</strong>g framework serves as apotential candidate for the task s<strong>in</strong>ce the duplicatedetectiontask is effectively sort<strong>in</strong>g the f<strong>in</strong>gerpr<strong>in</strong>ts ofthe blocks and identify<strong>in</strong>g the duplicates among them.This is an ideal use case of the Map-Reduceprogramm<strong>in</strong>g abstraction [2] which is used <strong>in</strong> Hadoop.Map-Reduce splits the compute problem <strong>in</strong>to twostages: Map and Reduce. Map is a transformationfunction and Reduce acts as an aggregation function.Every record <strong>in</strong> the data is <strong>in</strong>terpreted as a key-valuepair <strong>in</strong> Map-Reduce programm<strong>in</strong>g paradigm.


Map Hadoop Map-Reduce fits very well <strong>in</strong> our cases<strong>in</strong>ce it is distributed and has an <strong>in</strong>herent sort capabilitywhich sorts all the <strong>in</strong>termediate data on the basis of the<strong>in</strong>termediate key. This stage is called the Shuffle/Sortphase <strong>in</strong> Hadoop Map-Reduce. We try to exploit thisfeature and detect the duplicate f<strong>in</strong>gerpr<strong>in</strong>t records. Wealso leverage the use of Hadoop <strong>Distributed</strong> FileSystem [9] which acts as the source file system for theMap-Reduce jobs that are run.We used NetApp offl<strong>in</strong>e de-duplication [3] toaccommodate our Hadoop-based duplicate detectionframework. We replaced the duplicate detection stageof NetApp de-duplication with our duplicate detectionmechanism that uses Hadoop MapReduce.The Hadoop-based duplicate detection workflowthat we implemented, <strong>in</strong>cludes the follow<strong>in</strong>g stages: Receive the f<strong>in</strong>gerpr<strong>in</strong>ts from the storagecontroller. Generate f<strong>in</strong>gerpr<strong>in</strong>t database and store itpersistently on the Hadoop <strong>Distributed</strong> File System(HDFS). The same can be further used for<strong>in</strong>cremental offl<strong>in</strong>e de-duplication. It also aids <strong>in</strong>recovery and serves as a checkpo<strong>in</strong>t, we canresume de-duplication from the previously savedf<strong>in</strong>gerpr<strong>in</strong>t database, <strong>in</strong> case the current job ofduplicate detection fails. Generate the duplicate records from thef<strong>in</strong>gerpr<strong>in</strong>ts and send it back to the storagecontroller. This triggers the rest of the deduplicationphase <strong>in</strong> the storage controller.The follow<strong>in</strong>g section would give f<strong>in</strong>er implementationdetails.3.1 Hadoop-based <strong>Duplicate</strong> detectionThe structures <strong>in</strong>volved <strong>in</strong> the data flow ofduplicate detection:F<strong>in</strong>gerpr<strong>in</strong>t:F<strong>in</strong>gerpr<strong>in</strong>t<strong>Duplicate</strong>:Block1 attributesReduceBlock attributesBlock2 attributesThe <strong>in</strong>put to our Hadoop cluster is thef<strong>in</strong>gerpr<strong>in</strong>ts of all the blocks with no <strong>in</strong>herent order<strong>in</strong>g(Fig. 2). We sort these <strong>in</strong>gested f<strong>in</strong>gerpr<strong>in</strong>ts us<strong>in</strong>gF<strong>in</strong>gerpr<strong>in</strong>t as the key. The sorted collection is thenused to detect the duplicates (f<strong>in</strong>gerpr<strong>in</strong>ts with sameF<strong>in</strong>gerpr<strong>in</strong>t attribute). The duplicate record (shownabove) conta<strong>in</strong>s the block attributes of the two blockswhich are go<strong>in</strong>g to undergo shar<strong>in</strong>g dur<strong>in</strong>g deduplication.Once we have detected the duplicates us<strong>in</strong>gour map-reduce jobs, send them back as a s<strong>in</strong>glemetafile to the storage controller for further duplicateshar<strong>in</strong>g phase of NetApp de-duplication. Meanwhile wealso preserve a copy of the sorted f<strong>in</strong>gerpr<strong>in</strong>ts called thef<strong>in</strong>gerpr<strong>in</strong>t database (FPDB) with<strong>in</strong> the Hadoop cluster.Fig. 2 gives the outl<strong>in</strong>e of the MapReduce (MR)modules we implemented for duplicate detection.<strong>Duplicate</strong>s(unsorted)F<strong>in</strong>gerpr<strong>in</strong>ts (Obta<strong>in</strong>ed fromstorage controller)DupSort(MR module)<strong>Duplicate</strong>s (Sorted)(Sent back to storagecontroller for de-duplication)FPDBSort(MR module)F<strong>in</strong>gerpr<strong>in</strong>t <strong>Data</strong>base(Stored <strong>in</strong> HDFS)Fig.2: Hadoop based duplicate detection (MR: MapReduce)We run 2 MapReduce jobs serially to performduplicate detection <strong>in</strong> the Hadoop cluster.1) FPDBSortThis MapReduce module takes the f<strong>in</strong>gerpr<strong>in</strong>ts as the<strong>in</strong>put from the HDFS and generates the F<strong>in</strong>gerpr<strong>in</strong>t<strong>Data</strong>base (the sorted f<strong>in</strong>gerpr<strong>in</strong>t file) and the duplicatesfile. The duplicates file generated conta<strong>in</strong>s the duplicaterecords but not <strong>in</strong> sorted order. The different parametersof the module are as stated below: Input: F<strong>in</strong>gerpr<strong>in</strong>ts – directory from HDFSwhich conta<strong>in</strong>s the f<strong>in</strong>gerpr<strong>in</strong>ts. Output: FPDB – file which conta<strong>in</strong>s thef<strong>in</strong>gerpr<strong>in</strong>ts <strong>in</strong> sorted order. Dup_unsorted – file which conta<strong>in</strong>s theduplicate records but not sorted.


Map (byte [] f<strong>in</strong>gerpr<strong>in</strong>t_record)For each f<strong>in</strong>gerpr<strong>in</strong>t_recordEmit (key, value);//key – F<strong>in</strong>gerpr<strong>in</strong>t of block//value – Block attributesReduce (byte [] key, Iterator Values)1) For every value <strong>in</strong> values, emit (key, value)// this gives the FPDB file2) For every consecutive values, value_i andvalue_jEmit (value_i, value_j)//this populates the Dup_unsorted fileFig. 3: Map-Reduce algorithm of FPDBSort2) DupSortThis MapReduce module takes “Dup_unsorted” filefrom HDFS, sorts the duplicate record with a specificcomparator and writes the output to an output streamthat sends the data to the Storage controller via Socketcommunication. The different parameters of the moduleare as stated below. Input: “Dup_unsorted” file from HDFS Output: Sorted duplicates sent over the socketto the Storage controller.Map (byte [] duplicate_record)For each duplicate_ recordEmit (key, value);//key – duplicate_record//value – NULLReduce (byte [] key, Iterator Values)For every value <strong>in</strong> Values, emit (key, value)// Output is sent back to controller via socketFig.4: Map-Reduce algorithm of DupSort4. PerformanceAs mentioned previously, we implemented theHadoop based scale out duplicate detection bymodify<strong>in</strong>g the offl<strong>in</strong>e de-duplication of NetApp. Wepresent the performance comparisons made between theduplicate detection of NetApp de-duplication andduplicate detection us<strong>in</strong>g our Hadoop MapReduceframework. The size of f<strong>in</strong>gerpr<strong>in</strong>t data-structure <strong>in</strong> ourexperiment is 32 bytes correspond<strong>in</strong>g to a 4 KB datablock.Follow<strong>in</strong>g are the datasets used for the de-duplicationexperiments: <strong>Data</strong>set A – 498 GB (78% duplicates) <strong>Data</strong>set B – 445 GB (67% duplicates) <strong>Data</strong>set C – 217 GB (86% duplicates) <strong>Data</strong>set D – 197 GB (53% duplicates)We populated the datasets us<strong>in</strong>g Iozone FilesystemBenchmark tool [4]. We setup the Hadoop cluster on anESX Hypervisor with four dedicated 32-bit virtualmach<strong>in</strong>es tak<strong>in</strong>g up the role of the Hadoop nodes. Theconfiguration of each of the nodes is as given below:Number of CPUs: 2Memory: 4 GBOperat<strong>in</strong>g system: Ubuntu 10.04<strong>Duplicate</strong> detection phase of NetApp de-duplicationwas carried out <strong>in</strong> a NetApp Storage controller. Theconfiguration of the controller we used, is as mentionedbelow:Number of CPUs: 4Memory: 16 GBNVRam: 2 GBOperat<strong>in</strong>g system: <strong>Data</strong> ONTAPWe recorded the time taken dur<strong>in</strong>g theduplicate detection of NetApp de-duplication anddistributed duplicate detection us<strong>in</strong>g HadoopMapReduce (<strong>in</strong>clud<strong>in</strong>g the time taken for networktransfer of metadata from controller to the Hadoopcluster and back.) with different Hadoop cluster sizes.Table 1 gives the performance statistics we observeddur<strong>in</strong>g the experiments. The results described <strong>in</strong> Table1 are the time taken by the de-duplication process,start<strong>in</strong>g from the f<strong>in</strong>gerpr<strong>in</strong>t collection stage, until theend of detection of duplicates. Note that our Hadoopimplementation optimizes only the duplicate detectionphase.In-order to simulate a realistic workloadenvironment <strong>in</strong> the controller, we ran de-duplicationalong with a simulated workload where six threadswere assigned to perform read/writes of 1GB data eachon the controller. We measured the controllerperformance with and without the simulated workload.The performance results (Fig 5) clearly <strong>in</strong>dicate that ourdistributed duplicate detection framework (with fourcommodity/VM nodes) out-performs the standardduplicate detection implemented <strong>in</strong> controllers, underthe effect of simulated workloads. As the magnitude ofdata <strong>in</strong>creases, the Hadoop framework is expected toperform efficiently s<strong>in</strong>ce it is tailor-made for big dataprocess<strong>in</strong>g with reliability. We also observed thatscale-out duplicate detection freed up approximately512MB of memory while runn<strong>in</strong>g de-duplication on as<strong>in</strong>gle volume. Fig. 5 gives a bar-graph representationof the statistics obta<strong>in</strong>ed.5. Conclusion and Future WorkAs it is evident from the performance resultsobta<strong>in</strong>ed from our experiment, Hadoop-baseddistributed duplicate detection is a good enhancementto the s<strong>in</strong>gle node duplicate detection run on storagecontrollers. It outperforms standard offl<strong>in</strong>e de-


duplication mechanism due to its scale-out capability. Italso appears to be a good use case of leverag<strong>in</strong>gcommodity hardware <strong>in</strong> the present data storagescenario. Another positive that can be derived from this<strong>in</strong>itiative would be to free the storage controllerresources that could be utilized for other higher priorityhousekeep<strong>in</strong>g functionalities and serve the ma<strong>in</strong>purpose of data storage more effectively. The same setof Hadoop nodes, can also be used to run other suitableapplications like management and the like, which isbeyond the scope of this paper.Go<strong>in</strong>g ahead, we plan to work on the l<strong>in</strong>es of<strong>in</strong>creas<strong>in</strong>g the number of concurrent de-duplicationstreams that could be <strong>in</strong>itiated by the storage controllerand try to address the bottlenecks we encounter <strong>in</strong> thiscontext. We also <strong>in</strong>tend to <strong>in</strong>vestigate on how toachieve end-to-end scale-up <strong>in</strong> the rate of overall deduplication<strong>in</strong> this scenario. We are also <strong>in</strong>terested <strong>in</strong>evaluat<strong>in</strong>g the results us<strong>in</strong>g a practically larger Hadoopcluster and huge datasets that could be an <strong>in</strong>dicator ofthe storage scenario <strong>in</strong> the years to come. Anotherfuture work prospect would be on the l<strong>in</strong>es of assess<strong>in</strong>ghow to scale-out other stages of the de-duplicationus<strong>in</strong>g commodity hardware – f<strong>in</strong>gerpr<strong>in</strong>t (hash value ofthe data blocks) generation and even data process<strong>in</strong>gmodules <strong>in</strong>volved <strong>in</strong> duplicate shar<strong>in</strong>g phase of deduplication.We would also like to explore if thememory at the commodity Hadoop nodes, could beutilized as a layer of secondary cache for the controllerwhen some of the nodes are idle.<strong>Data</strong>set<strong>Duplicate</strong>detectionon Storagecontroller(withoutworkload)(m<strong>in</strong>)<strong>Duplicate</strong>detection onStoragecontroller(withworkload)(m<strong>in</strong>)<strong>Duplicate</strong>detectionon 2-node/VMHadoopCluster(m<strong>in</strong>)<strong>Duplicate</strong>detectionon 4-node/VMHadoopCluster(m<strong>in</strong>)A 29 35 38 30B 28 32 36 23C 14 21 20 14D 13 19 15 10Table 1: Total time taken dur<strong>in</strong>g f<strong>in</strong>gerpr<strong>in</strong>t collection and duplicatedetection stages, with and without the Hadoop implementation6. AcknowledgementsWe would like to acknowledge the key <strong>in</strong>putsfrom VR Bharadwaj on modify<strong>in</strong>g NetApp A-SIS forus to carry out our experiments. We thank members ofNetApp Advanced Technology Group for theirguidance. We would also like to thank the ApacheHadoop users community for their timely tips and help.Fig.5. Bar-graph <strong>in</strong>dicat<strong>in</strong>g the performance results6. ReferencesX-axis: <strong>Data</strong>setsY-axis: Time (m<strong>in</strong>s)[1] Petros Efstathopoulos and Fanglu Guo.Reth<strong>in</strong>k<strong>in</strong>g <strong>De</strong>duplication Scalability.HotStorage 2010.[2] Jeffrey <strong>De</strong>an and Sanjay Ghemawat. MapReduce:Simplified <strong>Data</strong> <strong>Process</strong><strong>in</strong>g on Large Clusters.OSDI 2004.[3] Carloz Alvarez. NetApp Technical Report:NetApp <strong>De</strong>duplication for FAS and V-Series<strong>De</strong>ployment and Implementation Guide.[4] William D. Norcott and Don Capps. IozoneFilesystem Benchmark. http://www.iozone.org.[5] Sean Qu<strong>in</strong>lan and Sean Dorward, “Venti: A newapproach to archival storage,” <strong>in</strong> FAST ’02:Proceed<strong>in</strong>gs of the Conference on File andStorage Technologies, Berkeley, CA, USA, 2002,pp. 89–101, USENIX Association.[6] Open<strong>De</strong>dup, “A userspace deduplication filesystem (SDFS),” March 2010,http://code.google.com/p/opendedup/.[7] ZHU, B., LI, K., AND PATTERSON, H.Avoid<strong>in</strong>g the disk bottleneck <strong>in</strong> the <strong>Data</strong> Doma<strong>in</strong>deduplication file system. In File and StorageTechnology Conference (2008).[8] Shai Halevi, Danny Harnik, Benny P<strong>in</strong>kas, andAlexandra Shulman-Peleg. Proofs of Ownership<strong>in</strong> Remote Storage Systems. CCS 2011[9] Konstant<strong>in</strong> Shvachko, Hairong Kuang, SanjayRadia, Robert Chansler. The Hadoop <strong>Distributed</strong>File System. MSST2010

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!