Distributed Duplicate Detection in Post-Process Data De ... - HiPC

or variable sized blocks) [8] rather than block level, forfaster results, but they again suffer from scalabilityissues in presence of billions of objects. For postprocessde-duplication, it is assumed that the datacenter would have “sleep” times when the applicationload would be significantly lesser, which is when thede-duplication tasks are best scheduled, but with datacenter’s serving data worldwide, such assumptions areno longer valid. It’s inevitable for the de-duplicationprocess to compete with IOPs for resources at thecontroller.We present a scalable distributed deduplicationdesign, scales out the de-duplication tasks.2. DesignIn our design, we propose a scalableenhancement to current de-duplication systems. Thedesign works best in a clustered storage environment.De-duplication can be sub-divided into two basic tasks,that of detecting duplicates and sharing the duplicates(instead of storing multiple instances of the sameobject). Our design advocates to fan-out the computeand memory intensive, detection of duplicates phase ofde-duplication process, to a cluster of compute nodeswhich carry out processing of block fingerprints(typically a hash of data in the block) and detection ofcandidate duplicate blocks in a parallel fashion. Thecluster of compute nodes is constructed usingcommodity hardware and is part of the storage-tier.Leveraging a cluster of commodity nodes allows us toscale out the said task of de-duplication system. Theduplicate detection phase of post process de-duplicationis entirely based upon identification of duplicatesamong the list of collected fingerprints (which arenothing but computed hash indexes of data chunks inthe storage system). The fingerprints are typically muchsmaller in size as compared to the data chunk theyrepresent. Owing to the much smaller size of fingerprintdatabase (in comparison to the size of original dataset),one does not have to move chunks of original datasetfor the purpose of scaling out of duplicate detectionphase. The sharing phase, however, is understandablybest done where the actual data is present as it involvesbyte-wise comparison of data chunks, ensuring they areidentical before de-duplicating them. We fan-out thefingerprint database for a given volume to our computecluster for further processing i.e. duplicate detection, asshown in Fig 1.2.1. BenefitsAs was mentioned in the introduction, it is acommon practice to limit the number of concurrent deduplicationthreads on a controller in pursuit of notcompromising on the core goal of serving I/Ooperations. With the aid of proposed design, we canprevent the critical storage controller resources frombeing spent in a workload which can be easily scaledout.The other important aspect of this design is that itallows for leveraging cheaper resources which are notat the controller, but still are part of the storage tier (byusing commodity hardware). This has an effect oflowering the overall cost of the solution, while allowingfor higher IOPs to be served through saved resources atthe controller. Another interesting aspect of this designis that it involves non-shared coarse-grained parallelismwhich enables better scalability. Thus with our design,the number of concurrent de-duplication processescould be increased by an order of magnitude withouttaking additional resources at the controller, by addingmore commodity hardware to the storage cluster.Storage ControllerCollection ofhashes/fingerprintsof dataSharing ofduplicate dataobjectsFig.1: Distributed de-duplicationCluster ofcommoditycompute nodesDetection ofduplicatefingerprints3. ImplementationAs explained in the design, our intention is totake the duplicate-detection phase of offline deduplicationto an external cluster of compute nodesusing commodity hardware. Hadoop, an Apache opensource distributed computing framework serves as apotential candidate for the task since the duplicatedetectiontask is effectively sorting the fingerprints ofthe blocks and identifying the duplicates among them.This is an ideal use case of the Map-Reduceprogramming abstraction [2] which is used in Hadoop.Map-Reduce splits the compute problem into twostages: Map and Reduce. Map is a transformationfunction and Reduce acts as an aggregation function.Every record in the data is interpreted as a key-valuepair in Map-Reduce programming paradigm.

Map Hadoop Map-Reduce fits very well in our casesince it is distributed and has an inherent sort capabilitywhich sorts all the intermediate data on the basis of theintermediate key. This stage is called the Shuffle/Sortphase in Hadoop Map-Reduce. We try to exploit thisfeature and detect the duplicate fingerprint records. Wealso leverage the use of Hadoop Distributed FileSystem [9] which acts as the source file system for theMap-Reduce jobs that are run.We used NetApp offline de-duplication [3] toaccommodate our Hadoop-based duplicate detectionframework. We replaced the duplicate detection stageof NetApp de-duplication with our duplicate detectionmechanism that uses Hadoop MapReduce.The Hadoop-based duplicate detection workflowthat we implemented, includes the following stages: Receive the fingerprints from the storagecontroller. Generate fingerprint database and store itpersistently on the Hadoop Distributed File System(HDFS). The same can be further used forincremental offline de-duplication. It also aids inrecovery and serves as a checkpoint, we canresume de-duplication from the previously savedfingerprint database, in case the current job ofduplicate detection fails. Generate the duplicate records from thefingerprints and send it back to the storagecontroller. This triggers the rest of the deduplicationphase in the storage controller.The following section would give finer implementationdetails.3.1 Hadoop-based Duplicate detectionThe structures involved in the data flow ofduplicate detection:Fingerprint:FingerprintDuplicate:Block1 attributesReduceBlock attributesBlock2 attributesThe input to our Hadoop cluster is thefingerprints of all the blocks with no inherent ordering(Fig. 2). We sort these ingested fingerprints usingFingerprint as the key. The sorted collection is thenused to detect the duplicates (fingerprints with sameFingerprint attribute). The duplicate record (shownabove) contains the block attributes of the two blockswhich are going to undergo sharing during deduplication.Once we have detected the duplicates usingour map-reduce jobs, send them back as a singlemetafile to the storage controller for further duplicatesharing phase of NetApp de-duplication. Meanwhile wealso preserve a copy of the sorted fingerprints called thefingerprint database (FPDB) within the Hadoop cluster.Fig. 2 gives the outline of the MapReduce (MR)modules we implemented for duplicate detection.Duplicates(unsorted)Fingerprints (Obtained fromstorage controller)DupSort(MR module)Duplicates (Sorted)(Sent back to storagecontroller for de-duplication)FPDBSort(MR module)Fingerprint Database(Stored in HDFS)Fig.2: Hadoop based duplicate detection (MR: MapReduce)We run 2 MapReduce jobs serially to performduplicate detection in the Hadoop cluster.1) FPDBSortThis MapReduce module takes the fingerprints as theinput from the HDFS and generates the FingerprintDatabase (the sorted fingerprint file) and the duplicatesfile. The duplicates file generated contains the duplicaterecords but not in sorted order. The different parametersof the module are as stated below: Input: Fingerprints – directory from HDFSwhich contains the fingerprints. Output: FPDB – file which contains thefingerprints in sorted order. Dup_unsorted – file which contains theduplicate records but not sorted.

Map (byte [] fingerprint_record)For each fingerprint_recordEmit (key, value);//key – Fingerprint of block//value – Block attributesReduce (byte [] key, Iterator Values)1) For every value in values, emit (key, value)// this gives the FPDB file2) For every consecutive values, value_i andvalue_jEmit (value_i, value_j)//this populates the Dup_unsorted fileFig. 3: Map-Reduce algorithm of FPDBSort2) DupSortThis MapReduce module takes “Dup_unsorted” filefrom HDFS, sorts the duplicate record with a specificcomparator and writes the output to an output streamthat sends the data to the Storage controller via Socketcommunication. The different parameters of the moduleare as stated below. Input: “Dup_unsorted” file from HDFS Output: Sorted duplicates sent over the socketto the Storage controller.Map (byte [] duplicate_record)For each duplicate_ recordEmit (key, value);//key – duplicate_record//value – NULLReduce (byte [] key, Iterator Values)For every value in Values, emit (key, value)// Output is sent back to controller via socketFig.4: Map-Reduce algorithm of DupSort4. PerformanceAs mentioned previously, we implemented theHadoop based scale out duplicate detection bymodifying the offline de-duplication of NetApp. Wepresent the performance comparisons made between theduplicate detection of NetApp de-duplication andduplicate detection using our Hadoop MapReduceframework. The size of fingerprint data-structure in ourexperiment is 32 bytes corresponding to a 4 KB datablock.Following are the datasets used for the de-duplicationexperiments: Dataset A – 498 GB (78% duplicates) Dataset B – 445 GB (67% duplicates) Dataset C – 217 GB (86% duplicates) Dataset D – 197 GB (53% duplicates)We populated the datasets using Iozone FilesystemBenchmark tool [4]. We setup the Hadoop cluster on anESX Hypervisor with four dedicated 32-bit virtualmachines taking up the role of the Hadoop nodes. Theconfiguration of each of the nodes is as given below:Number of CPUs: 2Memory: 4 GBOperating system: Ubuntu 10.04Duplicate detection phase of NetApp de-duplicationwas carried out in a NetApp Storage controller. Theconfiguration of the controller we used, is as mentionedbelow:Number of CPUs: 4Memory: 16 GBNVRam: 2 GBOperating system: Data ONTAPWe recorded the time taken during theduplicate detection of NetApp de-duplication anddistributed duplicate detection using HadoopMapReduce (including the time taken for networktransfer of metadata from controller to the Hadoopcluster and back.) with different Hadoop cluster sizes.Table 1 gives the performance statistics we observedduring the experiments. The results described in Table1 are the time taken by the de-duplication process,starting from the fingerprint collection stage, until theend of detection of duplicates. Note that our Hadoopimplementation optimizes only the duplicate detectionphase.In-order to simulate a realistic workloadenvironment in the controller, we ran de-duplicationalong with a simulated workload where six threadswere assigned to perform read/writes of 1GB data eachon the controller. We measured the controllerperformance with and without the simulated workload.The performance results (Fig 5) clearly indicate that ourdistributed duplicate detection framework (with fourcommodity/VM nodes) out-performs the standardduplicate detection implemented in controllers, underthe effect of simulated workloads. As the magnitude ofdata increases, the Hadoop framework is expected toperform efficiently since it is tailor-made for big dataprocessing with reliability. We also observed thatscale-out duplicate detection freed up approximately512MB of memory while running de-duplication on asingle volume. Fig. 5 gives a bar-graph representationof the statistics obtained.5. Conclusion and Future WorkAs it is evident from the performance resultsobtained from our experiment, Hadoop-baseddistributed duplicate detection is a good enhancementto the single node duplicate detection run on storagecontrollers. It outperforms standard offline de-

duplication mechanism due to its scale-out capability. Italso appears to be a good use case of leveragingcommodity hardware in the present data storagescenario. Another positive that can be derived from thisinitiative would be to free the storage controllerresources that could be utilized for other higher priorityhousekeeping functionalities and serve the mainpurpose of data storage more effectively. The same setof Hadoop nodes, can also be used to run other suitableapplications like management and the like, which isbeyond the scope of this paper.Going ahead, we plan to work on the lines ofincreasing the number of concurrent de-duplicationstreams that could be initiated by the storage controllerand try to address the bottlenecks we encounter in thiscontext. We also intend to investigate on how toachieve end-to-end scale-up in the rate of overall deduplicationin this scenario. We are also interested inevaluating the results using a practically larger Hadoopcluster and huge datasets that could be an indicator ofthe storage scenario in the years to come. Anotherfuture work prospect would be on the lines of assessinghow to scale-out other stages of the de-duplicationusing commodity hardware – fingerprint (hash value ofthe data blocks) generation and even data processingmodules involved in duplicate sharing phase of deduplication.We would also like to explore if thememory at the commodity Hadoop nodes, could beutilized as a layer of secondary cache for the controllerwhen some of the nodes are idle.DatasetDuplicatedetectionon Storagecontroller(withoutworkload)(min)Duplicatedetection onStoragecontroller(withworkload)(min)Duplicatedetectionon 2-node/VMHadoopCluster(min)Duplicatedetectionon 4-node/VMHadoopCluster(min)A 29 35 38 30B 28 32 36 23C 14 21 20 14D 13 19 15 10Table 1: Total time taken during fingerprint collection and duplicatedetection stages, with and without the Hadoop implementation6. AcknowledgementsWe would like to acknowledge the key inputsfrom VR Bharadwaj on modifying NetApp A-SIS forus to carry out our experiments. We thank members ofNetApp Advanced Technology Group for theirguidance. We would also like to thank the ApacheHadoop users community for their timely tips and help.Fig.5. Bar-graph indicating the performance results6. ReferencesX-axis: DatasetsY-axis: Time (mins)[1] Petros Efstathopoulos and Fanglu Guo.Rethinking Deduplication Scalability.HotStorage 2010.[2] Jeffrey Dean and Sanjay Ghemawat. MapReduce:Simplified Data Processing on Large Clusters.OSDI 2004.[3] Carloz Alvarez. NetApp Technical Report:NetApp Deduplication for FAS and V-SeriesDeployment and Implementation Guide.[4] William D. Norcott and Don Capps. IozoneFilesystem Benchmark. http://www.iozone.org.[5] Sean Quinlan and Sean Dorward, “Venti: A newapproach to archival storage,” in FAST ’02:Proceedings of the Conference on File andStorage Technologies, Berkeley, CA, USA, 2002,pp. 89–101, USENIX Association.[6] OpenDedup, “A userspace deduplication filesystem (SDFS),” March 2010,http://code.google.com/p/opendedup/.[7] ZHU, B., LI, K., AND PATTERSON, H.Avoiding the disk bottleneck in the Data Domaindeduplication file system. In File and StorageTechnology Conference (2008).[8] Shai Halevi, Danny Harnik, Benny Pinkas, andAlexandra Shulman-Peleg. Proofs of Ownershipin Remote Storage Systems. CCS 2011[9] Konstantin Shvachko, Hairong Kuang, SanjayRadia, Robert Chansler. The Hadoop DistributedFile System. MSST2010

Distributed Duplicate Detection in Post-Process Data De ... - HiPC

Create successful ePaper yourself

Delete template?

Save as template?