11.07.2015 Views

Index-Based Code Clone Detection - Software and Systems ...

Index-Based Code Clone Detection - Software and Systems ...

Index-Based Code Clone Detection - Software and Systems ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

many clones can cause some of these file lists to requiresubstantially longer processing times, causing this saturationeffect. The algorithm thus scales well up to a certain number ofmachines. It seems that using more than about 30 machines forretrieval does not make sense for a code base of the given size.However, the large job processing 2.9 GLOC demonstrates the(absence of) limits for the index construction part.Since a substantial part of the time is spent on networkcommunication, these times may not be compared with thoseof the previous section. Instead, a better comparison wouldbe with [35], where about 400 MLOC were processed on 80machines within 51 hours. While our code base is about afactor of 5.5 smaller, using 80 machines, index-based detectionis 86 times faster. Using only ten machines, it is still a factorof 16.5 faster (on the smaller code base). For a conclusivecomparison, we would have to run both approaches on thesame code <strong>and</strong> hardware. However, these numbers indicatethat the gained speedup is not only due to reduced code sizeor the use of slightly different hardware, but also because ofalgorithmic improvement.C. Real Time <strong>Clone</strong> <strong>Detection</strong>This case study investigates suitability for real-time clonedetection on large code that is modified continuously.<strong>Index</strong> implementation: We used a persistent clone indeximplementation based on Berkeley DB 14 , a high-performanceembedded database.Design <strong>and</strong> Procedure: To evaluate the support for real timeclone management, we measured the time required to (1) buildthe index, (2) update the index in response to changes to thesystem, <strong>and</strong> (3) query the index. For this, we used code ofversion 3.3 of the Eclipse SDK consisting of 209.312 Java fileswhich comprise 42.693.793 lines of code. Since our approachis incremental, a full indexing has to happen only once. Toassess the performance of the index building, we constructedthe index for the Eclipse source code <strong>and</strong> measured the overalltime required. Furthermore, to evaluate the incremental updatecapabilities of the index, we removed 1,000 r<strong>and</strong>omly selectedfiles from the Eclipse index <strong>and</strong> re-added them afterwards.Likewise, for the query capability, we queried the index forthe clone classes of 1,000 r<strong>and</strong>omly selected files.Results: On the test system, the index creation process forthe Eclipse SDK including writing the clone index to thedatabase took 7 hours <strong>and</strong> 4 minutes. The clone index for theEclipse SDK occupied 5.6 GB on the hard disk. In comparison,the source code itself needed 1.8 GB disk space, i.e., the indexused about 3 times the space of the original system. The testsystem was able to answer a query for a single file at anaverage time of 0.91 seconds, <strong>and</strong> a median query time of0.21 seconds. Only 14 of the 1000 files had a query time ofover 10 seconds. On average they had a size of 3 kLOC <strong>and</strong>350 clones. The update of the clone index including writing it14 http://www.oracle.com/technology/products/berkeley-db/index.htmlTABLE IIICLONE MANAGEMENT PERFORMANCE<strong>Index</strong> creation (complete)<strong>Index</strong> query (per file)<strong>Index</strong> update (per file)7 hr 4 min0.21 sec median0.91 sec average0.85 sec averageto the database took 0.85 seconds on average per file. Table IIIillustrates these performance measurement results at a glance.Discussion: The results from this case study indicate thatour approach is capable of supporting real time clone management.The average time for an index query is, in our opinion,fast enough to support interactive display of clone informationwhen a source file is opened in the IDE. Furthermore, theperformance of index updates allows for continuous indexmaintenance.V. CONCLUSION AND FUTURE WORKThis paper presents a novel clone detection approach thatemploys an index for efficient clone retrieval. To the best ofour knowledge, it is the first approach that is at the same timeincremental <strong>and</strong> scalable to very large code bases.The clone index not only allows the retrieval of all clonescontained in a system, but also supports the efficient retrievalof the clones that cover a specific file. This allows clonemanagement tools to provision developers in real-time withcloning information while they maintain code. For a case studyon 42 MLOC of Eclipse, average retrieval time was below1 second, demonstrating applicability to very large software.Since the clone index can be updated incrementally while thesoftware changes, cloning information can be kept accurate atall times.The clone index can be distributed across different machines,enabling index creation, maintenance <strong>and</strong> clone retrievalto be parallelized. <strong>Index</strong>-based clone detection thusimposes no limits on scalability—it can simply be improvedby adding hardware resources—while retaining its abilityfor incremental updates. For the case study, 100 machinesperformed clone detection in 73 MLOC of open source codein 36 minutes.For the analyzed systems, index-based clone detection (employingan in-memory index) outperforms suffix-tree-basedclone detection. This indicates that the index-based detectionalgorithm can potentially substitute suffix-tree-based algorithms,which are employed by many existing clone detectors.For future work, we plan to develop algorithms that employthe clone index to detect type 3 clones 15 . One approach is touse hash functions that are robust to further classes of changes,such as e.g., n-gram based hashing as proposed by Smith <strong>and</strong>Horwitz in [42]. Another approach is to use locality sensitivehashing, as e.g., employed by [14] <strong>and</strong> adapt the retrieval toalso find similar, not only identical, hashes. Furthermore, based15 Type 3 clones can differ beyond the token level; statements can beinserted, changed or removed [1].

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!