13.02.2013 Views

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.7 Duplication in the archive<br />

4.7 Duplication in the archive<br />

It is noteworthy that there is some duplication of soure code in the <strong>Debian</strong> archive. One<br />

example is the WebKit layout engine used by Apple Safari and Google Chrome, for example.<br />

WebKit code can be found in the WebKit package, in the Qt framework, in Chromium and in<br />

PhantomJS.<br />

In general, such duplication is unfortunate because it makes maintenance of the projects<br />

harder, that is, bugs have to be fixed in more than one place. Then again, it sometimes reduces<br />

the potential for bugs, for example when Qt doesn’t use whichever version of WebKit is<br />

installed on the system, but only uses its well-tested, bundled version.<br />

From the perspective of <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> or any code search engine, duplication is<br />

unfortunate because users might get multiple search results containing the same source code<br />

for a single query (e.g. a user searches for “(?i)bloomfilter” and gets presented with<br />

JavaScriptCore/wtf/BloomFilter.h from qt4, webkit, etc.). This is undesirable. Let us<br />

assume the limit for source code results is fixed at, say, 10 results per page. When the same<br />

result is presented three times, the user effectively is presented with 8 results instead of 10,<br />

and he might get frustrated because he needs to realize that the presented results are identical.<br />

Detecting duplicates (or very similar source code, e.g. a slightly different version of WebKit)<br />

can be done by applying metrics such as the Levenshtein distance 9 . The problem is not<br />

prevalent enough that it would be worthwhile (within this work) to design and apply such a<br />

technique on our entire corpus.<br />

Within <strong>Debian</strong>, there is also a project to reduce the number of duplicates to avoid security<br />

issues from unapplied bugfixes to embedded copies of code 10 .<br />

9 “[…] the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally,<br />

the Levenshtein distance between two words is equal to the number of single-character edits required<br />

to change one word into the other.” [22]<br />

10 http://lists.debian.org/debian-devel/2012/07/msg00026.html<br />

45

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!