Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.7 Duplication in the archive<br />
4.7 Duplication in the archive<br />
It is noteworthy that there is some duplication of soure code in the <strong>Debian</strong> archive. One<br />
example is the WebKit layout engine used by Apple Safari and Google Chrome, for example.<br />
WebKit code can be found in the WebKit package, in the Qt framework, in Chromium and in<br />
PhantomJS.<br />
In general, such duplication is unfortunate because it makes maintenance of the projects<br />
harder, that is, bugs have to be fixed in more than one place. Then again, it sometimes reduces<br />
the potential for bugs, for example when Qt doesn’t use whichever version of WebKit is<br />
installed on the system, but only uses its well-tested, bundled version.<br />
From the perspective of <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> or any code search engine, duplication is<br />
unfortunate because users might get multiple search results containing the same source code<br />
for a single query (e.g. a user searches for “(?i)bloomfilter” and gets presented with<br />
JavaScriptCore/wtf/BloomFilter.h from qt4, webkit, etc.). This is undesirable. Let us<br />
assume the limit for source code results is fixed at, say, 10 results per page. When the same<br />
result is presented three times, the user effectively is presented with 8 results instead of 10,<br />
and he might get frustrated because he needs to realize that the presented results are identical.<br />
Detecting duplicates (or very similar source code, e.g. a slightly different version of WebKit)<br />
can be done by applying metrics such as the Levenshtein distance 9 . The problem is not<br />
prevalent enough that it would be worthwhile (within this work) to design and apply such a<br />
technique on our entire corpus.<br />
Within <strong>Debian</strong>, there is also a project to reduce the number of duplicates to avoid security<br />
issues from unapplied bugfixes to embedded copies of code 10 .<br />
9 “[…] the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally,<br />
the Levenshtein distance between two words is equal to the number of single-character edits required<br />
to change one word into the other.” [22]<br />
10 http://lists.debian.org/debian-devel/2012/07/msg00026.html<br />
45