13.02.2013 Views

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.8 The trigram index<br />

3.8 The trigram index<br />

In order to perform a regular expression match, it is impractical to search through the whole<br />

corpus for each query. Instead, an inverted n-gram index is required:<br />

“[…] [an] n-gram is a contiguous sequence of n items from a given sequence of<br />

text or speech.” [23]<br />

In our case, the text is either the program source code when indexing, or the user’s search<br />

terms when processing a query. <strong>An</strong> n-gram of size 3 is called a trigram, a good size for<br />

practical use:<br />

“This sounds more general than it is. In practice, there are too few distinct<br />

2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.” [2]<br />

As an example, consider the following simple example of a search query:<br />

/Google.*<strong>Search</strong>/ (matching Google, then anything, then <strong>Search</strong>)<br />

Obviously, only files which contain both the terms “Google” and “<strong>Search</strong>” should be considered<br />

for the actual regular expression search. Therefore, these terms can be translated into<br />

the following trigram index query:<br />

Goo AND oog AND ogl AND gle AND Sea AND ear AND arc AND rch<br />

This process can be performed with any query, but for some, the translation might lead to<br />

the trigram query matching more documents than the actual regular expression. Since the<br />

trigram index does not contain the position of each trigram, these kinds of false positives are<br />

to be expected anyway.<br />

The full transformation process is described in Russ Cox’ “Regular Expression Matching<br />

with a Trigram Index” [2] .<br />

15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!