You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.8 The trigram index<br />
3.8 The trigram index<br />
In order to perform a regular expression match, it is impractical to search through the whole<br />
corpus for each query. Instead, an inverted n-gram index is required:<br />
“[…] [an] n-gram is a contiguous sequence of n items from a given sequence of<br />
text or speech.” [23]<br />
In our case, the text is either the program source code when indexing, or the user’s search<br />
terms when processing a query. <strong>An</strong> n-gram of size 3 is called a trigram, a good size for<br />
practical use:<br />
“This sounds more general than it is. In practice, there are too few distinct<br />
2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.” [2]<br />
As an example, consider the following simple example of a search query:<br />
/Google.*<strong>Search</strong>/ (matching Google, then anything, then <strong>Search</strong>)<br />
Obviously, only files which contain both the terms “Google” and “<strong>Search</strong>” should be considered<br />
for the actual regular expression search. Therefore, these terms can be translated into<br />
the following trigram index query:<br />
Goo AND oog AND ogl AND gle AND Sea AND ear AND arc AND rch<br />
This process can be performed with any query, but for some, the translation might lead to<br />
the trigram query matching more documents than the actual regular expression. Since the<br />
trigram index does not contain the position of each trigram, these kinds of false positives are<br />
to be expected anyway.<br />
The full transformation process is described in Russ Cox’ “Regular Expression Matching<br />
with a Trigram Index” [2] .<br />
15