13.02.2013 Views

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.8.2 Looking up trigrams in the index<br />

3.8 The trigram index<br />

In order to modify the index data structure in such a way that it is more suited for our purposes,<br />

we first need to understand how it works. Figure 3.4 is an illustration of the different parts<br />

of the index. On the left, you can see each section within the index file. On the right, an<br />

example entry for each of the different sections is provided (except for list of paths, which<br />

is unused in our code). Each section of the index is sorted, and the number of entries of the<br />

name index and posting list index are stored in the trailer.<br />

codesearch index<br />

Header<br />

List of Paths<br />

List of Names g t k<br />

List of Posting Lists s n a 2 5 1 1<br />

Name Index<br />

trigram<br />

0 0 0 8<br />

file IDs<br />

0 0 0 8<br />

Posting List Index<br />

file ID<br />

s n a<br />

name offset<br />

0 0 0 8 0 4 3 0<br />

Trailer<br />

trigram file count post. list offset<br />

- 3 / c o n s t . c \0<br />

filename<br />

Figure 3.4: The <strong>Code</strong>search index format. Trigram lookups are performed as described below.<br />

Assuming the list of all files which contain the trigram “sna” needs to be obtained, the<br />

following steps have to be performed:<br />

1. Seek to the posting list index and perform a binary search to find the entry for trigram<br />

“sna”. The entry reveals the file count and a byte offset (relative to the first byte of the<br />

index) pointing at the entry for “sna” in the list of posting lists.<br />

2. Seek to the entry for “sna” in the list of posting lists and decode the varint [7] -encoded<br />

list of file IDs.<br />

3. For each file ID, seek to the appropriate position in the name index. The byte offset<br />

pointing to the filename in the list of names is now known.<br />

4. For each name offset, seek to the appropriate position in the list of names and read<br />

the filename (NUL-terminated).<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!