You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.8.2 Looking up trigrams in the index<br />
3.8 The trigram index<br />
In order to modify the index data structure in such a way that it is more suited for our purposes,<br />
we first need to understand how it works. Figure 3.4 is an illustration of the different parts<br />
of the index. On the left, you can see each section within the index file. On the right, an<br />
example entry for each of the different sections is provided (except for list of paths, which<br />
is unused in our code). Each section of the index is sorted, and the number of entries of the<br />
name index and posting list index are stored in the trailer.<br />
codesearch index<br />
Header<br />
List of Paths<br />
List of Names g t k<br />
List of Posting Lists s n a 2 5 1 1<br />
Name Index<br />
trigram<br />
0 0 0 8<br />
file IDs<br />
0 0 0 8<br />
Posting List Index<br />
file ID<br />
s n a<br />
name offset<br />
0 0 0 8 0 4 3 0<br />
Trailer<br />
trigram file count post. list offset<br />
- 3 / c o n s t . c \0<br />
filename<br />
Figure 3.4: The <strong>Code</strong>search index format. Trigram lookups are performed as described below.<br />
Assuming the list of all files which contain the trigram “sna” needs to be obtained, the<br />
following steps have to be performed:<br />
1. Seek to the posting list index and perform a binary search to find the entry for trigram<br />
“sna”. The entry reveals the file count and a byte offset (relative to the first byte of the<br />
index) pointing at the entry for “sna” in the list of posting lists.<br />
2. Seek to the entry for “sna” in the list of posting lists and decode the varint [7] -encoded<br />
list of file IDs.<br />
3. For each file ID, seek to the appropriate position in the name index. The byte offset<br />
pointing to the filename in the list of names is now known.<br />
4. For each name offset, seek to the appropriate position in the list of names and read<br />
the filename (NUL-terminated).<br />
17