You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3 Architecture<br />
3.8.1 Trigram index limitations<br />
This work is based on Russ Cox’ <strong>Code</strong>search tools [2] (see also 3.2, page 7), which define an<br />
index data structure, index source code and support searching the previously indexed source<br />
code using regular expressions.<br />
Google <strong>Code</strong> <strong>Search</strong> was launched in 2006, but Russ Cox’ <strong>Code</strong>search tools are implemented<br />
in Go, a language which was created in late 2007 [6] . Therefore, they cannot be the original<br />
Google <strong>Code</strong> <strong>Search</strong> implementation, and have not been used for a large-scale public search<br />
engine.<br />
On the contrary, Russ recommends them for local repositories, presumably not larger than<br />
a few hundred megabytes:<br />
“If you miss Google <strong>Code</strong> <strong>Search</strong> and want to run fast indexed regular expression<br />
searches over your local code, give the standalone programs a try.” [2]<br />
One obvious limitation of the index implementation is its size limitation of 4 GiB, simply<br />
because the offsets it uses are 32-bit unsigned integers which cannot address more than 4 GiB.<br />
<strong>An</strong>other not so obvious limitation is that even though an index with a size between 2 GiB<br />
and 4 GiB can be created, it cannot be used later on, since Go’s mmap implementation only<br />
supports 2 GiB 9 .<br />
To circumvent this limitation while keeping modifications to Russ Cox’ <strong>Code</strong>search tools to<br />
a minimum 10 , a custom indexing tool to replace <strong>Code</strong>search’s cindex has been implemented.<br />
This replacement, called dcsindex, partitions the index into multiple files. Additionally, it<br />
tries to reduce the index size in the first place by ignoring some files entirely.<br />
Partitioning does not only work around the index size limitation, it also comes in handy<br />
when distributing the index across multiple machines. Instead of running six index shards<br />
on one machine, one could run three index shards on each of two machines.<br />
9 This is because len() always returns an int to avoid endless loops in a for i:=0; i