13.02.2013 Views

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3 Architecture<br />

3.8.1 Trigram index limitations<br />

This work is based on Russ Cox’ <strong>Code</strong>search tools [2] (see also 3.2, page 7), which define an<br />

index data structure, index source code and support searching the previously indexed source<br />

code using regular expressions.<br />

Google <strong>Code</strong> <strong>Search</strong> was launched in 2006, but Russ Cox’ <strong>Code</strong>search tools are implemented<br />

in Go, a language which was created in late 2007 [6] . Therefore, they cannot be the original<br />

Google <strong>Code</strong> <strong>Search</strong> implementation, and have not been used for a large-scale public search<br />

engine.<br />

On the contrary, Russ recommends them for local repositories, presumably not larger than<br />

a few hundred megabytes:<br />

“If you miss Google <strong>Code</strong> <strong>Search</strong> and want to run fast indexed regular expression<br />

searches over your local code, give the standalone programs a try.” [2]<br />

One obvious limitation of the index implementation is its size limitation of 4 GiB, simply<br />

because the offsets it uses are 32-bit unsigned integers which cannot address more than 4 GiB.<br />

<strong>An</strong>other not so obvious limitation is that even though an index with a size between 2 GiB<br />

and 4 GiB can be created, it cannot be used later on, since Go’s mmap implementation only<br />

supports 2 GiB 9 .<br />

To circumvent this limitation while keeping modifications to Russ Cox’ <strong>Code</strong>search tools to<br />

a minimum 10 , a custom indexing tool to replace <strong>Code</strong>search’s cindex has been implemented.<br />

This replacement, called dcsindex, partitions the index into multiple files. Additionally, it<br />

tries to reduce the index size in the first place by ignoring some files entirely.<br />

Partitioning does not only work around the index size limitation, it also comes in handy<br />

when distributing the index across multiple machines. Instead of running six index shards<br />

on one machine, one could run three index shards on each of two machines.<br />

9 This is because len() always returns an int to avoid endless loops in a for i:=0; i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!