13.02.2013 Views

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

2 Debian Code Search: An Overview

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5 Resource requirements and load-balancing<br />

As you can see in figure 3.2, when the user requests http://codesearch.debian.net/,<br />

the browser first needs to resolve the name codesearch.debian.net to an IP address using<br />

the Domain Name System (DNS). This is the first step where the load can be balanced between<br />

multiple servers: the browser will connect to the first IP address it gets, so a DNS server<br />

can just return all IP addresses in a different order (e.g. round-robin). DNS for debian.net is<br />

hosted by the <strong>Debian</strong> project, so <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> doesn’t have to setup or maintain any<br />

software or hardware for that.<br />

After resolving the hostname, the browser will open a TCP connection on port 80 to the<br />

resolved IP address and send an HTTP request. This request will be answered by the HTTP<br />

frontend webserver, which is the second step where the load can be balanced and redundancy<br />

can be added: The frontend can split the load between the available backends and requests<br />

can still be answered if a certain number of backends fail.<br />

Furthermore, the backend only has to communicate with the frontend, therefore the burden<br />

of handling TCP connections — especially slow connections — is entirely on the frontend.<br />

Requests which can be answered from the cache (such as static pages, images, stylesheets<br />

and JavaScript files) can be served directly from the frontend without causing any load on<br />

the backend. The HTTP frontend runs on a <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> machine.<br />

dcs-web receives actual search requests and runs on a <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> machine. This<br />

might be the same machine as the frontend runs on, or a different, dedicated machine, if the<br />

demand is so high that this is necessary to maintain good performance. To answer a request,<br />

dcs-web needs to perform the following steps:<br />

1. Query all index backends. The index is sharded into multiple index backend processes<br />

due to technical limitations, see section 3.8.1, page 16.<br />

2. Rank the results.<br />

3. Send the results to one of the source backends, which performs the actual searching.<br />

4. Format the response.<br />

Each index backend and source backend corresponds to one process, which typically will<br />

run on the same machine that dcs-web runs on. Should the index size grow so much that it<br />

cannot be held by one machine anymore, index backends can also run on different machines<br />

which are connected by a low-latency network.<br />

Should it turn out that disk bandwidth is a problem, one can run multiple source backends,<br />

one for each disk. These source backend processes can be run on the same machine with<br />

different disks or on different machines, just like the index backend processes.<br />

Index backends, if all deployed on a single machine, need to run on a machine with at<br />

least 8 GiB of RAM. Not keeping the index in RAM means that each request needs to perform<br />

a lot of additional random disk accesses, which are particularly slow when the machine does<br />

not use a solid state disk (SSD) for storing the index [28] .<br />

Source backends profit from storing their data on a solid state disk (SSD) for low-latency,<br />

high-bandwidth random file access. Keeping the filesystem metadata in RAM reduces disk<br />

access even further. The more RAM the machine which hosts the source backend has, the<br />

better: unused RAM will be used by Linux to cache file contents [24] , so search queries for<br />

popular files might never even hit the disk at all, if the machine has plenty of RAM. 16 GiB<br />

11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!