You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.5 Resource requirements and load-balancing<br />
As you can see in figure 3.2, when the user requests http://codesearch.debian.net/,<br />
the browser first needs to resolve the name codesearch.debian.net to an IP address using<br />
the Domain Name System (DNS). This is the first step where the load can be balanced between<br />
multiple servers: the browser will connect to the first IP address it gets, so a DNS server<br />
can just return all IP addresses in a different order (e.g. round-robin). DNS for debian.net is<br />
hosted by the <strong>Debian</strong> project, so <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> doesn’t have to setup or maintain any<br />
software or hardware for that.<br />
After resolving the hostname, the browser will open a TCP connection on port 80 to the<br />
resolved IP address and send an HTTP request. This request will be answered by the HTTP<br />
frontend webserver, which is the second step where the load can be balanced and redundancy<br />
can be added: The frontend can split the load between the available backends and requests<br />
can still be answered if a certain number of backends fail.<br />
Furthermore, the backend only has to communicate with the frontend, therefore the burden<br />
of handling TCP connections — especially slow connections — is entirely on the frontend.<br />
Requests which can be answered from the cache (such as static pages, images, stylesheets<br />
and JavaScript files) can be served directly from the frontend without causing any load on<br />
the backend. The HTTP frontend runs on a <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> machine.<br />
dcs-web receives actual search requests and runs on a <strong>Debian</strong> <strong>Code</strong> <strong>Search</strong> machine. This<br />
might be the same machine as the frontend runs on, or a different, dedicated machine, if the<br />
demand is so high that this is necessary to maintain good performance. To answer a request,<br />
dcs-web needs to perform the following steps:<br />
1. Query all index backends. The index is sharded into multiple index backend processes<br />
due to technical limitations, see section 3.8.1, page 16.<br />
2. Rank the results.<br />
3. Send the results to one of the source backends, which performs the actual searching.<br />
4. Format the response.<br />
Each index backend and source backend corresponds to one process, which typically will<br />
run on the same machine that dcs-web runs on. Should the index size grow so much that it<br />
cannot be held by one machine anymore, index backends can also run on different machines<br />
which are connected by a low-latency network.<br />
Should it turn out that disk bandwidth is a problem, one can run multiple source backends,<br />
one for each disk. These source backend processes can be run on the same machine with<br />
different disks or on different machines, just like the index backend processes.<br />
Index backends, if all deployed on a single machine, need to run on a machine with at<br />
least 8 GiB of RAM. Not keeping the index in RAM means that each request needs to perform<br />
a lot of additional random disk accesses, which are particularly slow when the machine does<br />
not use a solid state disk (SSD) for storing the index [28] .<br />
Source backends profit from storing their data on a solid state disk (SSD) for low-latency,<br />
high-bandwidth random file access. Keeping the filesystem metadata in RAM reduces disk<br />
access even further. The more RAM the machine which hosts the source backend has, the<br />
better: unused RAM will be used by Linux to cache file contents [24] , so search queries for<br />
popular files might never even hit the disk at all, if the machine has plenty of RAM. 16 GiB<br />
11