Foundations of Python Network Programming 978-1-4302-3004-5
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
CHAPTER 9 ■ HTTP<br />
HTTP Caching<br />
Many elements <strong>of</strong> a typical web site design are repeated on every page you visit, and your browsing<br />
would slow to a crawl if every image and decoration had to be downloaded separately for every page you<br />
viewed. Well-configured web servers therefore add headers to every HTTP response that allow browsers,<br />
as well as any proxy caches between the browser and the server, to continue using a copy <strong>of</strong> a<br />
downloaded resource for some period <strong>of</strong> time until it expires.<br />
You might think that adding a simple expiration date to each resource that could be cached and<br />
redisplayed would have been a sufficient innovation. However, given the real-world behaviors <strong>of</strong> servers,<br />
caches, and browsers, it was prudent for the HTTP specification to detail a much more complicated<br />
scheme involving several interacting headers. Several pages are expended, for example, on the specific<br />
question <strong>of</strong> how to determine how old a cached copy <strong>of</strong> a page is. I refer you to RFC 2616 for the real<br />
details, but I will cover a few <strong>of</strong> the most common cases here.<br />
There are two basic mechanisms by which servers can support client caching.<br />
In the first approach, an HTTP response includes an Expires: header that formats a date and time<br />
using the same format as the standard Date: header:<br />
Expires: Sun, 21 Jan 2010 17:06:12 GMT<br />
However, this requires the client to check its clock—and many computers run clocks that are far<br />
ahead <strong>of</strong> or behind the real current date and time.<br />
This brings us to a second, more modern alternative, the Cache-Control header, that depends only<br />
on the client being able to correctly count seconds forward from the present. For example, to allow an<br />
image or page to be cached for an hour but then insist that it be refetched once the hour is up, a cache<br />
control header could be supplied like this:<br />
Cache-Control: max-age=3600, must-revalidate<br />
When the time comes to validate a cached resource, HTTP <strong>of</strong>fers a very nice shortcut: the client can<br />
ask the server to retransmit the resource only if a new version has indeed been released. There are two<br />
fields that the client can supply. Either content type is sufficient to convince most servers to answer with<br />
only an HTTP header, but no content type or body, if the cached resource is still current. One possibility<br />
is to send back the value that the Last-modified: header had in the HTTP response that first requested<br />
the item:<br />
If-Modified-Since: Sun, 21 Jan 2010 14:06:12 GMT<br />
Alternatively, if the server tagged the resource version with a hash or version identifier in an Etag:<br />
header—either approach will work, so long as the value always changes between versions <strong>of</strong> the<br />
resource—then the client can send that value back:<br />
Etag: BFDS2Cpq/BM6w<br />
Note that all <strong>of</strong> this depends on getting some level <strong>of</strong> cooperation from the server. If a web server<br />
fails to provide any caching guidelines and also does not supply either a Last-modified: or Etag: header<br />
for a particular resource, then clients have no choice but to fetch the resource every time it needs to be<br />
displayed to a user.<br />
Caching is such a powerful technology that many web sites go ahead and put HTTP caches like<br />
Squid or Varnish in front <strong>of</strong> their server farms, so that frequent requests for the most popular parts <strong>of</strong><br />
their site can be answered without loading down the main servers. Deploying caches geographically can<br />
also save bandwidth. In a celebrated question-and-answer session with the readers <strong>of</strong> Reddit about The<br />
Onion’s then-recent migration to Django, the site maintainers—who use a content delivery network<br />
(CDN) to transparently serve local caches <strong>of</strong> The Onion’s web site all over the world—indicated that they<br />
were able to reduce their server load by two-thirds by asking the CDN to cache 404 errors! You can read<br />
155