09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 9 ■ HTTP<br />

HTTP Caching<br />

Many elements <strong>of</strong> a typical web site design are repeated on every page you visit, and your browsing<br />

would slow to a crawl if every image and decoration had to be downloaded separately for every page you<br />

viewed. Well-configured web servers therefore add headers to every HTTP response that allow browsers,<br />

as well as any proxy caches between the browser and the server, to continue using a copy <strong>of</strong> a<br />

downloaded resource for some period <strong>of</strong> time until it expires.<br />

You might think that adding a simple expiration date to each resource that could be cached and<br />

redisplayed would have been a sufficient innovation. However, given the real-world behaviors <strong>of</strong> servers,<br />

caches, and browsers, it was prudent for the HTTP specification to detail a much more complicated<br />

scheme involving several interacting headers. Several pages are expended, for example, on the specific<br />

question <strong>of</strong> how to determine how old a cached copy <strong>of</strong> a page is. I refer you to RFC 2616 for the real<br />

details, but I will cover a few <strong>of</strong> the most common cases here.<br />

There are two basic mechanisms by which servers can support client caching.<br />

In the first approach, an HTTP response includes an Expires: header that formats a date and time<br />

using the same format as the standard Date: header:<br />

Expires: Sun, 21 Jan 2010 17:06:12 GMT<br />

However, this requires the client to check its clock—and many computers run clocks that are far<br />

ahead <strong>of</strong> or behind the real current date and time.<br />

This brings us to a second, more modern alternative, the Cache-Control header, that depends only<br />

on the client being able to correctly count seconds forward from the present. For example, to allow an<br />

image or page to be cached for an hour but then insist that it be refetched once the hour is up, a cache<br />

control header could be supplied like this:<br />

Cache-Control: max-age=3600, must-revalidate<br />

When the time comes to validate a cached resource, HTTP <strong>of</strong>fers a very nice shortcut: the client can<br />

ask the server to retransmit the resource only if a new version has indeed been released. There are two<br />

fields that the client can supply. Either content type is sufficient to convince most servers to answer with<br />

only an HTTP header, but no content type or body, if the cached resource is still current. One possibility<br />

is to send back the value that the Last-modified: header had in the HTTP response that first requested<br />

the item:<br />

If-Modified-Since: Sun, 21 Jan 2010 14:06:12 GMT<br />

Alternatively, if the server tagged the resource version with a hash or version identifier in an Etag:<br />

header—either approach will work, so long as the value always changes between versions <strong>of</strong> the<br />

resource—then the client can send that value back:<br />

Etag: BFDS2Cpq/BM6w<br />

Note that all <strong>of</strong> this depends on getting some level <strong>of</strong> cooperation from the server. If a web server<br />

fails to provide any caching guidelines and also does not supply either a Last-modified: or Etag: header<br />

for a particular resource, then clients have no choice but to fetch the resource every time it needs to be<br />

displayed to a user.<br />

Caching is such a powerful technology that many web sites go ahead and put HTTP caches like<br />

Squid or Varnish in front <strong>of</strong> their server farms, so that frequent requests for the most popular parts <strong>of</strong><br />

their site can be answered without loading down the main servers. Deploying caches geographically can<br />

also save bandwidth. In a celebrated question-and-answer session with the readers <strong>of</strong> Reddit about The<br />

Onion’s then-recent migration to Django, the site maintainers—who use a content delivery network<br />

(CDN) to transparently serve local caches <strong>of</strong> The Onion’s web site all over the world—indicated that they<br />

were able to reduce their server load by two-thirds by asking the CDN to cache 404 errors! You can read<br />

155

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!