09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 9 ■ HTTP<br />

Identifying User Agents and Web Servers<br />

You may have noticed that the HTTP request we opened the chapter with advertised the fact that it was<br />

generated by a <strong>Python</strong> program:<br />

User-Agent: <strong>Python</strong>-urllib/2.6<br />

This header is optional in the HTTP protocol, and many sites simply ignore or log it. It can be useful<br />

when sites want to know which browsers their visitors use most <strong>of</strong>ten, and it can sometimes be used to<br />

distinguish search engine spiders (bots) from normal users browsing a site. For example, here are a few<br />

<strong>of</strong> the user agents that have hit my own web site in the past few minutes:<br />

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)<br />

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)<br />

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR<br />

» 1.1.4322; .NET CLR 2.0.50727)<br />

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.3<br />

» (KHTML, like Gecko) Chrome/6.0.472.62 Safari/534.3<br />

You will note that, the urllib2 user agent string notwithstanding, most clients choose to identify<br />

themselves as some form <strong>of</strong> the original Netscape browser, whose internal code name was Mozilla. But<br />

then, in parentheses, these same browsers secretly admit that they are really some other kind <strong>of</strong> browser.<br />

Many web sites are sensitive to the kinds <strong>of</strong> browsers that view them, most <strong>of</strong>ten because their<br />

designers were too lazy to make the sites work with anything other than Internet Explorer. If you need to<br />

access such sites with urllib2, you can simply instruct it to lie about its identity, and the receiving web<br />

site will not know the difference:<br />

>>> url = 'https://wca.eclaim.com/'<br />

>>> urllib2.urlopen(url).read()<br />

'...The following are...required...Micros<strong>of</strong>t Internet Explorer...'<br />

>>> agent = 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)'<br />

>>> request = urllib2.Request(url)<br />

>>> request.add_header('User-Agent', agent)<br />

>>> urllib2.urlopen(request).read()<br />

'\r\n\r\n\r\n\tEclaim.com - Log In...'<br />

There are databases <strong>of</strong> possible user agent strings online at several sites that you can reference both<br />

when analyzing agent strings that your own servers have received, as well as when concocting strings for<br />

your own HTTP requests:<br />

http://www.zytrax.com/tech/web/browser_ids.htm<br />

http://www.useragentstring.com/pages/useragentstring.php<br />

Besides using the agent string to enforce compatibility requirements—usually in an effort to reduce<br />

development and support costs—some web sites have started using the string to detect mobile browsers<br />

and redirect the user to a miniaturized mobile version <strong>of</strong> the site for better viewing on phones and iPods.<br />

A <strong>Python</strong> project named mobile.sniffer that attempts to support this technique can be found on the<br />

Package Index.<br />

152

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!