01.09.2015 Views

4.0

1NSchAb

1NSchAb

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

33<br />

Web Application Penetration Testing<br />

2. Create the list of directories that are to be avoided by Spiders, Robots,<br />

or Crawlers.<br />

How to Test<br />

robots.txt<br />

Web Spiders, Robots, or Crawlers retrieve a web page and then recursively<br />

traverse hyperlinks to retrieve further web content. Their<br />

accepted behavior is specified by the Robots Exclusion Protocol of<br />

the robots.txt file in the web root directory [1].<br />

As an example, the beginning of the robots.txt file from http:/www.<br />

google.com/robots.txt sampled on 11 August 2013 is quoted below:<br />

User-agent: *<br />

Disallow: /search<br />

Disallow: /sdch<br />

Disallow: /groups<br />

Disallow: /images<br />

Disallow: /catalogs<br />

...<br />

cmlh$ wget http:/www.google.com/robots.txt<br />

--2013-08-11 14:40:36-- http:/www.google.com/robots.txt<br />

Resolving www.google.com... 74.125.237.17, 74.125.237.18,<br />

74.125.237.19, ...<br />

Connecting to www.google.com|74.125.237.17|:80... connected.<br />

HTTP request sent, awaiting response... 200 OK<br />

Length: unspecified [text/plain]<br />

Saving to: ‘robots.txt.1’<br />

[ ] 7,074 --.-K/s in 0s<br />

2013-08-11 14:40:37 (59.7 MB/s) - ‘robots.txt’ saved [7074]<br />

cmlh$ head -n5 robots.txt<br />

User-agent: *<br />

Disallow: /search<br />

Disallow: /sdch<br />

Disallow: /groups<br />

Disallow: /images<br />

cmlh$<br />

The User-Agent directive refers to the specific web spider/robot/<br />

crawler. For example the User-Agent: Googlebot refers to the spider<br />

from Google while “User-Agent: bingbot”[1] refers to crawler from<br />

Microsoft/Yahoo!. User-Agent: * in the example above applies to all<br />

web spiders/robots/crawlers [2] as quoted below:<br />

User-agent: *<br />

The Disallow directive specifies which resources are prohibited by<br />

spiders/robots/crawlers. In the example above, directories such as<br />

the following are prohibited:<br />

...<br />

Disallow: /search<br />

Disallow: /sdch<br />

Disallow: /groups<br />

Disallow: /images<br />

Disallow: /catalogs<br />

...<br />

Web spiders/robots/crawlers can intentionally ignore the Disallow<br />

directives specified in a robots.txt file [3], such as those from Social<br />

Networks[2] to ensure that shared linked are still valid. Hence,<br />

robots.txt should not be considered as a mechanism to enforce restrictions<br />

on how web content is accessed, stored, or republished<br />

by third parties.<br />

robots.txt in webroot - with “wget” or “curl”<br />

The robots.txt file is retrieved from the web root directory of the web<br />

server. For example, to retrieve the robots.txt from www.google.com<br />

using “wget” or “curl”:<br />

cmlh$ curl -O http:/www.google.com/robots.txt<br />

% Total % Received % Xferd Average Speed Time Time<br />

Time Current<br />

Dload Upload Total Spent Left Speed<br />

101 7074 0 7074 0 0 9410 0 --:--:-- --:--:-- --:--:--<br />

27312<br />

cmlh$ head -n5 robots.txt<br />

User-agent: *<br />

Disallow: /search<br />

Disallow: /sdch<br />

Disallow: /groups<br />

Disallow: /images<br />

cmlh$<br />

robots.txt in webroot - with rockspider<br />

“rockspider”[3] automates the creation of the initial scope for Spiders/<br />

Robots/Crawlers of files and directories/folders of a web site.<br />

For example, to create the initial scope based on the Allowed: directive<br />

from www.google.com using “rockspider”[4]:<br />

cmlh$ ./rockspider.pl -www www.google.com<br />

“Rockspider” Alpha v0.1_2<br />

Copyright 2013 Christian Heinrich<br />

Licensed under the Apache License, Version 2.0<br />

1. Downloading http:/www.google.com/robots.txt

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!