17.12.2012 Views

Programmation PYTHON - Zenk - Security - Repository

Programmation PYTHON - Zenk - Security - Repository

Programmation PYTHON - Zenk - Security - Repository

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

376<br />

La bibliothèque standard<br />

TROISIÈME PARTIE<br />

Les liens ne sont pas suivis et laissés tels quels.<br />

Points abordés<br />

urllib2, SGMLParser, création de fichiers.<br />

Solution<br />

Aspirateur<br />

#!/usr/bin/python<br />

# -*- coding: utf8 -*import<br />

sys<br />

import os<br />

import urllib2<br />

import logging<br />

from urlparse import urlsplit<br />

from urlparse import urlunsplit<br />

from os.path import join<br />

from HTMLParser import HTMLParser<br />

from sgmllib import SGMLParser<br />

class PageParser(SGMLParser):<br />

"""Parse une page web et collecte ses liens<br />

"""<br />

def __init__(self, on_attribute_visited, tags_to_remove=('base',)):<br />

SGMLParser.__init__(self)<br />

self.on_attribute_visited = on_attribute_visited<br />

self.tags_to_remove = tags_to_remove<br />

def unknown_starttag(self, tag, attrs):<br />

if tag.lower() in self.tags_to_remove:<br />

return None<br />

final_tag = ''<br />

self._result.append(final_tag)<br />

def unknown_endtag(self, tag):<br />

if tag.lower() in self.tags_to_remove:<br />

return None<br />

self._result.append('' % tag)<br />

def parse(self, data):<br />

self._result = []<br />

self.feed(data)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!