13.07.2015 Views

Web Scraping Scraping is a process by which you can ... - Cindy Royal

Web Scraping Scraping is a process by which you can ... - Cindy Royal

Web Scraping Scraping is a process by which you can ... - Cindy Royal

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Advanced Online MediaDr. <strong>Cindy</strong> <strong>Royal</strong>Texas State University - San MarcosSchool of Journal<strong>is</strong>m and Mass Communication<strong>Web</strong> <strong>Scraping</strong><strong>Scraping</strong> <strong>is</strong> a <strong>process</strong> <strong>by</strong> <strong>which</strong> <strong>you</strong> <strong>can</strong> extract data from an html page, intoa CSV or other format, so <strong>you</strong> <strong>can</strong> work with in Excel and use in <strong>you</strong>rv<strong>is</strong>ualizations. Sometimes <strong>you</strong> <strong>can</strong> just copy and paste data from an htmltable into Excel, and all will be fine. Other times, not. So, <strong>you</strong> have to createa computer program that allows <strong>you</strong> to identify and scrape the data <strong>you</strong>want. There are numerous ways in <strong>which</strong> <strong>you</strong> <strong>can</strong> do th<strong>is</strong>, some that involveprogramming and some that don’t.Chrome Scraper ExtensionWe used th<strong>is</strong> when we were working with the SXSW data to make <strong>you</strong>rinfographics. You download the scraper extension and try to get it to find thedata <strong>you</strong> want on the page. You may have to modify the XPath (as we did)to have it select just the right data. Chrome has another extension calledxpathOnClick that <strong>can</strong> help <strong>you</strong> determine the XPath. Th<strong>is</strong> <strong>is</strong> quick and easyand often provides the right solution.Review: Remember, we used the following in the XPath to generate the l<strong>is</strong>tof items on the page after we search for something.//a[@class="link_item link_itemInteractive"]I found that <strong>by</strong> using the xpathOnClick extension on an item. You install it.Then <strong>you</strong> see an XPath tool on the top right. Before using the tool, chooseView, Developer, Javascript Console. The click the XPath button. Click onone of the elements (<strong>you</strong> may have to click twice). The console will show<strong>you</strong> the XPath and <strong>you</strong> <strong>can</strong> try some of these in the Scraper window until<strong>you</strong> get what <strong>you</strong> need.OutWitHubTh<strong>is</strong> <strong>is</strong> a Firefox Extension as well as a desktop app.https://www.outwit.com/. When <strong>you</strong> choose Download, use the FirefoxExtension version. It’s very powerful. You <strong>can</strong> use it to scrape various typesof data from a page. Download the extension. Then go to a page with dataand open OutWit Hub. It <strong>can</strong> be found under Tools. It will first show <strong>you</strong> thepage, but if <strong>you</strong> click on “table”, <strong>you</strong> should be able to see the results ofwhat <strong>is</strong> in the table. The column on the right allows <strong>you</strong> to export to csv. Thefree, light version limits <strong>you</strong> to 100 files per download, so <strong>you</strong> <strong>can</strong> either do


multiple requests or pay for an upgrade.Try it with th<strong>is</strong> site: http://en.wikipedia.org/wiki/L<strong>is</strong>t_of_blogsIt works pretty well with Wikipedia pages. And works nicely to find data intables, but <strong>you</strong> <strong>can</strong> also use to find l<strong>is</strong>ts and images on a page.Programming a ScraperThese techniques are very powerful, and may be all that <strong>you</strong> need. But let’stake a quick look at how <strong>you</strong> might program <strong>you</strong>r own scraper. You <strong>can</strong> useany <strong>Web</strong> programming language to program a scraper, like Python, Ru<strong>by</strong> orPHP. We are going to look at an example using Python. You <strong>can</strong> often find ascraper and modify it for <strong>you</strong>r own purposes.We are going to use ScraperWiki (scraperwiki.com) to provide theenvironment to do the scraper. ScraperWiki offers a code simulator thatprevents <strong>you</strong> from having to install these programs on <strong>you</strong>r computer, and itprovides all sorts of other features that allow <strong>you</strong> to save and export thedata, save the scraper for later.1. Go to Scraperwiki. Set up an account. It’s free. Choose SignUp nearthe top of the page.2. You should be able to click on New Scraper to start a scraper. We willbe using the following tutorial:https://scraperwiki.com/docs/python/python_intro_tutorial/3. Follow the instructions for copying each segment of code to thescreen, then choose Run. Use the Copy button You should see itrun, and not encounter any errors. Do each part separately, thenadd the next, so <strong>you</strong> <strong>can</strong> troubleshoot any errors.4. We are scraping data from th<strong>is</strong> page:http://web.archive.org/web/20110514112442/http://unstats.un.org/unsd/demographic/products/socind/education.htmIt <strong>is</strong> on the <strong>Web</strong> Archive site, providing the number of years studentsspend in school in various countries.import scraperwikihtml = scraperwiki.scrape("http://web.archive.org/web/20110514112442/http://unstats.un.org/unsd/demographic/products/socind/education.htm")print html


import lxml.htmlroot = lxml.html.fromstring(html)for tr in root.cssselect("div[align='left'] tr"):tds = tr.cssselect("td")if len(tds)==12:data = {'country' : tds[0].text_content(),'years_in_school' : int(tds[4].text_content())}print dataTake a look at the code. There are several elements.• Import the scraperwiki library. That allows the program to open theurl.• Scrape the html page.• Print the html page. If th<strong>is</strong> works, then <strong>you</strong> know <strong>you</strong> have access tothe page.• Import the lxml library that does the scraping• Establ<strong>is</strong>h the root of <strong>you</strong>r document at the html tag.• For loop : th<strong>is</strong> loop goes through all the tr elements that <strong>you</strong> identify inthe root.cssselect area. You have to look at <strong>you</strong>r code do find themarker elements that will allow <strong>you</strong> to find the data. In th<strong>is</strong> case,there <strong>is</strong> a div with a certain alignment and below that, we want the tr.• tds <strong>is</strong> an array that selects the td elements in the table.• The if statement further refines it to the table that has 12 columns.• Then the data variable (hash) <strong>is</strong> set up to extract elements in the tdsarray. Remember, arrays start with 0. We get the text.content of thecell. The second one gets an integer, if <strong>you</strong> need it to do that.• The final line prints what <strong>is</strong> in the data variable (hash).You should see the data now in the console, if <strong>you</strong> created the programproperly. Save the Scraper.To save the data in the ScraperWiki datastore, <strong>you</strong> <strong>can</strong> do so <strong>by</strong> replacingthe print data command with th<strong>is</strong>.scraperwiki.sqlite.save(unique_keys=['country'], data=data)You should be able to go to the Scraper profile and Download the data. Th<strong>is</strong>information <strong>can</strong> be changed for different pages from <strong>which</strong> <strong>you</strong> are trying toextract. If <strong>you</strong> go back to the scraper profile page, <strong>you</strong> <strong>can</strong> choose downloadto get a csv of the data.


Python <strong>is</strong> space sensitive, so <strong>you</strong> have to indent things properly. If <strong>you</strong> getany error messages in the console, read them. They will help <strong>you</strong>troubleshoot any problems.A New ExampleNow, let’s try th<strong>is</strong> with some other data. We will reuse the data from th<strong>is</strong>scraper. On <strong>you</strong>r scraper profile, choose Copy. Th<strong>is</strong> <strong>is</strong> also known as forking.Th<strong>is</strong> will create an exact copy of the code in a new scraper. You should clickon the Edit pencil next to the name to give it a new name. Call th<strong>is</strong> one“music.”• Let’s go to the Texas Music Office Site.http://governor.state.tx.us/music/musicians/talent/talent• The first page l<strong>is</strong>ts all the A musicians.• First, replace the url in the second line of the scraper.• Now, take a look at the code. View Source. Try to find where the databegins. How <strong>can</strong> we identify th<strong>is</strong> area?• I found that both of these worked to identify the root.cssselect:• "div[style='text-align: center'] tr"• “table tr”• The tds will be the same. We are looking for table cell data.• The if statement should be changed to the number of columns for th<strong>is</strong>table.• Adjust the data items <strong>you</strong> are pulling out. You <strong>can</strong> modify the integerconversion, if <strong>you</strong>r info <strong>is</strong> all text.• Modify the save command with the name for <strong>you</strong>r datas uniqueidentifier.• Make sure the scraper <strong>is</strong> saved. You should be able to go back to theScraper profile page and Download the csv.There are ways to have the scraper start again and do the next page in thesequence. You would set up a loop and identify the pattern. We aren’t goingto do th<strong>is</strong> in class, but it <strong>is</strong> possible.


Exerc<strong>is</strong>e:Now try it <strong>you</strong>rself. Go to th<strong>is</strong> page:http://www.infoplease.com/ipea/A0004420.htmlIt has the top 100 newspapers and their circulation. Copy and modify thepython program above to extract the rank, newspaper name and circulationfrom the page.Hint: You will need to change the following• URL• root.cssselect items• if statement number of columns• data elements ( <strong>you</strong> will have three, rank, newspaper, circulation)uniquekey in the sqlite save statementAs <strong>you</strong> <strong>can</strong> see, <strong>you</strong> don’t have to be an experienced programmer to usethese techniques if <strong>you</strong> have some idea of what the program <strong>is</strong> doing. Th<strong>is</strong> <strong>is</strong>just the beginning. Each page has unique challenges in scraping the dataand may require more advanced techniques. But th<strong>is</strong> <strong>is</strong> a good introductionto how <strong>you</strong> <strong>can</strong> use a basic program to scrape from a site.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!