Web Scraping Scraping is a process by which you can ... - Cindy Royal

Advanced Online MediaDr. Cindy RoyalTexas State University - San MarcosSchool of Journalism and Mass CommunicationWeb ScrapingScraping is a process by which you can extract data from an html page, intoa CSV or other format, so you can work with in Excel and use in yourvisualizations. Sometimes you can just copy and paste data from an htmltable into Excel, and all will be fine. Other times, not. So, you have to createa computer program that allows you to identify and scrape the data youwant. There are numerous ways in which you can do this, some that involveprogramming and some that don’t.Chrome Scraper ExtensionWe used this when we were working with the SXSW data to make yourinfographics. You download the scraper extension and try to get it to find thedata you want on the page. You may have to modify the XPath (as we did)to have it select just the right data. Chrome has another extension calledxpathOnClick that can help you determine the XPath. This is quick and easyand often provides the right solution.Review: Remember, we used the following in the XPath to generate the listof items on the page after we search for something.//a[@class="link_item link_itemInteractive"]I found that by using the xpathOnClick extension on an item. You install it.Then you see an XPath tool on the top right. Before using the tool, chooseView, Developer, Javascript Console. The click the XPath button. Click onone of the elements (you may have to click twice). The console will showyou the XPath and you can try some of these in the Scraper window untilyou get what you need.OutWitHubThis is a Firefox Extension as well as a desktop app.https://www.outwit.com/. When you choose Download, use the FirefoxExtension version. It’s very powerful. You can use it to scrape various typesof data from a page. Download the extension. Then go to a page with dataand open OutWit Hub. It can be found under Tools. It will first show you thepage, but if you click on “table”, you should be able to see the results ofwhat is in the table. The column on the right allows you to export to csv. Thefree, light version limits you to 100 files per download, so you can either do

multiple requests or pay for an upgrade.Try it with this site: http://en.wikipedia.org/wiki/List_of_blogsIt works pretty well with Wikipedia pages. And works nicely to find data intables, but you can also use to find lists and images on a page.Programming a ScraperThese techniques are very powerful, and may be all that you need. But let’stake a quick look at how you might program your own scraper. You can useany Web programming language to program a scraper, like Python, Ruby orPHP. We are going to look at an example using Python. You can often find ascraper and modify it for your own purposes.We are going to use ScraperWiki (scraperwiki.com) to provide theenvironment to do the scraper. ScraperWiki offers a code simulator thatprevents you from having to install these programs on your computer, and itprovides all sorts of other features that allow you to save and export thedata, save the scraper for later.1. Go to Scraperwiki. Set up an account. It’s free. Choose SignUp nearthe top of the page.2. You should be able to click on New Scraper to start a scraper. We willbe using the following tutorial:https://scraperwiki.com/docs/python/python_intro_tutorial/3. Follow the instructions for copying each segment of code to thescreen, then choose Run. Use the Copy button You should see itrun, and not encounter any errors. Do each part separately, thenadd the next, so you can troubleshoot any errors.4. We are scraping data from this page:http://web.archive.org/web/20110514112442/http://unstats.un.org/unsd/demographic/products/socind/education.htmIt is on the Web Archive site, providing the number of years studentsspend in school in various countries.import scraperwikihtml = scraperwiki.scrape("http://web.archive.org/web/20110514112442/http://unstats.un.org/unsd/demographic/products/socind/education.htm")print html

import lxml.htmlroot = lxml.html.fromstring(html)for tr in root.cssselect("div[align='left'] tr"):tds = tr.cssselect("td")if len(tds)==12:data = {'country' : tds[0].text_content(),'years_in_school' : int(tds[4].text_content())}print dataTake a look at the code. There are several elements.• Import the scraperwiki library. That allows the program to open theurl.• Scrape the html page.• Print the html page. If this works, then you know you have access tothe page.• Import the lxml library that does the scraping• Establish the root of your document at the html tag.• For loop : this loop goes through all the tr elements that you identify inthe root.cssselect area. You have to look at your code do find themarker elements that will allow you to find the data. In this case,there is a div with a certain alignment and below that, we want the tr.• tds is an array that selects the td elements in the table.• The if statement further refines it to the table that has 12 columns.• Then the data variable (hash) is set up to extract elements in the tdsarray. Remember, arrays start with 0. We get the text.content of thecell. The second one gets an integer, if you need it to do that.• The final line prints what is in the data variable (hash).You should see the data now in the console, if you created the programproperly. Save the Scraper.To save the data in the ScraperWiki datastore, you can do so by replacingthe print data command with this.scraperwiki.sqlite.save(unique_keys=['country'], data=data)You should be able to go to the Scraper profile and Download the data. Thisinformation can be changed for different pages from which you are trying toextract. If you go back to the scraper profile page, you can choose downloadto get a csv of the data.

Python is space sensitive, so you have to indent things properly. If you getany error messages in the console, read them. They will help youtroubleshoot any problems.A New ExampleNow, let’s try this with some other data. We will reuse the data from thisscraper. On your scraper profile, choose Copy. This is also known as forking.This will create an exact copy of the code in a new scraper. You should clickon the Edit pencil next to the name to give it a new name. Call this one“music.”• Let’s go to the Texas Music Office Site.http://governor.state.tx.us/music/musicians/talent/talent• The first page lists all the A musicians.• First, replace the url in the second line of the scraper.• Now, take a look at the code. View Source. Try to find where the databegins. How can we identify this area?• I found that both of these worked to identify the root.cssselect:• "div[style='text-align: center'] tr"• “table tr”• The tds will be the same. We are looking for table cell data.• The if statement should be changed to the number of columns for thistable.• Adjust the data items you are pulling out. You can modify the integerconversion, if your info is all text.• Modify the save command with the name for your datas uniqueidentifier.• Make sure the scraper is saved. You should be able to go back to theScraper profile page and Download the csv.There are ways to have the scraper start again and do the next page in thesequence. You would set up a loop and identify the pattern. We aren’t goingto do this in class, but it is possible.

Exercise:Now try it yourself. Go to this page:http://www.infoplease.com/ipea/A0004420.htmlIt has the top 100 newspapers and their circulation. Copy and modify thepython program above to extract the rank, newspaper name and circulationfrom the page.Hint: You will need to change the following• URL• root.cssselect items• if statement number of columns• data elements ( you will have three, rank, newspaper, circulation)uniquekey in the sqlite save statementAs you can see, you don’t have to be an experienced programmer to usethese techniques if you have some idea of what the program is doing. This isjust the beginning. Each page has unique challenges in scraping the dataand may require more advanced techniques. But this is a good introductionto how you can use a basic program to scrape from a site.

Web Scraping Scraping is a process by which you can ... - Cindy Royal

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?