13.07.2015 Views

Web Scraping Scraping is a process by which you can ... - Cindy Royal

Web Scraping Scraping is a process by which you can ... - Cindy Royal

Web Scraping Scraping is a process by which you can ... - Cindy Royal

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Python <strong>is</strong> space sensitive, so <strong>you</strong> have to indent things properly. If <strong>you</strong> getany error messages in the console, read them. They will help <strong>you</strong>troubleshoot any problems.A New ExampleNow, let’s try th<strong>is</strong> with some other data. We will reuse the data from th<strong>is</strong>scraper. On <strong>you</strong>r scraper profile, choose Copy. Th<strong>is</strong> <strong>is</strong> also known as forking.Th<strong>is</strong> will create an exact copy of the code in a new scraper. You should clickon the Edit pencil next to the name to give it a new name. Call th<strong>is</strong> one“music.”• Let’s go to the Texas Music Office Site.http://governor.state.tx.us/music/musicians/talent/talent• The first page l<strong>is</strong>ts all the A musicians.• First, replace the url in the second line of the scraper.• Now, take a look at the code. View Source. Try to find where the databegins. How <strong>can</strong> we identify th<strong>is</strong> area?• I found that both of these worked to identify the root.cssselect:• "div[style='text-align: center'] tr"• “table tr”• The tds will be the same. We are looking for table cell data.• The if statement should be changed to the number of columns for th<strong>is</strong>table.• Adjust the data items <strong>you</strong> are pulling out. You <strong>can</strong> modify the integerconversion, if <strong>you</strong>r info <strong>is</strong> all text.• Modify the save command with the name for <strong>you</strong>r datas uniqueidentifier.• Make sure the scraper <strong>is</strong> saved. You should be able to go back to theScraper profile page and Download the csv.There are ways to have the scraper start again and do the next page in thesequence. You would set up a loop and identify the pattern. We aren’t goingto do th<strong>is</strong> in class, but it <strong>is</strong> possible.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!