19.05.2014 Views

Measuring the Goals and Incentives of Local Chinese Officials

Measuring the Goals and Incentives of Local Chinese Officials

Measuring the Goals and Incentives of Local Chinese Officials

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

county-level counties (xian), <strong>and</strong> 39 are county-level districts (qu). Thirty-four counties<br />

are located in West China, 34 in Central China, <strong>and</strong> 35 in East China (see Appendix A for<br />

names <strong>of</strong> selected counties). 5<br />

Web pages for <strong>the</strong> 100 counties were identified by starting at <strong>the</strong> home page URL<br />

<strong>and</strong> following all internal links. 6 This yielded a total <strong>of</strong> 1,927,412 links for <strong>the</strong> 100 sites,<br />

with 1,469,715 internal web pages. The number <strong>of</strong> links per website ranged from 116<br />

to 333,321, <strong>and</strong> <strong>the</strong> number <strong>of</strong> internal html web pages ranged from 18 to 129,646 for<br />

each website (see Table 2). Only links with html, xml, or plain text were included in <strong>the</strong><br />

Figure 2: Summary <strong>of</strong> Links Retrieved per Root URL<br />

Distribution <strong>of</strong> Links Retrieved from Root URLs<br />

HTML links (median = 5,015)<br />

Density<br />

Internal links (median = 6,989)<br />

All links (median = 8,652)<br />

0 100000 200000 300000<br />

Number <strong>of</strong> links<br />

analysis. 7 For sites with more than 1,000 text links, 1,000 links were r<strong>and</strong>omly selected<br />

for web scraping. For websites with 1,000 or fewer text links, all internal text links were<br />

scraped. Data from a total <strong>of</strong> 80,161 web pages were collected. For all textual analysis<br />

conducted in this paper, only counties with more than 100 web pages containing <strong>Chinese</strong><br />

content are included, resulting in 71 counties.<br />

For <strong>the</strong>se 71 counties, I collect data on county characteristics as well as <strong>the</strong> data on<br />

<strong>the</strong> province to which <strong>the</strong> county belongs. County characteristics include whe<strong>the</strong>r <strong>the</strong><br />

5 Regional designations based on <strong>of</strong>ficial <strong>Chinese</strong> government definition.<br />

6 Internal web pages are web pages with <strong>the</strong> same root URL as <strong>the</strong> home page <strong>of</strong> website. Links were<br />

obtained used Python <strong>and</strong> Scrapy<br />

7 Links to pdf’s, Micros<strong>of</strong>t Office documents such as Word, Excel, <strong>and</strong> PowerPoint, images, flash,<br />

javascript, <strong>and</strong> audio were excluded because <strong>of</strong> <strong>the</strong> difficulty <strong>of</strong> parsing <strong>the</strong>se variable formats in an automated<br />

fashion.<br />

11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!