19.05.2014 Views

Measuring the Goals and Incentives of Local Chinese Officials

Measuring the Goals and Incentives of Local Chinese Officials

Measuring the Goals and Incentives of Local Chinese Officials

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

import BeautifulSoup<br />

import re<br />

import chardet<br />

’Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1)<br />

Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1’)]<br />

Newlines = re.compile(r’[\r\n]\s+’)<br />

def getPageText(url):<br />

data = br.open(url).read()<br />

dataD = data.decode(chardet.detect(data)[’encoding’], ’ignore’)<br />

bs = BeautifulSoup.BeautifulSoup(dataD,<br />

convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES,<br />

fromEncoding=chardet.detect(data)[’encoding’])<br />

for s in bs.findAll(’script’):<br />

s.replaceWith(’’)<br />

txt = bs.find(’body’).getText(’|’)<br />

txtD = txt.encode(’utf-8’)<br />

return Newlines.sub(’|’, txtD)<br />

link = raw_input()<br />

import time<br />

import r<strong>and</strong>om<br />

sec = r<strong>and</strong>om.uniform(1,5000)<br />

time.sleep(sec)<br />

text = {}<br />

try:<br />

text[link] = getPageText(link)<br />

except:<br />

text[link] = "||||ERROR||||" + str(sys.exc_info()[0])<br />

C<br />

Data Cleaning Methods<br />

The raw content <strong>of</strong> <strong>the</strong>se web pages was prepared for text analysis by removing non-<br />

<strong>Chinese</strong> characters, segmenting <strong>the</strong> <strong>Chinese</strong> characters to identify words, <strong>and</strong> removing<br />

stopwords.<br />

All non-<strong>Chinese</strong> characters <strong>and</strong> alphanumeric characters were removed from <strong>the</strong> raw<br />

output <strong>of</strong> <strong>the</strong> web scraping to improve <strong>the</strong> ease <strong>of</strong> parsing <strong>and</strong> tokenization. Content<br />

in languages such as Tibetan, Uyghur, a Turkic language used in Xinjiang, <strong>and</strong> Korean,<br />

which is used in Jilin province near <strong>the</strong> border with Korea, were removed. 12<br />

<strong>Chinese</strong> words can be composed <strong>of</strong> single or multiple characters, but <strong>the</strong>re are no<br />

12 Existing methods <strong>of</strong> multilingual processing remain limited, especially for languages that differ as<br />

much as <strong>Chinese</strong>, Korean, Tibetan, <strong>and</strong> Uyghur.<br />

23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!