Measuring the Goals and Incentives of Local Chinese Officials
Measuring the Goals and Incentives of Local Chinese Officials
Measuring the Goals and Incentives of Local Chinese Officials
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
import BeautifulSoup<br />
import re<br />
import chardet<br />
’Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1)<br />
Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1’)]<br />
Newlines = re.compile(r’[\r\n]\s+’)<br />
def getPageText(url):<br />
data = br.open(url).read()<br />
dataD = data.decode(chardet.detect(data)[’encoding’], ’ignore’)<br />
bs = BeautifulSoup.BeautifulSoup(dataD,<br />
convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES,<br />
fromEncoding=chardet.detect(data)[’encoding’])<br />
for s in bs.findAll(’script’):<br />
s.replaceWith(’’)<br />
txt = bs.find(’body’).getText(’|’)<br />
txtD = txt.encode(’utf-8’)<br />
return Newlines.sub(’|’, txtD)<br />
link = raw_input()<br />
import time<br />
import r<strong>and</strong>om<br />
sec = r<strong>and</strong>om.uniform(1,5000)<br />
time.sleep(sec)<br />
text = {}<br />
try:<br />
text[link] = getPageText(link)<br />
except:<br />
text[link] = "||||ERROR||||" + str(sys.exc_info()[0])<br />
C<br />
Data Cleaning Methods<br />
The raw content <strong>of</strong> <strong>the</strong>se web pages was prepared for text analysis by removing non-<br />
<strong>Chinese</strong> characters, segmenting <strong>the</strong> <strong>Chinese</strong> characters to identify words, <strong>and</strong> removing<br />
stopwords.<br />
All non-<strong>Chinese</strong> characters <strong>and</strong> alphanumeric characters were removed from <strong>the</strong> raw<br />
output <strong>of</strong> <strong>the</strong> web scraping to improve <strong>the</strong> ease <strong>of</strong> parsing <strong>and</strong> tokenization. Content<br />
in languages such as Tibetan, Uyghur, a Turkic language used in Xinjiang, <strong>and</strong> Korean,<br />
which is used in Jilin province near <strong>the</strong> border with Korea, were removed. 12<br />
<strong>Chinese</strong> words can be composed <strong>of</strong> single or multiple characters, but <strong>the</strong>re are no<br />
12 Existing methods <strong>of</strong> multilingual processing remain limited, especially for languages that differ as<br />
much as <strong>Chinese</strong>, Korean, Tibetan, <strong>and</strong> Uyghur.<br />
23