Measuring the Goals and Incentives of Local Chinese Officials

More documents

Recommendations

Info

import BeautifulSoup import re import chardet ’Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1’)] Newlines = re.compile(r’[\r\n]\s+’) def getPageText(url): data = br.open(url).read() dataD = data.decode(chardet.detect(data)[’encoding’], ’ignore’) bs = BeautifulSoup.BeautifulSoup(dataD, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES, fromEncoding=chardet.detect(data)[’encoding’]) for s in bs.findAll(’script’): s.replaceWith(’’) txt = bs.find(’body’).getText(’|’) txtD = txt.encode(’utf-8’) return Newlines.sub(’|’, txtD) link = raw_input() import time import random sec = random.uniform(1,5000) time.sleep(sec) text = {} try: text[link] = getPageText(link) except: text[link] = "||||ERROR||||" + str(sys.exc_info()[0]) C Data Cleaning Methods The raw content of these web pages was prepared for text analysis by removing non- Chinese characters, segmenting the Chinese characters to identify words, and removing stopwords. All non-Chinese characters and alphanumeric characters were removed from the raw output of the web scraping to improve the ease of parsing and tokenization. Content in languages such as Tibetan, Uyghur, a Turkic language used in Xinjiang, and Korean, which is used in Jilin province near the border with Korea, were removed. 12 Chinese words can be composed of single or multiple characters, but there are no 12 Existing methods of multilingual processing remain limited, especially for languages that differ as much as Chinese, Korean, Tibetan, and Uyghur. 23
white-spaces to delineate the boundaries between words. As a result, word segmentation is often the first step in Chinese language processing. Although the Chinese character corpus since antiquity comprises well over 20,000 characters, only around 10,000 are commonly in use today. However, estimates of the total number of Chinese words range from 50,000 to 500,000. The Hanyu Da Zidian, a compendium of Chinese characters, includes 54,678 entries for characters. The CC-CEDICT project contains 97,404 contemporary entries including idioms, technology terms and names of political figures, businesses and products. The 2006 SIGHAN Chinese Language Processing Bakeoff training data contained 509,000 words (Chang, Galley and Manning, 2008). 13 This large range in the number of Chinese words results from the difficulty of identifying words in Chinese text because almost all characters can be uni-gram words that form multi-gram words when joined to other characters, causing ambiguity in segmentation. Even for human readers who are native speakers, agreement on segmentation is only 75% (Wu and Fung, 1994; Sproat et al., 1996). There are two general approaches to Chinese work segmentation: lexicon-based and probabilistic, 14 though in recent years hybrids of the two approaches have been utilized (Chang, Galley and Manning, 2008; Chen and Liu, 1992; Cheng, Young and Wong, 1999; Lafferty, McCallum and Pereira, 2001; Peng, Feng and McCallum, 2004; Teahan et al., 2000; Tseng et al., 2005). Comparison of lexicon-based and probabilistic segmentation has found that the two methods yield similar results in terms of precision and invocabulary word recall (Chang, Galley and Manning, 2008). For this analysis, simple and complex maximum matching with pre-defined rules to resolve ambiguities is used for segmentation. Let represent the white-space between words and let C n , where n = 1, ..., 6 represent characters in a string. The unsegmented text appears as: C 1 C 2 C 3 C 4 C 5 C 6 Simple, forward maximum matching starts on the left of the string with C 1 and checks 13 SIGHAN is a Special Interest Group of the Association for Computational Linguistics. Annual Bakeoffs engage in Chinese word segmentation evaluation. 14 Lexicon-based method can also be referred to as dictionary-based method. 24
Page 1 and 2: Measuring the Goals and Incentives
Page 3 and 4: The durability of the Chinese regim
Page 5 and 6: from terror to redistribution as th
Page 7 and 8: levels of government to make inform
Page 9 and 10: of 100 randomly selected counties,
Page 11 and 12: including upkeep of servers, manage
Page 13 and 14: county is urban or rural, 8 2009 co
Page 15 and 16: of economic competence. Other web p
Page 17 and 18: Figure 3: Content of Websites by Ma
Page 19 and 20: expressions of good will and attent
Page 21 and 22: autocracies mold the incentives of
Page 23: gang (Guizhou), Luchun (Yunnan), Zh
Page 27 and 28: etc. did not interfere with the aut
Page 29 and 30: References Anonymous. 2007. “Peop
Page 31 and 32: Li, Hongbin and Li-An Zhou. 2005.
Page 33: Papers. New York: Public Affairs. 3

Measuring the Goals and Incentives of Local Chinese Officials

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?