19.05.2014 Views

Measuring the Goals and Incentives of Local Chinese Officials

Measuring the Goals and Incentives of Local Chinese Officials

Measuring the Goals and Incentives of Local Chinese Officials

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

this character against <strong>the</strong> lexicon to see if C 1 is a uni-gram word. Then, it checks to see<br />

if C 1 C 2 is a bi-gram word, <strong>and</strong> continues to do so until <strong>the</strong> sequence <strong>of</strong> characters is<br />

longer than <strong>the</strong> longest sequence containing that string <strong>of</strong> characters in <strong>the</strong> lexicon. The<br />

most plausible word is <strong>the</strong> longest, or maximum, match. If this match were C 1 C 2 , <strong>the</strong><br />

algorithm would repeat <strong>the</strong> process starting with C 3 (Chen <strong>and</strong> Liu, 1992; Cheng, Young<br />

<strong>and</strong> Wong, 1999).<br />

Complex maximum matching finds all possible combinations <strong>of</strong> three-word sequence<br />

starting with C 1 <strong>and</strong> takes <strong>the</strong> most plausible word as <strong>the</strong> one belonging to <strong>the</strong> three-word<br />

sequence containing <strong>the</strong> most characters in total. For example, in <strong>the</strong> string given above,<br />

let’s say <strong>the</strong>re are three possible three-word chunks:<br />

C 1 C 2 C 3 C 4<br />

C 1 C 2 C 3 C 4 C 5<br />

C 1 C 2 C 3 C 4 C 5 C 6<br />

The three-word sequence with <strong>the</strong> longest length is third combination, <strong>and</strong> as a result,<br />

C 1 C 2 will be considered <strong>the</strong> correct word. The algorithm will <strong>the</strong>n move to C 3 <strong>and</strong><br />

restart this process.<br />

Based on <strong>the</strong> segmentation, a dictionary <strong>of</strong> 77,855 unique words emerged, <strong>and</strong> a total<br />

<strong>of</strong> 1,612 stopwords were identified. Stopwords included parts <strong>of</strong> speech that do not convey<br />

substantive content, for example particles, prepositions, pronouns, conjunctions, <strong>and</strong> noun<br />

classifiers. This list is based on stopwords used by baidu.com, China’s largest internet<br />

search engine. 15 Noun classifiers show <strong>the</strong> conceptual classification <strong>of</strong> <strong>the</strong> referent <strong>of</strong><br />

<strong>the</strong> noun. For example, in <strong>Chinese</strong>, <strong>the</strong> noun classifier for humans is “ge” such that “3<br />

teachers” is “3-ge teachers,” <strong>the</strong> noun classifer for birds is “zhi” so that “3 birds” is “3-zhi<br />

birds,” <strong>and</strong> <strong>the</strong> noun classifier for things that are large <strong>and</strong> thin such pieces <strong>of</strong> paper or a<br />

table top is “zhang” such that “3 tables” is “3-zhang tables.” There are hundreds <strong>of</strong> noun<br />

classifiers in <strong>Chinese</strong>. Words that appeared in more than 95% (49,641) <strong>of</strong> documents were<br />

also removed. Perhaps removing <strong>the</strong>se extremes is redundant with removing stopwords,<br />

which are words that appear at high frequency, but I veered on <strong>the</strong> side <strong>of</strong> conservatism to<br />

ensure that words appearing frequently on websites, such as “web,” “web page,” “back,”<br />

15 Downloaded from http://wenku.baidu.com/view/982a25c608a1284ac85043fa.html.<br />

25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!