22.01.2015 Views

A Study on Features of the CRFs-based Chinese Named Entity ...

A Study on Features of the CRFs-based Chinese Named Entity ...

A Study on Features of the CRFs-based Chinese Named Entity ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Internati<strong>on</strong>al Journal <strong>of</strong> Advanced Intelligence<br />

Volume 3, Number 2, pp.287-294, July, 2011.<br />

© AIA Internati<strong>on</strong>al Advanced Informati<strong>on</strong> Institute<br />

A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong><br />

<strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />

Huanzh<strong>on</strong>g Duan, Yan Zheng<br />

Beijing University <strong>of</strong> Posts and Telecommunicati<strong>on</strong>s, Beijing, 100876 China<br />

hzhduan@gmail.com, yanzheng@bupt.edu.cn<br />

Received (20 February 2011)<br />

Revised (31 March 2011)<br />

This paper studies <strong>on</strong> features <strong>of</strong> <strong>the</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong> (CNER) <strong>based</strong> <strong>on</strong><br />

<strong>the</strong> C<strong>on</strong>diti<strong>on</strong>al Random Fields (<strong>CRFs</strong>). These <strong>Features</strong> which include comm<strong>on</strong> attributes,<br />

feature templates varying in window size and sequence label sets are very important for<br />

CNER. Taking advantages <strong>of</strong> <strong>the</strong>se features or <strong>the</strong>ir combinati<strong>on</strong> can greatly improve <strong>the</strong><br />

performance <strong>of</strong> CNER. The paper aims to provide a reference for selecting features <strong>of</strong><br />

CNER through a series <strong>of</strong> experiments. The experiment results show that appropriate<br />

features or <strong>the</strong>ir combinati<strong>on</strong>s, such as single <strong>Chinese</strong> character, part-<strong>of</strong>-speech (POS),<br />

prefix and suffix can c<strong>on</strong>tribute to <strong>the</strong> score <strong>of</strong> F-measure <strong>based</strong> <strong>on</strong> <strong>CRFs</strong>. Meanwhile, <strong>the</strong><br />

results indicate that selecting suitable feature templates and sequence label sets, not <strong>on</strong>ly<br />

can improve <strong>the</strong> performance <strong>of</strong> CNER, but also shorten <strong>the</strong> model-training process and<br />

reduce <strong>the</strong> system resource c<strong>on</strong>sumpti<strong>on</strong>.<br />

Keywords: <strong>Chinese</strong> named entity recogniti<strong>on</strong>; feature template; C<strong>on</strong>diti<strong>on</strong>al Random<br />

Fields.<br />

1. Introducti<strong>on</strong><br />

The <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong> (NER) was firstly introduced as a subtask <strong>on</strong> <strong>the</strong><br />

Sixth <strong>of</strong> <strong>the</strong> series Message Understanding C<strong>on</strong>ference (MUC-6) in November 1995<br />

1 . In this c<strong>on</strong>ference NER is defined as <strong>the</strong> proper nouns which people are interested<br />

in and <strong>the</strong> specific numeral nouns. The <strong>Named</strong> <strong>Entity</strong> (NE) is <strong>the</strong> basic informati<strong>on</strong><br />

unit in Natural Language Processing (NLP). It is <strong>on</strong>e <strong>of</strong> <strong>the</strong> fundamental problems<br />

in many NLP applicati<strong>on</strong>s, such as informati<strong>on</strong> extracti<strong>on</strong>, questi<strong>on</strong> answering,<br />

machine translati<strong>on</strong>, automatic abstract and data mining. In a limited sense, NE<br />

c<strong>on</strong>sists <strong>of</strong> names, locati<strong>on</strong>, organizati<strong>on</strong> etc., while in a generalized sense, it can<br />

also include time and numerical expressi<strong>on</strong>s.<br />

There are many difficult problems in NER. Firstly, NE is an open class and <strong>the</strong><br />

number <strong>of</strong> its comp<strong>on</strong>ents is very large such that it is very hard to enumerate all <strong>of</strong><br />

<strong>the</strong>m. Sec<strong>on</strong>dly, NE is not a stable class, and <strong>the</strong>re have been unknown <strong>on</strong>es <strong>of</strong> its<br />

comp<strong>on</strong>ents emerge frequently over time. In additi<strong>on</strong>, <strong>the</strong>re have not yet comm<strong>on</strong>ly<br />

named criteria for its definiti<strong>on</strong> in this field<br />

To correctly identify all <strong>of</strong> <strong>the</strong> named entities is a very difficult task for any<br />

language, since <strong>the</strong> level <strong>of</strong> difficulty depends <strong>on</strong> <strong>the</strong> diversity <strong>of</strong> language settings.<br />

287


288 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />

As to <strong>Chinese</strong>, <strong>the</strong>re are many complicated properties, for example, <strong>the</strong> complex<br />

compositi<strong>on</strong> forms, <strong>the</strong> uncertain length and boundary, NE definiti<strong>on</strong> within <strong>the</strong><br />

nest and so <strong>on</strong>. Therefore, CNER is a difficult task and <strong>the</strong>re still has a l<strong>on</strong>g way<br />

to go in this field<br />

At present, <strong>the</strong>re are two comm<strong>on</strong>ly used approaches to NER. One is rule<strong>based</strong><br />

and <strong>the</strong> o<strong>the</strong>r is statistics-<strong>based</strong>. However, <strong>the</strong>y both have advantages and<br />

disadvantages respectively. One <strong>of</strong> <strong>the</strong> advantages for <strong>the</strong>m respectively is that<br />

<strong>the</strong> former is fast and has good performance for small test corpus and <strong>the</strong> latter<br />

is better to transplant and depends <strong>on</strong> language weakly On <strong>the</strong> o<strong>the</strong>r hand, <strong>on</strong>e<br />

<strong>of</strong> <strong>the</strong> disadvantages is that <strong>the</strong> former is difficult to transplant and feeble to use<br />

universally, while <strong>the</strong> latter needs a large number <strong>of</strong> training corpus and more system<br />

resources. <strong>CRFs</strong> 2 is a statistics-<strong>based</strong> sequence annotati<strong>on</strong> model, which has str<strong>on</strong>g<br />

ability to integrate any kind <strong>of</strong> features and thus widely used in NLP and o<strong>the</strong>r<br />

fields. Known from <strong>the</strong> previous SIGHAN 3 <strong>of</strong> NER test results that <strong>the</strong> various<br />

systems <strong>based</strong> <strong>on</strong> <strong>CRFs</strong> can achieve better performance. This is <strong>the</strong> main reas<strong>on</strong><br />

why <strong>the</strong> CRF++ 4 is chosen as tools for CNER task.<br />

Because <strong>of</strong> <strong>the</strong> str<strong>on</strong>g ability to integrate any kind <strong>of</strong> features which plays an<br />

important role during training, <strong>CRFs</strong> becomes <strong>on</strong>e <strong>of</strong> <strong>the</strong> key factors affecting <strong>the</strong><br />

NER performance. The features <strong>of</strong> CNER include not <strong>on</strong>ly <strong>the</strong> internal features from<br />

c<strong>on</strong>text, such as character informati<strong>on</strong>, POS and boundary, but also <strong>the</strong> external<br />

features <strong>based</strong> <strong>on</strong> <strong>the</strong> statistical results such as surname that <strong>the</strong> prefix <strong>of</strong> <strong>Chinese</strong><br />

family names, <strong>the</strong> suffix <strong>of</strong> locati<strong>on</strong> and organizati<strong>on</strong> and so <strong>on</strong>. In additi<strong>on</strong>, <strong>the</strong><br />

feature template is also found to play an important role in CNER.<br />

2. Feature Template and Feature Set<br />

2.1. Feature Template<br />

To define <strong>the</strong> characteristic functi<strong>on</strong>, <strong>the</strong> true character <strong>of</strong> observati<strong>on</strong>s <strong>on</strong> <strong>the</strong><br />

b(x,i) collecti<strong>on</strong> is firstly c<strong>on</strong>structed. The characteristic collecti<strong>on</strong> not <strong>on</strong>ly dem<strong>on</strong>strates<br />

priori distributi<strong>on</strong> <strong>of</strong> <strong>the</strong> training data, but also reflects <strong>the</strong> model distributi<strong>on</strong>.<br />

As to <strong>the</strong> specified values from <strong>the</strong> current state which equal to <strong>the</strong> state<br />

functi<strong>on</strong> or from <strong>the</strong> state between <strong>the</strong> previous and <strong>the</strong> current which corresp<strong>on</strong>ding<br />

to <strong>the</strong> transfer functi<strong>on</strong>, <strong>the</strong> value <strong>of</strong> each characteristic can be assigned as an<br />

observati<strong>on</strong> characteristic functi<strong>on</strong> b(x,i) shown as follows,<br />

{<br />

ḃ(x, i), IF yi−1 is <strong>the</strong> previous tag AND y i is <strong>the</strong> current tag<br />

f(y i−1 , y i , x, i) =<br />

0, ELSE<br />

(1)<br />

where b(x,i) represents <strong>the</strong> value <strong>of</strong> a real observati<strong>on</strong> if y i−1 is <strong>the</strong> previous tag<br />

and y i is <strong>the</strong> current tag; o<strong>the</strong>rwise it is zero.<br />

In accordance with requirements <strong>of</strong> CRF++, <strong>the</strong> experiments demands a feature<br />

template file in additi<strong>on</strong> to <strong>the</strong> training data file with satisfied <strong>the</strong> specific format


A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />

289<br />

But <strong>the</strong> selecti<strong>on</strong> <strong>of</strong> <strong>the</strong> feature template can largely affect <strong>the</strong> results <strong>of</strong> experimental<br />

tests Meanwhile, <strong>the</strong> system resources required for training are varying for<br />

different feature template functi<strong>on</strong>s. According to <strong>the</strong> results in Jun Yu 5 and o<strong>the</strong>r<br />

references which c<strong>on</strong>ducted <strong>the</strong> comparis<strong>on</strong> between <strong>the</strong> character-<strong>based</strong> and <strong>the</strong><br />

word-<strong>based</strong> CNER performances, <strong>the</strong> character-<strong>based</strong> level feature template is chosen<br />

for investigati<strong>on</strong> in this study. Empirically, <strong>the</strong> window size <strong>of</strong> feature functi<strong>on</strong><br />

template can be set as 3 (including <strong>the</strong> previous, <strong>the</strong> current and <strong>the</strong> posterior tokens),<br />

or 5 (including <strong>the</strong> two pervious, <strong>the</strong> current and <strong>the</strong> two posterior tokens),<br />

or 7 (including <strong>the</strong> three previous, <strong>the</strong> current and <strong>the</strong> three posterior tokens). The<br />

form <strong>of</strong> feature functi<strong>on</strong> template is typically defined as:<br />

(i) Cn; (ii) CnCn+1; (iii) C-nCn; where n is an integer.<br />

According to different attributes and requirements from <strong>the</strong> number <strong>of</strong> label sets<br />

<strong>of</strong> training file format, nine feature functi<strong>on</strong> template sets are chosen herein to finish<br />

experiments and make comparis<strong>on</strong>s. From experimental results, <strong>the</strong> performance<br />

<strong>of</strong> CNER can be improved by selecting suitable template and <strong>the</strong> corresp<strong>on</strong>ding<br />

appropriate window size for different feature, which is discussed later in this paper.<br />

2.2. Feature Set<br />

The training feature set <strong>of</strong> <strong>CRFs</strong> is a very important parameter, which can<br />

directly affect <strong>the</strong> NER results. The main features <strong>of</strong> <strong>CRFs</strong> can be divided into<br />

two categories: <strong>the</strong> internal and <strong>the</strong> external. The former includes character or<br />

word, POS, boundary and o<strong>the</strong>r c<strong>on</strong>text informati<strong>on</strong>. Although <strong>the</strong>se basic internal<br />

features may have certain effects <strong>on</strong> <strong>the</strong> results, <strong>the</strong> external features are needed to<br />

obtain better results. The external features are mainly from <strong>the</strong> general statistical<br />

informati<strong>on</strong> <strong>of</strong> corpus, which c<strong>on</strong>sists <strong>of</strong> <strong>the</strong> prefix <strong>of</strong> comm<strong>on</strong> <strong>Chinese</strong> family names,<br />

<strong>the</strong> suffix <strong>of</strong> place names and organizati<strong>on</strong>al names and so <strong>on</strong>. Herein, <strong>the</strong> surname<br />

or prefix <strong>of</strong> pers<strong>on</strong> name is labeled with ‘PP’, <strong>the</strong> suffix <strong>of</strong> locati<strong>on</strong> with ‘LS’ and<br />

<strong>the</strong> suffix <strong>of</strong> organizati<strong>on</strong>al name with ‘OS’<br />

Firstly, <strong>the</strong> recogniti<strong>on</strong> accuracy <strong>of</strong> pers<strong>on</strong> name can be increased by adding <strong>the</strong><br />

prefix <strong>of</strong> <strong>Chinese</strong> family name (mainly comm<strong>on</strong> surname in China, such as ‘ 李 ’, ‘ 王 ’,<br />

‘ 赵 ’ etc.). For example, top 200 <strong>Chinese</strong> surnames are used as <strong>the</strong> prefix feature <strong>of</strong><br />

name, and pers<strong>on</strong> name whose length between 2 and 4 is selected as <strong>the</strong> <strong>Chinese</strong><br />

name feature according to <strong>the</strong> <strong>Chinese</strong> named norms. Sec<strong>on</strong>dly by adding <strong>the</strong> suffix<br />

<strong>of</strong> place name (such as ‘ 国 ’, ‘ 村 ’, ‘ 路 ’, ‘ 港 ’ etc.) to <strong>the</strong> feature <strong>of</strong> locati<strong>on</strong> recogniti<strong>on</strong>,<br />

50 comm<strong>on</strong> <strong>Chinese</strong> characters are chosen as <strong>the</strong> suffix <strong>of</strong> locati<strong>on</strong> by statistics, and<br />

<strong>the</strong> place name whose length not less than two <strong>Chinese</strong> characters are also selected<br />

as <strong>the</strong> locati<strong>on</strong> formati<strong>on</strong> feature according to <strong>the</strong> <strong>Chinese</strong> place named norms.<br />

However, Organizati<strong>on</strong>al name has lower recogniti<strong>on</strong> accuracy than <strong>the</strong> previous<br />

two, mainly because <strong>of</strong> <strong>the</strong> complex structure which may include pers<strong>on</strong> name or<br />

place name, and <strong>the</strong> length difference as well. Therefore, we not <strong>on</strong>ly add <strong>the</strong> suffix<br />

<strong>of</strong> organizati<strong>on</strong>al name, such as ‘ 局 ’, ‘ 厅 ’, ‘ 司 ’, ‘ 院 ’, ‘ 部 ’ etc., but also agreed to<br />

its length greater than 3. Meanwhile, if <strong>the</strong> tagged result c<strong>on</strong>tains pers<strong>on</strong> name or


290 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />

place name (<strong>the</strong> results <strong>of</strong> local label) except for <strong>the</strong> suffix <strong>of</strong> organizati<strong>on</strong>al name,<br />

it should be treated as an organizati<strong>on</strong>al name to obtain <strong>the</strong> results <strong>of</strong> a global<br />

dimensi<strong>on</strong>.<br />

In additi<strong>on</strong>, three label sets are compared in experiments. The comm<strong>on</strong> segmentati<strong>on</strong><br />

label sets include 3 tags (B, I, S), 4 tags (B, M, E, S), and 6 tags (B, B 2 , B 3 ,<br />

M, E, S). In this paper <strong>the</strong> comm<strong>on</strong> three named entities, pers<strong>on</strong> name, place name<br />

and organizati<strong>on</strong>al name which are tagged ‘PER’, ’LOC’ and ‘ORG’ respectively are<br />

mainly studied. Therefore, <strong>the</strong> label collecti<strong>on</strong>s <strong>of</strong> <strong>the</strong> corresp<strong>on</strong>ding <strong>of</strong> CNER can<br />

be generated by combining segmentati<strong>on</strong> label sets with entity tags. For example,<br />

a collecti<strong>on</strong> <strong>of</strong> pers<strong>on</strong> name entity labeled as 3 label set for (B-PER, I-PER, S), 4<br />

label set for (B-PER, M-PER, E-PER, S), and 6 label set for (B-PER, B 2 -PER,<br />

B 3 -PER, M-PER, E-PER, S).<br />

3. Experiments and Discussi<strong>on</strong>s<br />

3.1. Experimental preparati<strong>on</strong><br />

Since <strong>the</strong> main objective <strong>of</strong> this paper is to study <strong>the</strong> CNER c<strong>on</strong>sisting <strong>of</strong> pers<strong>on</strong><br />

name, local name and organizati<strong>on</strong>al name <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>on</strong>line open source<br />

s<strong>of</strong>tware CRF++ (current versi<strong>on</strong> CRF++0.54) is used as <strong>the</strong> training and test<br />

tools in experiments<br />

Corpus used in experiments is from <strong>the</strong> People’s Daily in January 1998 which<br />

has a size <strong>of</strong> about 5.11MB after processing. To obtain more unbiased results, <strong>the</strong><br />

ratio <strong>of</strong> training and test corpus is about 7:3.<br />

Due to <strong>the</strong> strict requirements <strong>of</strong> <strong>the</strong> CRF++ for <strong>the</strong> corpus format all <strong>of</strong> <strong>the</strong><br />

used corpuses are preprocessed by using <strong>the</strong> pers<strong>on</strong>ally developed CorpusPreC<strong>on</strong>v<br />

during each stage <strong>of</strong> <strong>the</strong> experiments. Fur<strong>the</strong>rmore, an evaluati<strong>on</strong> tool NEREvaluator<br />

written by authors is used in this study to evaluate <strong>the</strong> Precisi<strong>on</strong>, Recall and<br />

F-measure <strong>of</strong> pers<strong>on</strong> names, place names and organizati<strong>on</strong>al names respectively<br />

3.2. Experimental results and discussi<strong>on</strong>s<br />

In this study, three comm<strong>on</strong>ly used attributes (i.e., single <strong>Chinese</strong> character<br />

(char) itself, POS, prefix/suffix (p/s), and <strong>the</strong>ir combinati<strong>on</strong>) and three segmentati<strong>on</strong><br />

label sets (3tags:BIS, 4tags:BMES, 6tags:BB 2 B 3 MES) are c<strong>on</strong>sidered and<br />

investigated for a series <strong>of</strong> CNER experiments, which include pers<strong>on</strong>(PER) name<br />

recogniti<strong>on</strong>, local(LOC) name recogniti<strong>on</strong> and organizati<strong>on</strong>al(ORG) name recogniti<strong>on</strong><br />

by incorporating three different window size <strong>of</strong> feature template (size=3, 5, 7).<br />

The precisi<strong>on</strong>, recall and F-measure are set as <strong>the</strong> evaluati<strong>on</strong> indicators. Due to <strong>the</strong><br />

limited space herein, <strong>on</strong>ly <strong>the</strong> results <strong>of</strong> <strong>the</strong> case when <strong>the</strong> feature template size<br />

equals 5 are provided in Table 1, in which P means <strong>the</strong> Precisi<strong>on</strong> rate, R recall rate,<br />

and F1 F-measure value.<br />

The F-measure values <strong>of</strong> PER, LOC and ORG under <strong>the</strong> different c<strong>on</strong>diti<strong>on</strong>s are<br />

shown in Fig.1. (a) through (f). The results in Fig.1. (a), (b) and (c) represent <strong>the</strong>


A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />

291<br />

NER<br />

PER<br />

LOC<br />

ORG<br />

Table 1. The window size <strong>of</strong> feature template is 5<br />

The comm<strong>on</strong> attributes and <strong>the</strong>ir combinati<strong>on</strong><br />

PRF<br />

char char+POS char+p/s char+POS+p/s<br />

3tags 4tags 6tags 3tags 4tags 6tags 3tags 4tags 6tags 3tags 4tags 6tags<br />

P 96.7 94.9 94.9 94.8 94.7 94.0 96.5 93.7 92.9 95.4 95.2 95.2<br />

R 73.0 72.8 72.0 75.9 75.3 75.8 80.5 78.6 77.8 85.7 85.6 85.4<br />

F1 83.2 82.4 81.8 84.3 83.9 83.9 87.8 85.5 84.7 90.3 90.1 90.0<br />

P 91.5 91.5 91.5 91.3 90.9 91.1 94.9 95.1 94.8 95.7 95.6 95.2<br />

R 83.6 84.0 83.1 84.7 85.0 84.6 86.7 87.1 86.8 89.3 89.2 89.0<br />

F1 87.4 87.6 87.1 87.9 87.9 87.7 90.6 90.9 90.6 92.4 92.3 92.0<br />

P 88.5 85.8 85.1 90.5 90.3 90.4 92.6 92.2 91.0 93.7 93.6 93.6<br />

R 77.7 76.0 75.7 77.2 77.0 77.2 85.0 84.9 84.3 87.6 87.5 87.5<br />

F1 82.7 80.6 80.1 83.3 83.1 83.3 88.7 88.4 87.5 90.5 90.4 90.4<br />

F1 <strong>of</strong> PER, LOC, ORG respectively which have <strong>the</strong> same feature template setting<br />

and where <strong>the</strong> window size is 5, while <strong>the</strong> results <strong>of</strong> Fig.1. (d), (e) and (f) explore<br />

<strong>the</strong> corresp<strong>on</strong>ding F1 under <strong>the</strong> same attribute c<strong>on</strong>diti<strong>on</strong>s (i.e., char+POS+p/s).<br />

It can be found from Table 1 and Fig.1.(a) through (f) that <strong>the</strong> precisi<strong>on</strong> <strong>of</strong><br />

PER, LOC and ORG is relatively high for selecting <strong>the</strong> same feature template<br />

and window size (e.g., size=5) and using <strong>on</strong>ly <strong>the</strong> single <strong>Chinese</strong> character as <strong>the</strong><br />

attribute column as well, but <strong>the</strong> recall is very low (e.g., even less than 80%). On <strong>the</strong><br />

o<strong>the</strong>r hand, <strong>the</strong> combinati<strong>on</strong> <strong>of</strong> <strong>the</strong> <strong>Chinese</strong> character with POS and <strong>the</strong> combinati<strong>on</strong><br />

<strong>of</strong> <strong>Chinese</strong> character with prefix/suffix informati<strong>on</strong> can increase both <strong>the</strong> F-measure<br />

and <strong>the</strong> recall to some extent respectively. In particular, by c<strong>on</strong>sidering <strong>the</strong> three<br />

comm<strong>on</strong>ly used attributes, <strong>the</strong> recall <strong>of</strong> PER, LOC and ORG can attain to 85%,<br />

89% and 87% respectively, and <strong>the</strong> precisi<strong>on</strong> <strong>of</strong> all cases reaches to about 95%.<br />

From <strong>the</strong> case <strong>of</strong> using <strong>on</strong>ly <strong>on</strong>e attribute column (i.e.,char) to <strong>the</strong> cases <strong>of</strong> <strong>the</strong><br />

combinati<strong>on</strong> <strong>of</strong> three attributes column, <strong>the</strong> recall <strong>of</strong> PER, LOC and ORG have<br />

been increased by 12%, 6% and 10% respectively, and <strong>the</strong> corresp<strong>on</strong>ding F-measure<br />

values are also raised by 7%, 5% and 8%. Therefore, <strong>the</strong> results indicate that <strong>the</strong><br />

POS and prefix/suffix informati<strong>on</strong> play very important roles to train <strong>the</strong> <strong>CRFs</strong><br />

model such that <strong>the</strong>y can provide useful references for CNER.<br />

Fur<strong>the</strong>rmore, <strong>the</strong> results from Fig.1.(a) through (c) show that <strong>the</strong> effects <strong>of</strong><br />

CNER are different for <strong>the</strong> cases with <strong>the</strong> same feature template and attributes but<br />

different label set Meanwhile, <strong>the</strong> result <strong>of</strong> PER or LOC or ORG recogniti<strong>on</strong> for<br />

which <strong>the</strong> 3 tags (BIS) is selected as segmentati<strong>on</strong> marker set is better than that for<br />

<strong>the</strong> o<strong>the</strong>r two cases (4tags:BMES, 6tags:BB 2 B 3 MES) which implies that <strong>the</strong> more<br />

number <strong>of</strong> tags, <strong>the</strong> greater <strong>the</strong> training size and <strong>the</strong> slower <strong>the</strong> training speed and<br />

thus <strong>the</strong> more required system resources.<br />

Finally, <strong>the</strong> results observed from Figs. 4 through 6 dem<strong>on</strong>strate that, for <strong>the</strong><br />

cases with <strong>the</strong> same number <strong>of</strong> tags and <strong>the</strong> identical attributes but <strong>the</strong> varying<br />

feature template in window sizes, <strong>the</strong> effects <strong>of</strong> CNER are also different Am<strong>on</strong>g<br />

<strong>the</strong>se cases, and <strong>the</strong> result for <strong>the</strong> template size <strong>of</strong> 5 seems to be better than <strong>the</strong><br />

o<strong>the</strong>rs (e.g., size = 3, 7). Moreover, it can be found in Table 1 that <strong>the</strong> precisi<strong>on</strong>,<br />

recall and F1 <strong>of</strong> CNER are not always increasing with <strong>the</strong> window size <strong>of</strong> feature<br />

template. For example, when <strong>the</strong> window size <strong>of</strong> template increases from 5 to 7, <strong>the</strong>


292 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

(e)<br />

(f)<br />

Fig. 1. (a).The F1 <strong>of</strong> PER under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same feature template and window size is 5.<br />

(b). The F1 <strong>of</strong> LOC under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same feature template and window size is 5.(c).The<br />

F1 <strong>of</strong> ORG under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same feature template and window size is 5.(d).The F1 <strong>of</strong><br />

PER under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same attributes (char+POS+p/s).(e).The F1 <strong>of</strong> LOC under <strong>the</strong><br />

c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same attributes (char+POS+p/s).(f).The F1 <strong>of</strong> ORG under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong><br />

same attributes (char+POS+p/s).<br />

results <strong>of</strong> CNER are not clearly improved but <strong>the</strong> values <strong>of</strong> o<strong>the</strong>r indicators gradually<br />

decrease. Similarly, <strong>the</strong> more complex <strong>the</strong> feature template is <strong>the</strong> l<strong>on</strong>ger <strong>the</strong> training<br />

time is and <strong>the</strong> more system resources needed. Therefore, <strong>the</strong> appropriately selected<br />

feature template is also an important factor for <strong>the</strong> CNER.


A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />

293<br />

4. C<strong>on</strong>clusi<strong>on</strong>s<br />

A series <strong>of</strong> experiments are c<strong>on</strong>ducted for <strong>the</strong> fea<strong>the</strong>rs <strong>of</strong> <strong>the</strong> CNER, including<br />

<strong>the</strong> comm<strong>on</strong>ly used attributes, <strong>the</strong> varying feature template in window size and <strong>the</strong><br />

label sets. From <strong>the</strong> experimental results, it is shown that <strong>the</strong> different attributes<br />

or <strong>the</strong>ir combinati<strong>on</strong>s in <strong>the</strong> same feature template and label set have different<br />

effects <strong>on</strong> <strong>the</strong> CNER. In particular, better results can be achieved by combining <strong>the</strong><br />

comm<strong>on</strong>ly used attributes. For example in <strong>the</strong> studied experiments <strong>of</strong> this paper,<br />

<strong>the</strong> results become better when <strong>the</strong> 3 tags (BIS) set and <strong>the</strong> template with window<br />

size <strong>of</strong> 5 are used in <strong>the</strong> tests,. Therefore, <strong>the</strong> appropriately selected attributes<br />

or <strong>the</strong>ir combinati<strong>on</strong>s and label set can obviously improve <strong>the</strong> performance <strong>of</strong> <strong>the</strong><br />

CNER. Fur<strong>the</strong>rmore, <strong>the</strong> feature template with suitable window size is ano<strong>the</strong>r<br />

important factor affecting <strong>the</strong> results <strong>of</strong> <strong>the</strong> CNER. C<strong>on</strong>sequently, a comprehensive<br />

c<strong>on</strong>siderati<strong>on</strong> <strong>of</strong> <strong>the</strong>se factors can greatly improve <strong>the</strong> performance <strong>of</strong> <strong>the</strong> CNER<br />

and reduce <strong>the</strong> c<strong>on</strong>sumed system resources corresp<strong>on</strong>dingly.<br />

It is important to note that <strong>the</strong>re are so many available features in <strong>the</strong> CNER<br />

such that <strong>the</strong> <strong>on</strong>es c<strong>on</strong>cerned in this study are actually not enough for expressing<br />

all possible situati<strong>on</strong>s <strong>of</strong> <strong>the</strong> CNER. As a preliminary from this study, more<br />

investigati<strong>on</strong>s are needed in <strong>the</strong> future work.<br />

Acknowledgments<br />

This research work was supported by <strong>the</strong> Fundamental Research Funds for <strong>the</strong><br />

Central Universities (2009RC0206).<br />

References<br />

1. MUC-6, http://cs.nyu.edu/faculty/grishman/muc6.html<br />

2. J. Lafferty, A. McCallum, and F. Pereira. C<strong>on</strong>diti<strong>on</strong>al random fields: Probabilistic models for<br />

segmenting and labeling sequence data, In Proc. <strong>of</strong> ICML, pp.282-290,<br />

3. SIGHAN, http://sighan.cs.uchicago.edu/, 2010.<br />

4. Taku Kudo. “CRF++: Yet ano<strong>the</strong>r CRF tool kit”, http://crfpp.sourceforge.net/, 2010.<br />

5. Jun Yu, Xiaoou Chen. <strong>Named</strong> entity recogniti<strong>on</strong>: One-at-a-time Or All-at-<strong>on</strong>ce Word-<strong>based</strong><br />

Or Character-<strong>based</strong> Seventh Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> <strong>Chinese</strong> Informati<strong>on</strong> Processing,<br />

2007.<br />

6. Chang-Ning Huang and Hai Zhao. Which Is Essential for <strong>Chinese</strong> Word Segmentati<strong>on</strong>: Character<br />

versus Word (Invited paper), The 20th Pacific Asia C<strong>on</strong>ference <strong>on</strong> Language, Informati<strong>on</strong><br />

and Computati<strong>on</strong> (PACLIC-20),pp.1-12,Wuhan, China, November 1-3,2006.<br />

7. Hai Zhao and Chunyu Kit. Unsupervised Segmentati<strong>on</strong> Helps Supervised Learning <strong>of</strong> Character<br />

Tagging for Word Segmentati<strong>on</strong> and <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong>. The Sixth SIGHAN Workshop<br />

<strong>on</strong> <strong>Chinese</strong> Language Processing (SIGHAN-6), pp.106-111, Hyderabad, India, January 11-12,<br />

2008<br />

8. Hua-Ping ZHANG, Qun LIU. <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong> Using Role Model. Computati<strong>on</strong>al<br />

Linguistics and <strong>Chinese</strong> Language Processing. Vol.8, No.2, August 2003.<br />

9. HU Wen-bo, DU Yun-cheng, LV Xue_qiang, SHI Shui-cai. A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Chinese</strong> named entity<br />

recogniti<strong>on</strong> <strong>based</strong> <strong>on</strong> cascaded c<strong>on</strong>diti<strong>on</strong>al random fields. Computer Engineering and Applicati<strong>on</strong><br />

45(1):pp.163-165, 2009.<br />

10. Shumin Shi, Zhiqiang WANG, Lang ZHOU, Ch<strong>on</strong>g FENG, Heyan HUANG. <strong>Chinese</strong> <strong>Named</strong><br />

<strong>Entity</strong> Recogniti<strong>on</strong> Using C<strong>on</strong>diti<strong>on</strong>al Random Fields Model. Third Academic C<strong>on</strong>ference <strong>of</strong><br />

Computati<strong>on</strong>al Linguistics (ACCL), 2006.


294 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />

11. Zeng Guanming, Zhang Chuang, Xiao Bo, Lin Zhiqing. <strong>CRFs</strong>-Based <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong><br />

Recogniti<strong>on</strong> with improved Tag Set. World C<strong>on</strong>gress <strong>on</strong> Computer Science and Informati<strong>on</strong><br />

Engineering, 2009.<br />

Huanzh<strong>on</strong>g Duan<br />

He is currently a master student at Beijing University<br />

<strong>of</strong> Posts and Telecommunicati<strong>on</strong>s (BUPT).His main research<br />

interests include Natural Language Processing and<br />

Text Mining. No.10 XiTuCheng Road, HaiDian Distract,<br />

Beijing, P.R.China, 100876.<br />

Yan Zheng<br />

She received <strong>the</strong> Ph.D. degree in 2003 from Faculty <strong>of</strong><br />

Computer, Jilin University, China. From 2003 she was an<br />

associate pr<strong>of</strong>essor in <strong>the</strong> Faculty <strong>of</strong> Computer Sciences,<br />

<strong>the</strong> University <strong>of</strong> Posts and Telecommunicati<strong>on</strong>s. Her research<br />

interests include Data Mining, Text Mining, Natural<br />

Language Processing and Artificial Intelligence. No.10<br />

XiTuCheng Road, HaiDian Distract, Beijing, P.R.China,<br />

100876.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!