A Study on Features of the CRFs-based Chinese Named Entity ...
A Study on Features of the CRFs-based Chinese Named Entity ...
A Study on Features of the CRFs-based Chinese Named Entity ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Internati<strong>on</strong>al Journal <strong>of</strong> Advanced Intelligence<br />
Volume 3, Number 2, pp.287-294, July, 2011.<br />
© AIA Internati<strong>on</strong>al Advanced Informati<strong>on</strong> Institute<br />
A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong><br />
<strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />
Huanzh<strong>on</strong>g Duan, Yan Zheng<br />
Beijing University <strong>of</strong> Posts and Telecommunicati<strong>on</strong>s, Beijing, 100876 China<br />
hzhduan@gmail.com, yanzheng@bupt.edu.cn<br />
Received (20 February 2011)<br />
Revised (31 March 2011)<br />
This paper studies <strong>on</strong> features <strong>of</strong> <strong>the</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong> (CNER) <strong>based</strong> <strong>on</strong><br />
<strong>the</strong> C<strong>on</strong>diti<strong>on</strong>al Random Fields (<strong>CRFs</strong>). These <strong>Features</strong> which include comm<strong>on</strong> attributes,<br />
feature templates varying in window size and sequence label sets are very important for<br />
CNER. Taking advantages <strong>of</strong> <strong>the</strong>se features or <strong>the</strong>ir combinati<strong>on</strong> can greatly improve <strong>the</strong><br />
performance <strong>of</strong> CNER. The paper aims to provide a reference for selecting features <strong>of</strong><br />
CNER through a series <strong>of</strong> experiments. The experiment results show that appropriate<br />
features or <strong>the</strong>ir combinati<strong>on</strong>s, such as single <strong>Chinese</strong> character, part-<strong>of</strong>-speech (POS),<br />
prefix and suffix can c<strong>on</strong>tribute to <strong>the</strong> score <strong>of</strong> F-measure <strong>based</strong> <strong>on</strong> <strong>CRFs</strong>. Meanwhile, <strong>the</strong><br />
results indicate that selecting suitable feature templates and sequence label sets, not <strong>on</strong>ly<br />
can improve <strong>the</strong> performance <strong>of</strong> CNER, but also shorten <strong>the</strong> model-training process and<br />
reduce <strong>the</strong> system resource c<strong>on</strong>sumpti<strong>on</strong>.<br />
Keywords: <strong>Chinese</strong> named entity recogniti<strong>on</strong>; feature template; C<strong>on</strong>diti<strong>on</strong>al Random<br />
Fields.<br />
1. Introducti<strong>on</strong><br />
The <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong> (NER) was firstly introduced as a subtask <strong>on</strong> <strong>the</strong><br />
Sixth <strong>of</strong> <strong>the</strong> series Message Understanding C<strong>on</strong>ference (MUC-6) in November 1995<br />
1 . In this c<strong>on</strong>ference NER is defined as <strong>the</strong> proper nouns which people are interested<br />
in and <strong>the</strong> specific numeral nouns. The <strong>Named</strong> <strong>Entity</strong> (NE) is <strong>the</strong> basic informati<strong>on</strong><br />
unit in Natural Language Processing (NLP). It is <strong>on</strong>e <strong>of</strong> <strong>the</strong> fundamental problems<br />
in many NLP applicati<strong>on</strong>s, such as informati<strong>on</strong> extracti<strong>on</strong>, questi<strong>on</strong> answering,<br />
machine translati<strong>on</strong>, automatic abstract and data mining. In a limited sense, NE<br />
c<strong>on</strong>sists <strong>of</strong> names, locati<strong>on</strong>, organizati<strong>on</strong> etc., while in a generalized sense, it can<br />
also include time and numerical expressi<strong>on</strong>s.<br />
There are many difficult problems in NER. Firstly, NE is an open class and <strong>the</strong><br />
number <strong>of</strong> its comp<strong>on</strong>ents is very large such that it is very hard to enumerate all <strong>of</strong><br />
<strong>the</strong>m. Sec<strong>on</strong>dly, NE is not a stable class, and <strong>the</strong>re have been unknown <strong>on</strong>es <strong>of</strong> its<br />
comp<strong>on</strong>ents emerge frequently over time. In additi<strong>on</strong>, <strong>the</strong>re have not yet comm<strong>on</strong>ly<br />
named criteria for its definiti<strong>on</strong> in this field<br />
To correctly identify all <strong>of</strong> <strong>the</strong> named entities is a very difficult task for any<br />
language, since <strong>the</strong> level <strong>of</strong> difficulty depends <strong>on</strong> <strong>the</strong> diversity <strong>of</strong> language settings.<br />
287
288 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />
As to <strong>Chinese</strong>, <strong>the</strong>re are many complicated properties, for example, <strong>the</strong> complex<br />
compositi<strong>on</strong> forms, <strong>the</strong> uncertain length and boundary, NE definiti<strong>on</strong> within <strong>the</strong><br />
nest and so <strong>on</strong>. Therefore, CNER is a difficult task and <strong>the</strong>re still has a l<strong>on</strong>g way<br />
to go in this field<br />
At present, <strong>the</strong>re are two comm<strong>on</strong>ly used approaches to NER. One is rule<strong>based</strong><br />
and <strong>the</strong> o<strong>the</strong>r is statistics-<strong>based</strong>. However, <strong>the</strong>y both have advantages and<br />
disadvantages respectively. One <strong>of</strong> <strong>the</strong> advantages for <strong>the</strong>m respectively is that<br />
<strong>the</strong> former is fast and has good performance for small test corpus and <strong>the</strong> latter<br />
is better to transplant and depends <strong>on</strong> language weakly On <strong>the</strong> o<strong>the</strong>r hand, <strong>on</strong>e<br />
<strong>of</strong> <strong>the</strong> disadvantages is that <strong>the</strong> former is difficult to transplant and feeble to use<br />
universally, while <strong>the</strong> latter needs a large number <strong>of</strong> training corpus and more system<br />
resources. <strong>CRFs</strong> 2 is a statistics-<strong>based</strong> sequence annotati<strong>on</strong> model, which has str<strong>on</strong>g<br />
ability to integrate any kind <strong>of</strong> features and thus widely used in NLP and o<strong>the</strong>r<br />
fields. Known from <strong>the</strong> previous SIGHAN 3 <strong>of</strong> NER test results that <strong>the</strong> various<br />
systems <strong>based</strong> <strong>on</strong> <strong>CRFs</strong> can achieve better performance. This is <strong>the</strong> main reas<strong>on</strong><br />
why <strong>the</strong> CRF++ 4 is chosen as tools for CNER task.<br />
Because <strong>of</strong> <strong>the</strong> str<strong>on</strong>g ability to integrate any kind <strong>of</strong> features which plays an<br />
important role during training, <strong>CRFs</strong> becomes <strong>on</strong>e <strong>of</strong> <strong>the</strong> key factors affecting <strong>the</strong><br />
NER performance. The features <strong>of</strong> CNER include not <strong>on</strong>ly <strong>the</strong> internal features from<br />
c<strong>on</strong>text, such as character informati<strong>on</strong>, POS and boundary, but also <strong>the</strong> external<br />
features <strong>based</strong> <strong>on</strong> <strong>the</strong> statistical results such as surname that <strong>the</strong> prefix <strong>of</strong> <strong>Chinese</strong><br />
family names, <strong>the</strong> suffix <strong>of</strong> locati<strong>on</strong> and organizati<strong>on</strong> and so <strong>on</strong>. In additi<strong>on</strong>, <strong>the</strong><br />
feature template is also found to play an important role in CNER.<br />
2. Feature Template and Feature Set<br />
2.1. Feature Template<br />
To define <strong>the</strong> characteristic functi<strong>on</strong>, <strong>the</strong> true character <strong>of</strong> observati<strong>on</strong>s <strong>on</strong> <strong>the</strong><br />
b(x,i) collecti<strong>on</strong> is firstly c<strong>on</strong>structed. The characteristic collecti<strong>on</strong> not <strong>on</strong>ly dem<strong>on</strong>strates<br />
priori distributi<strong>on</strong> <strong>of</strong> <strong>the</strong> training data, but also reflects <strong>the</strong> model distributi<strong>on</strong>.<br />
As to <strong>the</strong> specified values from <strong>the</strong> current state which equal to <strong>the</strong> state<br />
functi<strong>on</strong> or from <strong>the</strong> state between <strong>the</strong> previous and <strong>the</strong> current which corresp<strong>on</strong>ding<br />
to <strong>the</strong> transfer functi<strong>on</strong>, <strong>the</strong> value <strong>of</strong> each characteristic can be assigned as an<br />
observati<strong>on</strong> characteristic functi<strong>on</strong> b(x,i) shown as follows,<br />
{<br />
ḃ(x, i), IF yi−1 is <strong>the</strong> previous tag AND y i is <strong>the</strong> current tag<br />
f(y i−1 , y i , x, i) =<br />
0, ELSE<br />
(1)<br />
where b(x,i) represents <strong>the</strong> value <strong>of</strong> a real observati<strong>on</strong> if y i−1 is <strong>the</strong> previous tag<br />
and y i is <strong>the</strong> current tag; o<strong>the</strong>rwise it is zero.<br />
In accordance with requirements <strong>of</strong> CRF++, <strong>the</strong> experiments demands a feature<br />
template file in additi<strong>on</strong> to <strong>the</strong> training data file with satisfied <strong>the</strong> specific format
A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />
289<br />
But <strong>the</strong> selecti<strong>on</strong> <strong>of</strong> <strong>the</strong> feature template can largely affect <strong>the</strong> results <strong>of</strong> experimental<br />
tests Meanwhile, <strong>the</strong> system resources required for training are varying for<br />
different feature template functi<strong>on</strong>s. According to <strong>the</strong> results in Jun Yu 5 and o<strong>the</strong>r<br />
references which c<strong>on</strong>ducted <strong>the</strong> comparis<strong>on</strong> between <strong>the</strong> character-<strong>based</strong> and <strong>the</strong><br />
word-<strong>based</strong> CNER performances, <strong>the</strong> character-<strong>based</strong> level feature template is chosen<br />
for investigati<strong>on</strong> in this study. Empirically, <strong>the</strong> window size <strong>of</strong> feature functi<strong>on</strong><br />
template can be set as 3 (including <strong>the</strong> previous, <strong>the</strong> current and <strong>the</strong> posterior tokens),<br />
or 5 (including <strong>the</strong> two pervious, <strong>the</strong> current and <strong>the</strong> two posterior tokens),<br />
or 7 (including <strong>the</strong> three previous, <strong>the</strong> current and <strong>the</strong> three posterior tokens). The<br />
form <strong>of</strong> feature functi<strong>on</strong> template is typically defined as:<br />
(i) Cn; (ii) CnCn+1; (iii) C-nCn; where n is an integer.<br />
According to different attributes and requirements from <strong>the</strong> number <strong>of</strong> label sets<br />
<strong>of</strong> training file format, nine feature functi<strong>on</strong> template sets are chosen herein to finish<br />
experiments and make comparis<strong>on</strong>s. From experimental results, <strong>the</strong> performance<br />
<strong>of</strong> CNER can be improved by selecting suitable template and <strong>the</strong> corresp<strong>on</strong>ding<br />
appropriate window size for different feature, which is discussed later in this paper.<br />
2.2. Feature Set<br />
The training feature set <strong>of</strong> <strong>CRFs</strong> is a very important parameter, which can<br />
directly affect <strong>the</strong> NER results. The main features <strong>of</strong> <strong>CRFs</strong> can be divided into<br />
two categories: <strong>the</strong> internal and <strong>the</strong> external. The former includes character or<br />
word, POS, boundary and o<strong>the</strong>r c<strong>on</strong>text informati<strong>on</strong>. Although <strong>the</strong>se basic internal<br />
features may have certain effects <strong>on</strong> <strong>the</strong> results, <strong>the</strong> external features are needed to<br />
obtain better results. The external features are mainly from <strong>the</strong> general statistical<br />
informati<strong>on</strong> <strong>of</strong> corpus, which c<strong>on</strong>sists <strong>of</strong> <strong>the</strong> prefix <strong>of</strong> comm<strong>on</strong> <strong>Chinese</strong> family names,<br />
<strong>the</strong> suffix <strong>of</strong> place names and organizati<strong>on</strong>al names and so <strong>on</strong>. Herein, <strong>the</strong> surname<br />
or prefix <strong>of</strong> pers<strong>on</strong> name is labeled with ‘PP’, <strong>the</strong> suffix <strong>of</strong> locati<strong>on</strong> with ‘LS’ and<br />
<strong>the</strong> suffix <strong>of</strong> organizati<strong>on</strong>al name with ‘OS’<br />
Firstly, <strong>the</strong> recogniti<strong>on</strong> accuracy <strong>of</strong> pers<strong>on</strong> name can be increased by adding <strong>the</strong><br />
prefix <strong>of</strong> <strong>Chinese</strong> family name (mainly comm<strong>on</strong> surname in China, such as ‘ 李 ’, ‘ 王 ’,<br />
‘ 赵 ’ etc.). For example, top 200 <strong>Chinese</strong> surnames are used as <strong>the</strong> prefix feature <strong>of</strong><br />
name, and pers<strong>on</strong> name whose length between 2 and 4 is selected as <strong>the</strong> <strong>Chinese</strong><br />
name feature according to <strong>the</strong> <strong>Chinese</strong> named norms. Sec<strong>on</strong>dly by adding <strong>the</strong> suffix<br />
<strong>of</strong> place name (such as ‘ 国 ’, ‘ 村 ’, ‘ 路 ’, ‘ 港 ’ etc.) to <strong>the</strong> feature <strong>of</strong> locati<strong>on</strong> recogniti<strong>on</strong>,<br />
50 comm<strong>on</strong> <strong>Chinese</strong> characters are chosen as <strong>the</strong> suffix <strong>of</strong> locati<strong>on</strong> by statistics, and<br />
<strong>the</strong> place name whose length not less than two <strong>Chinese</strong> characters are also selected<br />
as <strong>the</strong> locati<strong>on</strong> formati<strong>on</strong> feature according to <strong>the</strong> <strong>Chinese</strong> place named norms.<br />
However, Organizati<strong>on</strong>al name has lower recogniti<strong>on</strong> accuracy than <strong>the</strong> previous<br />
two, mainly because <strong>of</strong> <strong>the</strong> complex structure which may include pers<strong>on</strong> name or<br />
place name, and <strong>the</strong> length difference as well. Therefore, we not <strong>on</strong>ly add <strong>the</strong> suffix<br />
<strong>of</strong> organizati<strong>on</strong>al name, such as ‘ 局 ’, ‘ 厅 ’, ‘ 司 ’, ‘ 院 ’, ‘ 部 ’ etc., but also agreed to<br />
its length greater than 3. Meanwhile, if <strong>the</strong> tagged result c<strong>on</strong>tains pers<strong>on</strong> name or
290 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />
place name (<strong>the</strong> results <strong>of</strong> local label) except for <strong>the</strong> suffix <strong>of</strong> organizati<strong>on</strong>al name,<br />
it should be treated as an organizati<strong>on</strong>al name to obtain <strong>the</strong> results <strong>of</strong> a global<br />
dimensi<strong>on</strong>.<br />
In additi<strong>on</strong>, three label sets are compared in experiments. The comm<strong>on</strong> segmentati<strong>on</strong><br />
label sets include 3 tags (B, I, S), 4 tags (B, M, E, S), and 6 tags (B, B 2 , B 3 ,<br />
M, E, S). In this paper <strong>the</strong> comm<strong>on</strong> three named entities, pers<strong>on</strong> name, place name<br />
and organizati<strong>on</strong>al name which are tagged ‘PER’, ’LOC’ and ‘ORG’ respectively are<br />
mainly studied. Therefore, <strong>the</strong> label collecti<strong>on</strong>s <strong>of</strong> <strong>the</strong> corresp<strong>on</strong>ding <strong>of</strong> CNER can<br />
be generated by combining segmentati<strong>on</strong> label sets with entity tags. For example,<br />
a collecti<strong>on</strong> <strong>of</strong> pers<strong>on</strong> name entity labeled as 3 label set for (B-PER, I-PER, S), 4<br />
label set for (B-PER, M-PER, E-PER, S), and 6 label set for (B-PER, B 2 -PER,<br />
B 3 -PER, M-PER, E-PER, S).<br />
3. Experiments and Discussi<strong>on</strong>s<br />
3.1. Experimental preparati<strong>on</strong><br />
Since <strong>the</strong> main objective <strong>of</strong> this paper is to study <strong>the</strong> CNER c<strong>on</strong>sisting <strong>of</strong> pers<strong>on</strong><br />
name, local name and organizati<strong>on</strong>al name <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>on</strong>line open source<br />
s<strong>of</strong>tware CRF++ (current versi<strong>on</strong> CRF++0.54) is used as <strong>the</strong> training and test<br />
tools in experiments<br />
Corpus used in experiments is from <strong>the</strong> People’s Daily in January 1998 which<br />
has a size <strong>of</strong> about 5.11MB after processing. To obtain more unbiased results, <strong>the</strong><br />
ratio <strong>of</strong> training and test corpus is about 7:3.<br />
Due to <strong>the</strong> strict requirements <strong>of</strong> <strong>the</strong> CRF++ for <strong>the</strong> corpus format all <strong>of</strong> <strong>the</strong><br />
used corpuses are preprocessed by using <strong>the</strong> pers<strong>on</strong>ally developed CorpusPreC<strong>on</strong>v<br />
during each stage <strong>of</strong> <strong>the</strong> experiments. Fur<strong>the</strong>rmore, an evaluati<strong>on</strong> tool NEREvaluator<br />
written by authors is used in this study to evaluate <strong>the</strong> Precisi<strong>on</strong>, Recall and<br />
F-measure <strong>of</strong> pers<strong>on</strong> names, place names and organizati<strong>on</strong>al names respectively<br />
3.2. Experimental results and discussi<strong>on</strong>s<br />
In this study, three comm<strong>on</strong>ly used attributes (i.e., single <strong>Chinese</strong> character<br />
(char) itself, POS, prefix/suffix (p/s), and <strong>the</strong>ir combinati<strong>on</strong>) and three segmentati<strong>on</strong><br />
label sets (3tags:BIS, 4tags:BMES, 6tags:BB 2 B 3 MES) are c<strong>on</strong>sidered and<br />
investigated for a series <strong>of</strong> CNER experiments, which include pers<strong>on</strong>(PER) name<br />
recogniti<strong>on</strong>, local(LOC) name recogniti<strong>on</strong> and organizati<strong>on</strong>al(ORG) name recogniti<strong>on</strong><br />
by incorporating three different window size <strong>of</strong> feature template (size=3, 5, 7).<br />
The precisi<strong>on</strong>, recall and F-measure are set as <strong>the</strong> evaluati<strong>on</strong> indicators. Due to <strong>the</strong><br />
limited space herein, <strong>on</strong>ly <strong>the</strong> results <strong>of</strong> <strong>the</strong> case when <strong>the</strong> feature template size<br />
equals 5 are provided in Table 1, in which P means <strong>the</strong> Precisi<strong>on</strong> rate, R recall rate,<br />
and F1 F-measure value.<br />
The F-measure values <strong>of</strong> PER, LOC and ORG under <strong>the</strong> different c<strong>on</strong>diti<strong>on</strong>s are<br />
shown in Fig.1. (a) through (f). The results in Fig.1. (a), (b) and (c) represent <strong>the</strong>
A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />
291<br />
NER<br />
PER<br />
LOC<br />
ORG<br />
Table 1. The window size <strong>of</strong> feature template is 5<br />
The comm<strong>on</strong> attributes and <strong>the</strong>ir combinati<strong>on</strong><br />
PRF<br />
char char+POS char+p/s char+POS+p/s<br />
3tags 4tags 6tags 3tags 4tags 6tags 3tags 4tags 6tags 3tags 4tags 6tags<br />
P 96.7 94.9 94.9 94.8 94.7 94.0 96.5 93.7 92.9 95.4 95.2 95.2<br />
R 73.0 72.8 72.0 75.9 75.3 75.8 80.5 78.6 77.8 85.7 85.6 85.4<br />
F1 83.2 82.4 81.8 84.3 83.9 83.9 87.8 85.5 84.7 90.3 90.1 90.0<br />
P 91.5 91.5 91.5 91.3 90.9 91.1 94.9 95.1 94.8 95.7 95.6 95.2<br />
R 83.6 84.0 83.1 84.7 85.0 84.6 86.7 87.1 86.8 89.3 89.2 89.0<br />
F1 87.4 87.6 87.1 87.9 87.9 87.7 90.6 90.9 90.6 92.4 92.3 92.0<br />
P 88.5 85.8 85.1 90.5 90.3 90.4 92.6 92.2 91.0 93.7 93.6 93.6<br />
R 77.7 76.0 75.7 77.2 77.0 77.2 85.0 84.9 84.3 87.6 87.5 87.5<br />
F1 82.7 80.6 80.1 83.3 83.1 83.3 88.7 88.4 87.5 90.5 90.4 90.4<br />
F1 <strong>of</strong> PER, LOC, ORG respectively which have <strong>the</strong> same feature template setting<br />
and where <strong>the</strong> window size is 5, while <strong>the</strong> results <strong>of</strong> Fig.1. (d), (e) and (f) explore<br />
<strong>the</strong> corresp<strong>on</strong>ding F1 under <strong>the</strong> same attribute c<strong>on</strong>diti<strong>on</strong>s (i.e., char+POS+p/s).<br />
It can be found from Table 1 and Fig.1.(a) through (f) that <strong>the</strong> precisi<strong>on</strong> <strong>of</strong><br />
PER, LOC and ORG is relatively high for selecting <strong>the</strong> same feature template<br />
and window size (e.g., size=5) and using <strong>on</strong>ly <strong>the</strong> single <strong>Chinese</strong> character as <strong>the</strong><br />
attribute column as well, but <strong>the</strong> recall is very low (e.g., even less than 80%). On <strong>the</strong><br />
o<strong>the</strong>r hand, <strong>the</strong> combinati<strong>on</strong> <strong>of</strong> <strong>the</strong> <strong>Chinese</strong> character with POS and <strong>the</strong> combinati<strong>on</strong><br />
<strong>of</strong> <strong>Chinese</strong> character with prefix/suffix informati<strong>on</strong> can increase both <strong>the</strong> F-measure<br />
and <strong>the</strong> recall to some extent respectively. In particular, by c<strong>on</strong>sidering <strong>the</strong> three<br />
comm<strong>on</strong>ly used attributes, <strong>the</strong> recall <strong>of</strong> PER, LOC and ORG can attain to 85%,<br />
89% and 87% respectively, and <strong>the</strong> precisi<strong>on</strong> <strong>of</strong> all cases reaches to about 95%.<br />
From <strong>the</strong> case <strong>of</strong> using <strong>on</strong>ly <strong>on</strong>e attribute column (i.e.,char) to <strong>the</strong> cases <strong>of</strong> <strong>the</strong><br />
combinati<strong>on</strong> <strong>of</strong> three attributes column, <strong>the</strong> recall <strong>of</strong> PER, LOC and ORG have<br />
been increased by 12%, 6% and 10% respectively, and <strong>the</strong> corresp<strong>on</strong>ding F-measure<br />
values are also raised by 7%, 5% and 8%. Therefore, <strong>the</strong> results indicate that <strong>the</strong><br />
POS and prefix/suffix informati<strong>on</strong> play very important roles to train <strong>the</strong> <strong>CRFs</strong><br />
model such that <strong>the</strong>y can provide useful references for CNER.<br />
Fur<strong>the</strong>rmore, <strong>the</strong> results from Fig.1.(a) through (c) show that <strong>the</strong> effects <strong>of</strong><br />
CNER are different for <strong>the</strong> cases with <strong>the</strong> same feature template and attributes but<br />
different label set Meanwhile, <strong>the</strong> result <strong>of</strong> PER or LOC or ORG recogniti<strong>on</strong> for<br />
which <strong>the</strong> 3 tags (BIS) is selected as segmentati<strong>on</strong> marker set is better than that for<br />
<strong>the</strong> o<strong>the</strong>r two cases (4tags:BMES, 6tags:BB 2 B 3 MES) which implies that <strong>the</strong> more<br />
number <strong>of</strong> tags, <strong>the</strong> greater <strong>the</strong> training size and <strong>the</strong> slower <strong>the</strong> training speed and<br />
thus <strong>the</strong> more required system resources.<br />
Finally, <strong>the</strong> results observed from Figs. 4 through 6 dem<strong>on</strong>strate that, for <strong>the</strong><br />
cases with <strong>the</strong> same number <strong>of</strong> tags and <strong>the</strong> identical attributes but <strong>the</strong> varying<br />
feature template in window sizes, <strong>the</strong> effects <strong>of</strong> CNER are also different Am<strong>on</strong>g<br />
<strong>the</strong>se cases, and <strong>the</strong> result for <strong>the</strong> template size <strong>of</strong> 5 seems to be better than <strong>the</strong><br />
o<strong>the</strong>rs (e.g., size = 3, 7). Moreover, it can be found in Table 1 that <strong>the</strong> precisi<strong>on</strong>,<br />
recall and F1 <strong>of</strong> CNER are not always increasing with <strong>the</strong> window size <strong>of</strong> feature<br />
template. For example, when <strong>the</strong> window size <strong>of</strong> template increases from 5 to 7, <strong>the</strong>
292 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />
(a)<br />
(b)<br />
(c)<br />
(d)<br />
(e)<br />
(f)<br />
Fig. 1. (a).The F1 <strong>of</strong> PER under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same feature template and window size is 5.<br />
(b). The F1 <strong>of</strong> LOC under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same feature template and window size is 5.(c).The<br />
F1 <strong>of</strong> ORG under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same feature template and window size is 5.(d).The F1 <strong>of</strong><br />
PER under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same attributes (char+POS+p/s).(e).The F1 <strong>of</strong> LOC under <strong>the</strong><br />
c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong> same attributes (char+POS+p/s).(f).The F1 <strong>of</strong> ORG under <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> <strong>of</strong> <strong>the</strong><br />
same attributes (char+POS+p/s).<br />
results <strong>of</strong> CNER are not clearly improved but <strong>the</strong> values <strong>of</strong> o<strong>the</strong>r indicators gradually<br />
decrease. Similarly, <strong>the</strong> more complex <strong>the</strong> feature template is <strong>the</strong> l<strong>on</strong>ger <strong>the</strong> training<br />
time is and <strong>the</strong> more system resources needed. Therefore, <strong>the</strong> appropriately selected<br />
feature template is also an important factor for <strong>the</strong> CNER.
A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Features</strong> <strong>of</strong> <strong>the</strong> <strong>CRFs</strong>-<strong>based</strong> <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong><br />
293<br />
4. C<strong>on</strong>clusi<strong>on</strong>s<br />
A series <strong>of</strong> experiments are c<strong>on</strong>ducted for <strong>the</strong> fea<strong>the</strong>rs <strong>of</strong> <strong>the</strong> CNER, including<br />
<strong>the</strong> comm<strong>on</strong>ly used attributes, <strong>the</strong> varying feature template in window size and <strong>the</strong><br />
label sets. From <strong>the</strong> experimental results, it is shown that <strong>the</strong> different attributes<br />
or <strong>the</strong>ir combinati<strong>on</strong>s in <strong>the</strong> same feature template and label set have different<br />
effects <strong>on</strong> <strong>the</strong> CNER. In particular, better results can be achieved by combining <strong>the</strong><br />
comm<strong>on</strong>ly used attributes. For example in <strong>the</strong> studied experiments <strong>of</strong> this paper,<br />
<strong>the</strong> results become better when <strong>the</strong> 3 tags (BIS) set and <strong>the</strong> template with window<br />
size <strong>of</strong> 5 are used in <strong>the</strong> tests,. Therefore, <strong>the</strong> appropriately selected attributes<br />
or <strong>the</strong>ir combinati<strong>on</strong>s and label set can obviously improve <strong>the</strong> performance <strong>of</strong> <strong>the</strong><br />
CNER. Fur<strong>the</strong>rmore, <strong>the</strong> feature template with suitable window size is ano<strong>the</strong>r<br />
important factor affecting <strong>the</strong> results <strong>of</strong> <strong>the</strong> CNER. C<strong>on</strong>sequently, a comprehensive<br />
c<strong>on</strong>siderati<strong>on</strong> <strong>of</strong> <strong>the</strong>se factors can greatly improve <strong>the</strong> performance <strong>of</strong> <strong>the</strong> CNER<br />
and reduce <strong>the</strong> c<strong>on</strong>sumed system resources corresp<strong>on</strong>dingly.<br />
It is important to note that <strong>the</strong>re are so many available features in <strong>the</strong> CNER<br />
such that <strong>the</strong> <strong>on</strong>es c<strong>on</strong>cerned in this study are actually not enough for expressing<br />
all possible situati<strong>on</strong>s <strong>of</strong> <strong>the</strong> CNER. As a preliminary from this study, more<br />
investigati<strong>on</strong>s are needed in <strong>the</strong> future work.<br />
Acknowledgments<br />
This research work was supported by <strong>the</strong> Fundamental Research Funds for <strong>the</strong><br />
Central Universities (2009RC0206).<br />
References<br />
1. MUC-6, http://cs.nyu.edu/faculty/grishman/muc6.html<br />
2. J. Lafferty, A. McCallum, and F. Pereira. C<strong>on</strong>diti<strong>on</strong>al random fields: Probabilistic models for<br />
segmenting and labeling sequence data, In Proc. <strong>of</strong> ICML, pp.282-290,<br />
3. SIGHAN, http://sighan.cs.uchicago.edu/, 2010.<br />
4. Taku Kudo. “CRF++: Yet ano<strong>the</strong>r CRF tool kit”, http://crfpp.sourceforge.net/, 2010.<br />
5. Jun Yu, Xiaoou Chen. <strong>Named</strong> entity recogniti<strong>on</strong>: One-at-a-time Or All-at-<strong>on</strong>ce Word-<strong>based</strong><br />
Or Character-<strong>based</strong> Seventh Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> <strong>Chinese</strong> Informati<strong>on</strong> Processing,<br />
2007.<br />
6. Chang-Ning Huang and Hai Zhao. Which Is Essential for <strong>Chinese</strong> Word Segmentati<strong>on</strong>: Character<br />
versus Word (Invited paper), The 20th Pacific Asia C<strong>on</strong>ference <strong>on</strong> Language, Informati<strong>on</strong><br />
and Computati<strong>on</strong> (PACLIC-20),pp.1-12,Wuhan, China, November 1-3,2006.<br />
7. Hai Zhao and Chunyu Kit. Unsupervised Segmentati<strong>on</strong> Helps Supervised Learning <strong>of</strong> Character<br />
Tagging for Word Segmentati<strong>on</strong> and <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong>. The Sixth SIGHAN Workshop<br />
<strong>on</strong> <strong>Chinese</strong> Language Processing (SIGHAN-6), pp.106-111, Hyderabad, India, January 11-12,<br />
2008<br />
8. Hua-Ping ZHANG, Qun LIU. <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong> Recogniti<strong>on</strong> Using Role Model. Computati<strong>on</strong>al<br />
Linguistics and <strong>Chinese</strong> Language Processing. Vol.8, No.2, August 2003.<br />
9. HU Wen-bo, DU Yun-cheng, LV Xue_qiang, SHI Shui-cai. A <str<strong>on</strong>g>Study</str<strong>on</strong>g> <strong>on</strong> <strong>Chinese</strong> named entity<br />
recogniti<strong>on</strong> <strong>based</strong> <strong>on</strong> cascaded c<strong>on</strong>diti<strong>on</strong>al random fields. Computer Engineering and Applicati<strong>on</strong><br />
45(1):pp.163-165, 2009.<br />
10. Shumin Shi, Zhiqiang WANG, Lang ZHOU, Ch<strong>on</strong>g FENG, Heyan HUANG. <strong>Chinese</strong> <strong>Named</strong><br />
<strong>Entity</strong> Recogniti<strong>on</strong> Using C<strong>on</strong>diti<strong>on</strong>al Random Fields Model. Third Academic C<strong>on</strong>ference <strong>of</strong><br />
Computati<strong>on</strong>al Linguistics (ACCL), 2006.
294 F. Huanzh<strong>on</strong>g Duan, S. Yan Zheng<br />
11. Zeng Guanming, Zhang Chuang, Xiao Bo, Lin Zhiqing. <strong>CRFs</strong>-Based <strong>Chinese</strong> <strong>Named</strong> <strong>Entity</strong><br />
Recogniti<strong>on</strong> with improved Tag Set. World C<strong>on</strong>gress <strong>on</strong> Computer Science and Informati<strong>on</strong><br />
Engineering, 2009.<br />
Huanzh<strong>on</strong>g Duan<br />
He is currently a master student at Beijing University<br />
<strong>of</strong> Posts and Telecommunicati<strong>on</strong>s (BUPT).His main research<br />
interests include Natural Language Processing and<br />
Text Mining. No.10 XiTuCheng Road, HaiDian Distract,<br />
Beijing, P.R.China, 100876.<br />
Yan Zheng<br />
She received <strong>the</strong> Ph.D. degree in 2003 from Faculty <strong>of</strong><br />
Computer, Jilin University, China. From 2003 she was an<br />
associate pr<strong>of</strong>essor in <strong>the</strong> Faculty <strong>of</strong> Computer Sciences,<br />
<strong>the</strong> University <strong>of</strong> Posts and Telecommunicati<strong>on</strong>s. Her research<br />
interests include Data Mining, Text Mining, Natural<br />
Language Processing and Artificial Intelligence. No.10<br />
XiTuCheng Road, HaiDian Distract, Beijing, P.R.China,<br />
100876.