A Study on Features of the CRFs-based Chinese Named Entity ...

International Journal of Advanced Intelligence 

Volume 3, Number 2, pp.287-294, July, 2011. 

© AIA International Advanced Information Institute 

A <strong>Study</strong> on Features of the CRFs-based Chinese 

Named Entity Recognition 

Huanzhong Duan, Yan Zheng 

Beijing University of Posts and Telecommunications, Beijing, 100876 China 

hzhduan@gmail.com, yanzheng@bupt.edu.cn 

Received (20 February 2011) 

Revised (31 March 2011) 

This paper studies on features of the Chinese Named Entity Recognition (CNER) based on 

the Conditional Random Fields (CRFs). These Features which include common attributes, 

feature templates varying in window size and sequence label sets are very important for 

CNER. Taking advantages of these features or their combination can greatly improve the 

performance of CNER. The paper aims to provide a reference for selecting features of 

CNER through a series of experiments. The experiment results show that appropriate 

features or their combinations, such as single Chinese character, part-of-speech (POS), 

prefix and suffix can contribute to the score of F-measure based on CRFs. Meanwhile, the 

results indicate that selecting suitable feature templates and sequence label sets, not only 

can improve the performance of CNER, but also shorten the model-training process and 

reduce the system resource consumption. 

Keywords: Chinese named entity recognition; feature template; Conditional Random 

Fields. 

1. Introduction 

The Named Entity Recognition (NER) was firstly introduced as a subtask on the 

Sixth of the series Message Understanding Conference (MUC-6) in November 1995 

1 . In this conference NER is defined as the proper nouns which people are interested 

in and the specific numeral nouns. The Named Entity (NE) is the basic information 

unit in Natural Language Processing (NLP). It is one of the fundamental problems 

in many NLP applications, such as information extraction, question answering, 

machine translation, automatic abstract and data mining. In a limited sense, NE 

consists of names, location, organization etc., while in a generalized sense, it can 

also include time and numerical expressions. 

There are many difficult problems in NER. Firstly, NE is an open class and the 

number of its components is very large such that it is very hard to enumerate all of 

them. Secondly, NE is not a stable class, and there have been unknown ones of its 

components emerge frequently over time. In addition, there have not yet commonly 

named criteria for its definition in this field 

To correctly identify all of the named entities is a very difficult task for any 

language, since the level of difficulty depends on the diversity of language settings. 

287

288 F. Huanzhong Duan, S. Yan Zheng 

As to Chinese, there are many complicated properties, for example, the complex 

composition forms, the uncertain length and boundary, NE definition within the 

nest and so on. Therefore, CNER is a difficult task and there still has a long way 

to go in this field 

At present, there are two commonly used approaches to NER. One is rulebased 

and the other is statistics-based. However, they both have advantages and 

disadvantages respectively. One of the advantages for them respectively is that 

the former is fast and has good performance for small test corpus and the latter 

is better to transplant and depends on language weakly On the other hand, one 

of the disadvantages is that the former is difficult to transplant and feeble to use 

universally, while the latter needs a large number of training corpus and more system 

resources. CRFs 2 is a statistics-based sequence annotation model, which has strong 

ability to integrate any kind of features and thus widely used in NLP and other 

fields. Known from the previous SIGHAN 3 of NER test results that the various 

systems based on CRFs can achieve better performance. This is the main reason 

why the CRF++ 4 is chosen as tools for CNER task. 

Because of the strong ability to integrate any kind of features which plays an 

important role during training, CRFs becomes one of the key factors affecting the 

NER performance. The features of CNER include not only the internal features from 

context, such as character information, POS and boundary, but also the external 

features based on the statistical results such as surname that the prefix of Chinese 

family names, the suffix of location and organization and so on. In addition, the 

feature template is also found to play an important role in CNER. 

2. Feature Template and Feature Set 

2.1. Feature Template 

To define the characteristic function, the true character of observations on the 

b(x,i) collection is firstly constructed. The characteristic collection not only demonstrates 

priori distribution of the training data, but also reflects the model distribution. 

As to the specified values from the current state which equal to the state 

function or from the state between the previous and the current which corresponding 

to the transfer function, the value of each characteristic can be assigned as an 

observation characteristic function b(x,i) shown as follows, 

{ 

ḃ(x, i), IF yi−1 is the previous tag AND y i is the current tag 

f(y i−1 , y i , x, i) = 

0, ELSE 

(1) 

where b(x,i) represents the value of a real observation if y i−1 is the previous tag 

and y i is the current tag; otherwise it is zero. 

In accordance with requirements of CRF++, the experiments demands a feature 

template file in addition to the training data file with satisfied the specific format

A <strong>Study</strong> on Features of the CRFs-based Chinese Named Entity Recognition 

289 

But the selection of the feature template can largely affect the results of experimental 

tests Meanwhile, the system resources required for training are varying for 

different feature template functions. According to the results in Jun Yu 5 and other 

references which conducted the comparison between the character-based and the 

word-based CNER performances, the character-based level feature template is chosen 

for investigation in this study. Empirically, the window size of feature function 

template can be set as 3 (including the previous, the current and the posterior tokens), 

or 5 (including the two pervious, the current and the two posterior tokens), 

or 7 (including the three previous, the current and the three posterior tokens). The 

form of feature function template is typically defined as: 

(i) Cn; (ii) CnCn+1; (iii) C-nCn; where n is an integer. 

According to different attributes and requirements from the number of label sets 

of training file format, nine feature function template sets are chosen herein to finish 

experiments and make comparisons. From experimental results, the performance 

of CNER can be improved by selecting suitable template and the corresponding 

appropriate window size for different feature, which is discussed later in this paper. 

2.2. Feature Set 

The training feature set of CRFs is a very important parameter, which can 

directly affect the NER results. The main features of CRFs can be divided into 

two categories: the internal and the external. The former includes character or 

word, POS, boundary and other context information. Although these basic internal 

features may have certain effects on the results, the external features are needed to 

obtain better results. The external features are mainly from the general statistical 

information of corpus, which consists of the prefix of common Chinese family names, 

the suffix of place names and organizational names and so on. Herein, the surname 

or prefix of person name is labeled with ‘PP’, the suffix of location with ‘LS’ and 

the suffix of organizational name with ‘OS’ 

Firstly, the recognition accuracy of person name can be increased by adding the 

prefix of Chinese family name (mainly common surname in China, such as ‘ 李 ’, ‘ 王 ’, 

‘ 赵 ’ etc.). For example, top 200 Chinese surnames are used as the prefix feature of 

name, and person name whose length between 2 and 4 is selected as the Chinese 

name feature according to the Chinese named norms. Secondly by adding the suffix 

of place name (such as ‘ 国 ’, ‘ 村 ’, ‘ 路 ’, ‘ 港 ’ etc.) to the feature of location recognition, 

50 common Chinese characters are chosen as the suffix of location by statistics, and 

the place name whose length not less than two Chinese characters are also selected 

as the location formation feature according to the Chinese place named norms. 

However, Organizational name has lower recognition accuracy than the previous 

two, mainly because of the complex structure which may include person name or 

place name, and the length difference as well. Therefore, we not only add the suffix 

of organizational name, such as ‘ 局 ’, ‘ 厅 ’, ‘ 司 ’, ‘ 院 ’, ‘ 部 ’ etc., but also agreed to 

its length greater than 3. Meanwhile, if the tagged result contains person name or


place name (the results of local label) except for the suffix of organizational name, 

it should be treated as an organizational name to obtain the results of a global 

dimension. 

In addition, three label sets are compared in experiments. The common segmentation 

label sets include 3 tags (B, I, S), 4 tags (B, M, E, S), and 6 tags (B, B 2 , B 3 , 

M, E, S). In this paper the common three named entities, person name, place name 

and organizational name which are tagged ‘PER’, ’LOC’ and ‘ORG’ respectively are 

mainly studied. Therefore, the label collections of the corresponding of CNER can 

be generated by combining segmentation label sets with entity tags. For example, 

a collection of person name entity labeled as 3 label set for (B-PER, I-PER, S), 4 

label set for (B-PER, M-PER, E-PER, S), and 6 label set for (B-PER, B 2 -PER, 

B 3 -PER, M-PER, E-PER, S). 

3. Experiments and Discussions 

3.1. Experimental preparation 

Since the main objective of this paper is to study the CNER consisting of person 

name, local name and organizational name the CRFs-based online open source 

software CRF++ (current version CRF++0.54) is used as the training and test 

tools in experiments 

Corpus used in experiments is from the People’s Daily in January 1998 which 

has a size of about 5.11MB after processing. To obtain more unbiased results, the 

ratio of training and test corpus is about 7:3. 

Due to the strict requirements of the CRF++ for the corpus format all of the 

used corpuses are preprocessed by using the personally developed CorpusPreConv 

during each stage of the experiments. Furthermore, an evaluation tool NEREvaluator 

written by authors is used in this study to evaluate the Precision, Recall and 

F-measure of person names, place names and organizational names respectively 

3.2. Experimental results and discussions 

In this study, three commonly used attributes (i.e., single Chinese character 

(char) itself, POS, prefix/suffix (p/s), and their combination) and three segmentation 

label sets (3tags:BIS, 4tags:BMES, 6tags:BB 2 B 3 MES) are considered and 

investigated for a series of CNER experiments, which include person(PER) name 

recognition, local(LOC) name recognition and organizational(ORG) name recognition 

by incorporating three different window size of feature template (size=3, 5, 7). 

The precision, recall and F-measure are set as the evaluation indicators. Due to the 

limited space herein, only the results of the case when the feature template size 

equals 5 are provided in Table 1, in which P means the Precision rate, R recall rate, 

and F1 F-measure value. 

The F-measure values of PER, LOC and ORG under the different conditions are 

shown in Fig.1. (a) through (f). The results in Fig.1. (a), (b) and (c) represent the


291 

NER 

PER 

LOC 

ORG 

Table 1. The window size of feature template is 5 

The common attributes and their combination 

PRF 

char char+POS char+p/s char+POS+p/s 

3tags 4tags 6tags 3tags 4tags 6tags 3tags 4tags 6tags 3tags 4tags 6tags 

P 96.7 94.9 94.9 94.8 94.7 94.0 96.5 93.7 92.9 95.4 95.2 95.2 

R 73.0 72.8 72.0 75.9 75.3 75.8 80.5 78.6 77.8 85.7 85.6 85.4 

F1 83.2 82.4 81.8 84.3 83.9 83.9 87.8 85.5 84.7 90.3 90.1 90.0 

P 91.5 91.5 91.5 91.3 90.9 91.1 94.9 95.1 94.8 95.7 95.6 95.2 

R 83.6 84.0 83.1 84.7 85.0 84.6 86.7 87.1 86.8 89.3 89.2 89.0 

F1 87.4 87.6 87.1 87.9 87.9 87.7 90.6 90.9 90.6 92.4 92.3 92.0 

P 88.5 85.8 85.1 90.5 90.3 90.4 92.6 92.2 91.0 93.7 93.6 93.6 

R 77.7 76.0 75.7 77.2 77.0 77.2 85.0 84.9 84.3 87.6 87.5 87.5 

F1 82.7 80.6 80.1 83.3 83.1 83.3 88.7 88.4 87.5 90.5 90.4 90.4 

F1 of PER, LOC, ORG respectively which have the same feature template setting 

and where the window size is 5, while the results of Fig.1. (d), (e) and (f) explore 

the corresponding F1 under the same attribute conditions (i.e., char+POS+p/s). 

It can be found from Table 1 and Fig.1.(a) through (f) that the precision of 

PER, LOC and ORG is relatively high for selecting the same feature template 

and window size (e.g., size=5) and using only the single Chinese character as the 

attribute column as well, but the recall is very low (e.g., even less than 80%). On the 

other hand, the combination of the Chinese character with POS and the combination 

of Chinese character with prefix/suffix information can increase both the F-measure 

and the recall to some extent respectively. In particular, by considering the three 

commonly used attributes, the recall of PER, LOC and ORG can attain to 85%, 

89% and 87% respectively, and the precision of all cases reaches to about 95%. 

From the case of using only one attribute column (i.e.,char) to the cases of the 

combination of three attributes column, the recall of PER, LOC and ORG have 

been increased by 12%, 6% and 10% respectively, and the corresponding F-measure 

values are also raised by 7%, 5% and 8%. Therefore, the results indicate that the 

POS and prefix/suffix information play very important roles to train the CRFs 

model such that they can provide useful references for CNER. 

Furthermore, the results from Fig.1.(a) through (c) show that the effects of 

CNER are different for the cases with the same feature template and attributes but 

different label set Meanwhile, the result of PER or LOC or ORG recognition for 

which the 3 tags (BIS) is selected as segmentation marker set is better than that for 

the other two cases (4tags:BMES, 6tags:BB 2 B 3 MES) which implies that the more 

number of tags, the greater the training size and the slower the training speed and 

thus the more required system resources. 

Finally, the results observed from Figs. 4 through 6 demonstrate that, for the 

cases with the same number of tags and the identical attributes but the varying 

feature template in window sizes, the effects of CNER are also different Among 

these cases, and the result for the template size of 5 seems to be better than the 

others (e.g., size = 3, 7). Moreover, it can be found in Table 1 that the precision, 

recall and F1 of CNER are not always increasing with the window size of feature 

template. For example, when the window size of template increases from 5 to 7, the


(a) 

(b) 

(c) 

(d) 

(e) 

(f) 

Fig. 1. (a).The F1 of PER under the condition of the same feature template and window size is 5. 

(b). The F1 of LOC under the condition of the same feature template and window size is 5.(c).The 

F1 of ORG under the condition of the same feature template and window size is 5.(d).The F1 of 

PER under the condition of the same attributes (char+POS+p/s).(e).The F1 of LOC under the 

condition of the same attributes (char+POS+p/s).(f).The F1 of ORG under the condition of the 

same attributes (char+POS+p/s). 

results of CNER are not clearly improved but the values of other indicators gradually 

decrease. Similarly, the more complex the feature template is the longer the training 

time is and the more system resources needed. Therefore, the appropriately selected 

feature template is also an important factor for the CNER.


293 

4. Conclusions 

A series of experiments are conducted for the feathers of the CNER, including 

the commonly used attributes, the varying feature template in window size and the 

label sets. From the experimental results, it is shown that the different attributes 

or their combinations in the same feature template and label set have different 

effects on the CNER. In particular, better results can be achieved by combining the 

commonly used attributes. For example in the studied experiments of this paper, 

the results become better when the 3 tags (BIS) set and the template with window 

size of 5 are used in the tests,. Therefore, the appropriately selected attributes 

or their combinations and label set can obviously improve the performance of the 

CNER. Furthermore, the feature template with suitable window size is another 

important factor affecting the results of the CNER. Consequently, a comprehensive 

consideration of these factors can greatly improve the performance of the CNER 

and reduce the consumed system resources correspondingly. 

It is important to note that there are so many available features in the CNER 

such that the ones concerned in this study are actually not enough for expressing 

all possible situations of the CNER. As a preliminary from this study, more 

investigations are needed in the future work. 

Acknowledgments 

This research work was supported by the Fundamental Research Funds for the 

Central Universities (2009RC0206). 

References 

1. MUC-6, http://cs.nyu.edu/faculty/grishman/muc6.html 

2. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for 

segmenting and labeling sequence data, In Proc. of ICML, pp.282-290, 

3. SIGHAN, http://sighan.cs.uchicago.edu/, 2010. 

4. Taku Kudo. “CRF++: Yet another CRF tool kit”, http://crfpp.sourceforge.net/, 2010. 

5. Jun Yu, Xiaoou Chen. Named entity recognition: One-at-a-time Or All-at-once Word-based 

Or Character-based Seventh International Conference on Chinese Information Processing, 

2007. 

6. Chang-Ning Huang and Hai Zhao. Which Is Essential for Chinese Word Segmentation: Character 

versus Word (Invited paper), The 20th Pacific Asia Conference on Language, Information 

and Computation (PACLIC-20),pp.1-12,Wuhan, China, November 1-3,2006. 

7. Hai Zhao and Chunyu Kit. Unsupervised Segmentation Helps Supervised Learning of Character 

Tagging for Word Segmentation and Named Entity Recognition. The Sixth SIGHAN Workshop 

on Chinese Language Processing (SIGHAN-6), pp.106-111, Hyderabad, India, January 11-12, 

2008 

8. Hua-Ping ZHANG, Qun LIU. Chinese Named Entity Recognition Using Role Model. Computational 

Linguistics and Chinese Language Processing. Vol.8, No.2, August 2003. 

9. HU Wen-bo, DU Yun-cheng, LV Xue_qiang, SHI Shui-cai. A <strong>Study</strong> on Chinese named entity 

recognition based on cascaded conditional random fields. Computer Engineering and Application 

45(1):pp.163-165, 2009. 

10. Shumin Shi, Zhiqiang WANG, Lang ZHOU, Chong FENG, Heyan HUANG. Chinese Named 

Entity Recognition Using Conditional Random Fields Model. Third Academic Conference of 

Computational Linguistics (ACCL), 2006.


11. Zeng Guanming, Zhang Chuang, Xiao Bo, Lin Zhiqing. CRFs-Based Chinese Named Entity 

Recognition with improved Tag Set. World Congress on Computer Science and Information 

Engineering, 2009. 

Huanzhong Duan 

He is currently a master student at Beijing University 

of Posts and Telecommunications (BUPT).His main research 

interests include Natural Language Processing and 

Text Mining. No.10 XiTuCheng Road, HaiDian Distract, 

Beijing, P.R.China, 100876. 

Yan Zheng 

She received the Ph.D. degree in 2003 from Faculty of 

Computer, Jilin University, China. From 2003 she was an 

associate professor in the Faculty of Computer Sciences, 

the University of Posts and Telecommunications. Her research 

interests include Data Mining, Text Mining, Natural 

Language Processing and Artificial Intelligence. No.10 

XiTuCheng Road, HaiDian Distract, Beijing, P.R.China, 

100876.

A Study on Features of the CRFs-based Chinese Named Entity ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?