Journal of Networks - Academy Publisher

Journal of Networks 

ISSN 1796-2056 

Volume 6, Number 12, December 2011 

Contents 

REGULAR PAPERS 

Botnet Detection Architecture Based on Heterogeneous Multi-sensor Information Fusion 

HaiLong Wang, Jie Hou, and ZhengHu Gong 

XOEM plus OWL-based STEP Product Information Uniform Description and Implementation 

Chengfeng Jian and Haizhong Meng 

Design of Greenhouse Control System Based on Wireless Sensor Networks and AVR Microcontroller 

Yongxian Song, Chenglong Gong, Yuan Feng, Juanli Ma, and Xianjin Zhang 

Simulation of Networked Control System based on Smith Compensator and Single Neuron 

Incomplete Differential Forward PID 

Haitao Zhang and Zhen Li 

A Web Crawler System Design Based on Distributed Technology 

Shaojun Zhong and Zhijuan Deng 

A Ranking Method of Retrieval Results Based on Web Comprehending 

Zhijuan Deng and Shaojun Zhong 

An Encryption Scheme with Hidden Keyword Search for Outsourced Database 

Xiaoming Wang, Guoxiang Yao, and Zhen Zhang 

A Method of Object-based De-duplication 

Fang Yan and YuAn Tan 

Analysis on E-consumers’ Purchasing Behavior Based on Data-driving Model 

Lijuan Huang 

Repair Method of Complex Network Based on Matthew Effect 

Minsheng Tan, Qiang Cui, Lingfeng Zhu, and Hui Zhao 

Study and Design an Anycast Routing Protocol for Wireless Sensor Networks 

Demin Gao, Huanyan Qian, Zheng Wang, and Jiguang Chen 

Management Model Research of Low-power Wireless Sensor Network 

LinGe Wang and YueDou Qi 

Covert Flow Graph Approach to Identifying Covert Channels 

XiangMei Song and ShiGuang Ju 

A Novel HAVE Message of Peer-to-peer Protocol in BitTorrent Systems 

Jianyong Li, Jianchun Li, Daoying Huang, and Qiang Wei 

1655 

1662 

1668 

1675 

1682 

1690 

1697 

1705 

1713 

1719 

1726 

1734 

1740 

1747

Image-based Position Estimation and Adaptive Modulation Coding in Vehicular Communication 

Hao Yang, Qingmin Meng, Xiong Gu, and Baoyu Zheng 

A Request Distribution Algorithm for Web Server Cluster 

Wei Zhang, Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, and Li Ruan 

1754 

1760

JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1655 

Botnet Detection Architecture Based on 

Heterogeneous Multi-sensor Information Fusion 

HaiLong Wang and Jie Hou 

National University of Defense Technology, Changsha, 410073, China 

Email: {hlwang1981, jhou1983}@gmail.com 

ZhengHu Gong 

National University of Defense Technology, Changsha, 410073, China 

Email: gzh@nudt.edu.cn 

Abstract—As technology has been developed rapidly, botnet 

threats to the global cyber community are also increasing. 

And the botnet detection has recently become a major 

research topic in the field of network security. Most of the 

current detection approaches work only on the evidence 

from single information source, which can not hold all the 

traces of botnet and hardly achieve high accuracy. In this 

paper, a novel botnet detection architecture based on 

heterogeneous multi-sensor information fusion is proposed. 

The architecture is designed to carry out information 

integration in the three fusion levels of data, feature, and 

decision. As the core component, a feature extraction 

module is also elaborately designed. And an extended 

algorithm of the Dempster-Shafer (D-S) theory is proved 

and adopted in decision fusion. Furthermore, a 

representative case is provided to illustrate that the 

detection architecture can effectively fuse the complicated 

information from various sensors, thus to achieve better 

detection effect. 

Index Terms—botnet, botnet detection, network security, 

information fusion, D-S theory 

I. INTRODUCTION 

Internet threats have recently transformed from highly 

visible, disruptive attacks to stealthy attacks used for 

profit, and at the center of this change are the botnets [1]. 

These botnets have been the workhorses of many various 

disastrous attacks, such as information theft [2], 

distributed denial of service (DDoS) [3], and sending 

spam [4]. The threats can disable the infrastructure and 

cause the financial damage, which leads to a severe 

challenge for the global network security. Hence, in order 

to detect botnet attacks effectively, we need to have a 

correct and comprehensive understanding of the botnet 

attacks. In particular, we must fuse all the gathered 

information related to botnet activities from 

heterogeneous multi-source sensors, and then carry out 

further analysis for decision-making. Therefore, we can 

Manuscript received January 1, 2011; revised June 1, 2011; accepted 

July 1, 2011. 

© 2011 ACADEMY PUBLISHER 

doi:10.4304/jnw.6.12.1655-1661 

say that information fusion is absolutely a necessary 

component for botnet detection [5]. 

Botnet is a network composed by computers on which 

the software called ‘bot’ is automatically installed 

without user intervention, and is remotely controlled via 

command and control channel for malicious purpose [6]. 

Its activities have the following common characteristics. 

First, they have more action phases and representation 

forms than the traditional malware attacks. The activity 

cycle of a botnet attack usually consists of four stages, 

i.e., propagation, infection, communication, and attack 

[7]. Even in the same stage, different botnet attacks could 

exhibit various activity forms, such as propagating by 

system vulnerabilities or email. Second, the botnet 

activities are wide-ranging from a private host, local area 

network to the backbone [8]. Third, the botnet activities 

are always hidden. Since their resulting network traffic is 

small, the bots can upgrade itself without exposition for a 

long time [9]. These three characteristics make great 

difficulty in botnet detection. However, it is well known 

that the traces of botnet would be recorded during its 

actions over a wide range [10]. There are diverse types of 

information sources which can be retrieved, such as 

network packets, network flows, system logs, alerts from 

anti-virus software or intrusion detection systems, and the 

analysis results from botnet detection tools. Though the 

information could be used to identify the traces of botnet, 

it is usually large-scale, uncertain and redundant. 

Despite of the importance of information fusion for 

botnet detection, most of the existing work does not focus 

on this field. To our best knowledge, the existing botnet 

detection schemes can discover bots to some extent, but 

they do not make full use of the multifarious information 

related to botnet activities and are not able to handle the 

entire situation of botnet infiltration. In recent years, 

multi-sensor information fusion has been rapidly 

developed and applied in many sophisticated application 

areas, especially network security [11]. In the view of 

integrating the complicated information from 

heterogeneity and multi-source in an efficient way, we 

propose a botnet detection architecture based on 

information fusion techniques. In the architecture, we 

design a novel feature extraction module and adopt an

1656 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 

extended decision fusion algorithm, which enables the 

detection to achieve three-level fusions of data, feature 

and decision. 

The remainder of the paper is organized as follows. 

Section 2 discusses background technologies and related 

work. Section 3 presents the botnet detection architecture 

based on heterogeneous multi-sensor information fusion. 

Section 4 introduces a fusion algorithm used in the 

architecture and gives the proof of the algorithm. Section 

5 shows an illustration of botnet detection. Finally, 

section 6 concludes the paper. 

II. RELATED WORK 

Botnet is a new type of attack which is developed and 

syncretized from network worm, Trojan, backdoor tools 

and other traditional forms of malicious code [12]. 

However, compared to these traditional attacks, the major 

difference is that the botnet has a one-to-many control 

relationship among attackers and bots [13]. This feature 

makes botnet more privacy, flexible and efficient than 

any other malicious programs. 

With the evolution of botnet, the detection techniques 

for it have also developed. Many diverse schemes for 

botnet detection have been proposed, such as honeypot or 

honeynet for capture and analysis [14], correlation 

analysis of malicious behaviors[15], detection approaches 

for different C&C mechanisms (e.g. IRC, HTTP, DNS, or 

P2P) [16-19], and identifying bots from DDoS and spam 

[20, 21]. However, these techniques mainly focus on the 

network traffic and obtain evidences of botnet activities 

indirectly. For example, the evidence for detecting the 

upgrade of bot is obtained by identifying the upgrade 

binaries in the traffic, rather than directly derived from 

the code server which logs the download event. Single 

information source and indirect evidences cause the 

following three problems for botnet detection. First, it 

usually brings the false-positive and false-negative. 

Second, it will extend the detection cycle. Generally, 

multiple rounds observations are required to give the 

correct results. Third, due to the inadequate information 

collection, it is very difficult to be aware of new botnet or 

botnet variations. Therefore, the research on detection 

architecture with the ability of integrating heterogeneous 

multi-sensor information should be paid more attention. 

Robert et al. [22] design a multi-layered architecture 

for the detection of a wide range of existing and new 

botnets. The architecture can integrate many techniques 

to detect the gather information from all the available 

network information sources: network traffic data, system 

process information, and file system information. 

Napoleon et al. [23] introduce a risk-aware networkcentric 

management framework to detect and prevent 

targeted botnet attacks as well as propagation attempts 

with the network. The framework systematically collects 

network traffic and vulnerabilities in software, 

comprehensively analysis and discovers characteristics 

and unique behaviors of bots, and dynamically 

determines associated risks and generates corresponding 

detection rules. Zhang et al. [24] develop a top-down 

analytical framework as a basis for critical evaluation on 


the existing countermeasures. The framework correlates 

and integrates the observations and reports of anti-botnet 

tools at different layers, i.e., Internet, intranet, and host, 

for achieving a whole snapshot of the botnet. Alireza et 

al. [25] propose an architecture which is called “Visual 

Threat Monitor” that combines data mining and 

visualization to enhance botnet traffic detection. The 

processing pipeline of the architecture consists of 

correlation, statistical analysis, clustering, aggregation, 

and visualization. On the basis of the studies [15, 26, 27], 

Gu et al. [28] present a general detection framework to 

realize more accurate botnet detection over local area 

network. It analyzes of traffic and network flow, 

correlating with multiple alerts or events of intrusion 

detection system. The aforementioned detection 

architectures have some problems in the aspect of 

information fusion. First, the types of the information 

source are incomplete. And there is no proper division 

method towards the information source according to the 

botnet activities, which would cause large redundancy 

information and ill-targeted collection. Second, the 

aforementioned schemes are lack of a powerful algorithm 

to fuse large-scale information from different sources and 

obtain the correlation between attackers and their botnets, 

though they adopt some correlation analysis methods. 

Third, most of the existing frameworks do not have 

independent feature extraction module, or function of 

feature extraction is too simple. 

III. ARCHITECTURE OVERVIEW 

Information fusion technique is a kind of information 

processing method makes use of information from 

multiple sensors, and related information from associated 

database, achieves improved accuracies and more specific 

inferences than could be achieved by the use of a single 

sensor alone [29]. Network security is the latest 

application of information fusion, and all these 

applications are mainly about the improvement of IDS 

[30]. Information fusion processes are often categorized 

as data, feature or decision level fusion depending on the 

processing stage at which fusion takes place [31]. Data 

level fusion, combines several sources of raw information 

to produce new information that is expected to be more 

informative and synthetic than the inputs. Feature level 

fusion, various features are combined into a feature map 

that may be used by further process. Decision level 

fusion, combines decisions coming from several expert 

knowledge. According to the processes of information 

fusion, we give the botnet detection architecture based on 

heterogeneous multi-sensor information fusion in this 

paper, which consists of several parts in Figure 1. 

A. Information Collection 

This part adopts a role-based collaborative information 

collection model, which is our recent work in [32]. This 

part includes the software and hardware of the 

information collection system, the main task is to collect 

all the information related to botnet activities from 

heterogeneous multi-source sensors. The information can 

be gathered from computers in the network, network


Figure. 1 Botnet detection architecture based on heterogeneous multi-sensor information fusion. 

security equipments such as firewall, intrusion detection 

system (IDS), and network equipments such as router and 

switch. The function of this part is implemented by the 

information collection agent. 

To figure axis labels, use words rather than symbols. 

Do not label axes only with units. Do not label axes with 

a ratio of quantities and units. Figure labels should be 

legible, about 9-point type. 

Color figures will be appearing only in online 

publication. All figures will be black and white graphs in 

print publication. 

B. Pre-processing 

To increase the effect and efficiency of further 

information fusion, Pre-processing is needed. Preprocessing 

module is composed of classification and 

refinement. Classification is to divide the information 

source into original information, indirect information and 

direct information. The original information includes the 

real record of network and system behaviors without any 

security analysis, such as packet payload, system process 

information, etc. However, the indirect information is the 

alarm information from general security software, such as 

anti-virus software, firewall, honeypot, etc. The indirect 

information, always combined with original information, 

could be the indirect evidences for botnet detection. 

Besides, direct information is the analysis result of 

technical botnet detection tools (e.g. BotHunter [15]), 

which could be the direct evidence for botnet detection. 

Refinement is to filter out unwanted information, detect 

the suspicious information on preset rules, unify the 

presentation, and store the result into the information 

database. 

C. Feature Extraction 

Feature extraction is the core module of the 

architecture. The existing detection techniques use the 

following two extraction modes: 

• Utilizing the botnet samples captured by 

honeypot or honeynet (including bots, message 


contents, etc.). Because the sample data is 

relatively pure, extracted data features can be 

directly adopted as the essential features 

(signature or pattern) of botnet. 

• Utilizing general information (such as flow data, 

logs, etc.) and indirect information. The main 

process is: first of all, try to discover data 

features; then, compare to the results found by 

the proved botnet detection system; finally, verify 

whether the data features of information belong 

to the essential features of botnets. 

Figure. 2 Feature extraction. 

Our feature extraction covers the above-mentioned two 

modes. As shown in figure 2, the structure of feature 

extraction module consists of four parts, including 

attribute selection, data feature analysis, validation and 

scheme management. The data feature analysis integrates 

data mining methods, such as statistical data analysis, 

pattern recognition, artificial neural networks, support 

vector machines, etc. Its goal is to provide a mechanism 

for the identification of new features in the data sets from 

the attribute selection. For general and indirect 

information, it must be verified before being stored into 

the feature database as signatures or patterns. Meanwhile, 

through the extracted features, the scheme management


gives feedbacks to the modules of the data feature 

analysis and the attribute selection for dynamic 

optimization. Besides, this part divides the analytic 

results into four main categories as the inputs of the 

fusion analysis, i.e., propagation, infection, 

communication and attack, according to the stages of 

botnet activities. 

D. Fusion Analysis 

Fusion Analysis is also the key of the architecture and 

the main module of producing high-level and qualified 

alerts. This part gives the final conclusion for the 

decision-making to reflect the real situation of the botnet 

activities. The detailed process will be described in 

section 4. 

E. Database 

The information database stores the results producing 

by pre-processing module. The results have been divided 

into three categories, which is useful for the following 

fusions. The feature database stores signatures and 

patterns from the feature extraction. And the knowledge 

database includes vulnerability database, security policy, 

client configuration records, etc., which can provide a 

strong data support for the decision-making. 

F. Control and Collaboration Mechanism 

Control mechanism is used to react against the 

offending events taking place on or within the detection 

system. Depending on the result of analysis and 

synthesis, this part adopts the measure responding to the 

main modules. And some responding work can be 

finished automatically by control mechanism through 

adjusting system parameters. Collaboration mechanism 

provides the communications and function calls among 

the detection systems or with other security products. 

IV. METHOD OF FUSION ANALYSIS 

Decision fusion algorithms in the fusion analysis 

confront three critical requirements as follows: 

• Flexibility. The algorithm should require no prior 

probability and conditional probability. Since the 

botnet behaviors are often random and uncertain, 

it is difficult to obtain the prior knowledge. 

• Compatibility. It can effectively integrate 

heterogeneous multi-sensor information, and in 

particular with the accumulation of evidences, the 

decision will be more accurate. 

• Scalability. It has the ability to easily fuse new 

evidences from the emerging sensors without 

changing the framework of algorithm. 

The Dempster-Shafer (D-S) theory is the right one that 

can meet these requirements among the main algorithms. 

The D-S theory is a mathematical theory of evidence, 

introduced in the 1960's by Arthur Dempster [33] and 

developed in the 1970's by Glenn Shafer [34]. It is 

viewed as a mechanism for reasoning under epistemic 

(knowledge) uncertainty. First, we will give a brief 

introduction of D-S theory [35]. Then, in our architecture 

we will introduce an extended D-S theory proposed in 


[36] to fuse the results from feature extraction. And we 

will give a proof of the extended theory which was not 

proved in [36]. 

D-S Theory 

Frame of discernment (Θ) is a finite set mutually 

exclusive propositions and hypotheses about some 

problem domain. Basic probability assignment (bpa) is 

stated in [34] as: “If Θ is a frame of discernment, then a 

function m: is called a basic probability assignment 

whenever 

m( ∅ ) = 0 

(1) 

∑ m( A) 

= 1 

(2) 

A⊂Θ 

The mass value of A (m(A)) is also called A’s basic 

probability number, and it is understood to be the 

measure of the belief that is committed exactly to A.” 

( ) ( ) 

Bel A = ∑ m B 

(3) 

B⊆A Plausibility function (Pl) takes into account all the 

elements related to A (either supported by evidence or 

unknown). 

( ) 1 ( ) 

Pl A = − Bel ¬ A 

(4) 

For the subset A, Bel(A) and Pl(A) represent upper and 

lower belief bounds, and the interval [Bel(A), Pl(A)] 

represents the belief range. 

12 

( ) 

m A 

= 

∑ 

∑ 

B∩ C= A 

B∩C≠∅ ( ) ( ) 

m B m C 

1 2 

( ) ( ) 

m B m C 

1 2 

Dempster’s rule of combination can be used to 

combine the mass values of all features from each 

individual sensor to achieve the overall summary mass 

values for each sensor. These summary values from all 

sensors are combined to give the summary mass values 

for the system. Initially, the bpas are used to assign the 

mass values to appropriate hypothesis. Then the resulting 

mass values are used to calculate the belief for the 

appropriate hypothesis. Finally all beliefs are combined 

with Dempster’s rule of combination to gain the overview 

belief for the appropriate hypothesis, as shown in (5). 

Extended D-S Theory 

Dempster’s rule of combination gives equivalent trust 

to the evidences provided by different sensors as shown 

in (5). But actually it is not the case. Observations show 

that for the same type of sensors, the local ones should 

provide more credible information than the remote ones; 

even if the same sensor, installed in different locations of 

network will have different detection capacity; different 

types of sensors, may have different detection capability 

and accuracy for the same type of attack, so that the 

provided evidences will have great difference in 

importance and reliability. To solve these problems, the 

extended D-S theory assigns different weight to different 

(5)


sensors. It means that different sensors are given different 

trusts. As shown in (6), using a weighted index method, 

the evidences after the rule combination should meet the 

basic nature of the bpa, that is to say, (2). 

12 

( ) 

m A 

= 

∑ 

∑ 

B∩ C= A 

B∩C≠∅ ( ) ( ) 

w1 w2 

1 2 

⎡⎣m B ⎤⎦ ⎡⎣m C ⎤⎦ 

( ) ( ) 

w1 w2 

1 2 

⎡⎣m B ⎤⎦ ⎡⎣m C ⎤⎦ 

(6) 

In (6), m1 and m2 are the mass functions over Θ, 

m ∅ = 0 . We just need to prove m ( A) 

= 1 , 

and ( ) 

12 

∑ 

A⊂Θ 

which is shown in (7). So m12 is also the mass function 

and the evidences after the rule combination in (6) are 

truly to meet the basic nature of the bpa. 

∑ 

∩ = 

∑ m12 ( A) = m12 ( ∅ ) + ∑ m12 ( A) 

= ∑ 

A⊂Θ A⊂Θ, A≠∅A⊂Θ, A≠∅∑ 

12 

( ) ( ) 

⎡⎣m1 w1 B ⎤⎦ ⎡⎣m2 w2 

C ⎤⎦ 

B C A 

w1 w2 

⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) 

⎤⎦ 

w1 w2 ∑ ∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) ⎤⎦ A⊂Θ, A≠∅ B∩ C= A 

w1 w2 ∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) ⎤⎦ B∩C≠∅ w1 w2 

∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) 

⎤⎦ 

B∩C≠∅ w1 w2 

∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) 

⎤⎦ 

1 

B∩C≠∅ B∩C≠∅ = = = 

In the extended D-S theory, the weights can be 

obtained by training samples based on maximum entropy 

or minimum mean square error, and also can be the 

experience values from several tests. 

Weight Assignment 

In addition to the situations of sensors, our researches 

show that weight assignment should take the stages of 

botnet activities into account. The features which are 

extracted from the stages of communication and attack 

are more credible than those from the stages of 

propagation and infection. Moreover, other factors, such 

as vulnerability, might also affect the weight assignment. 

V. SCENARIO 

As shown in Figure 3, this is a typical environment for 

botnet attacks. The environment contains the local area 

network and backbone network, involving three 

application servers (EMAIL, WWW, and DNS), one 

management server, three firewalls, an attacker, a 

honeynet, several zombies, etc. The Attacker sends the 


(7) 

Figure. 3 Illustration of botnet detection. 

commands to the zombies through command and control 

channel. According to the commands, the bots on the 

zombies will carry out some actions such as propagation, 

information theft, DDoS attack, spam, etc. The thin oneway 

arrow in figure 3 shows the process of command and 

control communication. Towards this typical 

environment, BotHunters are deployed for two local area 

networks; Spam filters is used in EMAIL server; Servers 

and hosts are equipped with terminal monitor and log 

analysis tools; network traffic monitor for flow and traffic 

information; vulnerability scanning systems to collect 

vulnerability, configuration, and port information for 

servers and hosts. All the sensors through the collection 

agents send the information to the management server for 

fusion analysis. Then the management server gives the 

final results to the administrator. The thick one-way 

arrow shows the aforementioned process. 

To show the work of the botnet detection architecture, 

an example of sending spam is provided in figure 3. It 

can be observed that the attacker discovers the zombies 

online and commands the bots on the zombies to send 

spam to victim host A and B. On the one hand, the 

BotHunters can detect the bots in the network by 

monitoring the traffic. On the other hand, other sensors 

can also hold every stage of spam botnet activities. The 

log analysis tools in DNS server could discover some 

suspicious hosts, for the spam bots usually perform 

DNSBL lookups on the DNS server to determine whether 

they are blacklisted [37]. The terminal and traffic 

monitors could retrieve the direct evidences of anomalies 

from the communications. In a word, all the sensors send 

the suspicious information to the management server. 

Then, the management server carries out pre-processing, 

feature extraction and fusion analysis to integrate and 

analyze the received information, so that the 

administrator can fully master the evidences of the 

interactions between the attacker and the zombies within 

a short time. And, this fusion process can also identify the 

zombies [38] as well as the position of the attacker [39]. 

If the administrator only monitored the traffic or only 

checked email records to identify the spam activities, it 

may take more time and cause more false-positive alarms. 

Theoretically speaking, the results from fusion analysis 

are more accurate than those from the BotHunters.


VI. CONCLUSIONS 

In this paper, we have introduced a botnet detection 

architecture based on heterogeneous multi-sensor 

information fusion. Also, we described functionalities 

and features of each component in the architecture, 

highlighting the module of feature extraction. In addition, 

we introduced an extended algorithm of D-S theory and 

gave its proof. Finally, we elaborated a representative 

case of detecting spam botnet to demonstrate the 

feasibility of our architecture. 

For the future work, we are going to implement the 

prototype and deploy it in the real network, and then 

evaluate the accuracy of the fusion algorithm to compare 

the existing detection method. Our ultimate goal is to 

develop a practical botnet detection system, following the 

architecture proposed in this paper, to integrate multiple 

information fusion techniques, and eventually provide 

identification, evaluation and prediction for the botnet. 

ACKNOWLEDGMENT 

The authors would like to thank Tao Li for his helpful 

comments for improving this paper. This material is 

based upon work supported in part by the National 

Natural Science Foundation of China under Grant 

No.61070200 and No.61003303, the National Science 

and Technology Support Program of China under Grant 

No.2008BAH37B03, the National High-Tech Research 

and Development Plan of China under Grant 

No.2009AA01Z432, and the National Grand 

Fundamental Research 973 Program of China under 

Grant No.2009CB320503. 

REFERENCES 

[1] K. Singh, A. Srivastava, J. Giffin, and W. Lee, “Evaluating 

Email’s Feasibility for Botnet Command and Control,” 

Proc. 38th Annual IEEE/IFIP International Conference on 

Dependable Systems and Networks (DSN 2008), USA, 

2008, pp. 376-385. 

[2] K. Bohn, “Teen questioned in computer hacking probe,” 

CNN [Online], 2004, Available: 

http://www.cnn.com/2007/TECH/11/29/fbi.botnets/index.h 

tml. 

[3] J. Davis, “Hackers take down the most wired country in 

europe,” WIRED MAGZINE: ISSUE 15.09 [Online], 

2007, Available: 

http://www.wired.com/politics/security/magazine/15- 

09/ff_estonia. 

[4] T. Holz, M. Steiner, and F. Dahl, “Measurements and 

mitigation of peer-to-peer-based botnets: A case study on 

storm worm,” Proc. 1st USENIX Workshop on Large- 

Scale Exploits and Emergent Threats (LEET’08), 2008. 

[5] H. Wang and Z. Gong, “Collaboration-based botnet 

detection architecture,” Proc. 2nd International Conference 

on Intelligent Computation Technology and Automation, 

Zhangjiajie, China, 2009. 

[6] Zhaosheng Zhu, Guohan Lu, and Yan Chen, “Botnet 

Research Survey”, Proc. 32nd Annual IEEE International 

Computer Software and Applications Conference, Finland, 

2008. 

[7] J. Govil, “Examining the criminology of bot zoo,” Proc. 

6th International Conference on Information, 

Communications and Signal Processing, Singapore, 2007. 


[8] J. Govil, “Criminology of botnets and their detection and 

defense methods,” Proc. 2007 IEEE International 

Conference on Electro/Information Technology (EIT’07), 

2007. 

[9] D. Geer, “Malicious bots threaten network security,” IEEE 

Computer, vol. 38, no. 1, pp. 18-20, 2005. 

[10] M. Rajab, J. Zarfoss, and F. Monrose, “A multi-faceted 

approach to understanding the Botnet phenomenon,” Proc. 

ACM SIGCOMM/USENIX Internet Measurement 

Conference (IMC’06), Brazil, 2006. 

[11] G. Giorgio, R. Fabio, and S. Carlo, “Information fusion in 

computer security,” Information Fusion, vol. 10, no. 4, pp. 

272-273, 2009. 

[12] J. Zhuge, X. Han, Y. Zhou, Z. Ye, and W. Zou, “Research 

and Development of Botnets,” Journal of Software, vol. 19, 

no. 3, pp. 702-715, 2008. 

[13] J. Zhuge, X. Han, Z. Ye, and W. Zou, “Discover and track 

botnets,” Proc. Chinese Symposium on Network and 

Information Security (NetSec), Beijing, 2005. 

[14] J. Cheng, J. Yin, Y. Liu, and J. Zhong, “Advances in the 

Honeypot and Honeynet Technologies,” Journal of 

Computer Research and Development, vol. 45, no. 1, pp. 

375-378, 2008. 

[15] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee, 

“BotHunter: Detecting malware infection through idsdriven 

dialog correlation,” Proc. 16th USENIX Security 

Symposium (Security’ 07), 2007. 

[16] J. R. Binkley and S. Singh, “An algorithm for anomalybased 

Botnet detection,” Proc. USENIX SRUTI’06, 2006, 

pp. 43–48. 

[17] J. Lee, H. Jeong, J. Park, M. Kim, and B. Noh, “The 

activity analysis of malicious http-based botnets using 

degree of periodic repeatability,” Proc. 2008 International 

Conference on Security Technology, 2008, pp. 83-86. 

[18] H. Choi, H. Lee, and H. Lee, “Botnet detection by 

monitoring group activities in DNS traffic,” Proc. 7th IEEE 

International Conference on Computer and Information 

Technology, Aizu-Wakamatsu City, Japan, 2007. 

[19] S. Matthew and I. Igor, Detection of peer-to-peer botnets, 

Masters Thesis, University of Amsterdam, 2008. 

[20] F. Freiling, T. Holz, G, Wicherski, “Botnet Tracking: 

Exploring a Root-cause Methodology to Prevent Denial of 

Service Attacks,” Proc. 10th European Symposium on 

Research in Computer Security (ESORICS’05), 2005. 

[21] Z. Duan, P. Chen, F. Sanchez, Y. Dong, M. Stephenson, 

and J. Barker, “Detecting Spam Zombies by Monitoring 

Outgoing Messages, ” Proc. IEEE INFOCOM’09 

Conference, Brazil, 2009. 

[22] E. Robert, C. Adele, and B. Pranab, “A Multi-Layered 

Approach to Botnet Detection,” Proc. 2008 International 

Conference on Security and Management (SAM’08), 

USA, 2008. 

[23] N. Paxton, G.J. Ahn, and B. Chu, “Towards practical 

framework for collecting and analyzing network-centric 

attacks,” Proc. IEEE International Conference on 

Information Reuse and Integration, USA, 2007. 

[24] Z. Zhang and Y. Kadobayashi, “A holistic perspective on 

understanding and breaking botnets: Challenges and 

countermeasures,” Journal of the National Institute of 

Information and Communications Technology, vol. 55, no. 

2, pp. 43-59, 2008. 

[25] S. Alireza, F. Maryam, and A. Rodina, “Architecture for 

applying data mining and visualization on network flow for 

botnet traffic detection,” Proc. 2009 International 

Conference on Computer Technology and Development, 

Cairo, Egypt, 2009.


[26] G. Gu, J. Zhang, and W. Lee, “BotSniffer: Detecting 

Botnet command and control channels in network traffic,” 

Proc. 15th Annual Network and Distributed System 

Security Symposium (NDSS’08), USA, 2008. 

[27] G. Gu, J. Zhang, and R. Perdisci, “Botminer: Clustering 

analysis of network traffic for protocol- and structureindependent 

Botnet detection,” Proc. 17th USENIX 

Security Symposium (Security’08), USA, 2008. 

[28] G. Gu, Correlation-based Botnet Detection in Enterprise 

Networks, PhD Thesis, Georgia Institute of Technology, 

USA, 2008. 

[29] B.V. Dasarathy, “A special issue on information fusion in 

computer security,” Information Fusion, vol. 10, no. 4, pp. 

271, 2009. 

[30] Y. Niu, Q. Zheng, and H. Peng, “Network security 

management based on data fusion technology,” Proc. 7th 

International Conference on Computer-Aided Industrial 

Design and Conceptual Design, China, 2006. 

[31] B.V. Dasarathy, “Decision Fusion,” IEEE Computer 

Socienty Press, 1994. 

[32] H. Wang and Z. Gong, “Role-based collaborative 

information collection model for botnet detection,” Proc. 

2010 International Symposium on Collaborative 

Technologies and Systems (CTS 2010), Chicago, USA, 

2010. 

[33] A.P. Dempster, “Upper and lower probabilities induced by 

a multivalued mapping,” Ann. Math. Statist., 1967, pp. 

325-339. 

[34] G. Shafer, A Mathematical Theory of Evidence, Princeton 

University Press, Princeton and London, 1976. 

[35] Qi Chen, Uwe Aickelin, “Dempster-Shafer for Anomaly 

Detection,” Proc. the International Conference on Data 

Mining (DMIN 2006), Las Vegas, USA, 2006, pp. 232- 

238. 

[36] L. Ma, L. Yang, and J. Wang, “Research on Security 

Information Fusion from Multiple Heterogeneous 

Sensors,” Journal of System Simulation, vol. 20, no. 4, pp. 

181-187, 2008. 

[37] A. Ramachandran, N. Feamster, and D. Dagon, “Revealing 

Botnet membership using DNSBL counterintelligence,” 

Proc. USENIX SRUTI’06, 2006. 

[38] S. Kondo and N. Sato, “Botnet traffic detection techniques 

by C&C session classification using SVM,” Proc. 2nd 

International Workshop on Security, Japan, 2007. 

[39] M. Rajab, J. Zarfoss, and F. Monrose, “My botnet is bigger 

than yours (maybe, better than yours): Why size estimates 

remain challenging,” Proc. 1st Workshop on Hot Topics in 

Understanding Botnets (HotBots 2007), Boston, USA, 


2007.J. Clerk Maxwell, A Treatise on Electricity and 

Magnetism, 3 rd ed., vol. 2. Oxford: Clarendon, 1892, 

pp.68–73. 

HaiLong Wang JiLin Proviince, China. 

Birthdate: May, 1981. is Electronic 

Engineering B.E., graduated from Dept. 

Electronic Engineering Naval University 

of Engineering, Wuhan, China, in 2004. 

And research interests on network and 

information security, distributed 

computing, computer network 

architecture. 

He is currently working toward the Ph.D. degree at the 

School of Computer, National University of Defense 

Technology, Changsha, China. 

Jie Hou HeBei Proviince, China. 

Birthdate: July, 1983. is Communication 


Communication Engineering Chinese 

People’s Armed Police Force Institute of 

Engineering, Xi’an, China, in 2005. And 

research interests on the next generation 

computer network architecture, network 

and information security. 

She is currently working toward the 

Ph.D. degree at the School of Computer, National University of 

Defense Technology, Changsha, China. 

ZhengHu Gong HuNan Province, China. 

Birthdate: August, 1945. is Electronic 


Electronic Engineering Tsinghua 

University, Beijing, China, in 1970. And 

research interests on computer network 

and communication, network security, 

computer network architecture. 

He is currently a Professor with the 

School of Computer, National University of Defense 

Technology, Changsha, China.


XOEM plus OWL-based STEP Product 

Information Uniform Description and 

Implementation 

Chengfeng Jian 

Zhejiang University of Technology, Hangzhou, 310023, China 

Email: jiancf@zjut.edu.cn 

Haizhong Meng 

Zhejiang University of Technology, Hangzhou, 310023, China 

Email: mhz_1986@126.com 

Abstract—Aimed at the current inconsistencies in the OWLbased 

STEP description, the mapping rules between 

EXPRESS and OWL are established on the base of uniform 

semantic model named XOEM+OWL, then the 

implementation method of STEP-OWL converter is put 

forward and the corresponding examples are shown. 

Index Terms—OWL, STEP, XOEM, EXPRESS 


With the development of the semantic web and 

semantic grid, knowledge sharing and exchange of 

product information over the Internet became the main 

research focus. Currently, there are many research 

methods to realize the semantic description of STEP [1] 

by means of semantic web such as RDF, DAMIL, OWL 

[2], etc [3-5]. Summarized the results of these studies, 

their thinking is similar to STEP using the same XML 

data representation, are trying to use RDF or OWL to 

replace EXPRESS described. The limitations of this 

approach is: different from the data format conversion, 

OWL semantic description of a variety of ways, for the 

same kind of product information, OWL can be used 

many different ways to describe their internal semantics, 

even in the same kind of OWL language to describe the is 

difficult to standardize the understanding of semantic 

consistency. Therefore, trying to only through the 

description of OWL to realize semantic description of the 

unity of product information is difficult, which is 

currently difficult for these studies have further reasons. 

Overall, although the realization of the expression of 

STEP in OWL, but mainly through the EXPRESS and 

OWL syntax match between the establishment of 

mapping between ontology definitions and descriptions 

of their lack of consistency, lack of a unified model and 

define the appropriate constraints. 

project number: 60603087 


doi:10.4304/jnw.6.12.1662-1667 

In this paper, aimed at OWL-based STEP semantic 

description, the mapping rules between EXPRESS and 

OWL are established on the base of uniform semantic 

model named XOEM+OWL, then the implementation 

method of STEP-OWL converter is put forward and the 

corresponding examples are shown. 

II. XOEM+OWL-BASED SCHEMA MAPPING 

XOEM [6] is the data model of the XML-based STEP 

representation. It is difficult to realize the direct mapping 

between XOEM and OWL because OWL belongs to the 

semantic layer and the XOEM belongs to the data layer. 

XOEM has strong capability on the description of data 

object but the weak capability on the reasoning of 

constraint. So it is necessary to build the model that it can 

realize the conversion from XOEM and introduced from 

OWL pattern graph. That’s called XOEM+OWL [7]. 

According to the OO conception, table1 shows the 

comparison: 

XOEM+OWL model is based on the XOEM model. 

We can also get the follow definition reference to 

XOEM: 

Object: = Atomic Object | Complex Object 

Atomic Object: = (oid, label_name, attribute_type, 

attribute_value ) 

Complex Object: = (oid, label_name, Reference) 

Reference: = (link, oid, label_name )


TABLE I. 

EXPRESS-XML-OWL 

OO EXPRESS XML OWL 

Object Entity 

Object 

instance 

Object 

property 

Method 

Element 

type 

Definition1. 

Given directed graph G=(V, E). 

Class 

Entity instance Element Individual 

Entity 

attribute 

ENTITY 

Function 

Element 

Element Class 

ObjectProperty/ 

DatatypeProperty 

Declaration Entity Schema DTD Ontologies 

Relationship 

Complex 

Constraint 

express 

Hierarchy 

Complex 

Constraint 

express 

Assumption: v0, v1…vi, …vn ∈ V, e1, e2…en∈ E. 

Convention3: If the above rules cannot be achieved or 

is difficult to achieve, under the circumstances, using the 

original translation. 

According to above description, table2 shows the 

different corresponding Schema graph relation. 

TABLE II. 

XOEM+OWL-BASED SCHEMA GRAPH DESCRIPTION 

OWL Schema Graph 

Class Node 

Property with basic datatypes 

as range (Attribute) 

Property with other class as 

range (Attribute) 

Node with edge joining it to the class 

with name “hasProperty” 

Edge between the two class nodes 

Individual Node with edge joining it to the class 

with name “hasInstance” 

Class – subclass relationship Edge between class node to subclass 

node with name “hasSubClass” 

Exists r: d > 0 , r ∈V 

v ( V { r}) 

, i ∈ − 

, 

i=0, 1, …n. 

III. MAPPING RELATIONSHIP BETWEEN OWL AND 

EXPRESS EXPRESSION 

Definition2. 

Given directed graph G(V, E, r). 

A. SCHEMA definition 

SCHEMA defined as a collection of STEP ENTITY 

Exist G(Vi, Ei), vi∈ V. 

and types, which can be refer to each other for the 

i 

V = { v j | v j ∈V 

∧ v k , v j}; 

purpose of type reuse. The definition of SCHEMA can be 

corresponding to the Ontology in OWL which part #1 in 

i 

E 

i 

i 

= { < v j , v k > | v j ∈ V ∧ v k ∈ V ∧ < v j , v k >∈ E }; Figure 1 shows. 

B. Basic data type definition 

Rule1. 

Basic data types EBNF expressed as shown in Figure 2. 

For the XOME+OWL object, the Node of the directed OWL uses XML Schema embed data type, so as follows: 

graph is represented as Object. It is mapping into the For simple data types, mapping directly into the xsd 

Class of OWL. 

data types in the XML schema. 

For the construction of data types, mapping into owl: 

Rule2. 

oneOf. 

For the XOME+OWL object’s property, the Edge of For the aggregate data types, mapping into Owl:Class 

the directed graph is represented as Property. It is aggregate with attribute (lowerboundary, upperboundary, 

correspond to the property of the Class or the 

“hasSubClass” among the classes in the OWL. 

repetitiveness, if ordered, storage type) 

Convention1: If the relevant concepts or data types 

of EXPRESS can be directly expressed in OWL, then be 

expressed using the OWL keyword in priority, to ensure 

TypeDecl::=’’ 

TYPE_HEAD::=TYPE_ID+ TYPE_ID::MarkupDecl* 

TYPE_BODY::=TYPE_DECLARATION+SMarkupDecl* 

TYPE_DECLARATION::=’’ 

accuracy by reasoning tools. 

BASE_TYPE::=SimpleTypes|ConstuctedTypes|Aggregatio 

nTypes|TypeRef 

Convention2: If the relevant concepts or data types of WHERE CLAUSE::=’WHERE’|RuleDecl 

EXPRESS cannot be directly expressed in OWL, but can 

expressed by combining the OWL relevant concepts for 

the same purpose, and ensuring the accuracy of 

semantics. The combination approach is the better. 

Figure2. The BENF expression of basic data type in EXPRESS 

© 2011 ACADEMY PUBLISHER


SCHEMA config_control_design; #1 

Entity action; #2 

name:label; #3 

description: OPTIONAL text; #4 

… 

DERIVE 

scl: REAL:=NVL(scale, 1); #5 

; 

#1 

#2 

#3 

 

 

 

#4 

 

 

#5 

 

 

 

 

Figure1. Mapping Example between EXPRESS and OWL 

C. Entity definition 

ENTITY is an important concept in EXPRESS, so the 

mapping of entity is the most important. In EXPRESS, 

the definition of entity is shown in Figure 3(The BENF 

expression of entity in EXPRESS). The concept of class 

in OWL can be equivalent to that of entity. In this paper, 

we map the entity to class, But in OWL class, the 

definition of attributes and classes are separate, while In 

EXPRESS, that is defined together. In order to resolve 

property name conflicts, we plus the entity name at front 

of the attribute name. 

①Entity name 

For entity name, mapping into owl class, e.g. , see it in Figure1 #2. 

②Entity inheritance 

We adopt rdfs:subClassOf to represent the 

‘SUPERTYPE OF’ and “SUBTYPE OF” 

③Simple Attribute 

We call those Attributes Simple Attributes whose types 

are only simple data type or another entity. If the type is 

simple data type, then mapping into , or mapping into , such as Figure1 #3. 

④Aggregate Attribute 


For Aggregate Attribute, first define Attribute class 

as the subclass of class in 3) of 2.3.2, then set the 

Attribute’s lower boundary, upper boundary, 

repetitiveness, order, storage type. 

⑤OPTIONAL Attribute 

For optional attribute, not only mapping into 

or, but also 

providing attribute cardinality constraints 

(maxCardinality). It is shown in Figure1 #4. 

⑥DERIVE Attribute 

For DERIVE attribute, not only mapping into 

or, but also 

providing attribute constraints (allValuesFrom). It is 

shown in Figure1 #5. 

EntityDecl∷= ’’ 

ENTITY_HEAD∷=ENTITY_ID S INHERITANCE? 

ENTITY_BODY∷=ENTITY_DECLARATION + S 

MarkupDecl* 

ENTITY_DECLARATION∷=ENTITY_AttrDecl * 

SENTITY_ClauseDecl? 

ENTITY_ClauseDecl∷=INVERSE_ClAUSE | 

UNIQUE_ClAUSE | WHERE_ClAUSE 

WHERE_ClAUSE∷=’WHERE’ | RuleDecl 

UNIQUE_ClAUSE∷=’UNIQUE’ S Unique_Rule+ 

ENTITY_AttrDecl∷=Explicit_AttrDecl | 

Derive_AttrDecl | Inverse_AttrDecl 

Figure3. The BENF expression of entity in EXPRESS 

D. Function and rule definition 

In function and rule, there is a wealth of mathematical 

operations and Constraint mechanism on objects, but 

these expressions in OWL at this aspect are limited, so 

we adopt the literal translation with SWRL according to 

the Conversion 3. 

In addition to the above, there are many other concepts 

in EXPRESS, the mapping methods are similar. 

Ⅳ. DESIGN AND IMPLEMENTATION OF STEP-OWL 

CONVERTER 

A. Conversion of EXPRESS-OWL 

The mapping method of EXPRESS to OWL file has 

been described in detail in part2, so the most important 

task for the implementation of EXPRESS-OWL file 

conversion is lexical analysis. Here we have adopted a 

two-step to complete conversion, which are pre- 

converter and post-converter. 

① Pre-converter 

The so-called pre-converter resolve EXPRESS file to 

JAVA classes (Figure 4) in accordance with established 

EXPRESS keyword vocabulary (Figure 5) file.


ENTITY 

entityName : String 

superEntity : List 

subEntity : List 

SimpleAttribute : Map 

deriveAttribute : Map 

condition : Map 

Figure4. ENTITY class diagram 

public interface Vocabulary 

{ 

public final static String 

ABSTRACT=”ABSTRACT”; 


AGGREGATE=” AGGREGATE”; 

public final static String ALLAS=” ALLAS”; 

public final static String AS=”AS”; 

public final static String BAG=” BAG”; 

public final static String BEGIN=”BEGIN”; 

public final static String BINARY=” BINARY”; 


BOOLEAN=” BOOLEAN”; 

public final static String CASE=” CASE”; 

… 

Figure5. Part of EXPRESS keyword vocabulary 

The conversion method is roughly the same, so we use 

entity conversion for example. According to the EBNF 

description of entities and the characteristics of definition 

of EXPRESS entity in Figure 3, we can find that the 

keyword vocabulary EXPRESS entity definition is 

ENTITYEND_ENTITY 、 DERIVE 、 INVERSE 、 

WHERE 、 SUPERTYPE OF 、 SUBTYPE OF. Preconverter’s 

physical process is shown in Figure 6. 

no 

init 

r ead a l i ne of 

EXPRESS f i l e 

has keywor d 

ENTI TY? 

yes 

save the infomat i on 

and cont i nue r ead a 

line 

has keywor d 

END_ENTI T£¿ 

yes 

split the string 

according ';'£¬and write 

it to JAVA cl ass 

no 

no 

document 

end£¿ 

end 

Figure6. Pre-converter’s physical process 

② Post-converter 

The so-called post-converter is a documents writer 

based on the work of pre-converter, which generate OWL 

documents according to the mapping method in chapter 2. 


yes 

Figure 7 is the part of the STEP-OWL converter’s 

convert result for STEP AP203 shown in Protégé. 

Figure7. AP203 converted entity relationship results in Protégé 

B. STEP Part21 file conversion 

STEP Part21 [8] [9] file can be divided into two parts: 

HEADER and DATA. HEADER describe the file name 

file reference application protocol; DATA section 

composed by a number of data instances, each data 

instance composed by ID, "=", function statements. 

Although the data structure is a single paragraph, but the 

statement describes a variety of functions, how to design 

to meet the description of data example’s diversity is the 

focus of the conversion. 

①Lexical Analysis 

Read STEP file from left to right, just scan the 

character stream and then identify the word based on 

word formation rules. This step is divided word into data 

instance ("#" plus the number), the variable value 

(integer, string, data value), reserved words (the special 

characters and other special characters in Part21 physical 

file). 

②Syntax Analysis 

Syntax analysis’s task is to combine the word sequence 

into various grammatical phrases based on the lexical 

analysis, such as the "Program", "statement", "expression 

", etc. Syntax analysis charges the step file is correct or 

not on structure and analyze the expression phrase in 

hierarchical. 

③Semantic Analysis 

Semantic Analysis is a translation of syntax mapping 

based on lexical analysis and syntax analysis. According 

to the keywords generate by syntax analysis, we search 

the keywords in STEP Application Protocol library, and 

insert into file at the appropriate location based on 

conversion rules. Use data instance #5 = 

AXIS2_PLACEMENT_3D ('NONE', #6, #7, #8); for 

example. 

Step 1.Divide the data instance into #5, =, 

AXIS2_PLACEMENT_3D, (, 'NONE', #6, #7, #8).


Step2.Decompose the words in step 1 hierarchically, 

we can get #5->AXIS2_PLACEMENT_3D ->’NONE’, 

#6, #7, #8. 

Step3.Semantic processor search AXIS2_ PLACE 

MENT_ 3D based on definition in AP203 and map to 

AP203, for example, ’none’ means the value of property 

‘name’, which inherited from representation_item; # 6, # 

7, # 8, represent the values of the ‘location’ (inherited 

from the placement), ‘axis’, ‘ref_direction’. 

Through the above three steps, the data instance #5 = 

AXIS2_PLACEMENT_3D (‘NONE’, #6, #7, #8); is 

convert to owl file by STEP-OWL Converter, see in 

Figure 8. 

 

 

NONE 

 

 

 

 

…. 

…. 

… 

Figure8. Data instance # 5 

C. Example 

The examples of the convert result by the STEP-OWL 

converter for the STEP Part21 file are shown in Figure 9 

and Figure 10. 


Figure9. AP203 in OWL format 

Figure10. STEP Part21 in OWL format 

V. CONCLUSION 

XOEM+OWL-based STEP product information 

description can realize the semantic description while 

maintaining semantic consistency and effectiveness. 

There are still some issues that need further studies such 

as the semantic consistency for the Entity Function and 

Procedure with SWRL. 


Supported by the National Natural Science Foundation 

of China (No. 60603087), the Project of the Science and 

Technology Department of Zhejiang Province (No. 

2009C320076) 

REFERENCES 

[1] ISO10303-28, Industrial automation systems and 

integration, Product data representation and exchange, 

part28: Implementation methods: XML representations of 

EXPRESS schemas and data. 

[2] OWL Web Ontology Language, 

http://www.w3.org/TR/owl-features/ 

[3] Pan, Wen-Lin, “A formal EXPRESS-to-OWL mapping 

algorithm, ” Key Engineering Materials, vol.419, pp. 689- 

692, 2010. 

[4] Zhao, W., Liu, J.K, “OWL/SWRL representation 

methodology for EXPRESS-driven product information 

model. Part I. Implementation methodology, ” Computers 

in Industry, vol.59, pp. 580-589, August, 2008. 

[5] Ricardo Jardim-Goncalves, Nicolas Figay, Adolfo Steiger- 

Garcao, “Enabling interoperability of STEP Application 

Protocols at meta-data and knowledge level, ” 

International Journal of Technology Management, vol.36, 

pp.402-421, April, 2006. 

[6] Jian Cheng-Feng, Tan Jian-Rong, “Description and 

Identification of STEP Product Data with XML, ” Journal 

computer-aided design and computer graphics, vol.13, pp. 

983-990, Novemember, 2001.


[7] Jian Cheng-Feng, Zhang Mei-yu, “A Uniform Product 

Knowledge Representation Semantic Model, ” 2006 IEEE / 

WIC / ACM International conference on web intelligence, 

Hong Kong, pp.953-956 , Decemember, 2006. 

[8] ISO 10303-11 Industrial automation systems and 

integration, Product data representation and exchange, 

Part11: Description methods: The EXPRESS language 

reference manual. 

[9] ISO 10303-21 Industrial automation systems and 

Integration, Product data representation and exchange, 

Part21: Clear text encoding of the exchange structure. 


Chengfeng Jian Zhejiang Province, 

China. Birthdate: June, 1973. Ph.D., 

graduated from Zhejiang University. And 

research interests on CAD/PDM and 

Semantic Web/Semantic Grid. 

He is an associate professor of Dept. 

Computer Science and Technology 

Zhejiang University of Technology. 

Haizhong Meng Zhejiang Province, 

China. Birthdate: July, 1986. BA., 

graduated from Zhejiang Sci-Tech 

University. And research interests on 

Semantic Web and STEP 

He is currently a postgraduate student 

of Dept. Computer Science and 

Technology Zhejiang University of 

Technology.


Design of Greenhouse Control System Based on 

Wireless Sensor Networks and AVR 

Microcontroller 

Yongxian Song 

The Institute of Electronic Engineering Huaihai Institute of Technology, Lianyungang , 222005,China 

Email: soyox@126.com 

Chenglong Gong, Yuan Feng, Juanli Ma and Xianjin Zhang 

The Institute of Electronic Engineering Huaihai Institute of Technology, Lianyungang, 222005, China 

Email: soyox@163.com 

Abstract—In order to accurately determine the growth of 

greenhouse crops, the system based on AVR Single Chip 

microcontroller and wireless sensor networks is developed, 

it transfers data through the wireless transceiver devices 

without setting up electric wiring, the system structure is 

simple. The monitoring and management center can control 

the temperature and humidity of the greenhouse, measure 

the carbon dioxide content, and collect the information 

about intensity of illumination, and so on. In addition, the 

system adopts multilevel energy memory. It combines 

energy management with energy transfer, which makes the 

energy collected by solar energy batteries be used 

reasonably. Therefore, the self-managing energy supply 

system is established. The system has advantages of low 

power consumption, low cost, good robustness, extended 

flexible. An effective tool is provided for monitoring and 

analysis decision-making of the greenhouse environment. 

Index Terms—wireless sensor networks, AVR, greenhouse 


Greenhouse is a kind of place which can change plant 

growth environment, create the best conditions for plant 

growth, and avoid influence on plant growth due to 

outside changing seasons and severe weather [4-5]. For 

greenhouse measurement and control system, in order to 

increase crop yield, improve quality, regulate the growth 

period and improve the economic efficiency, the 

optimum condition of crop growth is obtained on the 

basis of taking full use of natural resources by changing 

greenhouse environment factors such as temperature, 

humidity, light, CO2 concentration. Greenhouse 

measurement and control system is a complex system, 

it needs to various parameters in greenhouse automatic 

monitoring, information processing, real-time control and 

online optimization. The development of greenhouse 

measurement and control system has made considerable 

progress in the developed countries, and reached the 

Manuscript received March. 5, 2011; revised March.25, 2011; 

accepted April. 10, 2011. 


doi:10.4304/jnw.6.12.1668-1674 

multi-factors comprehensive control level, but if we 

introduce the foreign existing systems, the price is very 

expensive and maintenance isn’t convenient. In recent 

years, our country have launched many studies in aspects 

of greenhouse structure and control, and made a lot of 

achievements, but the greenhouse measurement and 

control system is mostly based on cable, so it is not only 

wiring complex, but also unfavorable to improve the 

system efficiency. With the rapid development of the low 

cost, low power sensor and wireless communication 

technology, the conditions that construct wireless 

greenhouse measurement and control system becomes 

mature, and it is important to realize agricultural 

modernization [1-3]. According to the needs of quickly 

and accurately acquisition greenhouse environment 

information, in the paper, we have further studies in 

aspects of greenhouse environment information 

collection, treatment, transmission and so on, and we 

have developed greenhouse measurement and control 

system based on AVR microcontroller and wireless 

sensor networks. This system has high practical value to 

realize information and automation of large-scale 

greenhouse monitoring and improve work efficiency. 

II. THE GENERAL STRUCTURE OF THE SYSTEM 

The greenhouse measurement and control system 

compose of the monitoring center, sensor nodes and 

control equipments. Sensor nodes are deployed in every 

place of greenhouse, the responsible for periodic 

acquisition greenhouse environment information and send 

it to control center. The control center analyze these data 

which has been obtained, then relevant decisions are 

made and send control message to greenhouse control 

equipment, which regulate greenhouse environment 

parameters to obtain best growth environment for crops. 

Modern greenhouse has very large size, and which adopt 

hierarchical system structure. Supposed that greenhouse 

is rectangular area, the measurement system overall 

structure is shown in Fig.1.


Figure.1.The system structure of Greenhouse WSN measurement and 

control 

In Fig.1, the rectangular greenhouse was divided into 

several same area of greenhouse, each measurement and 

control area is managed by a base station, and is divided 

into many virtual grids which have the same sizes and is 

non-overlapping. A number of sensor nodes are deployed 

in virtual grid and make a cluster, each cluster includes a 

cluster head (sink node) and some cluster member nodes. 

Cluster head generated from the member nodes through 

cluster head election algorithm, and cluster member 

nodes compose of sensor nodes which can collect 

environmental data and control nodes which can control 

actuators and adjust environmental parameters. Control 

node does not participate in cluster head election, it 

obtain command which the monitoring center send from 

cluster head node and execute corresponding control 

operation. The star network composed of Cluster head 

nodes, sensor nodes and control nodes, it mainly 

complete data acquisition and control of greenhouse 

environment. The data which is collected is transmitted 

directly from sensor nodes to cluster head, the cluster 

nodes transferred data to the base station by way of 

multiple hops, at last, the base station transferred each 

cluster head node data which is packaged to the 

monitoring center. Base station is relay station between 

the monitoring center and greenhouse WSN nodes, the 

network control is realized by managing all the nodes of 

single greenhouse measurement and control area. The 

monitoring center is not only total console of more 

greenhouse network, but also data center of measurement 

and control system of the greenhouse network , and take 

charge of control and management of the entire system. 

III. GREENHOUSE WIRELESS SENSOR NETWORK NODE 

DESIGN 

Greenhouse wireless sensor network measurement and 

control system consists of two types of nodes, namely, 

sensor nodes and sink nodes. Sensor node composed of 

CPU module, wireless communication module, power 

supply module, sensor module and position switch which 

can set their physical location information. Sink node 

contains three modules: CPU module, wireless 


communication module, continuous power supply 

module and serial communication module. 

A. Sensor node module design 

Sensor node composed of CPU module, wireless 

communication module, sensor module, position switch 

and energy supply module. Its structure is shown in Fig.2. 

Sensor module is responsible for monitoring area 

information collection and data transfer, according to the 

application requirements, and can choose temperature 

sensor, humidity sensors, light sensor, carbon dioxide 

concentrations sensor etc. Processor module is 

responsible for controlling the sensor node operation, 

storage and processing the data which collected by the 

node and forwarded by other nodes. Wireless 

communication module is responsible for wireless 

communication, exchanging control information and 

transceiver acquisition data between this node and other 

nodes. Position setting switch is used to set a sensor node 

specific physical location in greenhouses. Energy supply 

module can provide energy which the work need for 

sensor node, in the paper, we adopt solar self-supply 

module for node power supply. 

Figure.2 Sensor node structure chart 

Figure.3. Sink node structure chart 

B. Sink node module design 

Sink node mainly complete the sensor nodes data 

gathering and fusion within communication network, and 

realize ascending and descending communication 

protocol conversion. It released monitoring task of 

management nodes, and the data collected is forwarded to 

the external network through a serial port. It is not only 

an enhanced sensor node, but also special gateway device 

which hasn’t monitoring function and only has wireless 

communication interface. Its structure is shown in Fig.3. 

It composes of a power supply module, storage module, 

processor module, node communication module and 

serial interface communication module and so on. 

Because sink node need process many sensor nodes data,


it work longer hours and dormancy time is short, the 

battery energy can’t satisfy sink node energy 

consumption, so the sink node adopt solar self-supply 

module for nodes power supply in the paper. 

C. Power supply module 

In order to solve energy supply problem of sensor 

nodes, we adopted solar energy supply system in the 

paper, and the structure is shown in Fig. 4. Fig.4 show 

that power supply module have energy collector, energy 

storage, backup energy memory, power management and 

control section. Energy collector composes of solar 

panels, and it is responsible for transforming solar energy 

into electrical energy. Energy storage include the main 

level energy storage, constitute of super capacitance, and 

is responsible for storing energy which is collected by 

solar battery and provide energy for wireless network 

sensor nodes. Backup energy memory composes of 

lithium battery, and provide energy source for system in 

an emergency. Power management and control section is 

responsible for monitoring status of energy memory 

which provide power supply according to the energy 

status, and take solar cell as energy memory supplement 

energy. 

Figure.4. Solar self-supply module structure 

IV. THE DESIGN OF MONITORING CENTER 

The monitoring center control operation of the whole 

network through the base station of all measurement and 

control area, and which the main task include sending 

control command for network, collection and handling 

monitoring data of each node in network and data is 

stored into database, historical data is inquired and 

analyzed. The monitoring center mainly composes of PC 

and wireless communication module. The hardware 

structure is shown in Fig. 5. 

In Fig.5, the PC is taken as upper computer, CC2430 is 

taken as a wireless communication module, and the 

communication between them is realized through serial 

port. In short, the main function of the monitoring center 

is described below. 

1. Network management and control function. Such as 

starting or stopping network operation, configuration 

network parameters. Network parameters include sensor 

node data acquisition frequency, the frequency submitting 

the data to base station, the length of each task time slot, 

the routing probability vector and so on. The monitoring 


center can also inquire operation state, environmental 

data and send control node to control command etc. 

2. Data storage function. The monitoring center need 

to preserve historical monitoring data for enquiries, this 

function is realized through the database. 

3. Data analysis and decision support functions. The 

monitoring data is analyzed by agricultural expert system 

and establish the most suitable greenhouse environment 

control strategy. 

The base station of measurement and control not only 

controls all nodes of the district, but also is 

communication hub between the monitoring center and 

measurement and control area, mainly providing data 

forwarding and data buffer function. 

Figure.5.The monitoring center hardware structure 

A. System software design 

V. SYSTEM SOFTWARE 

Figure.6 System software flowchart 

Modular design thought is adopted in system software 

program which mainly composed of data collection 

system of the greenhouse and wireless control systems. 

The data acquisition system transfer the data that is 

wireless sensor node acquisition own surrounding 

environment information to sink node by wireless


network. The data message that is fused is sent to 

controller by sink node. Meanwhile, the sink node 

receives instructions from controller, and forwards 

instructions to the sensor node. The flow chart of system 

software is shown in Fig.6. 

B. The software design of monitoring center 

The monitoring center send the system starts 

commands in spare time slot (Tidle) and receive the 

network monitoring data of each node in cluster interstate 

communication (Tinter) time slot. If necessary, other 

management control commands can be sent in spare time 

slot and routing time slot. In network formation time slot 

and communications time slot within the cluster, each 

node is busy with networking in greenhouse, and don’t 

monitor commands of control center, so the management 

control command for network need not be sent and 

complete some data processing tasks. We adopt 

Microsoft access for the monitoring center database. The 

program flowchart of monitoring center spare time slot is 

shown in Fig.7 

Figure.7.The program flowchart of monitoring center spare time slot 

In spare time slot, the monitoring center mainly 

completes start-up system functions. If the system is the 

first start, then must connect to database. Then, the 

monitoring center send starts commands to the base 

station of all measurement and control area in 

greenhouse, if not received a confirmation of the base and 

no more than retransmission times, and the starts 

commands is resent. If exceed retransmission times, and 

fault diagnosis module is run. If received confirmation 

frame that the base station returns and spare time slot is 

not over, the monitoring center can complete other 

management control command. 

In cluster interstate communication, the main task of 

monitoring center collect data that greenhouse WSN 

submitted and store in database. If users have 

management control requirements, and it may priority 

executed. The program flowchart of monitoring center 

cluster interstate communication time slot is shown in 

Fig.8. 


Figure.8 The program flowchart of monitoring center cluster interstate 

communication time slot 

C. The nodes deployed algorithm of measurement and 

control system based on WSN in Greenhouse 

In greenhouse WSN measurement and control system, 

the sensor nodes deployed in greenhouse periodically 

collected various environmental data and send it to 

control center with multiple hops communication manner, 

and it belongs to the typical centralized data collection 

network. In Such system, due to the nodes near the base 

station forward large quantities of data and premature 

deaths, and the network is divided and even completely 

paralyzed. The energy consumption hotspot is caused as a 

result of load distribution imbalance between the nodes, 

so we take phenomenon as funnel effect [6-7]. This 

article solve funnel effect of greenhouse WSN 

measurement and control system through redundancy 

node technology, using a single measurement and control 

area of greenhouse as the research object, taking the 

node's next-hop choose road probability as edge fuzzy 

weights, and introduce fuzzy graph theory, and the data 

probability from source cluster head to the destination 

node cluster head node by m jump is calculated, so we 

obtain network data load distribution in greenhouse 

measurement and control area by it, and the redundant 

nodes deployed algorithm (RNDA) based on cluster load 

balancing was designed. In order to balance the network 

load, we adopt three ways in the algorithm, namely, the 

multi-path routing, redundant nodes deployment and 

cluster head election. The key of RNDA is that 

determines each cluster head routing probability 

vector v P , and can construct network topology through 

this vector. In greenhouse WSN measurement and control 

system, v P of cluster head v is pre-set according to the 

nodes geographical location. In fact, v P became the basis 

for routing algorithms, when network begin to run, every 

kind of node communicate each other by using the same


preset v P , if the neighbor of a cluster head that can 

communicate can’t produce cluster head due to 

energy of all nodes are exhausted, and cluster head 

topology will change, so the cluster head v P should be 

adjusted. The cluster interstate communications model is 

shown in Fig.9, in order to narrative convenient, the 

monitoring area is divided into the5 5 × grid, we can set 

automatically grid number in simulation. 

p 

p 

p 

v( ev4) 

v( ev5) 

ve ( v6) 

p 

v( ev3) 

p 

v( ev7) 

v( ev2) 

ve ( v1) 

p = { p , p , p , p , 

p p p p 

p 

v v( ev1) v( ev2) v( ev3) v( ev4) 

ve ( 5) , 

v ve ( 6) , 

v ve ( 7) , 

v ve ( 8) 

} 

v 

Figure.9 Cluster interstate communications model 

p 

p 

v( ev8) 

Fig.9 (b) shows that each cluster head has eight routing 

direction at most, namely, v P has 8 component. 

According to cluster head category, taking one part or a 

few directions to give choose road probability value. 

P (e) 

These choose road probability v can be freely 

chosen, and ensure that the sum of choose road 

probability is 1. In Fig.9 (a), according to the 

geographical position, the cluster head is divided into hot 

cluster head H (black dots representation), boundary 

cluster head, general cluster head (colorless circle) etc. 

We consider that the cluster head which adopt data fusion 

strategy and doesn’t adopt data fusion strategies has on 

impact the network lifetime in simulation, The main 

purpose of WSN data fusion reduce the network data 

quantity through integration of each sensor node 

redundant information. In simulation experiments, the 

data fusion is put into practice in cluster head nodes, 

supposed that data fusion coefficient is 1( σ = 1 ) when 

the data fusion strategy is not executed. If the data fusion 

strategy is adopted, the different data fusion coefficient is 

chosen according to different fusion degree. Because the 

sensor nodes belong to isomorphism sensor nodes here, 

the type of the information collected is consistent, 

according to statistical knowledge, the small range 

environmental parameters hasn’t too large difference, so 

we fuse all child nodes data of one grid into a data, and 

describe environmental information of the grid (e.g. 

temperature, humidity). In Simulation experiments, 

supposed that the data fusion coefficient is 

1 

a ( σ = 1/a ) when the data fusion strategy is 

adopted, a is the activities node number inside grid, 


a are all set to 5 in the following simulated experiments. 

In Matlab 7.0, M document program is written according 

to algorithm process and the performance of RNDA 

algorithm is researched, and compare with uniform 

deployment way. In a uniform deployment mode, the 

redundant nodes is evenly distributed in each cluster, the 

networks is operated in three tasks slot mode. 

1. Fig.10 shows that is 4× 4 grid which d is 

25 cm (namely, d = 25m 

), communications distance 

d 2 

within the cluster is CI = d d 2 

and CO = dCI. 

Fig.10 (a) data fusion coefficient isσ 

= 1/a , Fig.10 (b) 

data fusion coefficient isσ 

= 1. 

Network lifetime/round 

(a) Data fusion coefficient 

(b)Data fusion coefficient 

Uniform deployment 

RNDA deployment 

σ = 1/a 

σ = 1 

Redundant nodes 

Figure.10.The Redundant nodes have impact on the network lifetime( 

4 4 

× grid) 

2. Fig.11 shows that is 5 5 × grid which d is 

20 cm (namely, d = 20m 

). Fig.11 (a) data fusion


coefficient isσ 

= 1/a , Fig.11 (b) data fusion coefficient 

isσ 

= 1. 

(a) Data fusion coefficient 

σ = 1/a 

σ = 1 

(b)Data fusion coefficient 

Figure.11.The Redundant nodes have impact on the network 

lifetime ( 5 5 × grid) 

As can be seen from the above graph, the network 

lifetime that the data fusion strategy is adopted is 

probably 2 ~ 3 times than the data fusion strategy isn’t 

adopted. Virtual grid number has also impact on the 

network life, the more virtual grid were classified in 

monitoring area, the greater the network data quantity is, 

and the shorter the network lifetime is. RNDA compare 

with uniform deployment mode, Fig.11 (a) shows that the 

network lifetime improved 35.8 percent in A and B dot. 

When we extend the same network life, RNDA can save 

a lot of redundant nodes. Compared with uniform mode, 

the RNDA only deployed 24% redundant nodes when the 

4 

network lifetime is 3 . 5× 

10 round. 


VI. CONCLUSION 

According to the characteristics of modern greenhouse 

production, the paper introduce wireless sensor network 

technique to greenhouse wireless detection-control 

system, and the whole greenhouse system can automatic 

adjust by combining wireless sensor network technology 

with greenhouse control technology. In hardware, WSN 

nodes mainly compose of Atmega128L and wireless 

transceiver chip CC2420. In software, the modularized 

design ideas is adopted in this paper, the sensor nodes 

deployment is made a in-depth analysis, the simulation 

results show that this algorithm can effectively prolong 

the network life. 

REFERENCES 

[1] Du Xiaoming, Chen Yan.The Realization of Greenhouse 

Controlling System Based on Wireless Sensor 

Network[J].JOURNAL OF AGRICULTURAL 

MECHANIZATION RESEARCH, 2009(6): 141-144. 

[2] Qiao Xiaojun, Zhang Xin, Wang Cheng, et al. Application 

of the wireless sensor networks in agriculture[J], 

Transactions of the CSAE,2005, 9(21):232-234. 

[3] S.L. Speetjens, H.J.J. Janssen, etc. Methodic design of a 

measurement and control system for climate control in 

horticulture[J]. COMPUTERS AND ELECTRONICS 

IN AGRICULTURE, 2008, (64):162-172. 

[4] Wang Linji.The Design of Realizing Change Temperature 

Control in Greenhouse by PLC [J].ELECTRICAL 

ENGINEERING, 2008, 5: 81-83. 

[5] Liu Yanzheng, Teng Guanghui, Liu Shirong.The problem 

of the control system for Greenhouse Climate[J].CHINESE 

AGRICULTURAL SCIENCE BULLETIN. 2007,23: 154- 

157. 

[6] C. Y. Wan, S. B. Eisenman, A. T. Campbell, et al. 

Overload traffic management for sensor networks[J]. ACM 

Transactions on Sensor Networks, 2007, 3, Article No. 18. 

[7] G. S. Ahn, E. Miluzzo, A. T. Campbell, et al. Funneling- 

MAC: A Localized, Sink-Oriented MAC For Boosting 

Fidelity in Sensor Networks[C]. Proceedings of the 4th 

international conference on Embedded networked sensor 

systems. New York: ACM, 2006: 293-306. 

[8] Li Nan, Liu Chengliang, Li Yanming, Zhang Jiabao, Zhu 

Anning. Development of remote monitoring system for soil 

moisture based on 3S technology alliance[J]. Transactions 

of the CSAE, 2010, 26(4): 169-173. 

[9] P. Santi, J. Simon. Silence Is Golden with High 

Probability: Maintaining a Connected Backbone in 

Wireless Sensor Network[C]. 1st European Workshop on 

Wireless Sensor Networks. Berlin: wireless sensor 

networks, proceedings, 2004: 106-121. 

[10] F. Chen, P. Jiang, Q. He. Phased waking coverage scheme 

based on hibernation of redundant nodes for wireless 

sensor networks[C]. Proceedings-International Symposium 

on Computer Science and Computational Technology. NJ: 

Institute of Electrical and Electronics Engineers Computer 

Society. 2008: 709-713 

[11] Z.M. Li, L. Lei. Sensor Node Deployment in Wireless 

Sensor Networks Based on Improved Particle Swarm 

Optimization[C].Proceedings of 2009 IEEE International 

Conference on Applied Superconductivity and 

Electromagnetic Devices. 2009:25-27. 

[12] J.H. Tarng, B.W. Chuang, PC. Liu. A relay node 

deployment method for disconnected wireless sensor


networks:Applied in indoor environments[J]. Journal of 

Network and Computer Applicatons. 2009.32:652-659. 

[13] Y.C. Wang, Y.C. Tseng. Distributed Deployment Schemes 

for Mobile Wireless Sensor Networks to Ensure Multilevel 

Coverage[J]. IEEE TRANSACTIONS ON PARALLEL 

AND DISTRIBUTED SYSTEMS. 2008.19(9):1280-1294. 

[14] P. Gajbhive, A. Mahajan. A Survey of Architecture and 

Node deployment in Wireless Sensor Network[C]. 1st 

International Conference on the Applications of Digital 

Information and Web Technologies, ICADIWT 2008: 426- 

430. 

[15] W.T. Xu, X.H. Hao, C.L. Dang. Connectivity Probability 

Based on Star Type Deployment Strategy for Wireless 

Sensor Networks[C].Proceedings of the 7th World 

Congress on Intelligent Control and Automation. 

2008:1738-1742. 

Yongxian Song was born in xuzhou,on 

April 1,1975. He r eceived the B.S. degree 

in Applied Electronic Technology from Hu 

aihai Institute of Technology, 

Lianyungang,China, in 1997, and the M.S 

degree in Control Theory and Control 

Engineering from Jiangsu university, 

Zhenjiang, China , in 2006. From 2009 to 

now, He is studing for Ph.D degree in Control Theory and 

Control Engineering from Jiangsu university, Zhenjiang, China. 

Since 2006, he has been a teacher in Huaihai Institute of 

Technology, Lianyungang, China. His current research interests 

include signal processing ,intelligent control, and industrial 

control . 


Chenglong Gong was born in 1964, male.He 

received the B.S. degree in Automatic 

Control from University of Electronic 

Science and Technology, Chengdu, China, in 

1984, and the M.S degree in Automation 

Control from China University of Mining and 

Technology, Xuzhou, China , in 1988. 

He is currently working as a professor with the department 

of electronic engineering of Huaihai Institute of Technology, 

Lianyungang 222005, China. His main research interesting is 

automatic measurement, control and system theory, computer 

network applications. 

Yuan Feng was born in Lianyungang ,on 

March 28,1978. He received the B.S. degree 

in Computer hardware and application from 

Huaihai Institute of Technology, 

Lianyungang, China, in 1999, and the M.S 

degree in Industrial Control from Nanjing 

University of Science, Nanjing, China, in 

2007.From 1999 to now, he has been a teacher in Huaihai 

Institute of Technology, Lianyungang,China. His current 

research interests include signal processing, Computer Control 

Technology. 

Juanli Ma female, lecturer, born in 1976, 

1995-1999 studied at Gansu University of 

Technology, studying electrical automation, 

and obtained a bachelor degree. 2004-2007 

studied at the Northwestern Polytechnical 

University, studying control theory and 

control engineering and obtained a Master 

degree in Engineering. From1999 to now, she 

has been working in the Huaihai Institute of Technology. 

Xianjin Zhang was born in suqian, in1975. 

He received the B.S. degree in Applied 

Electronic Technology from Guilin University 

of Electronic Technology, Guilin, China, in 

1998, and the M.S degree in Power Electronic 

and Control Engineering from Nanjing 

University of Aeronautics & Astronautics, 

Nanjing, China, in 2005. Since 2005, he 

has been a teacher in Huaihai Institute of Technology, 

Lianyungang, China. His current research interests include 

electric and electronical converting technique.


Simulation of Networked Control System based 

on Smith Compensator and Single Neuron 

Incomplete Differential Forward PID 

Haitao Zhang 

Electronic and Information Engineering College, Henan University of Science and Technology, Luoyang, China 

Email: zhang_haitao@163.com 

Zhen Li 

Electronic and Information Engineering College, Henan University of Science and Technology, Luoyang, China 

Email: lizhenzhen1228@163.com 

Abstract—In the networked control system with random 

time delay in forward and feedback channels, a kind of 

controller based on Smith compensator and signal neuron 

incomplete differential forward PID is presented. First, 

using root locus method and simulink simulation software, 

the influences of network’s time delay on the system 

stability and dynamic performance are analyzed. Then, 

combined with incomplete differential forward PID control 

algorithm, Smith compensation model is established. 

Compared with existing Smith compensator, the proposed 

control model is easy to be implemented, and can also get 

better control performance in the case of miss-matching 

compensator model. Finally, the simulation research on a 

DC motor is done, and the simulation results show the 

effectiveness of the proposed method. 

Index Terms—networked control system; Smith 

compensator; incomplete differentia forward PID; single 

neuron 


With the extensive application of large-scale control 

system, the networked control system (NCS) has been 

concerned by many researchers. In these networked 

control system, the communication among controllers, 

sensors and actuators is performed through the networks. 

In comparison with the traditional control, it has the 

characteristics of resources sharing, high reliability, low 

cost, easy to maintain and extend. However, since the 

carrying capacity of the network and communication 

bandwidth is fixed and limited, this will inevitably lead 

to the collision and retransmission of information, which 

causes the network-induced delay in the process of 

information transmission. The network-induced delay 

makes real-time capability of the system become worse, 

even leads to system instability. At present, for the 

random network-induced delay, two main research 

Manuscript received Mar. 1, 2011; revised Apr. 1, 2011; accepted 

Apr. 12, 2011. 

Project number: 61040010 


doi:10.4304/jnw.6.12.1675-1681 

methods are adopted in NCS design, the deterministic 

method and the stochastic method. The deterministic 

method is to convert the random delay to fixed delay by 

introducing data buffer, then use the existing method to 

design the controller[1][2]. However, this approach 

artificially extends the random delay of the controller, 

and lowers the system control performance. The 

stochastic method is directed by the random discrete time 

model. Nilsson discusses LQG optimal controller’s 

design within the framework of discrete control system 

in which the independent random delay is less than a 

sampling period and its time delay obeys Markov 

distribution[3], but this method must know the 

probability characteristics of time delay in advance, 

including mean, variance and other properties. The 

amount of computation is so large that it is not easy to 

achieve. Hu proposes the use of stochastic optimal 

control and optimal state estimation methods[4], the 

method is mainly used in the occasions when time delay 

is more than one sample period. Bauer uses Smith 

predictor to compensate time delay in the networked 

control system, the control structure is simple, but it is 

necessary to know the exact value of the network delay 

in advance [5]. 

The rest of the paper is organized as follows. In 

Section 2, we present the system structure of NCS, and 

analyze the influence of network-induced delay on the 

system stability and dynamic performance. Section 3 

presents a design method of NCS based on Smith 

compensator and single neuron incomplete differential 

forward PID controller. Section 4 gives the simulation 

results aiming at the model of DC motor, and the results 

shows the effectiveness of proposed method. Finally, a 

brief summary are discussed in Section 5. 

II. SYSTEM DESCRIPTION 

A. System Structure 

The basic structure of the networked control system is 

shown in Figure 1. The controller, actuator and sensor 

transmit data over the network, so there are essentially


three kinds of computer delays in the system: 

communication delay sc 

τ between the sensor and the 

controller, computational delay c 

τ in the controller and 

communication delay ca 

τ between the controller and the 

c 

actuator. Because τ is very small, usually it is 

considered to merge into ca 

τ , so the system delay is 

sc ca 

expressed as τ = τ + τ . In order to analyze the system 

with the effect of network delay, we use the approach of 

continuous-time systems to analyze networked control 

system, and a typical block diagram of the networked 

control system is shown in Figure 2. Where , R() s , 

U() s, Y() s and Es () = Rs () − Ys () are the reference, control, 

output, and error signals in S domain respectively. 

Gc() s is the transfer function of controller, and Gp() s is 

the transfer function of controlled object. 

Figure 1. The basic structure of networked control system 

ca 

s 

e τ − 

Rs () E() s U() s Y() s 

Gc() s Gp() s 

sc 

s 

e τ − 

Figure 2. A typical block diagram of networked control systems 

The transfer function of the closed-loop system shown 

in Figure 2 can be expressed as follows: 

ca 

−τ 

s 

Y() s Gc() s Gp() s e 

= (1) 

sc ca 

−τ s −τ 

s 

Rs () 1 + Gc( s) Gp( s) e e 

ca sc 

In Figure 2, τ and τ are respectively the time delay 

of forward and feedback channel. ca 

τ makes the control 

signal not timely to react on the controller object, and the 

response of the system lags behind the input of the 

sc 

systems. τ makes the system not timely to produce new 

control signal. 

In order to analyze the closed-loop control system 

with the effects of network delay, a typical approach is to 

use a rational function to approximate the delays. The 

function is as follows [6]. 

−τsτs−n e ≅ (1 + ) 

(2) 

n 

ca sc 

where τ may be τ or τ . 

Because the primary branches of the root locus of 

control system usually contain the dominant eigenvalues 

of the system, this approximation is adequate for 

practical applications. 

B. System Stability Analysis 

In this paper, we use a DC motor as controlled object 

to analyze the system stability, the transfer function of 

the controlled plant is expressed as follows[7]: 

2029.826 

Gp() s = 

(3) 

( s+ 26.29)( s+ 

2.296) 


The controller uses PID control, the transfer function 

expressed as follows: 

2 

Kp(( Kd / Kp) s + s+ ( Ki / Kp)) 

Gc() s = 

s 

(4) 

1 

= Kp( Tds+ 1 + ) 

Ts i 

Where, P K , d T and T i are the proportional gain, 

differential time constant and integral time constant, 

respectively. 

We use the formula (1) to (4), and select the following 

controller parameters: K p =0.1701, d T =0, T i =0.45, n 

=4, then the open-loop transfer function is expressed as 

follows: 

ca sc 

−τ s −τ 

s 

Gc() s e Gp() s e 

τ s −n 

= Gc() s Gp()(1 s + ) 

n 

1 1 

= Kp( Tds+ 1 + ) Gp( s) 

Ts τ s 

i 

n 

( + 1) 

n 

0.1701( s + 2.222) 2029.826 

= 

* (5) 

s ( s+ 26.29)( s+ 

2.296) 

1 

* 

τ s 4 

( + 1) 

4 

345.2734( s + 2.222) 

= 

τ s 4 

ss ( + 26.29)( s+ 

2.296)( + 1) 

4 

Seen from the formula (5), with τ changing from 0 to 

positive infinity, the system increases four-fold openloop 

negative real poles which are from negative infinity 

to 0. The existence of the poles enhances the system 

order, changes the distribution of the root locus in the 

real axis and shifts the root locus to the right, which is 

disadvantageous to the stability of the system. 

Imaginary Axis 

15 

10 

5 

0 

-5 

-10 

τ=0.1 

τ=0.2 

τ=0.5 

Root Locus 

-15 

-8 -6 -4 -2 

Real Axis 

0 2 

Figure 3. Primary branches of root locus with different delay 

We select different time delay to analyze the 

networked control system, and provide a reference that is 

used to analyze networked control system with delay


effect. The primary branches of root locus with different 

time delay is shown in Figure 3. It is gotten by the 

programming in Matlab. 

Seen from Figure 3, with the time delay changing, the 

system stability region changes with the delay, the 

greater the delay is, the smaller the system stability 

region becomes. 

C. Dynamic Performance Analysis of System 

From the upper sub-section, we know the stable region 

decreases with the increase of time delay. In the section, 

we will analyze the influence of random time delay on 

the performance of networked control systems in 

Simulink. The simulation model is shown as Figure 4, 

the parameter is set as follows: the input signal is 50rad/s, 

sampling period is 10ms, the selected control algorithm 

is formula (4), the model of the controlled object is 

formula (3), the time delay obeys uniform distribution 

which is simulated by the producer of Gauss random 

number and network delay module. 

Figure 4. Simulation model of network control system 

Under the function of typically input signal, the 

performance criteria which reflect the time response of 

control system are composed of two parts: static 

performance criterion and dynamic performance 

criterion. We choose the mean square errors MSE , 

overshoot P M and adjusting time ts to reflect the tracking 

error, control accuracy, stability and rapidity of the 

response of the control system. The performance cost 

function is as follows[8]: 

J = ω1J1+ ω2J2+ ω3J3 

(6) 

2 

⎧⎪ ( MSE − MSE0) , MSE > MSE0 

J1 

= ⎨ 

⎪⎩ 0 , MSE ≤ MSE0 

2 

⎧⎪ ( MP − MP0) , MP > MP0 

J2 

= ⎨ 

(7) 

⎪⎩ 0 , MP ≤ MP0 

2 

⎧⎪ ( ts − ts0) , ts > ts0 

J3 

= ⎨ 

⎪⎩ 0 , ts ≤ ts0 

Where, 

N 1 2 

MSE = ∑ e ( k) 

(8) 

N K = 0 

MSE represents the mean square error of system, 

ek ( ) = yk ( ) − rk ( ) represents the output error of system 

when t = kh, 

where k represents sampling sequence and 


s represents sampling period. MSE 0 , M P0 

and t s0 

are 

nominal mean square error, nominal overshoot, nominal 

adjusting time under the circumstance that system hasn’t 

time delay. J 1 , 2 J and J3 are the performance criteria of 

MSE , P M and ts that they deviate from nominal value. 

J 1 , 2 J and 3 J satisfy J1 = J2 = J3 

= 0 when the system 

has no time delay . ω 1 , ω 2 , ω3 are the weight coefficients 

of 1 J , 2 J and J 3 respectively, their range are from 0 to 1, 

and meet 1 2 3 1 

ω + ω + ω = . 

When the system has no time delay, we get the step 

response curve of the system by the execution of 

simulation model of Figure 4, and get the nominal value 

of MSE , P M and t s by the computation. Their nominal 

value is as follows: MSE 0 = 0.00595 , M P0 

= 5% , 

t s0 

= 0.309 , J = 0 . 

When ω 1 = 1 , ω2 = ω3 

= 0 , then J = J1 

, and the cost 

function reflects the response process of system and the 

relative stability at the steady state. Its output curve 

changed with time delay is shown in Figure 5(a). 

When ω 2 = 1 , ω1 = ω3 

= 0 , then J = J 2 , and the cost 

function reflects the stability of system. Its output curve 

changed with time delay is shown in Figure 5(b). 

When ω 3 = 1 , ω1 = ω2 

= 0 , then J = J3 

, and the cost 

function reflects the rapidity of system response. Its 

output curve changed with time delay is shown in Figure 

3-4(c). 

Seen from Figure 5, the time delay could lower the 

stability and dynamic performance of system. If τ < 12s , 

the control accuracy becomes lower a little, but the 

system still has the better stability and dynamic 

performance. If τ ≥ 12s , the dynamic performance of 

system becomes poor, and the stability and control 

accuracy also becomes poor. If the time delay 

reaches 40s , the system becomes instable. 

Figure 5. The waveforms of 1 J , 2 J and 3 

J with delay change 

In the following section, we will improve the 

performance of networked control system by introducing 

Smith compensator and single neuron incomplete 

differentiation on the basis of classical PID control. 

III. SMITH COMPENSATOR AND SIGNLE NEURON 

INCOMPLE DIFFERENTIAL FORWARD PID CONTROLLER


A. The Principle of Smith Compensator 

The control system model with Smith compensator is 

shown as Figure 6. 

ps e τ − 

R() s E() s U() s Y() s 

Gc() s 

Gp() s 

p ()(1 ) 

s τ ′ 

G s e − ′ − 

p 

Figure 6. Structure of Smith predictor 

p Gc() s is the transfer function of controller, () s 

Gps e τ − 

is the transfer function of controlled object with the pure 

p time delay. () s τ ′ 

G s e − ′ is the compensator function 

p 

which is introduced by Smith compensator. Then the 

closed-loop transfer function is expressed as the 

following: 

−τ 

s 

p 

Y() s 

Gc() s Gp() s e 

= 

Rs () −τ ps−τp′ s 

1 + G ( s) G ′ ( s) + G ( s)( G ( s) e −G 

′ ( s) e ) 

c p c p p 

When the system satisfies G ′ 

p () s = Gp() s , τ ′ 

p = τ p , 

the formula (9) is simplified and gets the following 

relation: 

() () () p 

() 1 () () 

s 

Y s Gc s Gp s −τ 

= e 

(10) 

Rs + Gc sGp s 

Its characteristic equation is expressed as the following: 

1 + Gc( s) Gp( s) 

= 0 

(11) 

The formula (11) doesn’t include pure time delay, so 

Smith compensator eliminates the influence of pure time 

delay on the system stability which could makes the 

control system instable. 

B. System Structure of NCS 

Smith compensator as a control algorithm is 

commonly used in the system with time delay. To reduce 

the influence of time delay on networked control system 

performance, Smith compensator has been introduced 

into the networked control system, and the networked 

control system with Smith compensator is shown as 

Figure 7. 

Rs () () 

G ′ () s 

p 

Es U() s 

ca 

s 

e τ − 

G s Gp() s 

() 

c 

e G ′ () s e 

cam scm 

−τ s −τ 

s 

p 

sc 

s 

e τ − 

Y() s 

Figure 7. The structure of networked control system with Smith 

compensator 

In Figure 7, Gp() s is the transfer function of controlled 

object, G ′ 

p () s is the predicted model of controlled object, 

cam scm 

τ and τ is the predicted model of time delay 

ca 

τ and sc 

τ respectively. Then the closed-loop transfer 

function of the system is expressed as follows: 

−τ 

s 

Y() s 

Gc() s e Gp() s 

= 

ca sc cam scm 

Rs () −τ s −τ s −τ s −τ 

s 

1 + G () s G ′ () s + G ()( s e G () s e −e 

G ′ () s e ) 

c p c p p 


ca 

(9) 

(12) 

When the designed system using Smith compensator 

doesn’t have model mismatch, i.e. G'( s) = G ( s) 

, 

scm sc cam ca 

τ = τ , τ = τ , the transfer function of the system 

shown in Figure 7 could be simplified as follows: 

Y() s G () () ca 

c s Gp s −τ 

s 

= e 

(13) 

Rs () 1 + Gc() sGp() s 

In the case, the networked control system could be 

simplified to Figure 8. Known from this Figure, when 

mathematical model of the object is exact there is no 

longer process of pure time delay in the closed-loop 

circuit after using Smith compensator, thus the delay no 

longer affects the characteristic equation of system. 

Compared with control system with no network-induced 

delay, it is actually a control system which postpones 

time delay ca 

τ . For this reason, after adding Smith 

compensator, the control quality will be improved and 

the stability of system can be ensured. 

R( s ) 

Y() s 

ca 

Gc () s Gp() s 

s 

e τ − 

Figure 8. The simplified diagram when the model is match 

C. Related Research of Smith Compensator 

Since the Smith compensator is based on the accurate 

mathematical model of controlled object and network 

delay, random network-induced delay and disturbances 

make the model with Smith compensator mismatches 

with controlled object. 

Owing to the upper reasons, it is difficult to get better 

effect to compensate network delay only utilizing Smith 

compensator. In order to overcome the impacts of those 

factors, it is necessary to introduce effective means of 

control. Some researchers present many improved 

methods including two aspects of structural 

improvement[9] and parameter tuning[10]. 

In order to make Smith Predictor applied to network 

control system and achieve the satisfied control effect, 

Du Feng presents two kinds of new Smiths compensator. 

One is to design the double dynamic Smith compensator 

of the pure time delay of controlled object and network 

delay. The structure doesn’t need to measure, predictor 

and identify the time delay online, and adapt to the 

networked system with random, time-varying, uncertain 

time delay[11]; the other is to bring the pure time delay 

of the controlled object and the time delay between the 

controller and actuator in the forward path out of the 

closed-loop controlled circuit, and eliminate the time 

delay between the transducer and actuator of feedback 

control loops completely. Then we needn’t schedule 

feedback channel to adjust network flow so that the 

network bandwidth can be utilized effectively, and the 

robustness of system to the packet loss is raised in the 

feedback control loops[12]. 

Sujuan Wang, etc. regard the network and controlled 

object as a time-varying controlled system, estimate the 

time-delay of system using fading memory LSM and do 

the compensation by Smith compensator. By combing 

p


immune feedback control with PI control, they get fuzzy 

immune PI controller which can adjust the parameter of 

the PI controller according to the change of controlling 

amount, and achieve the intention to overcome the 

control error caused by the error of time delay’s 

estimation[13-14]. 

Aiming at the networked systems with long time delay, 

Huiying Chen, etc. combine Smith compensator and T-S 

fuzzy model, and get PI controller with Smith 

compensator aiming at the networked systems with long 

time delay. The approach makes the closed-loop system 

obtain better stability and robustness[15]. 

Peng Chen, etc. construct an adaptive Smith 

compensator which is used to compensate the long time 

delay of NCS based on IP network by adding a firstorder 

filter on the feedback loop. This approach 

eliminates the negative effects caused by time delay 

efficiently and gets the better robustness[16-17]. 

Reiquan Lin, etc. give a design method of neuron 

adaptive controller based on Smith compensator. They 

study its application in the electric heating furnace by 

simulation method, and prove that this controller could 

efficiently makes up the deficiency of poor robustness 

and poor anti-jamming of the conventional Smith 

compensator[18-19]. 

As a basic unit of neural networks, neuron has the selflearning 

ability and adaptability. The algorithm of this 

control system that is constructed with neuron is simple, 

easy to realize, and has better robustness. Besides, the 

most prominent characteristic is that the system doesn’t 

need accurate identification of the controlled object, and 

the structure and parameter of controlled object. So the 

design of single neuron adaptive controller doesn’t need 

system modeling. Considering these characteristic of 

single neuron, we presents a incomplete differential 

forward PID algorithm based on Single Neuron, and 

apply it to Smith compensator so as to compromise the 

easy implement and the control performance of the 

existing Smith compensator. 

D. Single neuron incomplete differential forward PID 

The single neuron model is shown as Figure 9. The 

single neuron is an information processing cell with 

many inputs and single output. x 1 , 2 x , … , xn are the 

inputs of neuron, and ω 1 , ω 2 ,…, ω3 are respective weight 

value of the input x 1 , 2 x , … , n 

x . θ is the threshold of 

neuron, f [] ⋅ is excitation function, and yP is the output 

of neuron. The yP can be expressed as the following 

formula: 

n 

⎡ ⎤ 

yP = f ⎢∑ ωixi −θ⎥ 

(14) 

⎣ i= 

1 ⎦ 

x1 

x2 

ω 

� � 

Σ 

� ω 

xn 

1 

ω 2 


n 

θ f [] ⋅ 

yP 

Figure 9. Model of single neuron 

The single neuron model has been applied to PID 

control systems. Figure 10 is the model structure of 

single neuron PID control systems. 

r 

y 

x1 

x2 

x3 

ω1 

ω2 

ω3 

Σ 

K 

1 

Z − 

Figure 10. Structure of single neuron PID 

uk ( ) 

In Figure 10, r and y are respectively the input and 

output of the system, and satisfy e( k) = r( k) − y( k) 

. 

x 1 , 2 x and x3 are the inputs of neurons, they satisfy the 

following relation: 

x1( k) = ek ( ) −ek ( −1) 

x2( k) = e( k) 

(15) 

x3( k) = ek ( ) −2 ek ( − 1) + ek ( −2) 

ω 1 , ω 2 , and ω3 are respectively the weight value of the 

input 1 x , 2 x , and x 3 . 

Supposed that the proportional coefficient is K , and 

K > 0 , then the output of the controller can be 

expressed as follows: 

3 

uk ( ) = uk ( − 1) + K∑ ω ( kx ) ( k) 

(16) 

i= 

1 

i i 

In the control algorithm of single neuron, the 

coefficient K reflects the adjusting amplitude. Generally, 

if the error is bigger, the adjusting amplitude is also 

bigger so as to satisfy the requirement of rapidity of the 

system; if the error is smaller, the adjusting amplitude is 

also smaller so as to satisfy the requirement of stability 

of the system. 

We use Delta learning rule as the learning method of 

weight value, and it can be expressed as the following : 

∆ ωij ( k) = η[ 

di( k) − oi( k)] oj( k) 

(17) 

Where, ∆ ωij 

expresses the weight increment from i th to 

j th , η is learning ratio, i o and o j are respectively the 

activation value of i and j , and di is the expecting 

output value. 

Single neuron PID control method implements the 

adapting control of system by adjusting the weight 

coefficient. In order to ensure the convergence and 

robustness of learning method, we normalize the formula 

(16) and (17), and get the following the expression: 

3 

uk ( ) = uk ( − 1) + K∑ w'( kx ) ( k) 

(18) 

i= 

1 

3 

i i i 

i= 

1 

i i 

w '( k) = w ( k) / ∑ || w ( k) 

|| 

(19) 

w1( k + 1) = w1( k) + ηPe( 

k) x1( k) 

w2( k + 1) = w2( k) + ηI 

e( k) x2( k) 

(20) 

w3( k + 1) = w3( k) + ηDe( 

k) x3( k) 

Where η P , ηI and ηD are respectively proportion, 

integration and differentiation coefficient.


The single neuron PID controller improves the 

traditional PID controller, and overcomes the sensitivity 

to parameter change of traditional PID control. It has 

better learning ability and easily ensures real-time 

capability[20]. In addition, the system can get better 

control effect under the occasions of mismatching object 

model. 

The differentiation item is sensitive to the change of 

input value and the random disturbance, but the 

incomplete differential forward PID controller can 

improve the deficiency. 

The incomplete differential forward PID only 

differentiates to the feedback value, and adds a one-order 

filter, so the change of input value doesn’t affect the 

controller, and the change of output value doesn’t 

produce a very large control value. So we can combine 

the merit of single neuron and incomplete differential 

forward PID, and design the single neuron incomplete 

differential forward PID controller. 

The model of single neuron incomplete differential 

forward PID control is shown in Figure 11. 

Figure.11 Structure of incomplete differential forward PID control 

x 

Since we use the incomplete differential, 1 x , 2 x and 3 

should satisfy the following relation: 

x1( k) = ek ( ) −ek ( −1) 

x2( k) = e( k) 

(21) 

x3( k) = y1( k) −2 y1( k− 1) + y1( k−2) 

2 

=∆ y1( k) 

w 1 , 2 w and w 3 are the weight values of neurons, and 

still satisfy the equation (20). The control formula still 

satisfies the equation (18)-(20). 

The neuron controller uses incremental algorithms, so 

the relation of differential time constant and new learning 

algorithm satisfied: 

w3 w3'Td = = (22) 

w1 w1'h In this paper, single neuron control and incomplete 

differential forward PID control which are widely 

applied in actual control are introduced into the control 

system with Smith compensator so as to improve the 

robustness of the controller. 

IV. SIMULATION 

To verify the effectiveness of the method, we use a DC 

motor as the controlled object to simulate in 

Matlab/Simulink environment. The sampling period 

T=10ms, the reference input r=50rad/s, the network delay 

in forward and feedback channel is produced by gauss 

random generator in Simulink toolbox. The initial value 


of neuron weighting w 1 (0) = w 2 (0) = w 3 (0) =0.1, the 

learning rate of neuron η P =5, η I =0.03, η D =1.5, the 

proportional coefficient K =0.2, the incomplete 

differential coefficient γ =0.1. Using simple PID method, 

PID control with Smith compensator method and single 

neuron incomplete differential forward PID with Smith 

compensator respectively, then observe step responses 

under conditions of different random delay and the 

mismatch model. The results show in Figure 12 to Figure 

14. 

70 

60 

50 

40 

30 

20 

10 

1 

2 

3 

0 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Figure 12. The response when the mean of delay is 5ms 

1-PID; 2-PID control with Smith compensator; 3- Single neuron 

incomplete differential forward PID with Smith compensator 

90 

80 

70 

60 

50 

40 

30 

20 

10 

1 

2 

3 

0 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Figure 13. The response when the mean of delay is 30ms 

1-PID 2-PID control with Smith compensator 3- Single neuron 


90 

80 

70 

60 

50 

40 

30 

20 

10 

3 

1 

2 

0 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Figure 14 The response when the object model is 2500/s2+30s+80 

1-PID; 2-PID control with Smith compensator; 3- Single neuron 


The simulation results show that, when the delay is 

small all of the algorithm can achieve stable control 

performance. 

With the increase of delay, the control effect of these 

methods differs significantly. In the case of simple PID 

control, there is obvious oscillation in the response curve.


The system reaches the stable state, but the response time 

becomes longer, the rapidity becomes lower, and the 

overshoot becomes bigger. However, the other two 

methods still quickly reaches stable, and the rapidity of 

the response are not affected by the delay. This proves 

the validity of Smith compensator. 

When the model of Smith compensator does not fully 

match with the object model, Simple PID control system 

has serious oscillation, and doesn’t reach stable state in 

the simulation time. PID control system with Smith 

compensator reaches stable state, but the response time is 

longer, and has small oscillation. But model mismatch 

makes no great difference to the single neuron 

incomplete differential forward PID with Smith 

compensator. This shows the validity of proposed 

method. 

V. CONCLUSION 

By combing Smith compensator with single neuron 

incomplete differential forward PID algorithm, the static 

and dynamic performances of the networked control 

systems are improved. The proposed method is easy to 

be implemented, and the simulation results show that the 

method could get better control effect than conventional 

Smith compensator. 


This work was supported by Project 61040010 of the 

National Science Foundation of China. 

REFERENCES 

[1] R. Luck, A. Ray, “An observer based compensator for 

distributed delays”, Automatica, vol.26, No.5, pp903-908, 

May 1990. 

[2] Z. X. Yu, H. T. Chen, Y. J. Wang, “Research on Markov 

Delay Characteristic-Based Closed Loop Network Control 

System”, Control Theory and Applications, vol.19, No.2, 

pp263-267, February 2002. 

[3] J. Nilsson, “Real-time Control Systems with Delays”, 

Lund. Sweden: Lund Institute of Technology, 1998. 

[4] S. S. Hu, Q. X. Zhu, “Stochastic Optimal Control and 

Analysis of Stability of Networked Control Systems with 

long delay”, Automatica, vol.39, No.11, pp1877-1884, 

July 2003. 

[5] P. H. Bauer, M. Sichitiu, C. Lorand, etc., “Total Delay 

Compensation in LAN Control Systems and Implications 

for Scheduling”, Proc. of the American Control 

Conference, Arlington, vol.6, pp4300-4305, 2001. 

[6] Y. Tipsuwan, M. Y. Chow, “Gain Scheduler Middleware: 

a Methodology to Enable Existing Controllers for 

Networked Control and Teleoperation-Part I: Networked 

control”, IEEE Transactions on Industrial Electronics, 

vol.5, No.6, pp1218-122, 2004 

[7] Y. Tipsuwan, M. Y. Chow, “Control Methodologies in 

Networked Control Systems”, Control Engineering 

Practice, vol.11, No.10, pp1099-1111, October, 2003. 

[8] Y. Tipsuwan, M. Y. Chow, “On the Gain Scheduling for 

Networked PI Controller Over IP Network”, 

Mechatronics, vol.9, No.3, pp491-498, 2004 


[9] K. Watanabe, “A process-model control for linear system 

with delay”, IEEE Transaction on Automatic Control, 

vol.26,No.6, pp261-1269, 1981 

[10] J. J. Liu, W. M. Ni, Y. P. Yang, “New method for 

designing robust Smith predictor”, Journal of TsingHua 

University (Science and Technology), vol.39, No.9, pp54- 

57, 1999. 

[11] F. Du, Q. Q. Qian,W. C. Du, “Networked Control 

Systems Based on New Smith Predictor”, Journal of 

Southwest Jiaotong University, vo.4, No.1, pp65-69, 2010 

[12] F. Du, “Research of Networked Control Systems Based on 

New Smith Predictor”, Chengdu, Southwest Jiaotong 

University, 2008. 

[13] T. N. Shi, S. J. Wang, H. W. Fang, “Fuzzy Immune PI 

Control of Networked Control System Based on Prediction 

Compensation”, Journal of TianJin University, vol.42, 

No.11, pp959-964, 2009. 

[14] S. J. Wang, “Fuzzy Immune PI Control of Networked 

Control System Based on Prediction Compensation”, 

Tianjin: Tanjin University, 2008. 

[15] H. Y. Chen, Q. Guan, W. L. Wang, “Design of a fuzzy PI 

controller with Smith predictor for networked control 

systems with long time delay”, vol.33, No.4, pp418-420, 

2005. 

[16] P. Chen, L. K. Dai, “Adaptive Smith compensator for 

NCSs over IP networks”, Control Theory and Application, 

vol.23, No.1, pp115-118, 2006. 

[17] P. Chen, “Modeling and Controller Design for NCS over 

IP Network”, Zhejian: Zhejian University, 2005. 

[18] R. Q. Lin, G. W. Lin, “Models and simulation of neuron 

PID applied in electric oven based on MATLAB 

language”, vol. 30, No.1, pp55-58, 2002. 

[19] R. Q. Lin, F. W. Yang, “Realization of a Class of Neuron 

Controller Based on Smith Predictor”, Information and 

Control, vol.33, No.2, pp137-140, 2004. 

[20] Y. H. Tao, “New PID Control and Application”, Beijing: 

Mechanic Industry Press, 1998. 

Haitao Zhang Henan Province, China. 

Birthdate: November, 1972. is Control 

Theory and Control Engineering Ph.D., 

graduated from the Institute of 

Automation, Chinese Academy of 

Sciences. And research interests on 

intelligent control and computer 

application technology. 

He is an associate professor of 

Electronic and Information Engineering 

College, Henan University of Science and Technology. 

Zhen Li Henan Province, China. 

Birthday: Jan, 1987. is Automation B.S., 

graduated from Electronic and 

Information Engineering College, Henan 

University of Science and Technology, 

China. 

She is a graduate student of Electronic 

and Information Engineering College, 

Henan University of Science and 

Technology. And research interests on 

networked control system.


A Web Crawler System Design Based on 

Distributed Technology 

Shaojun Zhong 

Jiangxi University of Science and Technology/ Faculty of Science, Ganzhou, China 

infor2000@qq.com 

Zhijuan Deng 


66162815@qq.com 

Abstract—A practical distributed web crawler architecture 

is designed. The distributed cooperative grasping algorithm 

is put forward to solve the problem of distributed Web 

Crawler grasping. Log structure and Hash structure are 

combined and a large-scale web store structure is devised, 

which can meet not only the need of a large amount of 

random accesses, but also the need of newly added pages. 

Experiment results have shown that the distributed Web 

Crawler's performance, scalability, and load balance are 

better. 

Index Terms—Search Engine, Web Crawler, Grasping 

Strategy, Distributed System 


The production, transmission, collection and query of 

information are one of the most basic human activities. 

Considering information with writing as a carrier, 

traditionally libraries, corresponding cataloguing systems 

and professionals help us quickly find the information we 

need with “book” or “article” as the grain size. With the 

development of computer and information technology, 

there comes the field of Information Retrieval (IR) as 

well as the retrieval system of the whole text about books 

or literatures, making it convenient for us to obtain the 

relevant information with the grain size of “key words”. 

The openness of World Wide Web and the widespread 

accessibility of the information on it greatly encourage 

people to create while bringing new opportunities for 

development and technological challenges for the 

information retrieval of World Wide Web. 

The scale of traditional IR is relatively limited and the 

retrieved objects usually undergo serious screening and 

pretreatment. The number of queries it responds to is 

generally not very big. However, the information inquiry 

system (meaning search engine here), which provides 

services on web, is different with traditional IR both in 

scale and response time. Search engine has to deal with 

large-scale information (information swarms in and some 

are even fake) and a great number of accesses, which still 

requires fast response. 

Search engine is an application system, which 

develops based on IR, suits the features of web (or www) 


doi:10.4304/jnw.6.12.1682-1689 

and provides information query service. Search engine is 

generally defined as a kind of software system used on 

web, which collects and discovers information with 

certain strategies, deals with and organizes the 

information and finally offers web information query 

service for users. How does a software system like search 

engine work? If software system works on a data set, the 

data it operates includes not only unpredictable user 

queries but also huge web pages with dynamic change in 

number and these web pages will not come to the system 

automatically but need the system to grasp them. But in 

face of a large amount of user queries, it is impossible for 

the system to “search” online whenever there is an 

inquiry. So, the basis for large-scale search engine should 

be a batch of web pages gathered beforehand [1]. 

Therefore, web page catcher is also called Web 

Crawler. As a foremost part of search engine, it is an 

all-important studying object. Like the dynamic system 

carrying rocket system in aerospace, Web Crawler is the 

basis of search engine and all of the data it collects come 

from the work of Web Crawler in a smart, reasonable and 

powerful way. 

Search engine is one of the most high-end and complex 

Internet technologies and all companies keep the core 

technology to themselves. Some big companies have 

already had a mature solution to large web crawlers and 

have already put them into use. However, these large 

search engines can only provide ordinary users with 

common and non-customized search services. They could 

not take into consideration of various requirements of 

different users and single web crawlers fall down on their 

jobs in many cases. The flexible customization and the 

incomparable information acquisition speed and scale of 

the distributed web crawlers have satisfied people’s 

growing demand on user-oriented web information. 

Therefore, this paper presents a distributed design method 

of web crawler, and strives to achieve a robust, scalable 

and efficient hybrid strategy of a distributed search 

engine. 

II. CORE TECHNOLOGY OF DISTRIBUTED WEB CRAWLER


A. Priority Strategy of Webpage Grasping 

Priority strategy of Webpage grasping determines the 

grasping efficiency. Grasping strategies can be roughly 

divided into three kinds, i.e. depth-first strategy, 

breadth-first strategy and best-first strategy. Depth-first 

strategy could be employed when the amount of 

information is not huge. However, the rapid development 

of the Internet and the massive existence of web data will 

inevitably run into huge data by adopting depth-first 

algorithm strategy. Therefore, grasping strategies of the 

search engine will generally be breadth-first strategy and 

best-first strategy, as well as some of their improved 

algorithms [2]. 

B. Diameter of the World Wide Web 

Diameter of the World Wide Web or ‘Web Diameter’ 

is defined as ‘If d is used to represent a path from Web u 

to Web v, then the average length of the shortest path 

formed by all the different pairs of connected pages on 

the World Wide Web is called Web Diameter. According 

to this definition and the calculation of large-scale web 

pages, it can be known that Web Diameter is about 17[3]. 

The calculation formula of Web Diameter is 

d=0.35+2.06 log (N) (1) 

Study shows that the diameter of China’s World Wide 

Web is 16.26[4], namely if there is a path between any 

two web pages, click less than 17 times on average, you 

can reach one web page from another, which is shown in 

Figure1. 

Figure 1. Diagram of Diameter of the World Wide Web 

After analyzing the Diameter of the World Wide Web, 

the following two conclusions are obtained: 

(1) Traversing Algorithm has affected the crawler’s 

efficiency to a large extent. The World Wide Web page 

structure is not that deep as we have imagined, but 

unexpectedly wider. Therefore, the traversal mode of the 

crawler generally adopts the breadth-first one. Certainly, 

there is the reason of the importance of web pages, and 

this kind of means can help to grasp more important web 

pages. 

(2) The World Wide Web is so complex that a chosen 

grasping circuit cannot necessarily and invariably 

guarantee the best. In order to prevent this problem, the 

diameter of the web needs to be fully considered, and 

"depth-first strategy" should be adopted to control the 

grasping depth. In this way, the problem can be perfectly 

solved [5]. 

Let’s look at the following example: 

Suppose starting from seed site A, seed site B and seed 

site C, there are three paths to web page P, the lengths 

respectively being 3, 19 and 127 (CostAP=3; CostBP=19; 


Figure 2. Path cost diagram of different seed sites 

CostCP=127). As to grasp web page P from seed site A is 

very quick while seed site B and C reach P after a long 

path, it is apparently not economic enough. 

To prevent the Crawler from unlimited breadth-first 

grasping, a certain depth must be limited. Once reaching 

this depth, grasping should be stopped. The value of this 

depth is the length of diameter of the World Wide Web. 

When stopping at the maximum depth, those excessively 

deep un-grasped web pages always expect to reach from 

other seed sites in a more economic way. For example, 

seed site B and C stop grasping once reaching the depth 

of 17, leaving the opportunity for grasping web page P to 

the Crawler starting from seed site A to grasp. It is not 

hard to see that limiting the grasping depth destroys 

conditions of infinite loops and loops , if there are, will 

stop after limited times. Moreover, the combination of 

depth strategy and breadth-first strategy can effectively 

guarantee the closeness in the course of grasping, namely 

always grasping web pages under the same domain name 

in the process of grasping while web pages under other 

domain names are rare[6]. 

C. Judgement of the Web Importance 

While maintaining the priority strategy of web page 

grasping, please grasp important web pages first to ensure 

those more important web pages can be arranged with 

limited resources. Which web pages are more important? 

How to measure the importance? 

The measure of importance is decided by the following 

three aspects, i.e. IB (P), IL (P) and ID (P). 

1) IB(P) 

It is mainly decided by the number and quality of back 

links. Firstly, the more links (a great many back links) a 

web page has, the more it is recognized by other pages. 

Furthermore, there will be more opportunities for it to be 

visited by net-citizen and its importance is more obvious. 

Secondly, the more it is pointed to by more important 

web pages, the more important it is. The most classic is 

cheating web pages, which artificially set lots of 

Backlinks pointing to their own web pages to increase the 

importance of web pages. If the quality is not considered, 

local optimal will appear, rather than problem of global 

optimal. 

2) IL (P) 

It is a function of URL string which only investigates 

the string itself. IL (P) is realized mainly through some 

models, for example, it attaches more importance to URL 

containing ‘com’ or ‘home’. It also regards that the URL 

with fewer slashes is more important. 

3) ID(P) 

ID (P) represents that in a seed site set; there is a link 

(breadth-first traverse rules) in every seed site that can 

arrive at the web page. ID (P) is another important index


of the web pages. The closer it is to the seed site, the 

more opportunities it has to be visited. Therefore, it is 

more important and the seed site is where the most 

important web pages are. The farther it is to the seed site, 

the less important it is. 

D. Non-Repeated Grasping Strategy 

Massive web page images are other important 

characteristics of web. According to the 24 million page 

statistics by Google system, 22% of the web pages are 

images. The existences of a lot of duplicated web pages 

are unfavorable to the users’ query. It not only wastes the 

storage space of search engines, but also decreases the 

system efficiency. 

The reasons, on the one hand, are that the collecting 

program does not clearly record the visited URLs. On the 

other hand, the domain names and IP addresses have a 

multiply corresponding relation. The first problem can be 

solved by making a record of the visited URLs, and 

making a contrast between the new URLs and the visited 

ones every time. The second problem is relatively 

complex, because different URLs may refer to the same 

IP. 

There are four kinds of corresponding relationships 

between the domain names and IP addresses, namely: 

one-to-one, one-to-many, many-to-one and 

many-to-many. One-to-one relationship won't cause 

repeated collection, but the others are likely to do so. 

1) Algorithm Based on B-tree 

Due to the huge amount of web pages, web page 

grasping requires network bandwidth, machines, time and 

so on. The repeated grasping of the same web page 

greatly reduces the efficiency of the system, so the 

Crawler system should design a strategy to avoid 

repeated web page grasping to ensure that a web page is 

grasped only one time in a certain period of time [7]. 

B-tree is a kind of balanced multiway search tree. 

What the file system of operating system uses is the 

search algorithm of B-tree, which can also be used to 

design the algorithm matching URL to avoid repeated 

grasping in the Crawler. B-tree can be empty or multiway 

tree. A B-tree of m order must meet the following 

requirements: 

(1) A tree can have m subtrees at most; 

(2) If the root node is not the leaf node, at least two 

subtrees are necessary; 

(3) All non-terminal nodes except root have at least 

two subtrees; 

(4) All non-terminal nodes contain the following 

information data: (n, A0, K1, A1, K2, A2, …, Kn, An, ) 

Each node includes n pointers pointing to each 

keyword record. Ki(i=1, …, n) is keyword and 

Ki


make sure whether it has been grasped by looking into 

the Hash table. 

E. Webpage Revisiting Strategy 

The popularity of web results from the information 

web brings. Information is constantly changing, and the 

webpage information updating is unavoidable. However, 

the earlier grasped information may be out-of-date or of 

no use at all. A strategy is thus needed to solve the 

problem of timeliness of information, and it is called 

webpage revisiting strategy. Through revisiting, these 

webpages can keep pace with the changes of the World 

Wide Web. 

In 2000, Cho and Garcia-Monlina of Stanford 

University randomly chose 500, 000 web page samples 

and found that 23% of the web pages were updated on a 

daily basis while 40% of the web pages with .com as the 

suffix of their domain names wais updated every day. 

The half-life of web pages is 10 days. In addition, study 

shows that the process web pages change boils down to 

model of Poisson process [8]. 

To describe the model of Poisson process, X(t) is used 

to represent the number of changes of web pages in the 

period of (0, t) and the Poisson distribution with λ as 

its parameter meets the following nature. 

As for s>=0, t>=0, random variable 

X ( s + t) 

− X ( s) 

conforms to Poisson distribution, 

namely 

k 

λ t ) − λ t 

Pr{ X ( s + t ) − X ( s ) = k } = 

(1) 

( 

k ! 

� 

In which k=1, 2, 3… 

The expected value of random 

variable X ( s + t) 

− X ( s) 

is λ t . 

E [ X ( s + t ) − X ( s )] = λ t 

(2) 

It can be proved through a simple method. Suppose 

that time cycle (time interval) is 1, then 

∞ 

∞ k − k 

λ � 

E[ X ( t + 1) 

− X ( t)] 

= 

k = λ 

In which, 

obtained. 

∑ ∞ 

k = 1 

∑kPr{ X ( t + 1) 

− X ( t)} 

= ∑ 

k= 0 k= 

1 k! 

(3) 

k −1 

λ 

= � 

( k − 1)! 

E [ X ( t + 1) 

− X ( t)] 

= λ t = λ 

(4) 

λ 

, with (2), it can be 

Through the trace analysis of 500, 000 random web 

pages, Cho and Garcia-Molina came to the important 

conclusion that the update of most web pages belonged to 

Poisson distribution [9]. 

F. Robots Protocol 

Robots Protocol is a standard Web Crawler should 

conscientiously observe with Robots.txt document as its 

main content. In general conditions, Crawler writers will 


observe this protocol. A Crawler can still acquire web 

information without observing Robots.txt standard; but if 

a webmaster finds that a Crawler has problems, he will 

connect with its owner through its logo, or even prevent 

this Web Crawler form extracting some web pages in 

other ways. So Crawler developers shall conscientiously 

observe this protocol [10]. 

After entering a web page, web spider will first visit 

the text file equipped with Robots Protocol, which is 

usually in the root directory of web server, such as 

www.163.com/Robots.txt. With the protocol file, 

Robots.txt, webmasters can define the directories Web 

Crawler can not visit or the specific directories certain 

Web Crawlers can not visit [11]. For instance, if the 

executable directory and temporary file directory of some 

web pages do not want to be searched by search engine, 

webmasters can define these two directories as directories 

which deny access. 

The file format of Robots is as follows. 

User-agent: 

It is the name of Crawler. In the file “Robots.txt”, if 

more than one User-agent records show that many 

Crawlers are limited by this protocol, this file shall have 

at least one User-agent record. If the value of this record 

is set as *, this protocol is effective for any Crawler. In 

the file “Robots.txt”, record like“User-agent:*”can only 

have one. 

Disallow: 

It is used to describe a URL which does not want to be 

visited. This URL can be a complete path or part of it. 

Any URL started with Disallow can not be visited by 

Robot [12]. 

For example: 

A: “Disallow:/help”means that neither /help.html nor 

/help/index.html allows Crawler to grasp. 

B: “Disallow:/help/”means that Crawler can grasp 

/help.html but can not grasp /help/index.html. 

C: If the record of Disallow is empty, all pages of this 

website can be grasped by Crawler and in file 

“/robots.txt”, there are two or more Disallow records. If 

“/robots.txt”is an empty file, this website is open to any 

Crawler and can be grasped. 

Apart from observing Robots Protocol, Crawler should 

do its best to reasonably plan grasping strength by 

weakening the grasping strength during daytime while 

moderately increasing grasping strength at night when 

visit of Web host is low. Because of time difference, 

when it is daytime in Eastern Hemisphere, Western 

Hemisphere is at night. So the Crawler can enhance the 

strength of grasping American and European websites 

during the day while increasing the strength of grasping 

websites of its own country at night [13]. 

Even so, Crawler always inevitably brings trouble to 

Web host of other World Wide Web. So monitoring 

program of website grasping is indispensable. This 

program records the grasping traffic of every website to 

avoid problems caused when grasping strength is 

occasionally excessive.


III. A DISTRIBUTED DESIGN OF WEB CRAWLER SYSTEM 

Thousands of WWW servers on the web form mass 

information through the web link between them, with 

each connection between the hosts being relatively 

independent. Single processor system is restricted by the 

CPU handling capacity, disk storage capacity and 

network bandwidth resources, etc. It is impossible to 

boast the ability of dealing with such huge amounts of 

information, not to mention to catch up with the rapid 

growth of web information. The distributed technology 

becomes a choice. As a design of distributed system, it 

pursues the following goals: (1) The grasping ability of a 

single machine should not decrease a lot when the 

number of grasping machines increases, i.e. the 

communication and management expenses of the system 

should be reduced to the minimum while pursuing load 

balance. (2) Starting from the actual operation, dynamic 

configuration of the system should be considered, i.e. to 

allow the addition or removing of one or more machines 

during the operation. 

A. A Distributed Structure Design of Web Crawler 

System 

To design a robust and efficient web crawler, it is 

needed to make the task distribution across multiple 

machines in concurrent processing. Huge webpages 

should be independently distributed on the network and 

they should provide adequate possibility and rationality 

for concurrent accesses. Meanwhile, concurrent 

distribution will save network bandwidth resources. 

Besides, in order to improve the recall ratio, precision and 

search speed of the whole system, the internal algorithm 

of the search should boast certain intellectualization. 

Therefore, the distributed web crawler adopts the 

following structure design as in Figure. 3. 

In system design, it is needed to make the task 

distribution across multiple machines in concurrent 

processing. Huge web pages should be independently 

distributed on the network and they should provide 

adequate possibility and rationality for concurrent 

accesses. Meanwhile, concurrent distribution will save 

network bandwidth resources. Besides, in order to 

improve the recall ratio, precision and search speed of the 

whole system, the internal algorithm of the search should 

boast certain intellectualization. 

The core of system distribution is data distribution. 

The chief dispatcher is responsible for distributing URL 

to every distributed crawler. The distributed crawlers 

grasp webpages according to the HTTP protocol. In order 

to improve the speed, hundreds of distributed crawlers 

can usually be launched simultaneously. Distributed 

crawlers simultaneously analyze and dispose of the 

collected web pages, extract URL links and other relevant 

information, submit to their respective dispatchers, and 

their respective dispatchers submit them to the chief 

dispatcher. 

B. Basic Process Design for a Distributed Web Crawler 

Grasping 

Figure. 4 is a brief flow chart which only shows page 

processes with no errors. In this process, the web crawler 

will start working when one URL is added to the waiting 

queue. So long as there is one webpage or web crawler 

disposing of one webpage in the waiting queue, the web 

crawler program will continue its working. When the 

waiting queue is null and there is no disposing of any 

webpages, the web crawler will stop working. 

C. The Design of a Cooperative Grasping Algorithm of 

the Distributed Web Crawler 

In the circumstance of multiple crawlers grasping, how 

the workload will be decomposed becomes the major 

problem. If the division is not clear, it is probable that 

multiple crawlers have grasped the same web, thus 

causing additional expenses. There are two options to 

solve it. 

Scheme 1: To decompose through the web host's IP 

address and make a certain crawler grasp only the 

webpages of a certain section of addresses. 

Scheme 2: To decompose through the domain names 

of a web and make a certain crawler grasp only the 

webpages of a certain section of the domain names. 

World Wide Web determines the location of host 

according to the IP address in the network infrastructure, 

but as the IP address is dotted decimal, it is hard to 

remember. So domain name is adopted to map the IP 

address. Due to the kindness of domain name towards 

people, such a problem arises: many domain names 

correspond to the same IP. Medium-sized and small 

websites usually use this method to provide different 

Web services. It only takes economic factor into 

consideration, for only one server is needed; but large 

websites, like Sina, Sohu and other portals, generally 

adopt load balance IP multicast technology, which means 

Figure 3. A distributed structure design of web crawler system Figure 4. Basic process design for a distributed web crawler grasping 



that the same domain name corresponds to many IP 

addresses. In this way, robustness of the system is 

enhanced and load balance is achieved. 

Given the condition that many domain names 

correspond to the same IP address or that the same 

domain name corresponds to many IP addresses, a fairly 

good way is to decompose tasks according to domain 

names, which means that as long as the web pages of 

large websites are not repeatedly grasped, small websites 

can accept strategy allocation tasks even if they 

repeatedly grasp. This method of allocation allocates 

domain names to different Crawlers to grasp and a 

Crawler can only grasp web pages of “appointed” domain 

name set. For example, sina.com.cn is “appointed” to be 

grasped by spider1, jxust.cn to be grasped by spider2 and 

sim.jx.cn is “appointed” to be grasped by spider3. 

The main differences between these two kinds of 

solutions can be further understood by the following two 

examples. 

Suppose that we have 3 spiders to analyze 2 websites, 

www.jxust.cn and www.sim.jx.cn. They have different 

domain name and have the same IP address 

(218.87.136.5). The homepages are: 

http://www.jxust.cn/index.html and 

http://www.sim.jx.cn/index.html. After DNS, they are 

actually both http://218.87.136.5/index.html. The domain 

decomposition scheme will make spider2 and spider3 

repeatedly grasp this page. However, since the 

information of this site is not so much, the loss resulted 

from repeated grasps can be tolerated. 

The IP distribution scheme to grasp tasks is different. 

For example, sina.com.cn(71.5.7.138) is “appointed” to 

be grasped by spider1, sina.com.cn(71.5.6.136) to be 

grasped by spider2, jxust.cn(218.87.136.5) to be grasped 

by spider3 and sim.jx.cn(218.87.136.5) is “appointed” to 

be grasped by spider3. In this allocation scheme, there is 

no repetition in the problem of different domains pointing 

to the same IP, and the grasping tasks of jxust.cn and 

sim.jx.cn are both completed by spider3. However, 

sina.com.cn corresponds to several IPs, and the allocation 

is by spider1 and spider2 respectively. In this way, the 

grasping task of spider1 and spider2 repeat with each 

other. Obviously, sina is a large-scale web and the loss 

resulting from this repeated grasping will be huge. 

Through the comparison, the domain decomposition 

strategy is more reasonable which takes into 

consideration of the large website. Therefore, in Crawler 

system, the work of decomposing grasping tasks 

according to the domain name should be dealt with by a 

general scheduling to schedule web pages to different 

Crawlers to grasp through domain name decomposition. 

A formal scheduling distribution is as follows: 

Firstly, we suppose that n crawlers can work 

concurrently, and can define a function domain which can 

extract an URL domain name, such as: 

http://news.163.com.cn/20090824/08116145133.shtml 

Domain (URL) =news.163.com 

(1) For any URL, it will use the function domain to 

extract the domain name of URL. 


(2) Use MD5 signature function for the signatures 

domain, MD5 (domain (URL)). 

(3) Use MD5 signature value to do mould operations 

on n, int spider no=MD5 (domain (URL)) %n. 

(4) Allocate this URL to crawler numbered spider no 

to grasp. 

A mold operation can help a universal set be divided 

into several equivalence classes. Therefore, the union of 

equivalence classes is equal to the universal set, and the 

elements in an equivalence class certainly do not belong 

to another equivalence class. A formal equivalence 

relation can be expressed as follows. 

Set U as an universal set, and it is mapped respectively 

to S1, S2, …, Sn through a certain equivalence relation. It 

satisfies the following two conditions: 

(1) S1∪S2∪...∪Sn=U 

(2) if(a∈Si)&(b∈Sj)&(Si!=Sj) then a!=b 

Generally, n is the integral power of 2. For example, 

the mod of 4, 8, 16, 32…can be rapidly obtained by the 

means of digit and (&), i.e. int spider no=MD5 (domain 

(URL)) & (n-1). Generally, to mod the integral power of 

2, the means of & (n-1) could be employed (In it, n must 

be the integral power of 2) for rapid calculation. 

D. Large-Scale Web Storation Structure Design 

The World Wide Web keeps changing all the time, so a 

web page database must be able to delete the old version 

after deletion of web pages. In this way, storage voids 

may be left. Updating can be understood as addition after 

deletion and the addition of application order to the web 

database. Therefore, some disk space compact 

technologies have to be adopted to recover the storage 

voids. Besides, updating and visiting should be mutually 

excluded to avoid synchronization of the errors. 

Therefore, a good page storage structure can bring 

excellent access performance. 

To combine log structure and Hash structure based on 

its advantage is quite a good choice. For new web pages, 

the page's signature could be calculated through the URL. 

Then through modeling computation, a web page will be 

mapped to a unit on the Hash table, with each Hash table 

unit corresponding to the location of a log file. These 

newly added pages are mapped to Hash [1] through the 

calculation of Hash function, and then to the document 

Log1. You may want to randomly read an already 

accessed web page of URL, or still map to specific log 

files through similar Hash function calculation. Then you 

can search the B-tree index on the log file for 

corresponding page documents. You can acquire 

equivalent or even slightly better random access effect 

with log files (random access object files greatly 

decreased). What is worth mentioning most is that this 

kind of means can adopt processing batch writing-in, 

which will greatly improve the pure Hash structure. In 

each log file, writing-in queue will be added. Only when 

it has accumulated a certain amount of files, the 

processing batch can be realized, as shown in Figure. 5.


Figure 5. batch writing-in of newly added pages 

Figure 6. n node distributed cooperative grasping system performance 

decreases as time varies 

A Hash table can help to change the uncertainty of the 

insertion of those newly added web pages into certainty. 

Therefore, the addition of an inserted queue can be 

inserted into the target log files in batch mode. Through 

the Hash function decomposition, the size of each log on 

the basis of the Hash structure is far less than that in the 

log structure, and at the same time outweighs Hash 

barrels in the Hash structure a lot. 

Besides, it must be ensured that each log can be stored 

in memory. So to determine the size of the Hash table in 

Hash-Log, it is necessary to consider the size of actual 

physical memory and the scale of web pages which need 

to be stored. 

Table I gives a qualitative evaluation of three storage 

ways of web pages. 

To sum up, without lots of opportunities for random 

access, log structure can be the best way to store web 

pages. As for the possible great deal of random access 

and the need of many new web pages, Hash-Log is a 

more ideal way to store web pages, for it can effectively 

support distributed web page storage and effectively 

distribute web page storage to every storage node to 

increase the reliability and stability of web page storage 

in the condition that multi-machine is used to store web 

pages in a larger environment. The overall search effect 

will not be affected a lot even if a storage node goes 

wrong. 

IV. SYSTEM EVALUATIONS 

A. Operating System Environment 

B. Performance Evaluation 

The data statistics results show that different sites have 

quite different grasping rates and the grasping amount of 

webpages which depends on the access speed of each 

site, and some sites have restrictions on crawlers’ 

grasping. These restrictions include speed restrictions, as 

well as some web accessibility restrictions. Under the 


TABLE I. 

QUALITATIVE EVALUATION OF THREE STORAGE WAYS OF WEB PAGES 

Ordered 

access 

Random 

access 

Increase 

webpages 

Device 

type 

Lenovo 

M6000 

Lenovo 

T2900V 

Lenovo 

428E 

Web Site 

Device 

Purposes 

Server 

Client 

Client 

TABLE Ⅱ. 

HARDWARE DEVICES 

Device configuration Count 

Intel P4 3.2GHz/ Memory 

1G/ HardDisk 160G/ NIC 

10M-100M 

Intel Celeron 2GHz/ 

Memory 1G/ HardDisk 

160G/ NIC 1000M 

Intel P4 3.2GHz/ Memory 

512M/ HardDisk 800G/ 

NIC 10M-100M 

TABLE Ⅲ. 

GRASPING CONDITIONS OF SOME FAMOUS WEBSITES 

Web 

Size 

Count 

of 

Webs 

Crawl 

Time 

(Hour 

condition of three-node-distributed cooperative grasping, 

the average rate can reach 7 pages per second, which 

renders very satisfactory results. 

C. Scalability Evaluation 

A system with good scalability can bring linear growth 

to its performance with the addition of cost. It is also easy 

to be streamlined or expanded. 

Below is the influence on the grasping result of 

different numbers of cooperative grasping nodes. Figure4 

shows the operating result of the four kinds of different 

systems of scale respectively (the number of inspection 

cooperative grasping nodes= 1, 2, 4, 10 etc.) during the 

earlier 10 hours. Among them, the abscissa denotes the 

running time of the crawler system, with the unit being 

) 

Average 

Rate 

(Pages/ 

Second) 

163.com 10.5G 180596 5 10.033 3 

sina.com.cn 9.3G 132769 5 7.376 3 

yahoo.com. 

8.7G 150101 5 8.339 3 

cn 

qq.com 8.2G 142969 5 7.943 3 

sohu.com 7.7G 131094 5 7.283 3 

1 

3 

10 

Crawl 

Nodes 

TABLE Ⅳ. 

THE VARIATION OF THE NUMBER OF URL ALLOCATION OF THE 3 NODES 

AS TIME VARIES 

Running 

Time 

Nodes 

Log Structure based 

structure on Hash Hash-Log 

++ - + 

+- ++ + 

++ - + 

10 

Minutes 

30 

Minutes 

1 Hour 

2 

Hours 

5 

Hours 

Node 1 1893 5574 11753 22961 53742 

Node 2 1967 5782 12587 24323 57933 

Node 3 1952 5637 12127 22939 55161


hour. The y-coordinate represents the accumulated 

quantity of the grasped webpages. 

Figure. 6 shows that along with the increase of the 

number of cooperative grasping nodes, the basic system 

performance linearly increases. Therefore, this distributed 

system boasts good scalability and stability. 

D. Task Load Balance Evaluation 

The load balance of the system is based on the 

distributed web crawler cooperative grasping algorithm, 

which utilizes Hash function to allocate URL 

dynamically among the nodes. Since only one process is 

considered, one can not evaluate whether it has attained 

the load balancing only depending on the number of 

URLs allocated to each node in the process. Instead, all 

phases of the whole cooperative grasping process should 

be analyzed to evaluate the effect of load balance (The 

whole grasping process is divided into several phases in 

time). The experiment is carried out with 3 nodes 

cooperatively grasping 163.com. TABLE III shows the 

URL distribution of each node in the whole process of 5 

hours’ running of the system. 

It is shown in TABLE IV that each node has grasped a 

basically equal number of webpages. The system load 

balance of distributed web crawler has reached the 

expected elementary objective. 

REFERENCES 

[1] Li Xiaoming, Yan Hongfei, Wang Jimin, Search Engine- 

Principle, Technology and System. Beijing: science press, 

2005. 

[2] M.Najork, J.Wiener, “Breadth-first search crawling yields 

high-quality pages, ” In 10th International World Wide 

Web Conference, 2001. 


[3] Reka Albert, Hawoong Jeong, Albert-Laszlo Barabasi, 

“Diameter of the World-Wide Web, ” Nature 401, pp. 

130-131, 1999. 

[4] Li Xiaoming, “Estimation of the Number of Static Web 

Pages in China, ” PKU_CS_NET_TR2002006, 2002. 

[5] A.Broker, R.Kumar, F.Maghoul, Tomkins, a.J.Winener, 

“Graph structure in the web: experiments and models, ” 

presented at Proceedings of the 9th World-Wide Web 

Conference, Amsterdam, 2000. 

[6] Arasu. A, Cho. J, Garcia-Molina. H, “Searching the Web, ” 

ACM Transactions on Internet Technology, pp. 42. 

[7] Narayannan Shivakuma, Hector Garcia-Molina, “Finding 

near-replicas of documents on the web, ” Web DB 1998, 

pp. 204-212. 

[8] CHO. J, GARCIA-MOLINA. H, “Estimating Frequency of 

Change, ” ACM Transactions on Internet Technology, Vol. 

3, 2003. 

[9] A Standard for Robot Exclusion [EB/OL], 

http://www.robotstxt.org/wc/norobots.html 

[10] J. Talim, Z. Liu, Ph. Nain, E. G. Coffman. “Controlling the 

robots of Web search engines, ” Proceedings of the 2001 

ACM SIGMETRICS international conference on 

Measurement and modeling of computer systems, 

Cambridge, Massachusetts, United States, 2001. 

[11] Junghoo Cho, Hector Garcia-Molina, “Parallel crawlers, ” 

In Proceedings of the eleventh international conference on 

World Wide Web, Honolulu, Hawaii, USA, ACM Press, 

pp. 124-135, 2002. 

[12] Paolo Boldi, Bruno Codenotti, Massimo Santini and 

Sebastiano Vigna, UbiCrawler: A Scalable Fully 

Distributed WebCrawler, 2003. 

[13] Yan Hongfei, “Primary Exploration on Design, Realization 

and Application of Extensible Web Information Collection 

System, ” Beijing University Doctoral Dissertation, 2002.


A Ranking Method of Retrieval Results Based on 

Web Comprehending 

Zhijuan Deng 


66162815@qq.com 

Shaojun Zhong 


infor2000@qq.com 

Abstract—This thesis put forward a method used to 

calculate query similarity of webpage search results based 

on Web comprehending. According to users’ query input, 

this method can use Web comprehending technology to 

display the important web pages closer to users’ query in 

the first page of the list, make users more satisfied with the 

response of search engine, running after recall ratio and 

ensure precision at the same time. 

Index Terms—similarity, Web comprehending, search 

engine, search results 


As known to all, the scale of the World Wide Web is 

great. According to the analysis of Lawrence and Giles, 

the page number of World Wide Web doubles every two 

years. As early as 1998, all researches thought the scale 

of World Wide Web was at the magnitude of billion. 

Reckoned according to this method, the scale of present 

World Wide Web has reached the magnitude of ten 

billion [1]. Search engine should display the related 

information list to input content according to the query 

strings that users input. Although the precision of the 

search strategy based on keyword matching is very high, 

it obviously ignores many semantic correlative 

vocabularies, which limits the ability to offer users 

effective information. 

Topic-specific search engine is devoted to offering 

users more comprehensive and professional service 

related to the topic, combine with Web comprehending to 

offer correlation search, and actively offer users webpage 

search results of strong profession and high correlation. 

As for topic-specific search engine, if the query words 

that users input are common, the page quantity related to 

the input contents by users is great. If the captured pages 

are displayed for users not according to any ranking rules, 

there will be many pages of less correlation and 

unimportance in the front position. As some investigation 

shows, when querying with search engine, most users 

only pay attention to the first page and don’t look at the 

next page after they get the search results. In this way, if 

the first page is unimportant or of less correlation, the 

precision will be affected and users will feel unsatisfied 


doi:10.4304/jnw.6.12.1690-1696 

with the results of search engine. So ranking the pages 

got by searching is very necessary. 

However, the pages got through complete matching 

query and the pages got through Web comprehending 

technology are different. In order to obey the principle of 

important display in the first page, quantitative basis 

should be offered to distinguish such two kinds of pages 

when ranking pages. That is, the method of calculate 

similarity value between query words and documents 

should be accordingly changed on the basis of complete 

matching. 

II. CALCULATION OF THE SIMILARITY BETWEEN WEB 

COMPREHENDING TECHNOLOGY AND SEARCH RESULTS 

A. Comprehending the Importance of Web Pages with 

PageRank Technology 

PageRank is an analysis method of network links put 

forward by Sergey Brin and Lawrence Page in 1998, who 

were the doctoral candidates of Stanford University. It 

evaluate all pages, assigns every page a value to measure 

its importance and finally uses in the ranking of search 

results [2]. 

Specifically,PageRank assumes surfers do several 

steps of browse following the link, then follow the link 

again and browse after turning to a random starting page, 

so the value degree of a page is decided by the visiting 

frequency of random surfing. The basic thoughts of 

PageRank algorithm are shown as follow: (1) if a page is 

referenced by many other pages, this page may be 

important page. (2) if a page is not referenced many times 

but referenced by an important page, this page may be 

also important page. (3) the importance of a page is 

averaged and transferred to the page that it references. 

Based on the link structure of the entire Web, PageRank 

technology calculates the importance of all pages, and it 

thinks users can visit the entire network via the 

hyperlinks between pages. But usually Web figure is not 

strongly connected, so PageRank applies the processing 

mode of random surfing: under the circumstance of 

probability d, visitors may randomly turn to another node 

of Web figure, which is equivalent to adding a link 

between two pages without link. The application of


PageRank in Google have proved that it really could 

greatly improve the precision of search results by 

integrating the PR value of the results after analyzing and 

comprehending Web page links [3]. 

The value of PageRank is defined as follow: 

Assume there are pages Tl…Tn towards pageA (that is 

Tl…Tn reference pageA). Parameter d is a damping 

coefficient set between 0 and 1. C(A) is defined as link 

number starting from pageA. So the PageRank value of 

pageA is obtained by the following (1). 

PR( 

A) 

= ( 1− 

d) 

+ d × ( 

PR( 

Ti 

) 

) 

C( 

T ) 

n 

∑ 

i= 1 i 

PageRank values form probability distribution in the 

entire web page groups, so the sum of PageRank values 

of all web pages is 1. In the formula, PR(A) is the 

PageRank value of given by pageA; d is damping factor, 

0


similarity. 3) rank the documents and return them to users 

according to the similarity with a given query word. So 

vector space model reports the comprehended Web 

document content back to retrieval system to optimize 

users’ query. 

C. Utilize Latent Semantic Analysis to Comprehend 

Web Documents 

Latent semantic index is also called as LSA (Latent 

Semantic Analysis), which was put forward to improve 

the effect of vector space model. The basis of latent 

semantic index is feature item-text matrix. Singular value 

decomposition is conducted in this matrix to get latent 

semantic structural model. 

The object of LSI is just certain relation between 

words in texts, that is, some latent incidence relation is 

implied in context usage pattern of terms in texts. So the 

method of statistical calculation is applied to analyze 

plenty of texts to find the latent incidence relation, it 

doesn’t need certain semantic coding, it only relies on the 

object relation with context, uses such latent semantics to 

express words as well as texts and finally realize the 

objective of eliminating correlation between words and 

simplifying text vectors. Because there is strong 

correlation between words, LSI uses such correlation to 

conduct statistic transform to context usage pattern of 

concentrated words of texts to obtain a new semantic 

space [8]. 

Some infrequent usages of vocabularies should be 

removed from main semantic structure, for example: 

misuse of some vocabularies, some uncorrelated 

vocabularies occasionally exist in the same documents, 

and “noise” vocabularies that can’t represent the topic of 

the text such as high-frequency vocabularies, 

low-frequency vocabularies and others. The method of 

truncated singular value decomposition to reduce 

dimensions is used to realize the objective of filtering 

information and removing noise. LSI projects 

high-dimensional representation of texts and vocabularies 

in the low-dimensional latent semantic space, which 

reduces the dimension of problems, at the same time, the 

low-dimensional representation shows the semantic 

relation between vocabularies and texts [9]. 

D. Singular Value Decomposition 

The potential semantic indexing mainly applies 

matrix’s Singular Value Decomposition (SVD) 

technology. SVD is a common method in mathematical 

statistics, mainly utilized in the unlimited minimum cube 

and in the solution for matrix rank evaluation and 

relevant analysis on specification. 

Definition for matrix’s singular value: suppose A is the 

real matrix by m×n, and the arithmetic square root of 

non-zero characteristic value of n-rank square matrix 

A T A is the singular value of matrix A. 

The decomposition theorem of matrix’s singular value: 

suppose A∈R m×n , the rank is r, then there exist m-rank 

orthogonal matrix U and n-rank orthogonal matrix V, so 


that ⎡ 

T 

⎤ T 

U AV = ⎢ ⎥V ⎣ ⎦ 

∑ 0 . And ⎡ ⎤ T 

A = U ⎢ V 

0 0 

⎥ 

⎣ ⎦ 

∑ 0 is the 

0 0 

singular value decomposition of matrix A. 

What’s used in information retrieval is a special form 

of singular value decomposition, because the matrix 

needing singular value decomposition in information 

retrieval usually is high-rank sparse matrix [10]. 

Accurately generality, suppose vocabulary-text matrix 

A is a sparse matrix with m rows and n columns; therein, 

m>>n; it’s given that rank(A)=r and by the singular value 

decomposition theorem, A’ s singular value 

T 

decomposition is A = T0 

S 0 D . 0 

Each column of T0 is orthogonal and the length is 1, 

that is, T0 T T0=I; column vector of T0 is called as matrix 

A’s left singular value vector. S0 is called the standard 

type pattern of matrix A’s singular value, a unit value’s 

diagonal matrix, that is ∑= diag( λ1, 

λ2,..., 

λm 

) and 

there is λ λ ≥ ≥ λ ≥ λ ≥ ... = 0 , in which, 

1 ≥ 2 ... r r+ 

1 

λi is A i ’s singular value. 

Each column of D0 is orthogonal and the length is 1, 

that is, D0 T D0=I; column vector of D0 is called the right 

singular value vector of matrix X. 

Generally, as for A =T0S0D0 T , matrix T0,S0,D0 

are all full rank matrixes, which indicates all information 

of original matrix A. The edge of SVD decomposition 

lies in using smaller matrix for best fit approximation 

[11]. If all elements on diagonal S0 are ordered by value 

size, then select the previous k maximum singular values, 

and others are set as 0, thus, the obtained result of matrix 

Ak is recorded as, an approximation value of original 

matrix A whose rank is k. It can be proved that in all 

matrixes with rank k, matrix Ak is the only one that is 

closest to A through F-norm evaluation. After 0 is 

introduced to S0, S0 can be simplified by deleting 

corresponding rows and columns; a new diagonal matrix 

S0 is obtained, meanwhile, take previous k columns of T0 

and D0, and matrix T and matrix D are obtained 

respectively, then A’s k-rank approximation matrix Ak 

can be structured. 

A = 

T 

≈ Ak 

TSD 

(4) 

This is an optimum k-rank model with mean square 

approximation, which can be used to estimate the 

necessary data. 

The selection of dimension factor k relates to the 

efficiency of semantic space model; too small k can lose 

some useful information and over large k would make 

arithmetic complicated; generally when k is selected, as 

for ∑= diag( λ1, 

λ2,..., 

λm 

) and there 

is λ 1 ≥ λ2 

≥ ... ≥ λr 

≥ λr 

+ 1 ≥ ... = 0 , then make k satisfy 

contribution rate inequality. 

k 

∑ 

r 

i ∑ 

i= 

1 i= 

1 

λ λ ≥ θ 

/ (θ can be40%,50%) (5) 

i


Therein, θ includes threshold value of original 

information; contribution rate inequality is proposed 

according to the corresponding concept of factor analysis 

so as to measure representation level of k-dimension 

space for the whole space. 

Figure 2. Singular value decomposition diagram of 

vocabulary-text matrix 

As for approximation matrix AK, T’s row vector is 

called vocabulary vector and D’s row vector is text 

vector; in view of that, text retrieval and treatment of 

other texts are made, that is, latent semantic indexing 

(LSI); vocabulary vector and text vector can be projected 

into the same low k-dimension space which is called the 

latent semantic space. Figure. 3 is an example for 

vocabulary and text in latent semantic space. 

Figure 3. Expression of vocabulary and text in latent semantic 

space. 

Through singular value decomposition and selecting 

k-rank approximation matrix, LSI effectively solves the 

problems about synonym and polysem. For instance, 

“computer”, “computing machine”, “programming” and 

“home”, therein, “computer” and “computing machine” 

are synonyms, while “programming” is related to 

“computer” and “computing machine”, but “home” is 

totally irrelevant to other three words. 

In the retrieval system based on key words, if 

“computer” does not appear directly in the text, then 

when “computer” is input for retrieval, the text containing 

“computing machine” and that containing “home” can 

neither be covered. However, the users hope to find out 

text about “computing machine” when inquiring 

“computer”, or also find out text about “programming” 

whose association degree is lower compared with 

“computing machine”, but finding out text about “home” 

is out of the mind. 

Through the latent semantic space obtained by singular 

value decomposition, latent semantic indexing 

technology can well express inner relation between these 

words; in the space, the contexts of “computer”, 

“computing machine” and “programming” are consistent, 

to some degree, that is: the distance is shorter while 


farther from “home” so that the semantic relation 

between vocabulary is more foregrounded. 

As for vocabulary and text, it’s the same between text 

and text [12]. Generally speaking, it’s just necessary to 

select a smaller k value; the obtained semantic space can 

represent most information in original matrix A, 

meanwhile, information considered as “noise is removed. 

Besides, k-rank approximation matrix is much smaller 

than the terms of original m×n high-dimension sparse 

matrix. Reduction of matrix deduces calculation 

complication, helpful for improving retrieval efficiency. 

E. Calculation of Similarity Relation in Latent Semantic 

Indexing 

There are three important relations in semantic space: 

vocabulary and vocabulary, text and text, vocabulary and 

text. Because the approximation matrix Ak of primitive 

matrix A represents the most important and reliable latent 

semantic space in matrix A, vocabularies and texts are all 

projected into the same space, the similarity relation of 

the three relations can be expediently calculated by virtue 

of approximation matrix T, S and D [10]. 

1) Compare two vocabularies and do forward 

multiplication. 

T 

T 

T 

2 

A k × Ak 

= T × S × D × D × S × T = T × S × 

Therein D T ×D =I, because D has been orthogonal and 

normal. Its row i-column j represents the similarity 

between vocabulary i and vocabulary j. 

2) Compare two texts and do backward multiplication. 

T 

T 

T 

2 

A k × Ak 

= D × S × T × T × S × D = D × S × 

In the above formula, T T ×T =I, because T has been 

orthogonal and normal. Its row i-column j represents the 

similarity between texti and textj. 

3) Compare vocabulary and text, that is approximation 

matrix Ak of primitive matrix A. 

A × 

T 

T 

D 

T 

(6) 

(7) 

T 

k = T × S D 

(8) 

4) The similarity between users’ query request and 

texts. 

In retrieval, users’ query request can be vocabularies, 

texts or any combinations of both. Firstly the system 

preprocesses users’ query, generates query vector q 

according to word frequency information, regards it as a 

“pseudo-text”, and represents it in k-dimension semantic 

space. Set q as primitive query vector, it’s represented in 

k-dimension semantic space as: q * =q T S -1 , in this way, the 

similarity of q * and other text vectors can be calculated in 

k-dimension space, there are three common formulas as 

follows: 

a) Inner-product formula 

k 

∑ 

i= 

1 

* 

* 

Sim ( q , d ) = d × q 

(9) 

1 

b) Cosine formula 

j 

ji 

i


Sim 

* 

2 ( q , d j ) = 

k 

* 

∑ d ji × qi 

i= 

1 

k 

2 

* 2 

∑dji × ∑( 

qi 

) 

i= 

1 

(10) 

c) Pearson formula 

* 

Sim ( q , d ) = 

3 

j 

k 

∑ 

i= 

1 

k 

∑ 

i= 

1 

( d 

( d 

ji 

ji 

− d 

* 

− d )( q − q ) 

ji 

ji 

) × 

* 

i 

k 

∑ 

i= 

1 

* 

( q − q ) 

* 

i 

(11) 

In (9), (10) and (11), q * i is the weight of no. i 

vocabulary of query vector, dji is the weight of no. i 

vocabulary of no. j text vector, k is the dimension of 

semantic space. Finally the texts are ranked according 

similarity, and the text list is reported back to users 

according to their query request. 

Ⅲ. IMPROVED RANKING METHOD OF WEB PAGE 

RETRIEVAL RESULTS 

A. The Ranking Method of Search Results under Web 

Comprehending 

The quantitative process of correlation of important 

web pages is the basis of ranking web pages. It can be 

known from the above that the PageRank values of web 

pages can reflect the importance of web pages, so just 

link relation in web page sets is needed to use, according 

to (12). 

n PR( 

T ) i 

PR = ( 1− 

d) 

+ d × ( ) 

(12) 

C( 

T ) 

∑ 

i= 1 i 

The PageRank values of web pages are obtained, the 

query similarity of only partial web pages of high score is 

calculated, which can greatly reduce the scale of being 

vectors and calculating vector similarity. 

Documentd and queryq is simplified into sets of 

vocabularies after word segmentation. Set ∑ 

={ t1,t2,…,tN} is a dictionary, ti is lexical item, N is its 

scale, so 

m1 

m2 

mN 

d = { t1 

, t2 

, �, 

t N } 

n1 

n2 

nN 

q = { t1 

, t2 

, �, 

t N } 

In the above formula, mi and ni(i=1,2,…,N)represent 

the weights of corresponding words. Because dictionary 

is fixed, vectors are represented only with weight value. 

{ 1, 2 , , N } m m m d � 

= 

{ 1, 2 , , N } n n n q � = 

In the above formula, the typical TF*IDF calculation 

way is applied in weight calculation, the vector 

representation of documents can be obtained by 

normalizing mi. 

, , , ) w w w d � = 

( 1 2 N 


mi 

M 

w i = TFi 

× IDFi 

= × lg( ) 

mi 

ki 

Therein, 

∑ (13) 

In (13), ki represents the involved document number of 

lexical item ti in document sets, and M represents size of 

document sets. In this way, the weight values of all 

feature items the documents are got. In the same way, 

queryq can be formatted into weight value of feature 

item. 

The similarity between document and query string 

ultimately decides the display order of web pages. Apply 

the calculation formula of the similarity between users’ 

query and texts- cosine formula. 

Sim( 

q, 

d) 

= 

∑ 

i= 

1 

k 

k 

2 

∑di× ∑ 

i= 

1 

d 

i 

× q 

i 

q 

2 

i 

(14) 

Calculate the included angle cosine of document 

weight vector and query weight vector, that is, similarity 

value of documentd and queryq, and decide webpage 

ranking according to this value. 

B. The Improvement of Similarity Calculation Formula 

The webpage from all-pairs query, after all, is different 

from webpage from inquiring relevant words. In order to 

follow the principle of presenting the first page 

prominently, quantization basis should be provided when 

ranking webpage so as to distinguish the two kinds of 

webpage. That means the method to calculate the 

similarity value between query word and text should be 

changed correspondingly. 

Assume the query content has been preliminarily 

filtered when inputting query content. In order to respect 

the query strings by users, the traditional cosine formula 

is still used to calculate the similarity between the text got 

by complete matching query and queryq. 

Specific to the documents got according to the 

semantic correlation words, its value should be 

appropriately reduced when calculating query similarity. 

It can be known from the above analysis that the 

similarity between vocabularies can be calculated by 

forward multiplication of matrix. After product matrix is 

unitized, set vocabulary similarity θ(0 ≤θ≤1) as 

reduction factor, used to calculate the similarity between 

the documents got by semantic related term and query 

sector. According to the above analysis, this paper put 

forward a new calculation formula of similarity between 

Web document and query sector, which is evolved from 

the cosine formula and used to calculate the similarity 

value of query strings and the documents got from 

primitive index term and semantic related term, the 

formula as follows.


⎧ 

⎪ 

⎪ 

⎪ 

⎪ 

Sim( 

q, 

d) 

= ⎨ 

⎪ 

⎪ 

⎪θ 

× 

⎪ 

⎪⎩ 

∑ 

i= 

1 

k 

k 

∑ 

i= 

1 

k 

∑ 

i= 

1 

k 

∑ 

i= 

1 

( d × q ) 

d 

2 

i 

i 

× 

× 

∑ 

i= 

1 

( d × q ) 

d 

2 

i 

i 

n 

i 

n 

i 

∑ 

i= 

1 

q 

2 

i 

q 

2 

i 

( 0 ≤ θ ≤ 1) 

(15) 

In (15), θ is the threshold value that is got in 

common singular value decomposition of matrix with 

mathematical statistics in the calculation of latent 

semantic similarity and meets contribution rate in 

k 

∑ 

r 

∑ 

λi / λi 

≥ θ 

equation i= 

1 i= 

1 with primitive information 

included. The contribution rate in equation is used to 

measure the representation degree of k-dimension 

sub-space to the entire space. 

According to users’ query input, utilizing the improved 

similarity calculation formula to calculate query 

similarity of search results can give consideration to 

precision ratio at the time of pursuing the recall ratio of 

retrieval. Ranking web pages according to the similarity 

value in the above formula can make users more satisfied 

with the response of search engine. 

C. Experimental Results and Their Analysis 

The experiment chose the Web document set of the 

topic of “The 60th Anniversary of National Day” specific 

to “National Day”, “military parade” and other query 

words. Two web page ranking algorithms were made for 

the retrieval system respectively based on traditional 

cosine formula and similarity improvement formula as 

mentioned above. Because the page number displayed in 

the first page of result list by various common search 

engines is 10-20, the experiment tracked the browsing 

condition of users about retrieval back to the list, 

recorded users’ number of clicks in the first 10 pages and 

the first 20 pages to analyze users’ satisfaction. The 

detailed data is showed as table Ⅰ. 

TABLE I. 

COMPARISON OF USERS’ NUMBER OF CLICKING THE FIRST 10 PAGES/ 

THE FIRST 20 PAGES 

Query 

Words 

Traditional Cosine Formula 

Improved Similarity 

Calculation Formula 

National 

7/10 

Day 

7/12 

Hoisting 

5/6 

the Flag 

6/13 

Evening 

Party 

4/7 8/9 

Military 

Parade 

5/11 10/15 

It can be seen from the above table that the users’ 

satisfaction with different web page sets got by two 


similarity calculation formula is different specific to the 

same query words. The precision ratio of the first 10 

pages and the first 20 pages obtained based on the 

improved similarity calculation formula is high. 

Therefore, the pages included in the page list meet users’ 

query request better and users are more satisfied with the 

page list. Viewing the retrieval system from the angel of 

users, the returned content in the first page of list is closer 

to their query words, and the retrieval quality of this 

retrieval system will be higher. So the implication of 

improved similarity calculation formula can make the 

similarity value of page query more precise, thereby 

optimize the result list and improve retrieval quality. 

C. Conduct retrieval by relevant inverted indexing file 

Retrieval process is actually extracting query word 

according to users’ query strings, a process of matching 

the query words in indexing wordlist of inverted file and 

generating result set. 

Concretely speaking, after users inputting the query 

strings, retrieval system firstly conducts word 

segmentation on Chinese character in the strings, 

removes the stop-use words and punctuations as well as 

extracts query words. Besides, search the inverted file 

provided by indexing system and match query words. 

Read information about word frequency according to the 

successfully-matched indexing items, read inverted list of 

indexing words and keep record the document No. 

containing indexing word and the position of indexing 

word. Then, determine whether check the attribute 

content of relevant words according to the users’ choice 

about whether displaying semantically relevant document 

options provided by retrieval interface, If users choose to 

display relevant result, read the pointer-range content of 

semantically relevant word of indexing word; obtain the 

information about semantically relevant word by in-list 

deviation. Later, calculate the PageRank score for the 

obtained document’s corresponding original webpage; 

choose the document with higher score and calculate the 

query similarity. Finally, conduct the list displaying 

processing of webpage; decide the webpage displaying 

order according to similarity value. 

The displaying work of result list should also include 

displaying the webpage abstract result and webpage 

snapshot; moreover, query words contained in the 

webpage title, abstract and webpage snapshot should be 

of high-light displaying. High-light displaying the query 

word’s positioning can use the position list of indexing 

word stored in inverted file. 

Retrieval system receives query strings input by users, 

and the following is the arithmetic for searching relevant 

inverted file and obtaining search result: 

Input: query strings 

Output: webpage result list 

Algorithm: Searcher 

1. Initialize webpage set Res=Φ; 

2. Conduct word segmentation for q, and delete 

stop-use words to obtain the query vector expression 

q={t1,t2,…,tm};


3. For each ti∈q do 

4. Initialize the number set of result text Rid=Φ; 

5. Do matching term by term in the wordlist of related 

inverted file, to find index word ti; 

6. Read the inverted list of index item ti, and the 

document No. is stored in Rid; record word frequency 

and the occurrence position of item; 

7. If users choose to show that the attribute linked-list 

of semantically relevant words of related retrieval 

option &&ti is non-empty, then 

8. Read the in-list deviation attribute of semantically 

relevant words to find term t-ri; 

9. Record information about relevant item t-ri as 

procedure 6; 

10. End if 

11. Research webpage indexing file according to 

document no. in Rid and the obtained webpage is 

stored in Res; 

12. Calculate the PageRank value in Rid set by 

formula (1); 

13. Choose the webpage with larger PageRank value 

and calculate query similarity by formula (15); 

14. End for 

15. Rank the similarity of webpage in Res to decide 

the order of displaying list; 

16. Display webpage title, abstract and webpage 

snapshot in result list; 

17. Return to result list; 

REFERENCES 

[1] Xiaoming Li, Hongfei Run, Jiming Wang, Search 

Engine-Principle, Technology and System. Beijing: 

Science Press, 2005. 

[2] Page L, Brin S, Motwani R, The pagerank citation 

ranking:Bringing order to the web. Stanford Digital 

Libraries SIDL-WP, 1999. 

[3] Havelieala T H, “Topic-sensitive PageRank,” Proceedings 

of the 1lth International World Wide Web Conference. 

Hawaii, pp. 517-526, 2002. 

[4] Kleinberg J, “Authoritative sources in a hyperlinked 

environment,” Proceedings of the Ninth Annual 

ACMSIAM Symposium on Discrete Algorithms. San 

Francisco, California, pp. 668-677, 1998. 


[5] Xing Wenpu, Ghorbani A, “Weighted PageRank 

Algorithm,” Communication Networks and Services 

Research, Proceedings of Second Annual Conference. pp. 

305-314, 2004. 

[6] Hai Liu, Yuanyuan Wang, Xueren Zhang, “Study of Text 

Retrieval Problems Based on Latent Semantic Space,” 

Information Science, vol. 5, pp. 748-753, 2007. 

[7] Yuchang Lu, Mingyu Lu, “The Analysis and Construction 

of Word Weight Function in Sector Space Method,” 

Journal of Computer Research and Development, vol. 10, 

pp. 1205-1210, 2002. 

[8] Jiang Lu, “Study of the Application of Latent Semantic 

Analysis in Text Information Retrieval,” Wuhan: 

Huazhong University of Science and Technology, vol. 4, 

pp. 21-22, 2005. 

[9] Todd A, “Letsche,Michael W.Berry.Large-Scale 

Information Retrieval with Latent SemanticIndexing,” 

Information Science, vol. 1, pp. 105-137, 1997. 

[10] Nieholas Lester, Justin Zobel, Hugh Williams, “Effieient 

Online Index Maintenance for Contiguous Inverted Lists,” 

Information Processing and Management, vol. 4, pp. 

916-933, 2006. 

[11] Jiang Jiahui, Matrix Theoretical Basis. Dalian: Dalian 

University of Technology, pp. 65, 1995. 

[12] Sheng Jun, “Study on Markov Network Retrieval Model 

Baed on Latent Semantics,” Nanchang: Dissertation from 

Jiangxi Normal University, pp. 5-13, 2006. 

[13] Foltz P W, “The Measurment of Textual Coherence with 

Latent Semantic Analysis,” Discourse processes, vol. 1, pp. 

285-307, 1998. 

Zhijuan Deng, female, native place is 

Ganzhou, Jiangxi Province, born in 

Nov.1979, working in Faculty of 

Science, Jiangxi University of Science 

and Technology as an instructor. 

Research directions: Web information 

mining, software project management. 

Shaojun Zhong was born in Guzhou, 

China, in Oct.1979.He is now a Lecturer 

in Faculty of Science, Jiangxi University 

of Science and Technology, China. His 

research interests include data mining, 

network technology, and Intelligence 

computation.


An Encryption Scheme with Hidden Keyword 

Search for Outsourced Database 

Xiaoming Wang 

Jinan University/ Department of Computer Science, Guangzhou, 510632, China 

Email: wxm_gz@hotmail.com 

Guoxiang Yao and Zhen Zhang 

Jinan University/ Department of Computer Science, Guangzhou, 510632, China 

Abstract—An encryption scheme with hidden keyword 

search is proposed for Outsourced Database. In the 

proposed scheme, both pseudorandom function and 

polynomial function are employed in order to reduce 

computation and shortage overhead. The proposed scheme 

can not only provide controlled searching, and hidden 

searching as well as the provable secrecy for encryption, but 

also support the dynamic change of the permitted group 

users and be transparent to user when the users are added 

and removed since they are not involved in the process. 

Moreover, there is no interaction between database owner 

and server, server and user or database owner and user 

when the decrypted key is set up. Each user is only required 

to receive messages to setup their decrypted key and hence 

can query over encrypted data and decrypt the encrypted 

data. Therefore, the proposed scheme is more efficient and 

more practical for outsourced database. 

Index Terms—outsource database, hidden keyword search, 

added and revoked users 


The management of large databases is quite expensive, 

as it needs not only storage capacity, but also skilled 

personnel. A solution to this problem is outsourced 

database. In this solution, data owners store their data to a 

third-party service provider (server), which is not trusted. 

The server provides services to the users of the database. 

In outsourced database systems, the main problem is that 

sensitive data are stored on a third party site which is not 

under the data owner’s direct control; thus, data privacy and 

security can be put at risk. To protect resources from being 

disclosed to the server and outside attackers, as well as to 

realize access control on the server side, encryption 

methods are used to protect the sensitive data. By 

encrypting the data, the database owner should ensure 

that no one except the permitted users read the data. 

Although this solution can protect the data from outsider 

attackers and the server, the fundamental problem is that 

the search over the encrypted data seems very difficult, 

and it is hard to protect the user privacy as performing 

Manuscript received Mar. 1, 2011; revised April 5, 2011; accepted 

Aprl 12, 2011. 


doi:10.4304/jnw.6.12.1697-1704 

queries over encrypted data. 

To resolve this problem, there is a need to develop a 

solution enabling the user to perform the search over the 

encrypted domain in such a way that the server does not 

learn any unauthorized information by performing the 

search. In 2000, Song et al.[1] first studied a secure 

keyword search scheme by using a symmetric cipher. In 

their scheme, a user stores her encrypted data in a nontrusted 

database and later searches the data with an 

encryption keyword that is encrypted by the user with his 

secure key. Their techniques provide provable secrecy for 

encryption, in the sense that the non-trusted server cannot 

learn anything about the plaintext given only the 

ciphertext. Their scheme is simple and fast. However, 

their scheme applies only to the private-key setting for 

user who owns his data and wishes to upload it to a thirdparty 

database that they do not trust, their scheme cannot 

be used for practical applications such as in an email 

routing system, outsourced database etc.[11]. In their 

scheme, only the user oneself can search on encrypted 

data. If the other user is allowed to search for a word, the 

encryption key is disclose to him or disclose to server a 

list of potential locations where might occur. If the server 

is allowed to search for too many words, he may be able 

to use statistical techniques to start learning important 

information about the documents. One possible defense is 

to periodically change the key, re-encrypt data under the 

new key. As a result, the user must again re-encrypt the 

data and transmitted to server by finite channel. If the 

owner does not have the resources stored locally, a 

further preliminary step is needed to re-acquire them 

from the service, and decrypt them and encrypt them 

again, as well as transmit them to server by finite 

channel, it is involve a lot of performance overhead and 

become practically impossible. 

In outsourced database, the permitted users are allowed 

to search and read the data stored at the external server by 

the data owner. The permitted users wish to retrieve some 

data or search for some data without revealing to the 

server which data it is. Aiming to these requirements, an 

encryption scheme with hidden keyword search for 

outsourced database is proposed based on Song et al.’s 

scheme. But the proposed scheme is different from Song 

et al.’s scheme. The proposed scheme allows a group of 

the permitted users to search and read the data stored at


the external server by the data owner, not like Song et 

al.’s scheme, only to allow the user oneself to search on 

encrypted data. In the proposed scheme, we employ both 

polynomial function and pseudorandom function in order 

to reduce computation and shortage overhead. The 

proposed scheme can not only provide controlled 

searching, and hidden searching as well as the provable 

secrecy for encryption, but also support the dynamic 

change of the permitted group users and be transparent to 

user when the users are added and removed since they are 

not involved in the process. Moreover, there is no 

interaction between database owner and server, server 

and user or database owner and user when the decrypted 

key is set up and updated. Each user is only required to 

receive messages to setup their decrypted key and hence 

can query over encrypted data and decrypt the encrypted 

data. Therefore, the proposed scheme is more efficient 

and more practical for outsourced database. 

The rest of the paper is organized as follows: Section 2 

presents related works. An encryption scheme with 

hidden keyword search for outsourced database is 

presented in section 3. In section 4, the security and 

properties of the proposed protocol are analyzed. Finally, 

the concluding remarks are given. 

II. RELATED WORK 

In the existing some schemes for designing encrypted 

outsourced databases [2-5], it is assumed that the entire 

database is encrypted with a single key and the users are 

granted the key. The assumption is only limited to 

protecting data on the server side and the users have 

complete access to the database. However, in real world, 

complete access to the encrypted outsourced data is not 

acceptable. It is desirable that the users can only have 

selective access to the encrypted data. Moreover, these 

proposals, in case of updates of the authorization policy, 

would require re-encrypting the resources and resending 

them to the service. If the owner does not have the 

resources stored locally, a further preliminary step is 

needed to re-acquire them from the service and decrypt 

them by finite channel, and a great of the new decryption 

keys are frequently transmitted to all the authorized users, 

these would involve a lot of performance overhead and 

become practically impossible for large databases 

accessed by a dynamic group of users. For resulting the 

problems, Vimercati et al.[6] proposed the overencryption 

approach to avoid the need for shipping 

resources back to the owner for re-encryption when 

security requirements change. In their scheme, the 

resources are encrypted by the owner for providing initial 

protection and are encrypted by the outsourced server to 

reflect policy modifications. One potential limitation of 

the over-encryption scheme is that it may require to 

publishing too many tokens when the number of users is 

large[7]. In 2008, Liu et al.[7] proposed a new keyassignment 

approach based on secret sharing. In their 

scheme, resources are divided into different sets based on 

access control lists, and each set corresponds to a distinct 

encryption key. Users can use their corresponding key to 

derive the encryption key in order to access the resource. 


However, we consider Liu et al’s scheme is insecure 

against collusion attack. If two users share a resource, 

their scheme employs the two users as a subset to build a 

binary linear equation for deriving the encryption key. 

They randomly choose points (x, y) on this equation and 

assign as a key pair to users and choose another a point 

(xpub, ypub) on this equation as public token to publish. 

Each user uses the public token (xpub, ypub) together with 

his key pair to derive the decryption key. But a user can 

also reconstruct the binary linear equation using the 

public token (xpub, ypub) together with his key pair, they 

can compute many key pairs for many unauthorized 

users, thus the unauthorized users can access the 

resource. 

Database encryption prevents unauthorized users, 

including intruders braking into a network and database 

administrators, from seeing sensitive data in databases. 

However, it is very hard to protect the user privacy as 

performing queries over encrypted data. To resolve this 

problem, keyword search over encrypted data has 

received close attention in various environments such as 

encrypted web hard-systems, intelligent email routing, 

encrypted vendor systems, etc. In 2000, Song et al.[1] 

studied a secure keyword search scheme by using a 

symmetric cipher proposed a search technique on 

encrypted data. It deals with search problems between a 

user and a non-trusted server, also they gave out some 

practical solutions. In 2004, Golle et al.[8] first proposed 

the notion of conjunctive keywords searchable 

encryption, also they presented a solution to cope with 

this problem. They defined a security model for 

conjunctive keyword search over encrypted data and 

provided two secure constructions. In 2008, Wang et al.[9] 

first gave out a Keyword Field-Free Conjunctive 

Keyword Searches scheme, which answers the open 

problem asked by Golle et al. In their scheme, the target 

ciphertext includes a keyword set, the user could generate 

a trapdoor which consists a key word set; subset key 

words search means that if a keyword set of the target 

cipjhertext includes a keyword set of the trapdoor, then 

trapdoor and ciphertext were matched. However, 

reference[10] points out that there is a mistake in their 

proof in Golle et al’s scheme. 

Above these schemes were constructed in a symmetric 

key setting. In this setting, a user encrypts and stores his 

private data in the storage of remote server. A user can 

then retrieve his private data with a particular keyword 

from the remote storage. However, these systems cannot 

be used for practical applications such as in an email 

routing system, outsourced database etc.[11]. 

In 2005, Boneh et al.[12] first proposed a Public Key 

Encryption with Keyword Search (PEKS). With PEKS, a 

sender stores the encryption data to a server, the receiver 

makes a trapdoor for a keyword and sends the trapdoor to 

the server. Then the server can test whether or not the 

encryption and the trapdoor were made with the same 

keyword. If the keywords in the encryption and the 

trapdoor are same, the server sends the encryption to the 

receiver. Byun et al. [13] showed that the PEKS scheme 

is insecure against Off-line keyword-guessing attack.


That is, given a trapdoor, an attacker can learn which 

keyword is used to generate the trapdoor. Since a user 

usually queries commonly used keywords with low 

entropy, the keyword guessing attacks are meaningful. 

Rhee et al.[14], Tang and Camenisch et al.[15] gave out 

the way to cope with this attack accordingly. Later, many 

papers were published to extend PEKS. But PEKS only 

supports an efficient remote storage system for a 

designated receiver, which does not provide an efficient 

remote storage system for a number of users. 

III. PROPOSED SCHEME 

A. System model 

The proposed scheme uses DAS(database-as-a-service) 

model. System includes three entities data owner, server 

and user. This model is mostly suitable for one-to-many 

group where there is a single database owner and a large 

number of users. Database owner is responsible for 

producing, distributing, and updating encryption keys. 

Server is responsible for producing the query result on the 

encrypted data, and sending encrypted result to the user. 

User decrypts the result from the server using the 

decryption key in order to get the plaintext result. 

We assume that the data owner defines an access 

control policy to regulate access to the distributed 

resources. All the users of the outsourced database are 

divided into different groups according to their access 

privilege. Users with the same database access privilege 

are grouped together and can access the same part of the 

outsourced data. The outsourced database is protected 

with encryption. For the sake of simplicity, we assume 

the encryption operations to be referred to s single group. 

B. Setup 

Let p, q be distinct large primes and q|(p-1), a 

generator g in GF(p) with an order q, a pseudorandom 

function F and an additional pseudorandom function f, 

which will be keyed independently of F, a pseudorandom 

generator G, and a secure hash function H. We write 

fτ (x) 

for result of apply f to input x with secret key τ . 

The database owner, sever and each user ui have 

ε 

respectively a pair of keys such as ( ε o , y g o 

o = mod p ) 

ε 

( , y g s 

ε 

ε s s = mod p) 

and ( ε i , y g i 

i = mod p ). In setup 

phase, the encrypted key is established by the database 

owner and is sent to the each group user. Without loss of 

generality, assume that a group contains a set of privileged 

users U=(u1,u2,…un). The setup includes following 

several steps. 

(1)The database owner chooses at random a polynomial 

m−1 

f ( x) 

= a0 

+ a1x 

+ ... + am−1x 

mod q (1) 

where k=f(0)=a0, ai -s are the coefficients of f(x), m(m>n) 

be large positive integers. 


(2) The database owner chooses a random integerα and 

a set of random integers Dj=(d1, d2,…,dm), and computes 

ki for each group user ui as following 

m 1 

dl 

ki 

f ( xi 

) ∏ mod q 

x d 

l 1 i l 

− 

− 

α 

= , v = g mod p , (2) 

− 

= 

y p o ε 

δ i = i mod , Vi 

= Eδ 

( ki 

) y g i p 

i 

k 

i = mod , (3) 

m−1 

m−1 

−dl 

−x 

α 

i 

∑ f ( d j ) ∏ d 

j l l j j −dl 

d j −x 

= 1 = 1, 

≠ 

i 

zi 

= g 

mod p , (4) 

for 1 ≤ i ≤ n , and sends Vi to each group user ui., then 

publishes(v,zi(i=1,…,n)).Where Eδ {⋅} 

denotes encryption 

i 

operation with a key δ i using symmetrical encryption 

algorithm such as AES, xi is the identifier of group user ui. 

(3) On receiving V i , the group user ui computes 

y p i ε 

δ i = o mod 

(5) 

and the decryptsV i , thus obtains k i . 

C. Encryption 

When the database owner encrypts a data M that 

contain the sequence of words w1, w2,…,wl, he does 

following steps: 

(1) The database owner computes 

g p 

k α 

σ = mod , X i = H ( wi 

, σ ) , ci = fτ 

( X i ) (6) 

for 1 ≤ i ≤ l , where Xi is n-bits long, then generates a 

sequence of pseudorandom values ei using the 

pseudorandom generator G, where each ei is n-m bits long. 

Finally the database owner computes Fc ( ei 

) , adds F ( ) 

i 

c e 

i i 

in back of ei, and gets n bits long Bi 

=< ei 

|| Fc 

( ei 

) > . 

i 

(2) The database owner computes 

CTi = X i ⊕ Bi 

, C = M ⊕ H (σ ) , (7) 

and sends {C, CT i } to server. 

D. Trapdoor 

When a group user ui needs to query the data with 

words w1, w2,…, wl from the outsourced database, he 

needs to generate a trapdoor as following 

z v i p 

k 

σ ′ = i mod , X i′ 

= H ( wi 

, σ ′ ) , c i′ 

= fτ 

( X i′ 

) , (8) 

generates trapdoor T i =< ci′ 

, X i′ 

> and sends Ti to server. 

Where i=1,2,…,l. 

E. Test 

On receiving Ti, server computes B i′ 

= CTi 

⊕ X i′ 

and 

splits B′ i into two parts, Bi ′ =< ei′ 

|| r > , where e′ i denotes 

the first n-m bits of B′ i and r denotes the last m bits of B′ i . 

Then server computes Fc ( ei′ 

) and tests F e r 

i 

c ( 

i i′) 

= . If it 

holds, then Ti is correct. Server sends {C, CT i } to group 

user ui. 

The group user ui computes 

M = C ⊕ H (σ ′ ) 

(9) 

thus obtains the data M.


F. Adding group users 

Adding a new group user unew to a group does not 

require re-updating the decryption key. While adding unew 

to a group, database owner first picks an unused identifier 

xnew, and computes 

k 

new 

m 1 

∏ 

l 1 

− 

= 

− dl 

= f ( xnew) 

mod q , (10) 

x − d 

new 

y p o ε 

δ new = new mod , Vnew 

= Eδ 

( knew 

) , (11) 

m−1 

m−1 

−dl 

−x 

α 

new 

∑ f ( d j ) ∏ d 


d j −x 

= 1 = 1, 

≠ 

new 

znew 

= g 

mod p 

(12) 

and sends V new to new group user u new , publishes znew . 

Upon receivingV new , unew computes y p o ε 

δ new = new mod , 

and decrypts Vnew with δ new and obtains secret key k new , 

thus the group member unew can access server, query the 

outsourced database and decrypt the ciphertext received 

from the server by his secret key k new . 

G. Removing group users 

When a group user uB is removed from a group, the 

encryption key σ has to be changed in order to prevent 

the removed group users from querying and reading the 

restricted data. Database owner and server have to go 

through following steps: 

(1) The database owner chooses a random integer ρ , 

and computes as following for t= 1,…,n, t ≠ B , i=1,2,…,l. 

m−1 

m−1 

−dl 

−x 

ρ 

t 

∑ f ( d j ) ∏ d 


d j −x 

= 1 = 1, 

≠ 

t 

zt 

= g 

mod p 

, 

(13) 

ρ 

ρ 

l 

new 

v = g mod p , g p 

k σ = mod , (14) 

X = H ( w , σ ) , Y = H ( w , ) ⊕ H ( w , σ ) (15) 

i 

i 

i 

i σ i 

ε 

s 

s = H ( σ ) ⊕ H ( σ ) , y p 

o δ = mod , (16) 

s 

V = Eδ 

( Y1 

|| Y2 

|| ... || Yl 

|| s) 

(17) 

s new 

and publishes [( z t (t= 1,2,…,n, t ≠ B ), v ), then send Vs 

to server. 

(2) Server first computes y p 

s δ s = o mod and decrypts 

Vs to get (Yi,s), where i=1,2,…,l., then 

C = C ⊕ s = M ⊕ H (σ ) , (18) 

CT = CT ⊕ Y = B ⊕ H ( w , σ ) 

(19) 

i 

i 

{ C , i T C } is deposited in outsourced server. 

i 

A non-removed user ut can compute σ = z v t 

t mod p , 

then can generate a valid query trapdoor Ti and recover 

k 

the decryption key σ since z v t ρk 

σ = t = g mod p . 

However, the removed user uB cannot get σ since uB 

cannot obtain z B by public information z t (t=1,2,…,n, 


i 

ε 

i 

k 

t ≠ B ). If uB computes σ using his k B and old z B or 

other user z t , then σ ′ ≠ σ since 

k k k 

σ ′ z v B 

t zB 

v B ρ 

= ≠ ≠ g = σ mod p (20) 

Therefore, the removed user uB cannot recover the 

decryption keyσ , and he is prevented from generating 

query trapdoor and obtaining data, thus he is removed 

from the group. 

III. ANALYSIS 

A. Correctness 

Lemma 1. For a given ciphertext {C, CT i }, if the 

database owner follows the correct encryption procedure, 

then any privileged group user can correctly generate 

query trapdoor and decrypt the ciphertext to obtain data 

M. 

Proof: Because 

m−1 

m 

m−1 

−d 

x 

d 

f d 

l − i 

− 

α 

f x l 

∑ ( j ) ∏ 

+ α ( i ) ∏ 

k 

d d d x 

x d 

j l l j j − l j − i l i − l 

ziv 

i 1 1, 

1 

σ ′ 

= = ≠ 

= 

= = g 

αk 

= g = σ mod p 

X i′ 

= H ( wi 

, σ ′ ) = H ( wi 

, σ ) = X i , c i′ 

= fτ 

( X i′ 

) = ci 

Bi′ 

=< ei′ 

|| r >= CTi 

⊕ X i′ 

= X i ⊕ Bi 

⊕ X i′ 

, 

= Bi 

=< ei 

|| Fc 

( ei 

) > 

i 

then e i = ei′ 

, Fc ( ei 

) = r , therefore T =< ′ ′ > 

i 

i ci 

, X i is correct. 

Becauseσ ′ = σ , so 

M = C ⊕ H ( σ ′ ) = M ⊕ H ( σ ′ ) ⊕ H ( σ ) . □ 

Lemma 2. For a given ciphertext {C, CT i }, if the 

database owner and server follow the correct removing 

group user procedure, then equations (18) and (19)hold. 

Proof: The equation (18) holds since 

s = H ( σ ) ⊕ H ( σ ) , 

C = C ⊕ s = M ⊕ H ( σ ) ⊕ H ( σ ) ⊕ H ( σ ) 

= M ⊕ H ( σ ) 

The equation (19) holds since 

Y = H ( w , ) ⊕ H ( w , σ ) , 

i 

i σ i 

CTi 

= CTi 

⊕ Yi 

= Bi 

⊕ H ( wi 

, σ ) ⊕ H ( wi 

, σ ) ⊕ H ( wi 

, σ ) 

= Bi 

⊕ H ( wi 

, σ ) 

□ 

Lemma 3. If ui is a non-removed user, then he can 

generate a valid query trapdoor and decrypt the ciphertext 

to obtain data M. 

Proof: Because ui has 

k 

i 

t 

dl 

f ( xi 

) ∏ mod q 

x d 

−1 

− 

= 

, 

− 

l= 

1 

then ui can compute correctσ 

i 

l


k 

σ ′ = z v i 

i 

m−1 

m 

m−1 

−d 

x 

d 

f d 

l − i 

− 

ρ 

f x l 

∑ ( j ) ∏ 

+ ρ ( i ) ∏ 

d d d x 

x d 

j 1 l 1, 

l j j − l j − i l 1 i − 

= = ≠ 

= l 

= g 

ρk 

= g = σ mod p 

therefore computes 

X i′ 

= H ( wi 

, σ ′ ) = H ( wi 

, σ ) , c i = f ( X i′ 

) 

and generates trapdoor T c′ 

, X ′ > . 

i =

′ τ 

By the equations (15) and (19) know CTi = Bi 

⊕ X i , 

so 

X ′ = H ( w , σ ) = X , 

i 

i 

i 

i 

ci 

i 

Bi 

=< ei 

|| r >= CTi 

⊕ X i′ 

= X i ⊕ Bi 

⊕ X i′ 

, 

= B =< e || F ( e ) > 

then e i = ei 

, ( e ) = r , therefore T i =< ci′ 

, X i′ 

> is correct 

F i 

c i 

and the data is recovered from M = C ⊕ H (σ ) since 

C = C ⊕ s = M ⊕ H (σ ) 

□ 

Lemma 4. If ui is a removed user, then he cannot 

generate a valid query trapdoor and recover the 

decryption key σ . 

Proof: Because 

k 

σ ′ = z v B 

B 

m−1 

m 

m−1 

−dl 

−xB 

−d 

α 

l 

∑ f ( d j ) ∏ 

+ ρf 

( xB 

) ∏ 

d 


d j −xB 

x 

l B −d 

= 1 = 1, 

≠ 

= 1 l 

= g 

≠ σ mod p 

Therefore, the removed user uB cannot compute the 

decryption key σ , and he is prevented from generating 

query trapdoor and obtaining data. □ 

B. Security Proof 

The security of the proposed scheme is based on 

security of pseudorandom function and the computational 

Diffie-Hellman problem(CDHB). To show the proposed 

scheme is secure, we first state a useful lemma 1. Due to 

space considerations, we omit the proof of the lemma, but 

refer to the full version of this paper[1]. 

Lemma 1[1]: If F is a (t,l,eF)-secure pseudorandom 

function, f is a (t,l,ef)-secure pseudorandom function, G is 

a (t,eG)-secure pseudorandom generator, and if the key 

material is chosen as described above. Then the algorithm 

described above for generating the sequence will a ( t −ψ , eH 

) -secure pseudorandom generator, 

where eH=l eF + ef+ eG+l(l-1)/(2/|X|), X={0,1} n-m . 

Definition 1: A encryption scheme with hidden 

keyword search for outsourced database semantically is 

secure against chosen keyword attack if F, f are secure 

pseudorandom functions, G is a secure pseudorandom 

generator, H is a secure one-way hash function, and there 

exits no polynomial-time adversary with a non-negligible 

advantage in the following game: 

(1) Setup: A challenger C first generates system 

parameters, data owner’s key pair and n group users’ key 


i 

pairs as the same as section 3 (B). The challenger C gives 

system parameters and public keys to an adversary A. 

Phase 1: The adversary A issues the following kinds of 

queries adaptively: 

(2) Encryption queries: A produces a message M and 

sends encryption query for M to C. A will be given the 

result { CT i ,C}of encryption with input (M, k, wi) by C. 

(3) Trapdoor queries: The adversary A makes trapdoor 

queries for any keyword of his choice to the challenger C. 

If the trapdoor is valid, C responses the result; Otherwise, 

C returns the symbol⊥. 

(4) Challenge: The adversary A produces two keyword 

w0 and w1. The challenger C chooses a random bit 

b ∈{ 

0, 

1} 

and computes a trapdoor Ti with input ( z i , k i , 

v)) to the adversary A as a challenge. 

Phase 2: The adversary A issues new queries as in 

Phase 1. It is not allowed to make a trapdoor query for the 

target challengeT i . 

Guess: At the end of the game, A outputs a bit b′ . The 

adversary A wins this game if b ′ = b . The advantage of A 

is defined as Adv(A)=Pr[ [ b ′ = b ]-1/2. 

Theorem 1: The proposed encryption scheme with 

hidden keyword search is ( t , ε )-secure against chosen 

keyword attacks if F, f are secure pseudorandom 

functions, G is a secure pseudorandom generator, H is a 

secure one-way hash function and if there exists no 

polynomial-time algorithm that solves CDHP with 

( t 1, ε1) 

. Where t denotes the running time and ε the 

advantage that the adversary A succeeds. 

Proof: Assume that there exists an ( t , ε )-adversary A 

that can break the encryption scheme with hidden 

keyword search in the game of Definition 1. In the 

following, we will demonstrate how to use A to construct 

a ( t1, ε1) 

- algorithmη1 that solves one-way hash function 

with the advantage ε 1 . η 1 simulates the challenger C and 

interacts with A as follows: 

Phase 1: The adversary A issues the following kinds of 

queries adaptively: 

(1) Setup: η1 outputs the system parameters, data 

owner’s key pair as the same as those in Definition 1, and 

ϑi 

the group users’ public keys ( yi 

= g mod p ,i=1,2,…,n, 

ζ 

v = g mod p ), where ς , ϑi are random integers. After 

receiving the system parameters, data owner’s key public 

and zi, A outputs the target ui ∈U with public key (v, zi). 

(2) Encryption queries: For an encryption query on a 

message M chosen by A, η1 first computes 

g p 

k α ) 

σ = mod , X i = H ( wi 

, σ ) , ci = fτ 

( X i ) 

for 1 ≤ i ≤ l , where Xi is n-bits long, then generates a 

sequence of pseudorandom values ei using the 

pseudorandom generator G, where each ei is n-m bits long. 

Finally the database owner computes c ( ei 

) , adds 

F i


Fc ( ei 

) in back of ei, and gets n bits long 

i 

Bi 

=< ei 

|| Fc 

( ei 

) > . 

i 

Then computes 

CTi = X i ⊕ Bi 

, C = M ⊕ H (σ ) 

Finally sends {C, CT i } is returned as the encryption 

result of this query. 

(3) Trapdoor queries: For the trapdoor queries for wi, 

the algorithmη1 first computes 

k 

i σ ′ = ziv mod p , X i′ 

= H ( wi 

, σ ′ ) , c i′ 

= fτ 

( X i′ 

) , 

generates trapdoor T i =< ci′ 

, X i′ 

> . 

By the setting of CTi above, have 

k 

αk 

σ ′ i = ziv 

= g mod p = σ , 

X i′ 

= H ( wi 

, σ ′ ) = H ( wi 

, σ ) = X i , c i′ 

= fτ 

( X i′ 

) = ci 

B′ 

=< e′ 

|| r >= CT ⊕ X ′ 

i 

= X 

i 

i 

⊕ B ⊕ X ′ = B =< e || F ( e ) > 

i 

i 

i 

then e i = ei′ 

, ( e ) = r , therefore T i =< ci′ 

, X i′ 

> is correct. 

Fci i 

Hence, Ti is a valid trapdoor for wi. η1 outputs Ti. 

Otherwise, returns the symbol⊥. 

(4) Challenge: The adversary A produces two 

keywords w0 and w1. The challenger C chooses a random 

bit b ∈{ 

0, 

1} 

and computes a trapdoor as following 

ϑ 

i σ ′ = ziv mod p , X i′ 

= H ( wi 

, σ ′ ) , c i′ 

= fτ 

( X i′ 

) , 

generates trapdoor T w , c′ 

, X ′ > is returned as the 

result of this query. 

i =

ζ 

i 

Recalling that v = g mod p , y = g mod p . If 

ϑ 

z v i ϑ 

g iς 

σ ′ = i = mod p , 

then σ ′ is indeed a random trapdoor of wb. If σ ′ is a 

random integer, then the last element of Ti is a random 

element and thereforeσ ′ is independent of b. 

Phase 2: The adversary A issues new queries as in 

Phase 1. It is not allowed to make a trapdoor query for the 

target challengeT i . 

Analysis: If g p 

i ϑ ς 

σ ′ = mod , the adversary A’s view 

in the simulated experiment is distributed identically to 

A’s view in the real experiment. Hence, 

Pr[ η 1 = 1] 

= Pr[ b = b′ 

] 

On the other hand, when σ ′ is uniformly distributed 

in Z p , the adversary A has no information about the value 

of b and hence the probability of it outputs b ′ = b is at 

most 1/2. Therefore, η 1’s 

advantage 

Adv η ) = ε ≥ Pr[ b = b′ 

] −1/ 

2 ≥ ε □ 

( 1 1 

C. Security Analysis 

(1) The proposed scheme provides data confidentiality. 

In the sense that the untrusted server cannot learn 

anything about the owner’s outsourced data contents in 

any cases when only given the ciphertext since server 

administrators cannot know the encryption key σ . In 

same reason, outsiders cannot read the owner’s 


i 

i 

i 

i 

ϑ 

ci 

i 

outsourced data. However, the authorized users can 

access to the outsourced data since they get the 

decryption key σ . But an authorized user can only 

access the part that owner allowed them to see and cannot 

access whole database. Because outsourced database are 

divided into different groups Gi and the data of the 

different groups are encrypted by different keys. A user 

ui, who is granted to access the group Gi’s resource by 

database owner, he can only obtain the group Gi’s 

specific decryption key sent by database owner and 

cannot obtain other group’s decryption keys. Therefore, 

The proposed scheme assures that no one except the 

permitted users can search over encrypted data and read 

data. Therefore, the proposed satisfies data 

confidentiality. 

(2) In the proposed scheme, a removed user will never 

be able to search and access restricted data. When a user 

uB is removed from a group, the database owner has to 

update the encryption key σ withσ , and server updates 

the encryption data with (Yi, s) that is sent by data owner 

such as C = C ⊕ s = M ⊕ H (σ ) (see the equation (18)), 

CTi = CTi 

⊕ Yi 

= Bi 

⊕ H ( wi 

, σ ) (see the equation (19)). It 

is computationally infeasible for the revoked user uB to 

get any information about σ . Therefore, the removed 

user uB cannot recover the decryption keyσ , thus he is 

prevented from accessing constrained data. 

(3) The proposed scheme can resist collusion attack. In 

the proposed scheme, even all users collude and give 

their secret share ki each other, they cannot reconstruct 

polynomial f(x) since they can only obtain at most n share 

polynomials that are less than a threshold m. Therefore 

The proposed scheme can resist collusion attack. 

Moreover, a user cannot use the public information 

together with his key pair to derive the decryption key 

since the public information is zi and not is f ( d j ) . 

(4) The proposed scheme can achieve the hidden 

searching. To search for keyword wi, user must compute 

trapdoor Ti= to server, where 

ki 

σ = ziv mod p , X i = H ( wi 

, σ ) , ci = fτ 

( X i ) 

Server searches for wi in ciphertext according T i . It is 

evident without revealing wi itself. Therefore, the 

proposed scheme allows a user to ask server to search for 

keyword wi, but he does not reveal the keyword wi to 

server. 

(5) The proposed scheme can achieve controlled 

searching. In the proposed scheme, only privileged group 

user can ask server to search for keyword wi since other 

users don’t knowσ and don’t generate valid trapdoor T i . 

D. Performance Analysis 

(1) The proposed scheme supports the dynamic change 

of the permitted group users, and it is transparent for user 

when the users are added and removed since they are not 

involved in the process. When granting a new user to a 

resource, that is, adding to a group, it is not needed reencrypting 

the resource and re-updating the decryption 

keys for the users in the group. While adding a new user 

to a group, the new user’s decryption key is encrypted


and sent by database owner to the new group user. Upon 

receiving the decryption key, the new group user can 

access the server, query the outsourced database and 

decrypt the result received from the server. Therefore, the 

proposed scheme can easily, efficiently and quickly grant 

new users to a resource. 

In order to remove a user from the resource, it only 

needs to update the encryption key σ with σ by the data 

owner, and the resource is re-encrypted with newσ by 

server without revealing σ for server (see section Ⅲ). It 

does not need to update their secret key for each user who 

can access the resource since the users can recover σ 

with the public information. Therefore, the proposed 

scheme is very easily, efficiently and quickly to remove a 

user from the group. 

However, in previous many schemes, when a group 

user is removed from the group, the database owner has 

to re-encrypt data, transmit the encrypted data to server, 

and transmit a great of the new decryption keys to all the 

authorized users. If a large encrypted database is 

frequently transmitted to server by finite channel and a 

great of the new decryption keys are frequently 

transmitted to all the authorized users, these would 

involve a lot of performance overhead and become 

practically impossible for large databases accessed by a 

dynamic group of users. The proposed scheme avoids that 

outsourced database has to re-encrypt data by new key, 

transmit the re-decrypted data to server, and transmit new 

decryption keys to all the authorized users. Therefore, 

The proposed scheme is very efficient and practical for 

large databases accessed by a dynamic group of users. 

(2) The proposed scheme, the major computation in the 

system is shifted from the user to his database owner and 

can be done in the initialization phase. In terms of 

efficiency, the computation cost for recovering the secure 

key σ is only a multiplication operation, and a modular 

exponent computation for each user. Because using 

symmetrical encryption algorithm, the computational cast 

of trapdoor, encryption and decryption is thus minimized, 

therefore, the efficiency of the proposed scheme is high. 

The storage overhead only includes a key of constant 

size for each user, therefore, the storage overhead of the 

scheme is very low. Moreover, the scheme doesn’t 

require any interaction between database owner and 

server, server and user as well as database owner and user 

when the decrypted key is set up and updated. 

V. CONCLUSION 

In this paper, we have presented the efficient and 

secure an encryption scheme with hidden keyword search 

for outsourced database. We also analyze security and 

performance and show that the scheme is secure and 

practical for outsourced database. Whenever the 

permitted group users change, the data owner does not 

need to re-encrypt data, transmit the encrypted data to 

server and a great of the new decryption key to all the 

authorized users. User joining or removing is also simple, 

quick and efficient. The proposed scheme can ensure the 


privacy and confidentiality of sensitive data from even 

inside attackers and outside attackers. 


This work was supported in part by a grant 61070164 

from the National Natural Science Foundation of China; 

by a grant 81510632010000022 from Natural Science 

Foundation of Guangdong Province, China; by grants 

2010B010600025 and 2010A032000002 from Science 

and Technology Planning Project of Guangdong Province, 

China. 

REFERENCES 

[1] D.Song, D.Wagner, A.Perrig. Practical Techniques for 

Searching on Encrypted Data. In: IEEE Symposium on 

Research in Security and Privacy 2000, pp. 44–55. 

[2] R. Agrawal, J. Kierman, R. Srikant, and Y. Xu. Order 

preserving encryption for numeric data. In Proc. of ACM 

SIGMOD 2004, Paris, France, June 2004. 

[3] E. Damiani, S. De Capitani di Vimercati, S. Foresti, 

Jajodia, S.Paraboschi, and P.Samarati. Metadata 

management in outsourced encrypted databases. In Proc. of 

the 2nd VLDB Workshop on Secure Data Management 

(SDM’05), Trondheim, Norway, September 2005. 

[4] R. Brinkman, J. Doumen, and W. Jonker. Using secret 

sharing for searching in encrypted data. In Proc. of the 

Secure Data Management Workshop, Toronto, Canada, 

August 2004. 

[5] S.Paraboschi, and P. Samarati. Modeling and assessing 

inference exposure in encrypted databases. ACM 

Transactions on Information and System Security, 8(1), 

pp.119–152, February 2005. 

[6] S. De Capitani di Vimercati, S. Foresti, S. Jajodia, S. 

Paraboschi, and P. Samarati. Over-encryption: 

Management of access control evolution on outsourced 

data. In VLDB, 2007. 

[7] S.Liu,W.Li,L.Y.Wang.Towards Efficient Over-Encryption 

in Outsourced Databases Using Secret Sharing. New 

Technologies, Mobilety and Security,pp.1-5, 2008. 

[8] P.Golle, J.Staddon, B.Waters. Secure conjunctive search 

over encrypted data. In: ACNS 2004, Lecture notes in 

computer science, vol.3089. Springer; 2004. pp. 31–45. 

[9] P.Wang, H.Wang, J.Pieprzyk. Keyword field-free 

conjunctive keyword searches on encrypted data and 

extension for dynamic groups. In: CANS 2008, Lecture 

notes in computer science, vol. 5339. Springer; 2008. pp. 

178–95. 

[10] B.Zhang, F.Zhang. An efficient public key encryption with 

conjunctive-subset keywords search. Journal of Network 

and Computer Applications 34 (2011) ,pp.262–267. 

[11] Y. H. Hwang,P. J. Lee. Public Key Encryption with 

Conjunctive Keyword Search and Its Extension to a Multiuser 

System. Lecture Notes in Computer Science, 2007, 

Volume 4575/2007, 2-22. 

[12] J.W Byun, H.S.Rhee, H.A.Park,D.H.Lee. Off-line keyword 

guessing attacks on recent keyword search schemes over 

encrypted data. In: Proceedings of SDM’06. LNCS, vol. 

4165, pp. 75–83. 

[13] H.S.Rhee, J.H. Park, W.Susulo, D.H.Lee. Trapdoor 

security in a sear chable public-key encryption scheme 

with a designated tester. Journal of Systems and Software 

2010, 83(5),pp.763–71. 

[14] Q. Tang. Revisit the concept of PEKS: problems and a 

possible solution. Technical report TR-CTIT-08-54, Centre


for telematics and information technology, University of 

Twente, Enschede. ISSN 1381-3625, 2008. 

[15] J.Camenisch, M.Kohlweiss, A.Rial, C.Sheedy. Blind and 

anonymous identity-based encryption and authorised 

private searches on public key encrypted data. In: PKC, 

Lecture notes in computer science, vol. 433, 2009. pp. 

196–214. 

Xiaoming Wang received her Ph.D degree in Department of 

Mathematics from Nankai University in 2003. She is a professor 

of Department of Computer Science, Jinan University. Her 

research areas include database security, cryptography, network 

security, etc. 



A Method of Object-based De-duplication 

Fang Yan 

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China 

School of Information, BeiJing WuZi University, BeiJing, China 

Email: yanfang.joy@gmail.com 

YuAn Tan 

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China 

Email: victortan@yeah.net 

Abstract—Today, the world is increasingly awash in more 

and more unstructured data, not only because of the 

Internet, but also because data that used to be collected on 

paper or media such as film, DVDs and compact discs has 

moved online [1]. Most of this data is unstructured and in 

diverse formats such as e-mail, documents, graphics, 

images, and videos. In managing unstructured data 

complexity and scalability, object storage has a clear 

advantage. Object-based data de-duplication is the current 

most advanced method and is the effective solution for 

detecting duplicate data. It can detect common embedded 

data for the first backup across completely unrelated files 

and even when physical block layout changes. However, 

almost all of the current researches on data de-duplication 

do not consider the content of different file types, and they 

do not have any knowledge of the backup data format. It 

has been proven that such method cannot achieve optimal 

performance for compound files. 

In our proposed system, we will first extract objects from 

files, Object_IDs are then obtained by applying hash 

function to the objects. The resulted Object_IDs are used to 

build as indexing keys in B+ tree like index structure, thus, 

we avoid the need for a full object index, the searching time 

for the duplicate objects reduces to O(log n).We introduce a 

new concept of a duplicate object resolver. The object 

resolver mediates access to all the objects and is a central 

point for managing all the metadata and indexes for all the 

objects. All objects are addressable by their IDs which is 

unique in the universe. The resolver stores metadata with 

triple format. This improved metadata management 

strategy allows us to set, add and resolve object properties 

with high flexibility, and allows the repeated use of the same 

metadata among duplicate object. 

Index Terms—data de-duplication, object-based, backup, 

object index, metadata 

I. MOTIVATION 

Limited storage capacity are increasingly becoming the 

bottleneck of IT systems. There are two main reasons: 

first, the information revolution have led to far more data 

than in the past, all the time produced a flood of new 

data; second, With the calculation and storage capacity 

increase, people tend to permanently save all the data, 

Physical capacity must be purchased for all allocated 

storage. In this trend, more and more computer storages 

bear the pressure, in order to save huge amounts of data 

while in storage on the input costs, often has come to a 


doi:10.4304/jnw.6.12.1705-1712 

shocking degree. To address these problems, data deduplication 

technology is used to effectively reduce the 

duplication of user data in the daily backup, so backup 

data is greatly reduced[2, 3]. 

Broadly speaking, there are three approaches to how 

data can be de-duplicated. They are file level data deduplication, 

block-level data de-duplication and object 

level data de-duplication. 

File-level de-duplication is the most basic form of deduplication, 

which can identify identical files and store 

them only once. Also known as Single Instance Storage, 

this is also perhaps the easiest approach to 

implement. The weak point is that if you change the file 

by even a single byte, the entire file needs to be stored 

again [4]. If you change a file and save it with a different 

name, the entire file will also be backed up again. This 

happens more often that one may think. 

Disk-based backup technology commonly used blocklevel 

data de-duplication technology, same block from 

different files stored only once. Block-level deduplication 

generally includes three steps: chunking, 

compute the hash, find and store the unique chunk data. 

Block-level data de-duplication technology partition the 

backup file into multiple data chunks, and identify 

duplicate chunks by comparing their fingerprints, which 

are hash values computed by hash function. If find the 

same data chunk, then insert a pointer to the index node 

of the backup file which point to the data chunk already 

stored; only non-repeated data chunk can be stored. The 

biggest difference in the implementation of current block 

de-duplication technologies is the use of fixed size data 

chunks versus variable sized data chunks and the use of 

sliding windows to define the address of common chunks 

versus using fixed offsets to define the address of a 

chunk. Fixed-sized data chunking refers to partition files 

into fixed-sized data chunks, the chunk size is always 

equal to the physical block size of storage devices, for 

example, 8KB, 16KB and so on; To tolerate shifted 

contents, variable-sized chunking is a way of breaking a 

file into a sequence of chunks so that chunk boundaries 

are determined by the local contents of the file. This is in 

contrast to using fixed size chunks[5]. The Basic Sliding 

Window Algorithm [6] is the prototypical variable sized 

chunking algorithm. 

The most useful area for file-level and block-level deduplication 

implementation is in backup workflows


where the same exact set of files are archived routinely 

and there is a relatively low change rate in the files. In 

these workflows, the files are backed up regardless of 

whether they have changed or not, so it is highly likely 

that there is a very high level of commonality between 

many blocks from one backup to another. In general, 

these techniques work well for text based or simple 

content and do not work very well for compound file 

formats and workflows. Furthermore, in online 

versioning schemes such as snapshots or in backup 

workflows where only the modified files are backed up, 

there is a very low likelihood of finding common blocks. 

In such schemes, block de-duplication schemes will not 

yield any benefit and existing technologies for online 

archives (backups), snapshots and mirroring become 

expensive. 

This paper presents an object-based data deduplication 

solutions to existing problems. In our 

proposed system, after file type detection, we will first 

extract objects from files. According to the size and 

content of the object, Object_ID are then obtained by 

applying hash function. The object resolver is a central 

point for managing all the metadata and indexes for all 

the objects. The advantage of object based data deduplication 

is that even if the physical layout of a file 

changes – which can happen with a simple save operation 

– the logical objects can still be detected and stored only 

once. Unlike file level and block level technologies, 

object-based de-duplication chunks the file into well 

known logical objects like images, paragraphs, 

worksheets, slides, etc. 

II. SYSTEM ARCHITECTURE 

In many cases, because the same files or different 

versions of the information are used, the name and 

location of the objects are same in compound files. 

Alternatively, the creation of relevant documents is 

unknown, so, we will first parse the file before extraction 

of objects. Accordingly, the system architecture design is 

shown in figure 1. 

Input files 

file parser 

Object extractor 

Duplicated Object Resolver 

Storage 

File update log 

MetaData 

Figure 1. Object-based data de-duplication system structure 


The system includes the major components: file parser, 

object extractor, duplicate object resolver and storage. 

Input file formats may include .pdf, .ppt, .doc, .jpg, etc., 

depending on file type. 

A. File Parser 

The system will parse a file to determine if it is 

compound or primitive and determine the file type and 

attributes. It will determine the boundaries of the 

primitive objects within the compound file. 

We divide file into two categories: compound objects 

and atomic objects. Among them, the compound of object 

encapsulates a number of other objects, such as ZIP files, 

PPT files, word documents. They are typically encoded 

representations of the union of their contained objects. 

File extension name may be as many as 20 kinds, file 

encoding format may be more than 10 species. Primitive 

objects are the most basic representations of discrete data 

structures such as images, executable files, etc. 

B. Object Extractor 

Define abbreviations and acronyms the first time they 

are used in the text, even after they have been defined in 

the abstract. Abbreviations such as IEEE, SI, MKS, CGS, 

sc, dc, and rms do not have to be defined. Do not use 

abbreviations in the title or heads unless they are 

unavoidable. 

• Step 1:extract objects 

For atomic objects, such as JPEG images, CAD 

drawings, AVI clips, etc. you can go directly to step 2; 

For the compound file, they differ in the specific 

document headers that they used to identify the encoded 

sections and objects. The object extraction process is 

recursive, that is, a recursive process as layer after layer 

is uncovered until the lowest level atomic object is 

uncovered. Some compound file does not include clear 

rules elements as HTML tags, such as PPT files. So for 

different types of compound documents, objects should 

be extracted using different algorithms. Sometimes, the 

analysis by analyzing the header may be done, and by 

analyzing file header to determine the potential 

combination of objects and object code format. For 

example, TIFF images have specific header information 

to describe the representation of the image and 

compression algorithm that may have been used. 

• Step 2: compute objects fingerprints 

With collision-resistant hash function, such as SHA-1, 

for each atomic object is assigned a globally unique 160bit 

identifier called an object ID (Object Identifier). 

Fingerprint is the start of the 32-bit bytes, the size of the 

object. Size does not bother to get the object, and objects 

of different sizes is clearly not the same. The remaining 

contents of the object by 128-bit hash function to 

calculate the running. Object ID is not only used for 

verification, but also a unique virtual address as an object 

for a given request and locate objects, namely the use of 

the underlying storage mechanism for storing objects 

based on object fingerprint, and use that name to retrieve 

them, the actual storage and we have no relationship.


C. Duplicate object resolver 

The duplicate object resolver mediates access to all the 

objects and is a central point for managing all the 

metadata and indexes for all the objects. The resolver 

knows the total set of objects. All objects are addressable 

by their IDs which is unique in the universe. The resolver 

is a singleton object which may be created in multiple 

threads or processes and accesses the same underlying 

data storage and synchronization engine. The resolver 

provides the following services: 

1) Metadata Services 

The metadata is an abstract concept that can exist 

independently from the data itself. There are many 

variations that can be made for each object, and each 

object requires different parameters. Rather than having 

different constructors for each object type, the resovler 

maintains consistency and flexibility by following a very 

simple pattern: 

• We term metadata to be a set of statements about 

objects, expressed in triple notations (Subject, 

Attribute, Value), where Subject is the object_ID 

the statement is made about. An Attribute can be 

any kind of value or relationship, such as the size 

of a object, a file number where the object is 

extracted from, or the timestamp, etc. A Value is 

the value of the attribute, which is either some 

textural value, or another object_ID. All metadata 

reduce to the triple representation. 

• Using this system, we are able to store arbitrary 

attributes about any object. The triple shows that 

the object has all these attributes and their values. 

We call these ''relations'' or ''facts''. As follows, 

Obj represents global object domain. 

• This flexibility allows duplicate objects to use the 

same metadata, and allows different storage 

strategies according to different types of objects, 

while allowing third parties to extend type of 

object properties, or to introduce a new type to 

improve de-duplication efficiency. 

• The duplicate object resolver construct the object 

index tree based on these facts and relation, and 

stores object metadata in triple storage format. 

This includes setting, adding and resolving 

attributes for a given object_ID. 

2) Object index and object de-duplication services 

First, it must be noted that comparison for the two 

objects must have the same encoding format, otherwise, 

you can not be compared for the same , but can only 

make approximate comparison. Encoded files have this 

property: any two documents appears to be similar or the 

same information, may be represented by totally different 

bit on the storage medium. Most General compound file 


using different coding schemes[7]. Thus, we should 

compare duplicate objects based on the object content 

encoding format. 

Indexing plays an important role in de-duplication 

process. In this work, the duplicate object resolver try to 

build and search the B+ tree like structure for object 

indexing (see section 4), to identify two or more duplicate 

atomic objects from one or more files. 

III.OBJECT EXTRACTION GRANULARITY 

Two similar large objects perhaps contain only one 

byte of different content in large body of data, but this 

will prevent de-duplication due to hash code index 

method. Therefore, you can choose object de-duplication 

granularity based on object type during the de-duplication 

processing. We classify the object content type into text, 

images, audio, video and executable programs. Here we 

introduce object size threshold. The object size threshold 

can be used as the basis for object extraction. 

The method for determining the object size threshold : 

A. Generate a sample files collection 

Generate a sample files collection in the storage pool: 

we randomly select backup file set for 1 to 2 times from 

backup systems as sample files collection, placed in the 

storage pool. 

B. Sample objects classification 

The system extracts and analyzes objects according to 

different file types, the sample objects has the same type 

is placed in the same collection. 

C. Determine the range of candidate size thresholds 

Objects of different size is clearly not the same. 

According to the distribution of object size, supposing 

there are n objects in the sample object collection, the 

size distribution of objects in the collection is represented 

by a collection of S: 

S = { s1, s2,...... sk}, k ≤n, si ≠ si + 1,1 ≤i≤ k (1) 

Let dmin = MIN( s1, s2,...... sk 

) , represents the 

minimum object size in the sample object collection. 

Let dmax = MAX( s1, s2,...... sk) 

, represents the 

maximum object size in the sample object collection. 

Determine the range of candidate size thresholds: 

D= [ d1, d2,...... dm],1≤m≤ k 

(2) 

To consistent with the specified minimum average 

block size 256B in backup system, the candidate 

thresholds meet the following value conditions ((3)~(6)): 

d 1 = d 

(3) 

min 

if ( dmin 

+ �) 

mod 256 = 0 

(4) 

then d2 = min( dmin 

+ �), �= 

1, 2,3,...... 

di + 1 = di + 256, 2 ≤i≤m− 2 

(5) 

if ( dmax 

+ �)mod256= 

0 

(6) 

then d = min( d + �), �= 

1, 2,3,...... 

m 

max


D. Generate object size thresholds 

For various types of objects in the sample collection, 

the system traverses the range of candidate thresholds for 

each candidate threshold. If an object size larger than the 

candidate threshold, it will be divided into smaller objects 

by the threshold value. Then we calculate data 

compression ratio called DCR generated by the candidate 

threshold value. We calculate the DCR by the following 

equation: 

Initial Dedup _ ObjTS 

DCR = (7) 

Dedup _ ObjTS 

Where, : 

Initial Dedup_ObjTS is the total amount of data after 

de-duplication based on the size of original objects; 

Dedup_ObjTS is the total amount of data after deduplication 

based on the candidate threshold value. 

Candidate threshold that produced the maximum DCR 

will be selected as the size threshold for particular object 

type. 

E. Save threshold 

We establish one mapping relationship between each 

type of object and the corresponding size threshold, and 

save into the object-type threshold library. 

IV. OBJECT INDEX MECHANISM 

In the de-duplication system, the data block 

comparison is operation of the highest frequency, because 

the most important task in de-duplication is to compare 

Object_ID5 

all the data blocks to determine whether the data has been 

stored. Traditional method of comparing the data block, 

generally use the hash value database approach to retain 

each block a unique hash value. But the complexity of the 

hash query is generally linear or logarithmic order, that is, 

With the expansion of data size, the efficiency of the data 

block comparison will be gradually reduced. In largescale 

de-duplication system, this will cause great impact 

on the system, and lead to lower the system operating 

efficiency. Therefore, how to use a fast data comparison 

technology to make the data comparing efficiency has 

nothing to do with the size of backup data, to improve the 

operating efficiency of large-scale backup systems, is the 

main problem in the data de-duplication system [8, 10]. 

Our proposed object index mechanism for data deduplication 

is based on B + tree index structure. The 

optimal search time is O (log n), which is more efficient 

than the full indexing O(n). The duplicate object resolver 

constructs the index tree according to the extracted object 

fingerprint and object information. By using the 

advantage of B+ tree properties, all the number of nodes 

in the left and right sub-trees of non-leaf node are 

balanced. Comparing with binary search in contiguous 

memory space, its advantage is to change the B+ tree 

(insert and delete nodes) do not need to move the large 

segment of the memory data, or even usually a constant 

overhead. 

Proposed indexing mechanism is shown in the figure2, 

which Object_ID is object identifier, Object_IDn's 

MetaData is the metadata for particular object, Objectn is 

the content of the object. 

Object_ID27 Object_ID64 …… 

Object_ID5 Object_ID10 Object_ID20 

Object_ID27 Object_ID30 Object_ID50 Object_ID64 Object_ID75 …… 

Object_ID5's 

Metadata 

file123,file789,Ojbect_ID5 

Object relation node 

Object_ID10's 

Metadata 

…… 

…… …… 

…… 

Object_ID27's 

Metadata 

Object5 Object7 Object27 Object10 Object11 Object25 Object20 Objec12 …… 

Figure 2. the object index structure 

In the path of an object index contains the following types of nodes: 


c 

b 


A. Object index node 

Object index node is constituted by the object 

identifier ( Object_IDn ) . Objects in each node are 

ordered according to their size. 

B. Object metadata node 

The metadata maintain the object identifier, object 

size, object type, object encoding format, the object's 

location in the document, etc., which are stored in the 

form of the triple. Object metadata can be stored in an 

external SQL server. 

C. Object Relation node 

Object Relation node is used to describe the 

relationship between two objects. Relations stored in a 

file format that contains the object hash code and the 

filename on each line. In practice, it is much more 

efficient to refer to the filename and its long directory 

path via a short index number into a separate table of 

filenames stored in a database [9]. 

Multiple file number referring to identical object are 

listed out with the first file number that contains object 

that have been stored, and the second file number that 

contains duplicate object, followed by the identical 

object fingerprint. In fact, the relation nodes implicitly 

Data Dedupliction… … 

Data Dedupliction 

became very popular 

in storage archiving 

and backup …… 


consist in partioninng 

a large file into 

smaller parts…… 

file 

Chunk1 Chunk2 

File237 

What are you waiting for? …… data 

Dedupliction may be the best thing. 


file became very popular 

in storage archiving 

Chunk1 Chunk2 

and backup …… 

figure2 

All backups have 

duplicate data,but 

how much air a 

dedupe applicance 

or app can …… 

File169 


figure1 

a 

d 

include the file-file similarity pairs as desired. In the 

future, we can use the well-known union-find algorithm 

to determine clusters of interconnected files. We then can 

compare the similarity of the files. 

D. Object Content node 

Object Content node is used to store the contents of 

the object. 

V. OBJECT DE-DUPLCIATION PROCESS 

For different file types, such as .pdf, .word, .ppt, .txt, 

and zip, rar, tar, etc., perform the following steps: 

• Step1: Accept input file; 

• Step2: Analysis of file types; 

Step3: Extract objects from files, and compute 

Object_IDs; 

• Step4: Check whether duplicate objects exist or 

not by comparing object fingerprints composed 

by object size and hash code with efficient object 

indexing mechanism; 

• Step5: If the object is duplicate, update object 

relation node. Otherwise, insert the object index 

node and metadata, then store the new data. 

Extracted Objects 

Object_Content 

_Hashes 

a 

Extracted Objects 

c 

d 

Object_Content 

_Hashes 

a 

(245) 

b 

(1010) 

c 

(1067 

d 

(3035) 

…… 

9D321418 B34F2C12 313F3C20 805C4E32 …… 

a 

(2569) 

Figure 3. object extraction diagram 

Duplicate Objects 

b 

(3035) 

c 

(1010) 

Duplicate Objects 

d 

(1965) 

…… 

4E312FF8 805C4E32 B34F2C12 32B5C804E ……


The figure3 to figure5 show the object de-duplication 

process, the example include a PDF file (as File237 shown 

in the figure3) and a PPT file(as File169 shown in the 

figure3). The contents boxed by a dashed line represent a 

unit able to be treated as an independent object. As shown , 

Object a, b, c and d are extracted from the file (brackets is 

Figure 4. object index tree 

object size in bytes). Content hash is calculated for each 

object. 

Assume that the system has stored the objects in the 

file237. Before inserting objects in file169, structure of the 

object index tree is shown in figure4. 

File169 contains two duplicate objects, the object index tree after insert operation is shown below: 

…… 

2459D32... 1010B34F... 

237,169,1010B34F2C127EBDF18526F6323F3E2D2E3D 

…… 

1010B34F... 3035805C.. 

Obj:ID= 1010B34F2C127EBDF18526F6323F3E2D2E3D 

Obj:ID:filenum=237 

Obj:ID:type = txt 

Obj:ID:stored= 5bdbf7bcd8a540cb9af0fd7e4d0e2c9e 

Object Metadata node 


196532BC.. 3035805C.. 

106732B5.. 196532BC.. 25694E31.. 3035805C.. 

…… …… 

Obj:ID= 3035805C4E32BF559232DDA4D1FBF161D068 

Obj:ID:filenum=237 

Obj:ID:type = image 

Obj:ID:stored= 479ef7bce9n340cb9af0fd7e4d0e18a 

Object Metadata node 

Object Relation node 237,169,3035805C4E32BF559232DDA4D1FBF161D068 

Object Relation node 

Object_a Object_b …… Object_dObject_c 

…… 

Figure 5. Object index tree


It can be seen from above example:object based deduplication 

can detect common embedded data across 

unrelated files and even when physical block layout 

changes. However, block level de-duplication has no idea 

where a logical object begins and where it ends. As a 

result, the chunking process will split the images in files. 

Due to different positions of the image, duplicate data 

will not be detected at all. 

VI.EVALUATION 

This paper mainly focuses on one evaluation aspect for 

data backup: the de-duplication ratio archived by our 

proposed method. We chose 2 representative data sets: 

one was a collection of compound files, a compound file 

often contains text, figures, audio or video clips. The 

details of data set1 are described in Table 1. In Table1, 

#of files represents the number of files.; and the other was 

a collection of source code, source code are typically 

versioned, this data set consisted of 450 versions from 

1.2.1 to 2.5.75, the total size is 26GB. 

We use the two data sets and four full backups for our 

evaluations. We performed three different de-duplication: 

file-level de-duplicaiton, block-level de-duplication and 

Deduplication Ratios 

TABLE I. 

BACKUP DATASET1 

Type Size(KB) #of files 

1st PDF 4, 113, 

862 

6020 

PPT 335, 006 562 

2nd 

3rd 

PDF 1, 113, 

862 

1420 

PPT 34, 019 108 

PDF 5, 002, 

635 

6421 

PPT 310, 006 511 

4th PDF 263, 943 2501 


object-level de-duplication. SHA-1 is used as our hash 

algorithm. It generates 160 bit fingerprint for each file, 

chunk or object. Block level deduplication will start with 

a fixed size block, we chose 16KB. The experiment 

results are showed in figure6 and figure 7. 

We can draw a few of conclusions from the results : 

The improvements to each data set are different. Object 

based data de-duplication can effectively improve the 

data de-duplication ratio to dataset1. This is because the 

object based data de-duplication can mainly improve the 

de-duplication ratio of unstructured data sets. According 

to our experiments, the improvements to data sets 2 are 

not obvious than block-level and file-level de-duplication. 

Note that , our evaluation currently is not a production 

quality storage deduplication system but rather a research 

prototype. Hence, our experiment results should not used 

for absolute comparison with other storage de-duplication 

systems. We will do more comprehensive experiments in 

our future work, especially for data index and metadata 

management. 

VII. CONCLUSION AND FUTURE WORK 

Existing file and block-based data de-duplication 

technology is very suitable for text and simple content, 

but not for compound documents. This paper proposes an 

object-based de-duplication framework and an efficient 

object index mechanism to speed up the searching facility 

to identify duplicate objects. It can detect common 

embedded data for the first backup across completely 

unrelated files and even when physical block layout 

changes. As a result, object-based de-duplication 

provides the best efficiency for compound files vs. block 

based de-duplication. 

Future work includes: a) Implementing the framework; 

b) Improving the processing speed by move most 

computations to the graphic processing unit(GPU), which 

we expect will reduce the time spent on intensive 

computations such as object extraction and computing the 

fingerprints. 

fixed-block whole file object 

45% 

40% 

35% 

30% 

25% 

20% 

15% 

10% 

5% 

0% 

1 2 3 4 

Figure 6. De-duplicaiton efficiency comparison of data set1


Deduplication Ratios 

45% 

40% 

35% 

30% 

25% 

20% 

15% 

10% 

5% 

fixed-block whole file object 

0% 

1 2 3 4 

REFERENCES 

[1] Dell product group, Object Storage — A Fresh Approach 

to Long-Term File Storage, A Dell Technical White Paper. 

[2] Tony A, Biggar H. Data De-Duplication and Disk-to-Disk 

Backup Systems: Technical and Business Considerations. 

The Enterprise Strategy Group Technical Report. 2007. 

[3] Biggar H. Experiencing in Data De-Duplication: 

Improving Efficiency and Reducing Capacity 

Requirements. The Enterprise Strategy Group Technical 

Report. 2007. 

[4] William J. Bolosky, Scott Corbin, David Goebel*, and 

John R. Douceur , Single Instance Storage in Windows 

2000, In Proceedings of the 4th conference on USENIX 

Windows Systems Symposium, Volume 4 USENIX 

Association Berkeley, CA, USA , 2000. 

[5] An in-depth look at data deduplication methods, The 

Enterprise Strategy Group Technical Report, 

www.falconstor.com. 

[6] A.Muthitacharoen, B.Chen, and D.Mazieres. A lowbandwidth 

network file system. In Proceedings of the 18th 

ACM Symposiumon Operating Systems Principles 

(SOSP’01), pages174–187, Ban, Canada, October 2001. 

[7] Goutham Rao, San Jose, Eric Brueggemann, Carter 

George, Object deduplication and application aware 

snapshots, patent application publication, US, 2010. 

[8] Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in 

the data domain deduplication file system. In: Proceedings 

of the 6th USENIX Conference on File and Storage 

Technologies. 2008. 

[9] George Forman, Kave Eshghi, Stephane Chiocchetti, 

Finding Similar Files in Large Document Repositories. In 

the 11th ACM SIGKDD International Conference on 

Knowledge Discovery and Data Mining (KDD’05), 

Chicago, USA, August 2005. 

[10] Bayer.R and Me. Creight, "Organization and Maintenance 

of Large ordered Indices", Acta Informatica, Volume I, 

Springer, Berlin/Heidelberg, New York, 1972, pp. 173- 

189. 

[11] S. Walter, T.Thiago, M.Carla and Jr. Wagner Meira, "A 

Scalable Parallel Deduplication Algorithm", 19th 

International Symposium on Computer Architecture and 


Figure 7. De-duplicaiton efficiency comparison of data set2 

High Performance Computing, IEEE Computer Society, 

Brazil, 2007, pp. 79-86. 

[12] W.You et aI., "PRUN: Eliminating Information 

Redundancy for Large Scale Data Backup System", 

International Conference on Computational Sciences and 

Its Applications (ICCSA 2008), IEEE Computer Society, 

Italy, 2008 

[13] V. Henson and R. Henderson. Guidelines for Using 

Compare-by-Hash. Forthcoming, 2005. 

http://infohost.nmt.edu/~val/review/hash2.html 

[14] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise 

G, Camble P. Sparse indexing: large scale, inline 

deduplication using sampling and locality. In: Proceedings 

of the 7th USERNIX Conference on File and Storage 

Technologies. 2009 

[15] Quinlan S, Dorward S. Venti: a new approach to archival 

storage. In Proceedings of the Conference on File and 

Storage Technologies. 2002, 89–101 

Fang YAN, born in 1980 .Ph.D. 

candidate. Beijing Institute of 

Technology, Beijing, China. And 

research interests include data deduplication 

and network storage. 

She is a senior lecturer of Dept. 

Information BeiJing WuZi university. 

Yuan TAN, BeiJing, China.born in 

1972. is computer science Ph.D. And 

current research interests include 

Information Security and network 

storage. 

He is a Professor, Ph.D. Beijing 

Institute of Technology, Beijing, China . 

and supervisor, senior member of China 

Computer Federation.


Analysis on E-consumers’ Purchasing Behavior 

Based on Data-driving Model 

Lijuan Huang 

Information Management College of Jiangxi University of Finance and Economics, Nanchang, 330013, China 

Email: huanglijuan66s@126.com 

Abstract—It is the Internet world with vasty purchasing 

data sea online that makes research model of e-consumers’ 

purchasing behavior very different from traditional ones. 

Firstly this paper proposes three kinds of research models of 

consumers’ purchasing behavior, and then pointed out that 

data-driving model is the best one to analyze e-consumers’ 

purchasing behavior on the Internet. Secondly, it adopts the 

improved SOFM Neural Network as the tool of data-driving 

model to detailedly analyze e-consumers’ purchasing 

behavior of Internet marketing. Lastly experiment results 

demonstrate that the method has more visualization, 

exactness and robustness. Because consumers’ purchasing 

behavior analysis based on the SOFM Neural Network is a 

comparatively novel method, the research fruit in this paper 

is just for reference. 

Index Terms—Internet marketing, purchasing behavior, 

neural network, data-driving model 


Research about consumers’ purchasing behavior 

characteristics dates back to England in eighteenth 

century. At that time, large number of farmers poured 

into cities. These new urban residents show faith in the 

products which were able to demonstrate their social 

status, and the faith and attitude for these products from 

the residents brought people’s attention focused on 

consumer behavior[1]. The research about consumer 

behavior originated and developed from a western paper 

named Consumer Analysis published by Guest in Annual 

Review of Psychology in 1962 [2]. Afterwards, many 

celebrated scholar did active work on characteristics of 

consumer behavior. For example, Engel, Kotler and Cliff 

Allen proposed T-I-K model of consumer behavior in 

1993, Solomon, Schiffman and Kanuk raised U-S-E 

model of consumer behavior in 1999, J. Paul Peter and 

Jerry C. Olsom presented S-C-T model of consumer 

behavior in 2000 [3-6]. But these researches were 

attributed to one of experience-driving research model or 

theory-driving research model. The author believes that 

research model of consumer behavior should include 

data-driving model besides experience-driving research 

model and theory-driving research model, with the 

development of modern science and technology, and 

especial with development of neural network technology, 

data mining, artificial intelligence, and multi-disciplinary 

technology. These three kinds of research models are 

described in Table I. 


doi:10.4304/jnw.6.12.1713-1718 

TABLE I. 

RESEARCH MODEL OF CONSUMERS’ PURCHASING BEHAVIOR 

Method 1: Experience-driving model 

Researcher can communicate with consumers by means of tongue, 

facial expression and other body language, and then make an analysis of 

consumers’ purchasing behavior based on the researcher’s own 

experience. However, in the virtual world of the Internet, there is large 

sum of data about e-consumers’ purchasing behavior and the researcher 

lose the chance face to face to communicate with consumers, So 

analysis of consumers’ purchasing behavior based on experience-driving 

model loses effect. 

Method 2: Theory-driving model 

The research steps of theory-driving model are shown in Fig 1. From 

Fig 1, we can know, in this kind of research mode, researcher first 

obtains a theory model from purchasing behavior theories; Then makes 

full use of purchasing data to test and modify the model repeatedly; 

Finally, based on the last model to deduct and analyze the consumers’ 

purchasing behavior. This kind of research mode usually can get an 

unreliable analysis result due to the imperfect and even wrong 

purchasing behavior theories. 

Method 3: Data-driving model 

The research steps of data-driving model are shown in Fig 2. From 

this figure, we can know, in this kind of research mode, researcher first 

select appropriated intelligent algorithm; Then a model is drawn from 

purchasing data and is modified repeatedly by these purchasing data; 

Finally, based on the last model to deduct and analyze the consumers’ 

purchasing behavior. Obviously, data-driving model is based on real 

data other than personal experience or pure theories and this kind of 

model realizes the scientific idea that Let data say for themselves. So, 

the result of analyzing consumers’ purchasing behavior is more 

scientific, objective and fair. 

Table I shows that it is difficult to adopt experiencedriving 

model to analyze characteristics of online 

consumer purchase behavior, and adopting theory-driving 

model or data-driving model may be appropriate. Seen in 

Table I, Fig. 1 and Fig. 2, it is more objective, scientific 

and unbiased to adopt data-driving model than to adopt 

experience-driving model or theory-driving model for 

analyzing e-consumers’ purchasing behavior.


Purchasing 

behavior 

theories 

Figure 1. Analyzing consumers’ purchasing behavior based on theorydriving 

model 

Data 

warehouse 

Model 

deducting 

① 

Model 

modifying 

② 

Model 

Model 

modifying 

② 

Figure 2. Analyzing consumers’ purchasing behavior based on datadriving 

model 

Therefore, data-driving model is the most suitable for 

analyzing characteristics of online Consumers’ 

purchasing behavior, and all input data is from the 

consumers, it also fully reflected the idea: “Consumer is 

the God”. Because Self-Organizing Feature Map Neural 

Network (SOFM NN) belongs to a typical data-driving 

mode , this paper takes SOFM NN as a tool to analyze econsumers’ 

purchasing behavior. The basic principles of 

SOFM NN are described as follows. 

II. BSICAL PRINCIPLES OF THE SOFM NEURAL NETWORK 

In 1981, Finnish scholar Teuvo Kohonen firstly raised 

the concept of SOFM NN[7], which can simulate the 

function of the brain that reflects to different kinds of 

input signals (e.g. light signal, sound signal) and 

automatically sort these input signals into different zones 

of the brain layer[8]. Through inputting large sum of 

purchasing data of consumers into SOFM NN, these econsumers 

can be objectively, scientifically, and 

automatically clustered and divided into different groups 

based on the similarity of consumers’ purchasing data, 

and this means minimizing difference between the 

consumers in the same group and maximizing the 

difference between different groups. Analyzing and 

aiming directly at the different feature of these different 

consumer groups, it would be helpful to make some 

aimed marketing strategies for promotion, service, price 

etc, avoid the risk of taking the uniform strategies for all 

the consumers and with high cost for not important 

consumers or taking the unscientific ranked service to 

lost the potential VIP consumers. 

A. Topology Structure of the SOFM NN 

The typical SOFM NN (seen in Fig. 3) forms topology 

structure of input signals based on one-dimension or two- 

③ 

Drawing conclusion from 

the model 

Model 


Algorithm 

selecting 

① 

③ 

Drawing conclusion from 

the model 

Data 

warehouse 

Feature of 

purchasing 

behavior 

e.g. 

DM, 

ANN 

Feature of 

purchasing 

behavior 

dimension cellular array [8], so the SOFM NN has the 

ability to extract the feature of the input signals’ model[9]. 

The SOFM NN commonly only includes a onedimensional 

or two-dimensional arrays, but could also be 

extended to handle the multi-dimensional cellular array 

[10-12]. In order to have better stability and operating 

efficiency of SOFM NN, we add a feedback loop on the 

traditional SOFM NN to obtain improved SOFM NN 

(seen in Fig. 4). 

Victorious neuron 

Input Layer 

Competitive Layer 

Figure 3. Topology structure of the traditional SOFM NN 

Victorious neuron 

Competitive layer 

feedback loop 

Input layer 

Figure 4. Topology structure of the improved SOFM NN 

The improved SOFM NN is composed of the 

following four parts. 

• Cellular array for recognizing: This is mainly 

used for receiving the input signals and forming 

the “discrimination function” to recognize the 

input signals. 

• Mechanism for comparing and choosing: This is 

used for comparing these “discrimination 

functions” and making a decision to choose a 

processing unit with stronger functional output 

signals. 

• Local inter-connection and inter-action: This is 

used for stimulating both the chosen signals 

processing unit and its nearby signals processing 

unit. 

• Self-adapting process: This is used for modifying 

the parameters of stimulated processing unit so 

that it can increase the output value of the given 

“discrimination function”. 

B. The SOFM NN’s Algorithm 

The SOFM NN’s algorithm are described as follows. 

1) Initialization: choose “nearby neuron” set S j (0) 

with output neurons j, and the connection weight value 

, (0) wi j for both the input neuron i and the output neuron j 

is computed as equation(1).


1 

w (0) = ∑ X 

(1) 

i, j PAM 

n X∈S( k) 

2) Calculating the Euclidean distance: euclidean 

distance means the distance between the input sample and 

every output neuron j, Calculating the Euclidean distance 

d j () t is shown in equation (2). 

n 

2 

j() = ln(|| − j ||) = ln( [ i() − i, j()] 

) 

i= 

1 

d t X w ∑ x t w t (2) 

3) Defining a neighborhood function: neighborhood 

function Sj( t) is expressed in equation (3), where 

Sj() t gets decreased as the time goes on. 

d j () t 

Sj( t) = Sj(0)exp( 

− ) 

(3) 

2 

2σ 

4) Working out the minimum distance: the minimum 

distance min( d j ) among these corresponding neurons is 

calculated as equation (4). 

n 

2 

j = ∑ i − i, j (4) 

j 

i= 

1 

min( d ) argmin [ x () t w ()] t 

5) Setting learning rate: learning rate η may be 

computed according to equation(6) , where η gets 

decreased to zero as time t goes on. 

t 

η(t)= η(0)exp( 

− ) 

(5) 

τ 

6) Modifying the weight value: When the weights 

∆wij 

() t 

variation reduces to zero, topology structure of the 

∆wij 

() t 

SOFM NN is most stable, and is computed as 

equation(6). 

⎛η()[ t xi() t −wij()], t X ∈S( 

k) 

⎞ 

∆ wij () t =⎜ ⎟ (6) 

⎝0, X ∉ S( k) 

⎠ 

7) Offering new learning samples to repeat the learning 

process mentioned above, then t←t+1, till 

η() 

t 

decreases 

to 0 or enough small, and process of network learning is 

terminted. 

III. AN EXAMPLE OF ANALYZING E-CONSUMERS’ 

PURCHASING BEHAVIOR 

Because selling book is one typical choice to do Ebussiness, 

this paper takes consumers of book bussiness 

website for example to analyze e-consumers’ purchasing 

behavior[13]. 

A. Main Clustering Variables 

Most data of customers come from online dealing 

records of a famous book website (dingdang.com) in 

China[1]. These data could be divided into two groups: 

customers’ attributes data, and transaction data. 

Customers’ basic attributes data mainly include: 

customer’s name, gender, age, income, educational 

status, occupation, city, marriage status, enrolment time, 

home address, hobby etc. Transaction data mainly 

include: shopping time, frequency of shopping, 

consumption of shopping, product name, price, way of 


paying (e.g. cash on delivery, cash on postage and credit 

Card), latest shopping time etc. 

Main clustering variables of the SOFM neural network 

are seen in Table II, where main variables labeled by (*) 

indicates to be clustering variables.. 

TABLE II. 

MAIN CLUSTERING VARIABLES OF THE SOFM NEURAL NETWORK 

Total amount 

of purchase 

Monthly 

income 

Frequency 

of ihopping 

Latest time 

of shopping 

x1 (*) x2(*) x3(*) x4(*) 

Age Gender 

Educational 

status 

District 

x5 x6 x7 X8 

B. Sample Data of Consumers’ Behavior 

There are 5000 sample records but limited by the 

length of this paper, we will only list part of the samples 

as demonstrated in Table III, where capitalized variables 

in Table III means to be standardized in the domain [0, 

1]. 

Cust-ID 

TABLE III. 

CONSUMING SAMPLE DATA FROM E-MARKET 

Total amount of 

purchase (X1) 

Monthly income 

(X2) 

1001 0.9260 0.9454 

1002 0.7549 0.6950 

1003 0.8118 0.8975 

1004 0.7982 0.6825 

1005 0.6532 0.5816 

… … … … … … 

Cust-ID 

Frequency of 

shopping (X3) 

Latest shopping 

time (X4) 

1001 0.9720 0.9335 

1002 0.7273 0.6918 

1003 0.7586 0.7324 

1004 0.8180 0.7817 

1005 0.6609 0.5141 

… … … … … … 

Through system function premnmx() or user-defined 

functions, sample data can be normalized in the domain 

[0, 1]. In this paper, we adopt the Min-Max standardize 

method shown in equation (7). 

X(i) = 

x(i) - min{x(i)} 

max{ 

x(i)} - min{x(i)} 

(7)


C. Design of the SOFM Neural Network 

1) Topology Structure 

There are three kinds of topology structures: 

rectangular topology structure, hexagonal topology 

structure and random topology structure, which can take 

the corresponding three functions (namely gridtop(), 

hextop() and randtop() ) to describe the different topology 

structure of these neuron areas [14]. Here we take the 6*4 

random topology structure (shown in Fig. 5). 

Figure 5. 6*4 Random topology structure 

2) Main Programming Codes 

We firstly use function newsom() to create a SOFM 

neural network; then we use function train() and function 

sim() to train and simulate the new created network in 

order. Different training steps have different effects over 

efficiency of self-recognizing. Here, we set the training 

steps as 1000, 3000, 5000 and 10000 and observe the 

efficiencies of clustering respectively. The main 

programming codes are shown as follows: 

net=newsom(minmax(X),[6,4],’ randtop’); 

a=[1000 3000 5000 10000]; 

yc=rands(1,10); 

for i=1:4 

net.trainParam.epochs=a(i); 

net=train(net,X); 

figure; 

w1=net.IW{1,1} 

plotsom(w1,net.layers{1}.distances); 

y=sim(net,X); 

yc=vec2ind(y) 

end 

D. Analysis on the Result of Training and Computing 

1) Network ’s Weight Value Structure 

There are great differences of SOFM neural 

network’s performance when we take different training 

steps. In the paper, we only take four kind of different 

training steps namely 1000, 3000, 5000 and 10000, and 

the corresponding Network ’s weight value structure are 

shown in Fig. 6, Fig. 7, Fig. 8, and Fig. 9 respectively as 

follows. 


Figure 6. Network ’s weight value structure (training steps: 1000) 




From the above 4 figures, we can easily find that 

Network’s weight value figure comes to a comparatively 

stable status when the training steps is 5000 and 10000.


2) Network ’s Clustering Result 

Through training and simulating according to the four 

different kinds of training steps, we can also acquire a 

clustering result as shown in Table IV, where only 20 

sample records are listed for demonstration. 

Training 

Steps 

1000 

3000 

5000 

10000 

Training 

Steps 

1000 

3000 

5000 

10000 

TABLE IV. 

NETWORK ’S CLUSTERING RESULT TABLE 

1 2 3 4 5 

11 12 13 14 15 

8 20 8 20 8 

15 20 20 8 15 

16 19 13 19 13 

13 7 19 13 16 

11 18 7 19 7 

7 11 13 11 13 

6 7 8 9 10 

16 17 18 19 20 

15 20 8 8 20 

8 20 8 8 15 

7 19 13 7 19 

19 13 19 16 7 

13 19 7 18 19 

18 11 13 19 7 

To observe Table IV, we can find some rules as 

follows: 

� When the training steps are 1000, all the 

samples are divided into 1 group. 


samples are divided into 2 groups. 


samples are divided into 3 groups. 


samples are divided into 3 groups 

From Fig. 10, we can also find that there is the unqiue 

minimum from a single neuron’s error surface, so the 

structure of the above improved SOFM NN is 

comparatively stable. This means Customers clustering 

stability are robust. 


Figure 10. Single Neuron’s Error 

3) Customers’ Recognition and the Corresponding 

Marketing Strategies 

According to the above 4 network’s weight value 

structure Figures (namely Fig.6-9) and one network ’s 

clustering result table (namely Table IV), We can also 

reach a further conclusion: when the training steps are 

more than 5000 (including 5000), the samples are steadily 

clustered and divided into 3 groups. To observe these 3 

groups and make an analysis of customers’ purchasing 

behavior, we find each group has its own special features 

as illustrated in Table V, where 3 distinguished marketing 

strategies are strongly suggested aiming at these 3 

groups’ special features. Obviously, recognizing 

customers’ features and taking the distinguishing 

marketing strategies can help to reach a win-win result 

between customers and bussiness website, increase the 

loyalty of customers (esp. VIPs), and maximize the profit 

of e-marketing. 

TABLE V. 

ANALYSIS RESULT OF E-CONSUMERS’ PURCHASING BEHAVIOR 

Cluster NO 1 Customers (5.71%) Consumption (0.13%) 

� Features of consumers’ purchasing behavior: occasional 

customers, most of the occasional customers are teenagers 

who come from different districts of the nation, and the 

total amount of purchase is low with low income and low 

shopping frequency. Most of them have a low-level 

educational status. 

� Marketing strategy: These customers deserve the normal 

service, such as racking up points for discount, getting the 

book information through e-mail but reading e-books not 

free on the Internet. 

Cluster NO 2 Customers (74.71%) Consumption: (17.86%) 

� Features of consumers’ purchasing behavior: main 

customers, most of the main customers are youths who 

come from different districts of the nation, and the total 

amount of purchase is higher with middle-level income 

and higher shopping frequency. They have a middle-level 

educational status.


� Marketing strategy: these customers deserve the middleclass 

service, such as ordering individualized information 

of book through e-mail, racking up points for higher 

discount, reading some e-books free on the Internet when 

the amount of purchase accumulate to a certain point, 

enjoying free e-cards or e-flowers on their birthday and so 

on. 

Cluster NO 3 Customers (19.58%) Consumption (82.01%) 

� Features of consumers’ purchasing behavior: most is VIPs, 

who are young women who often come from highly 

developed districts or remote places, and the total amount 

of purchase is the highest with high income or low income 

and the highest shopping frequency. Most of the VIPs have 

a middle-level or high-level educational status. 

� Marketing strategy: these customers deserve the top 

service, such as enjoying VIP service to have free private 

cyberspace and fastest green passage, downloading or 

reading some e-books free on the Internet, conferring the 

latest book catalogue in both paper’s form and e-mail’s 

form, free biggest cards and best flowers on their birthday, 

the highest discount and so on. 

Table V strongly proves the Pareto 80/20 Principle: 

20% of all customers are the VIPs (Cluster NO 3), and 

their contribution is 80%. In this table, we can also find 

some interesting phenomena. For example, VIPs would 

not definitely be customers with high income, and most 

of VIPs are young women rather than men, VIP 

customers are not only from developed regions, but also 

from less developed regions. 

IV. CONCLUSION 

Famous economist Christopher pointed out: in today’s 

unpredictable business competition, the market is no 

longer on the sellers’ side but on the buyers’ side [14]. 

“Customer is the god”. So exactly to analyze consumers’ 

purchasing behavior on the Internet and accordingly to 

make some scientific Internet marketing strategy for sale 

promotion are key factor to success for assuring the profit 

of E-business website. As for how to analyze econsumers’ 

purchasing behavior, this paper proposes and 

compares three kinds of research models, and pointed out 

thedata-driving model is best one to analyze econsumers’ 

purchasing behavior. SOFM NN belongs to a 

typical data-driving model, so this paper improves the 

tradional SOFM NN and takes the improved one as a tool 

to analyze e-consumers’ purchasing behavior. Because econsumers’ 

purchasing behavior analysis based on the 

SOFM Neural Network is a comparatively novel method, 

the result of research in this paper is just for reference. 


The author thanks the anonymous reviewers for their 

valuable remarks and comments. This work is supported 


by 2010 National Social Science Fund of China (Grant 

No. 10BGL028), National Natural Science Fund of China 

(Grant No. 70861002), China Postdoctoral Science Fund 

(Grant No. 200902535), 2010 Science and Technology 

Project of education department of Jiangxi Province 

(Grant No. GJJ10430), and 2010 Social Science Planning 

Project of Jiangxi Province (Grant No. 10GL35). 

REFERENCES 

[1] H.Rubost, “Consumer Behavior of Online Procurement 

and Book Supply Chain,” Service Operations, Logistics, 

and Informatics. May 2005. pp 49-66. 

[2] J.P.Peter, and J.C. Olson, “Consumer Behavior and 

Marketing Strategy,” McGraw-Hill Press, 2009. 

[3] Blanca Hernández, Julio Jiménez, and M. José, “Customer 

behavior in electronic commerce: The moderating effect of 

e-purchasing,” Journal of Business Research, Volume 63, 

Issues 9-10, September-October 2010, pp. 964-971. 

[4] H. J. Chang, L. P. Hung and C. L. Ho, “An anticipation 

model of potential customers’ purchasing behavior based 

on clustering analysis and association rules analysis,” 

Expert Systems with Applications, Vol.32, Issue 3, April 

2007, pp. 753-764. 

[5] P.W Engel, “A View Coming from Database Management 

of Consumer’s Behavior,” New York: Dryden Press, 2008 

[6] I. C. Yeh, C. H. Lien, T. M. Ting, Y. Y Wang and C. M. 

Tu, “Cosmetics purchasing behavior–An analysis using 

association reasoning neural networks,” Expert Systems 

with Applications, Vol.37, Issue 10, October 2010, pp. 

7219-7226. 

[7] Simon Haykin, A Comprehensive Foundation, World 

publishing house, February 2004. 

[8] D J.Willshow, “How Patterned Neural Connections Can 

Be Set Up By Self-organizations,” Proc Roy Soc London 

B,1976,194: 431-445. 

[9] T.Kohonen, “Self-organized Formation of Topologically 

Correct Feature Maps,” Biological Cybernetic. 

1982,43(1):59-69. 

[10] FeiSi Science Research Center, Neural Network theory and 

Realization in Matlab 7, Beijing: Publishing House of 

Industry Electronics, May 2005, pp. 165-178. 

[11] Z. H. Yang and Y. Yan, “Research and Development of 

Self-organizing Maps Algorithm,” Computer Engineering, 

2006, 32 (16), pp. 201-228. 

[12] Mao Guojun, et al. Principle and Algorithm of Data 

Mining, Beijing: Tsinghua University Press, 2008. 

[13] Huang Lijuan. Yu Guoping. “Research on the Design for 

the National Unified E-marketing Platform of Chinese 

Book Supply Chain,” UESTC Press, 2006. 

[14] M.Christopher, “Logistics and Supply Chain 

Management,” London: Pitman Publishing House, 1992. 

Lijuan Huang Jiangxi Province, China. 

Birthdate: February, 1971. is 

Management Science and Engineering 

Ph.D., graduated from Nanchang 

University. And research interests on ecommerce 

and Logistics and Supply 

Chain Management. 

She is a postdoctor of Jiangxi 

University of Finance and Economics.


Repair Method of Complex Network Based on 

Matthew Effect 

Minsheng Tan 

School of Computer Science and Technology, University of South China, Hengyang Hunan, 421001,China 

Email:tanminsheng65@163.com 

Qiang Cui, Lingfeng Zhu and Hui Zhao 

School of Computer Science and Technology, University of South China, Hengyang Hunan, 421001, China 

Email:{kiteblue@126.com, 407999562@qq.com, zhaohui.1006@yahoo.com.cn } 

Abstract — Complex network repair after suffering the 

deliberate assault becomes extraordinarily important. In this 

paper, a repair method of complex network based on 

Matthew Effect was proposed. Single-node selective attack 

algorithm and multi-node cluster attack algorithm was given. 

Aiming at the two kinds of attack, linear detection algorithm 

and BA network generation algorithm was put forward to 

get experiment data. Correspondingly, repair experiments 

were done. Experimental results show that repair rate of the 

method is more than 95% in sampling Internet and BA 

network. For repair rate of complex network, the conception 

of stability and its mathematics description was addressed. 

Experiments show that the complex network can achieve a 

steady topology state after some steps of attacks and repairs. 

Index Terms—Complex Network, Repair Method, Power-law, 

Matthew Effect, Stability 


Issues on complex network repair are raised as 

forefront topic in recent years in this field. Currently, 

research on this topic is very little domestic and in its 

infancy abroad. Complex network repair, which has no 

uniform definition, only use connectivity to evaluate 

repair method is good or bad. The repair is lack of some 

unified considerations, such as the cost of restoration, the 

stability of the network and the ability against attacks after 

repair [1-2]. 

Repair methods and attack methods are inseparable. 

Studies on attack efficiency, damage degree and attack 

principle of different attack strategies help to find speed 

and efficient repair strategies. Through constant attacks 

and repair on the network, we can observe changes in 

network topology, anti-attacks level and easy-repairing 

ability of different types of network topology. Currently, 

measure of repair method to measure quality just reflects 

the connectivity of the network topology but not the 

performance of network run-time communication services 

which is precisely one of the greatest concern of users and 

Manuscript received Feb.25, 2011; revised Mar.5, 2011; accepted Apr. 

2, 2011. 

project number: 60572137, 10JJ9025, 2009GK3036 , 10C1185. 


doi:10.4304/jnw.6.12.1719-1725 

managers [3-5]. Repair strategies and exploration on 

complex network has important theoretical significance 

and application value. This paper proposes a new repair 

method of complex network based on Matthew Effect for 

the power-law. 

II. MATTHEW EFFECT 

Matthew Effect is a phenomenon, which is the good 

better, the bad worse and worse, much more, little less, 

and its name comes from a fable in the "Bible. Gospel of 

Matthew"[6]. In 1968, the United States • History of 

Science researcher Robert Morton proposed the term used 

to summarize a social psychological phenomenon. Robert 

• Morton interpreted "Matthew Effect" as: any individual, 

group or region, if success and progress in one respect 

(such as money, fame, status, etc.) it will produce a 

cumulative advantage, and there will be more 

opportunities to achieve greater success and progress 

[7-9]. 

Real Internet in the generation process is a true 

portrayal of Matthew in practical application, when a new 

node is added to the network, the node will tend to be 

connected with network nodes which have larger degrees 

[10-11]. 

BA network model can well reflect the Matthew Effect. 

Its generation process well considered the two following 

characteristics [12]: 

� Growth characteristic: network growing larger. 

� Preferential attachment characteristic: new nodes 

tend to connect with those with high degree of "big" 

node connections. 

Figure 1 shows the evolution process of BA network 

when m = m0 = 2. 

Figure 1 Formation process of BA network


III. REPAIR METHOD OF COMPLEX NETWORK 

BASED ON MATTHEW EFFECT 

This article research Matthew Effect application in 

repair from scale-free network based on the Matthew 

Effect in complex network. A single-node selective attack 

(for example, delete the network nodes with degree of the 

maximum value) and multi-node cluster attack (such as 

one-time attacks on 30% network nodes of moderate value 

ones) under sustained attack is the main consideration. 

While a node or multiple nodes are deleted as a ratio r 

priority in the network, a node is reconnected. Those 

nodes losing neighbor nodes reconnect to other nodes to 

replace the lost nodes; in addition, the node attacked is 

reconnected to the network as a new node. Compensation 

dynamics in linear priority sustained attack will lead to 

power-law degrees distribution associated with index 

truncation which depends on the rate of priority deletion. 

Thus, when the node of the network with maximum 

degree was attacked, compensation agreement could still 

protect the index of power-law distribution. Even in a high 

rate of priority attack, or attacking the network nodes with 

a large value, as long as the new node can connect 

network randomly with m ≥ 2, the network will be able to 

maintain a large connection parts, and the lost connection 

is no longer the damage result of this sustained attack. The 

repair method considered here are changing from the time, 

which is showed as follows: 

A. Repair Algorithm under Single-node Selective Attack 

For a given network topology, according to the size of 

first statistical degree of the network nodes to do selective 

attack network nodes, because the result of attacks and 

repair will lead to changes in the degree of network nodes, 

which need count the degree of the network nodes 

according to changes of time in real-time, this repair is to 

change over time. Here are the steps in the recovery 

algorithm: 

(1) Count degree of nodes in the network, data 

storage :d; 

(2) According to the degree of the nodes from the 

statistics (1) to attack a node in the network 

(assuming the network nodes numbered from 1 

onwards, if a node i meet that a (i, 1) == M (k) ∪ a 

(i, 2 ) == M (k) is true, then the edge(i, j) connecting 

this node will be deleted); 

(3) Recount degree of nodes in the network; 

(4) Count the number of node with the maximum degree 

and the one with the maximum degree in the network: 

p ← max (d (i, 1)); 

(5) Remove the node M; 

(6) Count network node degree, the number of nodes 

with maximum degree and the ones with maximum 

degree in the network; 

(7) Unicom generated the largest sub-graph f, recount 

the degrees of nodes in the network and statistics the 

number of nodes in f; 

(8) Repeat steps (1) - (7), when the number of network 

nodes and average degrees tend to balance, the 


algorithm end; 

(9) Repair rate calculation. 

Input: connected network with N nodes and certain 

number of edges. 

Output: N nodes in the connected network, repair rate 

s (r). 

d said the matrix storing node degree, n said the number 

of network nodes, m said the number of edges in the 

network, (i, j) said one network edge, a said network 

before each attack or repair, M (k) said the node with 

degree k, M, M ∈ (1 ... n), p said the matrix storing node 

degree, f said the largest restored Unicom network 

sub-graph. 

In both algorithms attacks and repair process, Matthew 

Effect is used to remove of a linear priority and repair in 

the network. After each once, the degree of network nodes 

are recounted to ensure nodes attacked by a linear attack 

and repair. The time complexity is O (n 3 ). 

B. Repair Algorithm under Multi-node Cluster Attack 

According to the degree, network nodes are divided 

into the central node, sub-central node, the intermediate 

value node and small scale value node, each attack the 

bulk of those kinds of node, the specific algorithm is as 

follows: 

(1) A new node added to the network, and the linear 

preferential attachment to m nodes from the network; 

(2) In accordance with the value of node degree, the 

nodes include the central node, sub-central node, the 

intermediate value of the node and the value of small 

degree nodes, then select n nodes w1、w2、w3…wn 

from those kinds of node at different rates r and 

delete them, the following steps: 

� Remove nodes w1, w2, w3 ... wn and all their sides, 

then w1, w2, w3 ... wn, respectively, as a new node 

connect to m nodes of the network; 

� Each node connected to the node w1, w2, w3 ... wn 

has lost an edge, and added a random edge to 

compensate; 

(3) Repeat steps (1), (2), until the network nodes and 

edges become balanced. 

Input: N nodes of connected network with a certain 

number of edges. 

Output: N nodes of the connected network, repair rate 

s (r). 

From steps of the repair method based on Matthew 

Effect, first adding one node in linear preferential 

attachment with m nodes in the network ensures that most 

connection information will be stored in the new nodes 

which will preferential attack to or attach with network 

nodes in linear to promise the power-law of network. The 

attack and repair process is similar to natural growth 

process, so in the term of topology of total network, the 

topology after attack and attachment will change 

strikingly. 

r 

The repair rate in the process is: () 

N 

sr = , N r said 

N 

the node number after repair in the network, N said the


node number before. 

IV. TWO KEY ALGORITHMS GETTING 

EXPERIMENT DATA 

Considering that Internet topology has important 

influence to its anti-destroying ability, to research better 

on Internet topology, NSF (National Science Foundation 

of America) subsidizes National Laboratory for applied 

network research to measure and analysis on Internet 

topology. The original measured result includes AS-level 

Internet topology which truly reflects the status of Internet 

connection. Taking into account the authenticity 

of network simulation and constraints of 

experimental hardware, to validate the effectiveness of 

repair methods in this paper, experiment test in the 

Matalab simulation platform. To make the simulation 

closer to the real Internet model, we used real network 

simulation statistics. Specific steps to get experimental 

data: 

A. Algorithm of Sampling from Actual Network. 

(1) b=zeros(37447,3) 

(2) for n=1:37447 

(3) b(n,1)=data(n,1) 

(4) b(n,2)=data(n,2) 

(5) end 

(6) a=zeros(2400,2) 

(7) k=1 

(8) a(1,1)=b(4513,1) 

(9) a(1,2)=b(4513,2) 

(10) for m=1:50 

(11) for i=m+1:37447 

(12) for j=1:2 

(13) if b(i,j)==a(k,1)&&b(i,3)==0||b(i,j)==a(k,2) 

&&b(i,3)==0 

(14) k=k+1;%%% 

(15) a(k,1)=b(i,1) 

(16) a(k,2)=b(i,2) 

(17) b(i,3)=1 

(18) end 

(19) end 

(20) end 

(21) end 

Input: one network with ten thousands of nodes. 

Output: one network with one thousands of nodes. 

First import a matrix from measured data and detect in 

linear from one node omnipresence one connection 

sub-graph kept other matrix, of which the number of node 

is not continuous, so the nodes of the graph is necessary to 

renumber start from 1 to make number continuous. 

According to the actual data on the 

http://moat.nlanr.net/routing/rawda-ta (the total number of 

edges in the network is 37,448, the total number of nodes 

is 26589, the number of nodes zero is 13010, the 

maximum degree is the 2637, the average degree is 

5.515576), sampling network by detection method (after 

sampling, number of edges is 2358, the number of nodes 

is 1028, the maximum is 191 degrees, the average degree 

of 4.587549). Figure 2 shows results which obey the 

degree distribution of real network, the nodes of between 


1~5 degree account for about 80% in network nodes, and 

less large value ones. 

B. BA Scale-free Networks Generated and The Steps as 

Follows: 

I)Initializing network 

(1) nodes ← zeros (N) 

(2) cii ← zeros (1, N) 

(3) t ← zeros (1, N) 

(4) for i ←1: m 

(5) nodes (i,m+1) ←1 

(6) nodes (m+1,i) ←1 

(7) list (i) ←i 

(8) end 

(9) for i←m+1:2*m 

(10) list (i) ←m+1 

(11) end 

II)Increasing the node and edge into the Internet and 

add 2m each t into the auxiliary vector list. 

(1) for n←m+2: N 

(2) t←2*m*(n-m-1) 

(3) for i←1: m 

(4) list (t+i) ←n 

(5) end 

(6) k←1 

(7) while k0&p(k)


0.5 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 

0 20 40 60 80 100 

度数 

120 140 160 180 200 

Figure 2 Internet degree distribution sampling 

Figure 3 BA network degree distribution 

Figure 4 Sampling Internet degree and BA network degree 

distribution in Logarithmic Coordinates 

V. EXPERIMENT PROCESS AND RESULT 

ANALYSIS 

To validate the effectiveness of repair methods in this 

paper, experiment test in the Matalab simulation platform. 

To make the simulation closer to the real Internet model, 

we used real network simulation statistics. 

A. Repair Process under Single-node Selective Attack 

According to the algorithm of the previous section, 

single-node selective attack and repair to the sampling 

Internet and BA network, as follows: 

(1) A new node, respectively, was added to sampling 

Internet and BA network, and connect to the m 

(where m = 3) nodes with maximum degree of both 

network; 

(2) At ratio r = 0.0125,0.03,0.2,0.33,0.5,1 select nodes 

from sampling Internet, the 40 of 3 degree, the 1019 

of 6 degree, the 632 of 14 degree, the 1015 of 21 

degree, the 599 of 25 degree, the 457 of 191 degree, 

in each time attack one of them and remove all edges 

connecting it; To select nodes at ratio r = 0.0047, 

0.04, 0.17, 0.33,1,1 from BA network, the 114 of 3 

degree, the 85 of 7 degree, the 194 of 10 degree, the 

127 of 15 degree, the 13 of 24 degree, the 3 of 69 

degree, in each time attack one of them and remove 

all edges connecting it; 

(3) Nodes attacked have priority to connect to m (m = 1) 

nodes, while each one lose one edge; 

(4) Repeat steps (1), (2), (3), until the number of nodes 

in the network remain at 1027, the average Internet 

remained at about 4.5 degree, the average degree of 

BA network kept steady state of about 3.8; 

(5) At this time calculating connection rate and 

power-law of both networks and index sharp 

truncated of sampling Internet. Internet, BA network 

connectivity rate s (r) is equal to the total number of 

nodes in the network after the repair dividing 1028, 

the results shown in Table I, Table II, Table III. 

Table I RATE OF CONNECTIVITY OF THE SAMPLING NETWORK s (r) AND THE POWER-LAW k 

r 0.0125 0.03 0.2 0.33 0.5 1 

s(r) 1.0 0.999027 0.998054 0.997082 0.990272 0.955253 

dmax 192 192 192 192 192 122 

dave 4.585 4.578 4.555 4.520 4.525 4.354 

k 2.372 2.384 2.467 2.352 2.664 2.572 

Table II RATE OF CONNECTIVITY OF THE BA NETWORK s (r) AND THE POWER-LAW k 

r 0.0047 0.04 0.17 0.33 1 1 

s(r) 0.995136 0.995136 0.995136 0.996109 0.995136 0.995236 

dmax 55 55 55 55 54 55 

dave 3.858 3.848 3.836 3.81 3.767 3.767 

k 3 3 3 3 3 3 

Table III INDEX SHARP CUT OF THE SAMPLING INTERNET 

r 0.0 0.01 0.03 0.05 0.07 

Kc(r) 27 20 14 12 8 



B. Repair Process under Multi-node Cluster Attack 

According to the algorithm of the previous section, 

multi-node cluster attack and repair to the sampling 

Internet and BA network, as follows: 

(1) A new node, respectively, was added to the sampling 

Internet and the BA network, and connected to m 

(where m = 3) nodes of maximum degree value; 

(2) In the subnet of Internet, all central nodes, 1% of the 

total number, 10% and 50% of the sub-central node 

of 3% in total, 10% and 50% the middle value of 

degree nodes, and 10% and 50% small value of 

degree nodes of 60% in total, are attacked. BA model 

in the same proportion of the nodes were also tested; 

(3) Nodes attacked have priority to connect to m (m = 1) 

nodes in the network, while these m nodes will lose 

edges; 

(4) Repeat steps (1), (2), (3), until the number of nodes 

in the network remains at 1027, the average 

remained at about 4.5 degree, BA degree of the 

network to keep the average steady state of about 

3.8; 

(5) At this time calculating connection rate and power 

rate of both networks and index sharp truncated of 

sampling internet. The results are shown in Table , 

Table and Table . 

C. Analysis of Experimental Results 

From Table and Table , the repair method with a 

very high repair rate on the sampling real Internet, even if 

nodes of the maximum degree value are attacked ,or nodes 

are subjected to cluster attack, a simple repair can make 

the network connectivity rate still reach more than 95% 

and 99% under attack of nodes with the general value of 

degree; From Table II and Table V, for different r, repair 

rate of BA networks is more than 99%; Table I, Table II 

Table IV and Table V again proved that applying Matthew 

Effect to construct BA network can generate network 

topology very close to the real Internet. But the Internet in 

the build process, following Matthew Effect, is also 

affected by other factors on which research is advantage to 

research in the real Internet; Table I and Table II also 

shows the network average degree value decreased when a 

high rate of repair methods, which indicates that the repair 

method can remove redundant edge to the network easier. 

From Table I, Table II, Table IV and Table V, with this 

method, the power-law distribution network can well 

maintain its power-law and high rate of repair, and the 

original topological properties has also been well 

maintained. Table III, Table VI mirrored index sharp 

truncated appears in the sampling Internet in different 

options proportion, which again illustrates the real 

network in the build process is affected by other factors, in 

addition to follow Matthew Effect. 

Table IV RATE OF CONNECTIVITY OF THE SAMPLING NETWORK s (r) AND THE POWER- LAW k AFTER REPAIR 

Node class Central node Next central node Intermediate value node Small scale value node 

Node ratio of r 50% 10% 50% 10% 50% 10% 50% 

s(r) 1.0 0.999027 0.988054 0.997082 0.980272 0.9955 0.9936 

dmax 167 178 192 192 192 192 192 

dave 4.585 4.578 4.555 4.520 4.525 4.354 4,237 

k 2.37 2.42 2.54 2.28 2.46 2.41 2.39 

Table V RATE OF CONNECTIVITY OF THE BA NETWORK s (r) AND THE POWER -LAW k AFTER REPAIR 

Node class Central node Next central node Intermediate value node Small scale value node 


s(r) 0.998 0.999027 0.998054 0.997082 0.990272 0.9975 0.99 

dmax 173 189 192 192 192 192 0.99 

dave 4.585 4.578 4.555 4.520 4.525 4.354 4.37 

k 3 3 3 3 3 3 3 

Table VI INDEX SHARP CUT OF THE SAMPLING INTERNET AFTER REPAIR 

Node class Central node Next central node Intermediate value node Small scale none 


k 14 9 12 21 23 19 17 

VI. The STABILITY OF COMPLEX NETWORK IN 

THE REPAIR PROCESS 

By complex network repair algorithm based on 

Matthew Effect, this paper researched and compared 

random network, scale-free network and small world 

network and found that all of them can be evolved into a 

state of equilibrium. So the introduction of the stability of 

S (t) is to describe the repair extent of the system after 


repair and that of the network easy to repair. 

The current international and domestic study on the 

destruction of complex network still limits the robustness 

which is the capacity of complex network to bear the 

external damage. This is the first study on the 

characteristics of complex network under destruction and 

repair. 

In general, a maximal connected sub-graph of the 

network tends to a stable value in the process of constant


attack and repair process and it is said to have reached a 

steady state. 

Considering N(t), the size of the sub-graph of the 

network, changing along with time, there are N(t0),N(t1), 

N(t2),N(t3),…N(tn), then the stability of S (t) is defined as: 

N( 

t0 

) 

S( 

t) 

= 

(1) 

N( 

tn 

) n→∞ 

That is to say, the stability S (t) is a ratio of the size of 

network and the size of the largest connected sub-graph of 

final network in the constant attacks and repair process. 

N ( tn 

) said the size of the largest connected sub-graph of 

the network, N( t0 

) said the size of network. As can be 

seen from the definition, S (t) is a step-wise increasing 

function with initial value 1 and S (t) whose value is a 

number greater than or equal to 1. 

S (t) reflects, to some extent, the stability of network 

topology. The greater S (t) is, more easily the system 

reaches a steady state after the repair. Relative to the 

topology structure at other times, the topology at this time 

is more easily to fix. 

Figure 5 shows the evolution of the stability along with 

the time step t, size for the sample Internet N = 1028, 

connection probability and repair probability Pr = PC = 

0.02. 

We find that S (t) grows very fast at the beginning with 

evolution, S (t) gradually slows the growth and eventually 

reaches a balance value Sb. Stability S (t) gradually 

increasing means that the system becomes more balanced 

through a series of attacks and repair, more easily to 

achieve good restoration results. 

One point worthy to illustrate here: the stability S (t) 

finally reached a balance value Sb. In equilibrium, the 

value of S (t) is the largest one in the system. This 

implies that the system reaches a vulnerable state after 

thousands of steps evolution. 

Figure 5 Stability s (t) changes with time t map, t that repair times, 

s (t) that the stability 

VII. CONCLUSION 

Single-node selective attack and multi-node cluster 

attack is the most difficult to deal in complex network 

attacks. For these two attacks, this paper proposed a repair 

method of complex network based on Matthew Effect. 

Experimental results show that the rate of the proposed 

repair method under attacks both sampling Internet and 

the BA network can reach 95% or more. Applying the idea 


of building the BA network to the repair of power-law 

distribution network can not only get a high repair rate, 

but also optimize the network topology. For the level of 

complex network repair, we also proposed the conception 

of stability and described it in mathematics. Experimental 

results show that complex network after several steps of 

attacks and repairs can gradually evolved into a relatively 

stable state. In this state, the complex network is easily 

repaired. 

Matthew Effect increased the efficiency of information 

exchange in network, but also brought problems to 

network security. If network nodes of large value were 

attacked, the probability would increased that part nodes 

in the network can not be able to connect with others. We 

will consider this issue in future research. 

ALKNOWLEDGEMENT 

This work was supported by Project 60572137 of the 

National Science Foundation, Project 10JJ9025 of the 

Hunan Natural Science Foundation, Project 2009GK3036 

of the Hunan Science and Technology Plan and Porject 

10C1185 of the Hunan Province of Science Research. 

REFREENCES 

[1] Wang Xiao-fang, Li Xiang, Chen Guan-rong. Complex 

nextwork theory and application[M].BenJing: Tsing Hua 

University punishment. 2006:11-14. 

[2] Carreras B A, Newman D E, Dobson I, et al. Evidence for 

self organized criticality in electric power system 

blackouts[C]. Thirty forth Hawaii International 

Conference on System Sciences. Maui, Hawaii, 

2001:705-709. 

[3] Wu Jun, Tan Yue-jin. Complex network anti-destroying 

ability estimation research[J]. system engineering journal, 

2005(2):128-131. 

[4] Chen Zhen-yi, Wang Xiao-fang, congestion and control in 

scale-free network[J]. system engineering journal, 

2005,20(1):132-138. 

[5] Faloutsos M, Flaoutsos P, Faloutsos C. On power-law 

relationship of the Internet topology[J]. ACM SIGCOMM 

Computer Communication Review, 1999,29(4): 251- 262. 

[6] Albert R, Barabási A L. Statistical mechanics of complex 

network[J]. Review of Modern Physics, 2002,74(1):47-97. 

[7] Barthelemy M, Amaral L A N. Small-world networks: 

Evidence for a crossover picture [J]. Phys.Rev.Lett, 

1999,82:5180-5184. 

[8] Erdos P, Renyi A. On the evolution of random graph[J]. 

Publ.Math.inst.Hung. Acad Sci, 1960,5:17-60. 

[9] Watts D J, Strogatz S H. Collective dynamics of 

small-world networks[J]. Nature, 1998, 393(6684):440-442. 

[10] Holme P, Kim B J, Yoon C N, et al. Attack vulnerability of 

complex networks[J]. Phys.Rev.E, 2002,65(5):056109. 

[11] Xiao Zhong-zhe, Dong Zai-Wang. Improved GIB 

synchronization method for OFDM system[J]. IEEE 

Telecommunications,2003,2(8):1417-1421. 

[12] Criado R, Flores J, Hernández-Bermejo B, et al. Effective 

measurement of network vulnerability under random and 

intentional attacks[J]. Journal of Mathem-atical Modelling 

and Algorithms, 2005,4(3):307-316. 

[13] Che Hong-an, Gu Ji-fa. Scale-free network and its system 

scientific significance[J]. system engineering theory and 

practice, 2004 (4):11-16.


Minsheng Tan, Hunan province, China, 

Birthday: Sep, 1965, master tutor, 

graduated from Dept. Computer Science, 

Wuhan University. His research interests 

include computer network and 

information security. 

He is a professor of School of 

Computer Science and Technology, 

University of South China. 

Prof. Tan is the member of ACM, senior member of China 

Computer Society, director of Hunan Computer Society, 

executive director of Hunan Computer Committee of Higher 

Education Institute, executive director of Hunan Computer Users 

Association. 


Qiang Cui, Shandong province, China, 

Birthday: Nov, 1981, is master. He 

graduated from School of Computer 

Science and Technology, University of 

South China. And the main research 

interest is complex network. 

Lingfeng Zhu, Hunan province, China, 

Birthday: June, 1984, is working toward 

master in computer science of University 

of South China. And the main research 

interests include computer network and 


Hui Zhao Henan province, China, 

Birthday: Oct, 1986, is working toward 

master in computer science of University 

of South China. And the main research 

interest is trusted network 

.


Study and Design an Anycast Routing Protocol for 

Wireless Sensor Networks 

Demin Gao 

Nanjing University of Science and Technology Department of Computer Science and Engineering, Nanjing, China 

Email:gdmnj@163.com 

Huanyan Qian, Zheng Wang, Jiguang Chen 

Nanjing University of Science and Technology Department of Computer Science and Engineering Nanjing, China 

Email:ninanan@tom.com, wangzheng@163.com, chenjiguang@163.com 

Abstract—In wireless sensor networks, there is usually a sink 

which gathers data from the battery-powered sensor nodes. 

As sensor nodes around the sink consume their energy faster 

than the other nodes, several sinks have to be deployed to 

increase the network lifetime. Anycast is a mechanism that 

the source node sends the data to the nearest sink node. The 

paper study and design an anycast service for deploying 

several sinks in wireless sensor network. A novel anycast 

tree-based is proposed approach to minimize the path cost. 

Here the nodes form a tree with a sink node as the root, while 

the height of the tree integrates multiple metrics to calculate 

path cost based on diverse selection criteria. This paper 

discusses and analyzes the model deeply. The experimental 

data proves its validity and efficiency. Computer simulation 

shows that the proposed scheme reduces and balances the 

energy consumption among the nodes effectively, so it 

significantly extends the network lifetime compared to the 

existing schemes. 

Key words: Wireless sensor networks; Anycast; Routing 

protocol 


Wireless sensor networks are paid to lots of attention 

due to their promising techniques and wide-ranging 

applications in recent years. This kind of network consists 

of a large number of low-cost, low-power, small-size, and 

multifunction sensor nodes which can sense and process 

data and communicate with other nodes in a short distance. 

In many applications of wireless sensor network, usually a 

sink node and numerous tiny sensor nodes are deployed in 

the monitoring area randomly. With the scale of wireless 

sensor network increasing, nodes close to the sink 

consume their energy faster than that of farther nodes. 

When the energy all the nodes around the sink have 

exhausted, the sink node is not able to receive any data 

from the sensors, nor gets connecting with the network. 

When this situation happens, the whole network is 

considered to be down. In addition, sensor nodes are 

deployed in a remote or dangerous area in which servicing 

a node may be impossible. A solution to these problems is 

to deploy several sinks and tiny sensor nodes that need to 

send data to a nearest sink node in the sensor networks. If 

the traffic is balanced among the sinks, the network 

lifetime can be significantly increased since the energy 

consumption will be almost equal for all the nodes in the 


doi:10.4304/jnw.6.12.1726-1733 

network. 

Internet Protocol Version 6(IPv6) specifically defines 

a new addressing scheme called "Anycast address” that is 

an identifier for a set of interfaces [1, 2] . A data packet is 

intended to be delivered to an Anycast address and routed 

to the "nearest" interface. The routing protocols can be 

classified into unicast, broadcast, multicast, and anycast 

roughly [3] . Nowadays the Anycast technology is studied in 

wireless network widely. The Anycast communications 

becomes quite important in a network with multiple sinks. 

Anycast can be an important paradigm for a wireless 

sensor network in terms of resource, robustness and 

efficiency for replicated service applications. Assuming 

that the sources and the sinks are distributed in the network 

uniformly, the sources sending the data packet to 

the ”nearest” sink around the area in which the events 

happen can reduce the hops of packets transmitting, so that 

it saves energy, reduces the cost of router table 

maintenance and extends the effect of network survival. 

This simple strategy is assumed to balance the energy 

consumption. When a sensor node produces data, it has to 

send it to any available sink. A sink selection strategy is to 

choose a sink for each source arbitrarily. 

This paper addresses the sink discovery and routing 

problem in sensor networks. Generic routing protocols 

designed for wireless ad hoc networks fail in sensor 

networks primarily due to the fact that they are designed 

for more powerful nodes with higher transmission range 

and power as compared to sensors. In addition to this, the 

packet structure, routing table sizes, implemented code 

size and many other states that are maintained, cannot be 

ported to tiny sensors directly. This paper contains a 

description of a protocol implementing the anycast service. 

Construct an anycast tree that is rooted at the sink and 

contains many sensor nodes as leaves. The objective is to 

select a minimum path cost for every sensor node. The 

paper is organized as follows. In section II we present a 

number of existing Anycast solutions, while in section III 

specify the network model and energy model used, Section 

IV we present our anycast protocol. Section V contains 

experimental results. Conclusions are presented in section 

VI. 

II. RELATED WORKS 

The concept of anycast was studied in multiple


contexts, including network type, communications model, 

and purpose of usage. For example, anycast is studied in 

the TCP/IP networks deeply. As it is used for directing 

DNS queries to the closest root name server [4] .It is also 

used for server selection in distributed systems [5] . When 

Anycast is used to access gateways which interconnect 

IPv6 with IPv4 networks, it gain more attention. Though 

Anycast is originally designed for Internet service, it has 

been applied to routing protocol design for wireless ad hoc 

and sensor networks. In mobile networking, there are 

some Anycast routing protocols which were improved to 

support Anycast service and mainly come from current 

routing protocol. 

In the paper [6, 7], the AODV protocol is used to 

support Anycast service. AODV is an on-demand reactive 

routing protocol designed for ad hoc networks. When there 

are packets needed to transmit, the source node initiates 

the process of route establishment. It’s suitable for the 

situation of mobile nodes. In addition, Anycast routing 

protocols based on the tree structure [8, 9, 10] is in accordance 

with the agreement, the extended model usually in the tree 

by hop count, physical interval or time intervals for unit, to 

build an Anycast tree. An query is transported along the 

most fitting Anycast tree. Routing and sink discovery 

protocols which are designed for ad hoc networks do not 

adapt to the sensor networks. 

Low-Energy Adaptive clustering Hierarchy (LEACH) 

[11] is one of the representative clustering schemes. In 

LEACH sensors are organized into clusters and one node 

in each cluster acting as cluster-head takes the 

responsibility to collect data, aggregate data and finally 

transmit data to the distant Sink. Lifetime of 

heterogeneous wireless sensor networks can be increased 

in networks with more than one data sink when access to 

the sinks is provided by an Anycast protocol [12] .Such a 

network consists of two types of devices resource rich 

(information sinks) and resource-constrained (sensors 

generating new data) [13] .A similar concept of improving 

the energy efficiency of WSNs has been proposed in the 

HAR [14] protocol. All the above anycast solutions are 

different from our paper. In each of them, the set of 

attributes used as the anycast address is not a singleton. 

Usually, node sent data to the nearest sink, rather than 

a specific one which is different from the TCP/IP 

networks and the ad hoc networks. Another type of 

anycast which can be found in the WSN environment, is 

anycasting to a region. Solutions such as SPEED [15] and 

HLR [16] assume a situation where it is sufficient to deliver 

a packet to any node in a specified area. Algorithms for 

region-targeted anycast rely on the strong spatial 

correlation of the attributes used for addressing, which is 

not the case in this paper. 

In the view of the Anycast routing protocol in 

wireless sensor network, combining the characteristics of 

wireless sensor networks and to improve the performance 

of Anycast routing, this paper puts forward a method 

which based on the Anycast tree routing algorithm for 

wireless sensor networks. Some protocols are simplified to 

suit for the wireless sensor network application. Algorithm 

is used to establish an Anycast tree for each sink node. 


Each sensor node joins in an Anycast tree which is nearest 

to it. Applications require minimizing certain cost metric(s) 

to optimize the performance, such as energy consumption. 

Thus, applications require using of multiple metrics for 

path cost calculation to guarantee the performance. Based 

on the multiple-metric path cost specified by the 

application requirement, path with the minimum cost 

value will be selected as the best route. This algorithm can 

balance the network load greatly, extend the whole 

network of survival and improve the performance of 

Anycast routing algorithm. 

III. SYSTEM MODEL AND PATH SELECTION 

It is first discuss the topology model, energy model and 

path selection metrics used in the proposed routing 

scheme. 

A. Topology Model 

Consider a static wireless network modelled as an 

undirected graph G = ( V, A) 

where V are the set of sensor 

nodes and sink nodes. A is the set of links. A graph is 

simple if it has no loops and no two of its links join the 

same pair of vertices. An acyclic graph is one that contains 

no cycles. A tree is a connected acyclic graph. A sink tree 

is a tree with a sink node as tree root and sensor nodes as 

tree leaves. G consists of a finite nonempty vertex set V 

and edge set A of ordered pairs of distinct vertices of V. A 

leaf is a vertex of degree 1.Two nodes i and j are 

connected by a link if they can transmit a packet to each 

other with a transmission power less than the maximum 

transmission power at each node. Thus all links are 

assumed to be bi-directional. This assumption is not 

necessary for the convergence of the distributed 

algorithms however it can make the presentation clearer. 

The set of nodes are connected to node i by links is 

denoted as N i .We assumes that the network graph is 

connected, i.e. It is always exists a path between any pair 

of nodes i and j inV . 

A wireless sensor network contains a number of 

sensor nodes and multiple sinks is considered which are 

distributed in a given region randomly. These sensor nodes 

transmit the information they have collected to the sink 

node. We make some assumptions about the sensor nodes 

and the underlying network model as follows: 

� All sensor nodes are started with the same 

initial energy. The sink node doesn’t have 

energy constraint. 

� Every node is aware of its own location. A 

sensor node can compute approximate distance 

of the source based on the received location 

information. 

� The transmitting power of a sensor node is 

controllable which means transmitting power 

can be modulated according to the transmitting 

distance. 

� Sink and sensor nodes are static. All nodes are 

homogeneous and have the same capabilities. 

Each node is assigned a unique identifier (ID) 

except the sink node and all sink nodes form an


anycast group sharing an ID. 

These hypotheses are reasonable because of wireless 

hardware technology and low power calculation 

technology's development and progress. 

B. Energy Model 

The power consumption of a sensor node consists of 

four parts: sensing and generating data, idling, receiving, 

and transmitting. Also the power e g for generating one bit 

of data is assumed to be the same with all nodes. The idle 

power consumed by a node, is assumed to be the same for 

all nodes and independent of traffic, is denoted by e s . For 

power consumption in receiving and transmitting, the first 

order radio model is adopted in [17-19]. Specifically, a 

node needs ε elec = 50nJ 

for running the circuitry and 

2 

ε amp = 100 pJ / bit / m for the transmitting amplifier. 

Therefore, the power consumption for receiving one bit of 

data is given by er = ε elec .The power consumption for 

transmitting one bit of data to a neighbor node j is given 

n 

by eij = εelec+ εamp 

∗ dij, 

where n is the path loss exponent, 

which typically ranges between 2 and 4 for free-space and 

short-to-medium-range radio communication. Let i E 

denote the initial battery energy of node i and w i denote 

the fraction of power consumption for one bit of data. 

w = e + e + e + e (1) 

i s g r ij 

Where the first term is the idling power consumption, the 

second term is the power for sensing, the third term is the 

power consumption for receiving and the last term is the 

power consumption for transmitting. 

C. Path Selection 

A simple linear combination of different routing 

metrics is used to determine the path cost, as shown in 

following equation: 

' 

φ = φ + α ∗metric 

Where 

∑ 

i∈V i i (2) 

' 

φ is the accumulated cost of previous nodes along 

the path, metric i is scaled value from (0, 1) and αi is the 

weight factors (or called coefficients) for metric i to 

calculate the cost. Basing on application requirement, 

these weight factors can be flexibly varied to change the 

importance of the cost metrics during route discovery. Our 

protocol adopt four path cost metrics: hop count, energy 

cost, data delay, and remaining energy. Therefore, the path 

cost equation becomes: 

' 

φ φ α1 hopi α2 wi α3 delay α4 

Ei 

= + ∗ + ∗ + ∗ + ∗ (3) 

Here, hop i =1, which is the hop count, energy cost 

denotes the normalized energy cost for the link from the 

previous hop to the current node, data delay denotes the 

time for transmitting the data from the node to next, and 

E i denotes the surplus energy. Different applications can 

define their requirement by including different sets of 

weight factors. For example, an application might only 


want to consider energy consumption, thus, (α1, α2, α3, α4) 

= (0, 1, 0, 0).In order to demonstrate how different 

requirements and path cost metrics guiding route 

discovery and resource consumption, simulations with 

three different network deployment are conducted. The 

model will be used to the choosing tactics of the next node. 

D. PROBLEM DEFINITION 

The core of anycast routing protocol for wireless 

sensor networks is to select a “nearest” sink as destination. 

The problem of optimal sink selection can be formulated 

as follows. Consider a case of n sources{ s1, s2, …, sn} 

and 

a group of k sink nodes where 1 ≤k ≤ n.The 

problem is 

to assign the n sources to k sink nodes so that the total 

path cost of the network is minimized. The problem can be 

formulated as a 0-1 integer programming problem as 

follows: 

n k 

Minimum∑∑ φijλij (4) 

i= 1 j−1 

Subject to 

k 

∑ λij 

= 1(1 ≤i≤n) (5) 

j= 

1 

λ ij = 0or1(1 ≤i≤ n)(1 ≤ j ≤ m) 

(6) 

Where λij is the path cost of the best route between 

node i and node j and λ ij is a binary variable used for 

sink selection: if the best sink node chosen for node i is 

node j , then λ ij =1, otherwise λ ij =0. Constraint (5) states 

that node i can only transmit all its packets to one sink. 

IV. ANYCAST ROUTING POTOCOL 

The anycast routing proposed scheme which employs 

the tree-based is introduced approach to distribute the 

energy load evenly among the sensors in the network and 

thus minimize data transfer time. An objective of our 

protocol is to establish a connection between sensor nodes 

and sinks which belong to an anycast group based on 

multiple path selection metrics. Thus, the selected sink can 

forward packets to the destination in the core network. 

A. Packet Format 

Four types of control packets are designed for our 

protocol, as it’s explained in this section. Hello packet 

(HELLO) is a special type of packet generated only by the 

sink nodes which is broadcasted periodically to all sensor 

nodes, for sensor nodes that do not have any valid route 

available to any member of the anycast group in its routing 

table. The traditional Route Request (RREQ), Route Reply 

(RREP), and Route Error (RERR) packets are stripped of 

unnecessary fields for a WSN, such as the reserved fields, 

flags for multicast, prefix field, and life time field. In 

addition, a small HELLO packet is added for sink 

advertisement. A Hello message is transmitted 

periodically to advertise the presence of a Sink. The 

transmission range of a mobile platform will cover all


sensor nodes more than one hop away. Thus, there is no 

need to retransmit the Hello message by sensor nodes. 

Nodes receive the hello packet and cache the information. 

Route Request packet (RREQ) is generated to 

initialize the route discovery. RREQ is different from the 

packet in the TCP/IP, such as AODV protocol. The major 

differences are: instead of using unicast address as 

destination address, the packet has the anycast group ID as 

the destination address. Two more fields are added for 

adapting application requirements and utilizing multiple 

metrics as path cost. In our protocol, the RREQ include 

CRQ (Child Request) and PRQ (Parent Request). CRQ is 

used to discover a child node and PRQ to discover a parent 

node. 

The data packet format in our protocol is defined as 

follows: 

(Type, Anycast group ID, Path costφ , Next node’s 

ID, Node’s address) 

If type=1, the packet is CRQ. If type=2, the packet is 

PRQ. If Anycast group ID =0, denotes the packet comes 

from a sink, otherwise from a sensor node. If the next 

node’s ID is empty, denotes the packet comes from a 

sensor node and the node hasn’t discovery a rout to a sink. 

Every node doesn’t need to remember its child nodes, 

because the node doesn’t transmit message to its child 

nodes. The sink transmission range can cover all sensor 

nodes. Node’s address denotes the node address, such as 

the node ID and position. 

Route Reply packet (RREP) is generated by sinks or 

sensors for corresponding RREQ packets. While 

destination anycast group ID represents the anycast group 

that the destination node belongs to. The accumulative 

path cost is the accumulative cost along the path from the 

destination node to the source node. Route ERROR Packet 

(RERR) is the same as that of AODV protocol. 

B. Established an Anycast tree 

The sensor nodes are distributed in the monitoring 

area randomly. There are multiple sink nodes and n 

sensor nodes. The anycast group is assigned an identifier 

( ID = 0 ) which identifies the anycast group and contains 

all sink nodes. Every sink node can construct an anycast 

tree and the root is the sink node. Sensor nodes can get 

anycast services from the anycast tree. This protocol starts 

with the creating of a number of spanning trees. In this 

model, if a sensor node wants to become an Anycast 

member it must join in an anycast tree firstly. A sensor 

node can join in an anycast tree through the following 

process: 

1) Every sink node broadcasts a query CRQ to its 

neighbor nodes within small range. The CRQ contains the 

location information of some one sink node, the ID of 

Anycast group, the path costφ from a sink to the node that 

has sent the CRQ. If the CRQ comes from a sink node 

directly, the value ofφ is zero and Next node’s ID is zero. 

2) If a neighbor node receives the CRQ and it hasn’t 

joined in any anycast tree. The node accepts the CRQ and 

checks if it comes from a tree’s node through checking the 

id that identifying anycast group, if it is, it appends the 


CRQ into its father node table and records the father node's 

relevant parameters including location information, the 

path costφ and anycast id it is requested to join. If the id in 

the CRQ was not belonging to any anycast tree’s node, the 

node discards the CRQ. 

3) After receiving a CRQ, the node set a timer whose 

time interval may be decided by the current network status. 

The node may receive more than one CRQ in the time 

interval. After the timer expires, the node selects the 

neighbor node with the minimum path costφ as its father 

node through comparing the size of the path costφ in the 

CRQ, records the information on its father node and 

returns a RREP to its father node. If more than one the path 

costφ of the CRQ received is equal, the node selects a 

neighbor node as its father node randomly. 

4) After the father node receives the joining message, 

it will return an ACK message to this child node. Due to 

the characteristics of the algorithm, each node only needs 

to retain the information of his father node. The father 

node doesn’t need to record the relevant information of the 

child node. This is different from the TCP/IP which will 

record the child’s IP. The child node replaces Next node’s 

ID in the CRQ with it’s the father ID, recalculate the path 

costφ from the sink to this node and replaces the path 

cost φ in the CRQ with the new φ . At the same time, 

updates the relevant parameters (position, etc) and 

broadcasts the CRQ to the next hop until all node join in an 

anycast tree, just as it is shown in fig. 1. 

Figure.1 a anycast tree is establish from all sensor to a sink 

C. New node joins in the Anycast tree 

In fact, if a sensor node wants to share the anycast 

service. It must join in an anycast tree. If a node wants to 

join in an Anycast tree, it will broadcast a joining message 

PRQ to its neighbor nodes. The PRQ contains the location 

information of the node. If one neighbor node receives the 

PRQ and it has joined in any anycast tree, the node will 

accept the PRQ and return a CRQ. The CRQ contains the 

location information of the neighbor node, the ID of 

Anycast group, the path costφ from the sink node to this 

neighbor node. 

The node that sends the joining message accepts the 

CRQ and appends the CRQ into its father node table with 

the node's relevant parameters including the location 

information, the path cost φ from the sink node to this 

node. The node then sets a timer and expects to receive 

more CRQ in the time interval. After the timer expires, the


node selects a node with the minimum the path costφ as 

its father node through comparing the size of the path cost 

φ of CRQ from the father table, recalculates the path 

costφ from the sink to this node and replaces the path 

costφ in the CRQ with the newφ , returns a RREP to its 

father node. If there is more than one minimum path costφ 

of the CRQ received, the node selects a neighbor node as 

its father node randomly. The father node will receive the 

RREP and sent an ACK message to the child node. So the 

new node joins in an anycast tree successfully. 

D. Node leave or be failed 

The energy of some nodes was exhausted as the 

sensor node power is constrained then the node become 

invalid. There are three cases when a node becomes failed. 

1) The failed node is the anycast tree’s leaf and the 

node has no child. When the father node can’t receive the 

information from the node in a time interval set in 

advance, the node is considered to be failed. This case is a 

sample, as it is show in the fig.2, if v5 is failed, v4 don’t 

need to do anything and nor revise the relevant 

information of v4. 

2) If the failed node is an intermediate node and it has 

a child node. In fig. 2, the v2 is the intermediate node. If v2 

is failed, the v4 and v5 will get disconnected to v1and the 

data that v4 and v5 have collected can’t transmit to the sink 

node. In this case, v4 should broadcast a joining message 

PRQ to its neighbor node, such as the node v3 and node v6. 

The process is the same as a new node join in an anycast 

tree that is shown in the above section C. In fig. 2, the node 

v4 will receive the CRQ from the node v3 and node v6. 

Clearly, the node v4 selects the node v3 with the minimum 

path costφ as its father node because ofφ3< φ6. 

3) If the failed node is the sink node which is the root 

of the anycast tree. All data collected by the tree’s node 

can’t be transmitted to the sink node and the anycast tree 

will become invalid. All nodes will start the tree creating 

process that has shown in the above section B. 

Figure2 when the node v2 was failed, the node v4 will be disconnected to 

the sink node s1 and should rebuild the connection to the node v3 

E. Anycast tree 

After the tree construction is over, every node joins in 

an anycast tree successfully where many anycast trees 

exist, as it is shown in fig. 3. In this phase every node sends 

the collected data to the parent node. Every parent node 

receives data from the children nodes, fuses the data with 

its own and forwards them to its parent node along the 


anycast tree. When the data from all member nodes in the 

anycast tree have been received, the sink node applies data 

fusion to the received data. After that, it sends the fused 

data to the internet or other devices. In fact, sensor nodes 

don’t know which sink nodes the data is sent to in the 

transition, but the data was certainly transmitted to some 

one sink. 

Note that the notable feature of the proposed anycast 

routing protocol is that several trees are constructed 

instead of one which allows more distributed operation 

among the nodes. The tree construction which is based on 

the path cost further increases this effect, which results in 

more balanced energy consumption and data delay among 

the nodes and increases network lifetime in the long run. 

Data collected by sensor nodes may contain 

redundant information due to the spatiotemporal 

correlation. Therefore, it is desirable to aggregate the data 

at the sink to remove the redundant information. However, 

the correlation data may be transmitted to different sink. If 

sinks transmit so redundant information to the internet or 

others, the frequent communications is vulnerable to be 

wiretapped and the transition interference will be very 

serious. In our paper, the data correlation is taking into 

account. The data received by every sink should be 

aggregated. All sinks form a tree and one sink is selected 

as root sink randomly. The root sink will gain all data from 

all sinks and aggregate them. An example is shown in 

Fig.3 

V1 

S1 

S4 

S3 

S2 

V1 

V2 

V3 

V3 

Source node 

Middle node 

Sink node 

Root sink 

Link 

Pseudo link 

Figure3. Multiple anycast trees are established cove all sensor 

nodes and one sink is selected as the root sink 

V. SIMULATIONS 

In this section the performance of the anycast routing 

protocol is evaluated via computer simulation and 

compared it with other schemes such as AODV [6] , 

LEACH [11] . Assume that there are 100 sensor nodes 

including 5 sink nodes and 95 sensor nodes distributed 

randomly in a 100×100 region. The simulation parameters 

are given in Table 1. All nodes’ transmission power is 

adjustable and they adjust transmission power to 

communicate with other nodes according to actual need. 

Every two nodes can communicate directly with each 

other in the transmission range. 

Fig4. Shows the resultant network topology obtained 

by different schemes for a network. The topology of 

LEACH is shown in Fig4 (a). The transition distance is 

one hop count from every sensor node to sink. There are 

no transmissions between sensor nodes. Data collected by


sensor nodes are transmitted directly from the member 

nodes to the cluster-head. Sensor node need to consume 

more energy because of many long distance transmissions. 

The topology of anycast routing protocol of our paper is 

shown in Fig4 (b). Every sensor node joins in an anycast 

tree according to the path cost and data is transmitted 

along the tree from sensor leaves to root sink. Observe that 

the proposed scheme display more balanced and 

distributed pattern of network. 

TABLE 1.THE PARAMETERS USED IN THE SIMULATION 

Parameter Value Parameter Value 

Size of target 100×100 Data packet 512 byte 

area 

size 

Number of 5 Metadata 25 byte 

sink nodes 

packet size 

Number of 95 Maximum 20m 

sensor nodes 

radius, R 

Initial energy 10J α 1 1 

ε elec 50 nJ/bit α 2 1 

ε amp 

50 

nJ/bti/m2 

α 3 1 

α 1 

e 100 nJ/s s 

4 

(a)LEACH 

(b) Anycast routing protocol 

Figure4. The network topology with different protocols 

For a network flow f , let f ij denote the rate of 

information flow from node i to node j .The energy 


spent by node i to transmit a unit of information directly 

to node j is e ij .Then the lifetime of node i under 

flow fij is given by 

Ei 

Ei 

T = = T = 

wi ⋅ fij 

( es+ eg+ er+ eij ) ⋅ fij 

Fig.5 measures network energy consumption and 

lifetime when we vary the number of sensor nodes which 

shows that deploying 2, 3, 4, 5, 6 sink respectively. As the 

network size increases, the network total energy 

consumption rate rises and the network lifetime is 

gradually reduced. With the increase of the sinks, the 

network rate of total energy consumption decreased and 

the lifetime of network increases. Meanwhile, with the 

number of sinks increase, sinks added to the network 

newly prolong the lifetime capacity reduced gradually. 

This is because the number of nodes increase, cause the 

shortening of distance between nodes, data relevance 

increase and lower transmission power is needed, while 

the routing algorithm can effectively balance the node data 

traffic load, which makes the network lifetime increases. 

The new sinks adding to the network can reduce the 

distance between nodes, so the network lifetime can be 

prolonged. With the number of sinks increase, it only 

affects the route near the sinks. The impact on the network 

becomes smaller and the effect of increasing the lifetime 

of the network decreases. 

Energy consumption/nJ 

Energy consumption/nJ 

(a) The energy consumption 

Times/S 

(b). the lifetime of network 

Figure.5. the network energy consumption and lifetime when vary the 

number of sensor nodes and deploy 2, 3, 4, 5, 6 sink respectively.


We define that the delay time means the time interval 

between the transmission of a packet by the source and the 

reception of the same packet by the sink. The delay time of 

AODV is the longest, as it shows in Fig6 (a). When there 

are some packets that some sensor node collected need to 

be transmitted to a sink node, these sensor nodes initiate 

the process of route establishment. The time is accounted 

to the delay time, so that the delay time is increased. 

AODV tries to create a route to a single sink, thus waste 

more time comparing to the other two methods. Our 

protocol and LEACH are proactive routing protocol. The 

route has been established before the packets are 

transmitted to a sink node, so that the packets can be sent 

to a sink node in the shortest time. Our protocol is a little 

better than the LEACH. In LEACH, as data are transmitted 

directly from the member nodes to the cluster-head, many 

long distance transmissions are required in a cluster. The 

number of long distance transmissions will increase as the 

network size grows. However, in our protocol, the 

minimum the path cost node is selected as the father node. 

So we can say that our route is better than the other two 

protocols. In particularly, when the rate of transmission is 

quick, the property of our protocol increases 5% than 

LEACH. 

Fig.6 (b) shows that the comparison of energy 

consumption as time moves. With the increasing of the 

time, more and more packets can be transmitted to sink 

node and the energy consumption increases. Compared 

with the AODV and the LEACH, our protocol has a less 

energy consumption, and it’s more with the increasing. As 

it expected, our protocol has the best performance. The 

AODV protocol transmits more packets than the others 

because it sends route request and rebuilds the route every 

time when the new packets collected by sensor nodes need 

to be sent to a sink node, then it consumes more energy. 

The gap between them was getting wider and wider as the 

time moving. We can also see that both LEACH and ours 

perform better than AODV. Our protocol is a little better 

than the LEACH. The main reason is that communication 

radius is may be very large in LEACH. However, multiple 

paths cost metrics is considered and the minimum the path 

cost node is selected as the father node in our protocol, so 

that it can minimize the energy consumption and reduce 

the data delay. As previously discussing, this is because 

the anycast can reflect the wireless advantage when there 

are more than one sink nodes. 


In this paper an anycast routing protocol basing on 

anycast tree scheme for energy efficient data transfer and 

reducing average delay time is proposed in wireless sensor 

networks. To form a tree for each sink node, every node 

sends the collected data to the parent node along the 

anycast tree. The architecture of anycast tree is decided 

according to the path cost of nodes to sink. Some protocols 

are simplified to suit for the wireless sensor network 

application.The data packet can be sent to the nearest sink 

node along the anycast tree. Multiple-metric is used to 

instruct the route discovery and sink selection. The node 


own a minimum path cost is selected as the father node 

and forwards the packet. It can minimize the energy 

consumption which was required for the communication 

between the nodes and the sinks. Simulation results show 

that the proposed scheme reduces the delay time 

successfully and balances the energy consumption among 

the nodes and thus significantly extends the network 

lifetime comparing to those existing schemes. 

(a) Delay as the packets transfer rate 

(b) Energy consumption as time moving 

Figure6. The comparison of delay and energy consumption 

REFERENCES 

[1] Weber S, Cheng L.A survey of Anycast in IPv6 networks. 

IEEE Communications Magazine, 2004, 42 (1):127-132. 

[2] Doi S, A ta S, Kitamura H.Protocol design for Anycast 

communication in IPv6 network .Proceedings of 2003 

IEEE Pacific Rim Conference on Communications, 

Computers and Signal Processing(PACR MI’03). New 

York, USA: IEEE Press, 2003.470-473. 

[3] Jia W, Zhou W, and Kaiser J.Efficient algorithm for mobile 

multicast using anycast group. IEEE Proc. 

Communications, 2001, 148 (1):14–18. 

[4] Abley, J.:Hierarchical Anycast for Global Service 

Distribution(2003) 

[5] Michael, J., Freedman, K.L. Mazieres, D.:Oasis: Anycast for 

any service. In:Proceedings of the 3rd Symposium on 

Networked Systems Design and Implementation, San Jose, 

CA(May 2006) 

[6] Subramanian Swaminat han, Jinye Huo, Fang Liu.An


Anycast Routing Protocol for Ad-Hoc Networks. 

http://www.cs.ucsb.edu/ebelding/, 2003- 03. 

[7] Jianxin Wang, Yuan Zheng, Weijia Jia.A-DSR:A Based-DSR 

Anycast Protocol for IPv6 Flow in Mobile Ad Hoc 

Networks.IEEE Proc V TC2003[C].2003. 

[8] Thepvilojanapong N, Tobe Y, Sezaki K.HAR: hierarchy 

based anycast routing protocol for wireless sensor 

networks//Proceedings of Symposium on Applicat- ions 

and the Internet Workshops. 2005: 204- 212. 

[9] WANG Xiao-nan etc, Routing protocol for w ireless sensor 

networks based on Anycast, Application Research of 

Computers, 2009, 7(7):2695-2697. 

[10] Michal Koziuk, Jaroslaw Domaszewicz. Tree-based anycast 

for wireless sensor/actuator networks. Lecture Notes in 

Computer Science archive Proceedings of the 9th 

international conference on Distributed computing and 

networking. Kolkata, India SECTION: Sensor 

networks .2008 

[11] W.R.Heinzelman, A.Chandrakasan, and H. Balakris- hnan, 

“Energy-Efficient Communication Protocol for Wireless 

Micro-sensor Networks”, In Proceedings of the Hawaii 

International Conference on System Science, Maui, 

Hawaii, 2000. 

[12] Hu, W., Bulusu, N., Jha, S.:A communication paradigm for 

hybrid sensor/actuator networks(2004) 

[13] Hu, W., Chou, C.T.:S.J.N.B:Deploying long-lived and 

cost-effiective hybrid sensor networks(2004) 

[14] The pvilo jana pong, N., Tobe, Y, Sezaki, K.:Har: 

Hierarchy-based anycast routing protocol for wireless 

sensor networks.In:SAINT 2005:Proceedings of the The 

2005 Symposium on Applications and the Internet 

(SAINT 2005), pp.204–212.IEEE Computer Society, Los 

Alamitos(2005) 

[15] He, T, Stankovic, J.A, Lu, C, Abdelzaher, T.F:A 

spatiotemporal communication protocol for wireless 

sensor networks.IEEE Transactions on Parallel and 

Distributed Systems 16, 995–1006(2005) 

[16] Bian, F., Govindan, R., Schenker, S., Li, X.:Using 

hierarchical location names for scalable routing and 

rendezvous in wireless sensor networks.In:SenSys 

2004:Proceedings of the 2nd international conference on 

Embedded networked sensor systems, pp. 305–306.ACM 

Press, New York(2004) 

[17] W.R.Heinzelman, A.Chandrakasan, and H.Balakrishnan, 

“EnergyEfficient Communication Protocol for Wireless 

Micro-sensor Networks”, In Proceedings of the Hawaii 

International Conference on System Science, Maui, 

Hawaii, 2000. 


[18] Lindsey, C.S.Raghavendra, “PEGASIS:Power-Efficient 

gathering in sensor information systems, ”in Proc.of the 

IEEE Aerospace Conf., Canada, March 2002.pp.1-6. 

[19] S.S.Satapathy and N.Sarma, “TREEPSI:tree based energy 

efficient protocol for sensor information”, Wireless and 

Optical Communications Networks 2006 IFIP 

International Conference, April 2006. 

Demin Gao ShanDong Province, 

China. Birthdate: September, 1980. He 

received the M.S. degree in computer 

application technology from Jingdezhen 

Ceramic Institute, Jingdezhen, Jiangxi, 

china, in 2008. He is pursuing the Ph.D. 

degree in Nanjing University of Science 

and Technology Department of Computer 

Science and Engineering. And research 

interests on routing protocols for wireless 

sensor networks and data aggregation in wireless sensor 

networks. 

Huanyan Qian Jiangsu Province, China. Birthdagte: October, 

1950. He is currently a professor in the Nanjing University of 

Science and Technology at Department of Computer Science and 

Engineering. His current research interests include sensor 

networks, mobile communication and wireless communication 

networks. 

Zheng WANG Jiangsu Province, China. Birthdate: September, 

1980. He received the M.S. degree in computer application 

technology from Nanjing University of Science and Technology, 

Nanjing, Jiangsu, china, in 2007. He is pursuing the Ph.D. degree 

in Nanjing University of Science and Technology Department of 

Computer Science and Engineering. And research interests on 

routing protocols for wireless sensor networks. 

Jiguang Chen Henan Province, China. Birthdate: February, 

1982. He received the M.S. degree in Education from Henan 

Normal University, Xinxiang, Henan, China, in 2008. He is 

pursuing the Ph.D. degree in Nanjing University of Science and 

Technology Department of Computer Science and Engineering. 

And research interests on routing protocols for wireless sensor 

networks.


Management Model Research of Low-power 

Wireless Sensor Network 

LinGe Wang 

Ningbo Dahongying University college of software, Ningbo, 315175, China 

Email:Wanglingew@163.com 

YueDou Qi 

Ningbo Dahongying University college of software, Ningbo, 315175, China 

Email:yuedouqi@sohu.com 

Abstract—Nowadays most of the wireless sensor network 

management modes have a shorter lifetime because 

adopting the way of transferring management information 

to each other, which thus consuming energy too fast. This 

paper Present a new modal based on mobile agent for 

wireless sensor network scluster management.this model 

can make up the shortcoming of the nowadays wireless 

sensor networks management architecture.The nowadays 

models are fall eousider the information report of each 

nodes can consulne lots of energy and result in reduce the 

network lifetime.The mobile agent-based wireless seusor 

networks management model inherit the preponderant of 

the traditional merit, and have plenty consideration in nodes 

energy feature.Through the analysis of the model, the model 

author provided have more predominance than traditional 

model in save energy, data integrate, topology control and 

so on. 

Index Terms—wireless sensor network, mobile Agent, Low 

energy consumption 


With the rapid development and increasingly 

sophisticated of communication, embedded computing 

and sensor technology, with a perception by the 

substantial capacity, computing power and 

communications capability of, Sensor networks 

composed of thousands of micro-sensors, with each senor 

capable of sensing, computing and communication, has 

aroused great concern.It integrates the sensor, embedded 

computing, networking, and wireless communications 

technology, become a new information acquisition and 

processing technology, Be widely used in national 

defense and, military, environmental, monitoring, traffic 

management, medicine and health care. 

Agent technology is developed from artificial 

intelligence.Agent system is a loosely coordinated system 

which stands for the trends of distributed software 

development, is more flexible and intelligent.Agent 


Apr. 20 2011. 


doi:10.4304/jnw.6.12.1734-1739 

software as a new software technology has made 

considerable progress, is used in many areas such as 

internet information retrieval, information collection, ecommerce, 

data mining, integrated manufacturing and SO 

on. The node of the wireless sensor network often USeS 

batteries to supply electricity, but the electricity energy in 

the batteries is limited, meanwhile the communication, 

calculation and storage abilities of the node are limited 

which raise challenges to the hardware and software 

design of the wireless sensor network 

II. WIRELESS SENSOR NETWORK ARCHITECTURE 

Composed of wireless sensor network system are as 

shown in figure 1.A large number of sensor nodes are 

randomly distributed in the monitoring area.These nodes 

constitute a network of self-organization structure way, 

Each node not only data collection but also routing, The 

data was collected through the multi-hop transmission to 

the focal point, Passed to the Internet, Information will be 

man-agement, classification, treatment that is the task 

manager node in the network.Finally, for users to focus 

on. 

Figure 1. Wireless Sensor Network 

Past communication system for wireless sensor 

networks, Network nodes to collect raw data and send it


directly to the central node, By the central node for the 

signal processing tasks.This central approach to waste a 

lot of bandwidth resources, At the same time,, Because a 

lot of information forwarded, The Nodes near the center, 

Soon lead to depletion of energy. 

Sensor network is an integrated monitoring, control, 

and wireless communication network system, Much 

larger number of nodes (thousands), More intensive 

distribution of nodes;Because the environmental impact 

and energy depletion, Node failure more easily, 

Environmental interference and node failures could easily 

lead to changes in network topology;Typically, Most of 

the sensor nodes are stationary. In addition, Sensor node 

has the energy, processing, storage capacity and 

communication are very limited, This makes the transfer 

of resources, Power management, Computing, Network 

topology discovery, etc. should be considered 

comprehensive in the Wireless Sensor Network.In 

particular, the energy consumption of wireless sensor.On 

the one hand to minimize the energy consumption of 

sensor nodes, On the other hand when the node energy 

depletion should be able to find a new topology, Isolation 

the death node, generate a new path to complete data 

collection and processing.This fact also shows two 

important elements of wireless sensor networks:Topology 

discovery and reduce energy consumption.This paper will 

explore three aspects of wireless sensor networks to 

reduce energy consumption as a precondition to achieve 

the route discovery, while ensuring the reliability of 

wireless sensor networks. 

Custering model is based on mobile agent, namely, 

how to solve clustering problems. 

In the cluster model, how to choose the cluster head 

node. 

In the cluster model, once the cluster head node is 

identified, the next, Solve the routing problem which the 

Mobile agent in the cluster how to move. Between Nodes 

within the same cluster 

III. ANALYSIS OF EXISTING MANAGEMENT MODEL 

A. MANNA 

Advantages: 

Groundbreaking research, A set of complete network 

management system, Specificity for sensor networks, The 

SNMP management model based on Summarizes the 

sensor network management architecture, including the 

organization, functions, and information etc. 

Disadvantages: 

Although made some simulations, but did not give 

detailed implementation programs, research level is the 

initial stage. 

B. MIADSN 

Advantages: 

The entire sensor network is divided into 

everal.subsystems of different functions, Conducive to 

modular. To conserve bandwidth, The introduction of 

mobile agent technology, Fuzzy theory of statistical 


methods, Only a few sensor nodes achieve the purpose of 

collecting data. 


Main consideration is the negative effect of data 

fusion, Management talked about less specific methods 

under study. 

C. Other existing methods rely on broadcast traffic: 

Advantages: 

Low-level nodes are organized and managed through 

high-level, When the Mobile Agent in the 

implementation of each task is assigned a certain degree 

of strategy, Through these strategies to control the Mobile 

Agent Wireless sensor nodes to achieve the data 

collection. 


Because the management information required notice 

with each other, Lead to excessive energy consumption of 

nodes, Lead to reduced network lifetime. There are other 

issues not considered the energy of the node. 

D. Comparison of common methods 

Because the characteristics of wireless sensor 

networks,, The network management model of CMIP, 

SNMP and ANMP is no longer adapted to the wireless 

sensor network management, So researchers put forward 

some new network management solution.For example 

Linnyer B. Ruiz .etc. From the management level.the 

management functions and management functions 

domain described in three aspects of the management 

framework for wireless sensor networks, And design the 

architecture of wireless sensor network — MANNA, 

Through it configuration and managementwireless sensor 

network. WangFeng etc. Proposed the Distributed Sensor 

Network Management Model based on Mobile Agent that 

is the sensor network management based on Mobile 

intelligent agent.(Mobile Intelligent Agent-based DSN, 

called MIADSN), It uses a new model:Data retained in 

the local, Data fusion in remote.It was also proposed 

cluster-based wireless sensor network self-management 

hierarchical model, Low-level nodes are organized and 

managed through high-level, Because the management 

information required notice with each other, Lead to 

excessive energy consumption of nodes, Lead to reduced 

network lifetime. It was also proposed based on strategic 

management framework for wireless sensor MobileAgent, 

In this management structure, According to management 

needs, When the Mobile Agent in the implementation of 

each task is assigned a certain degree of strategy, 

Through these strategies to control the Mobile Agent 

Wireless sensor nodes to achieve the data 

collection.There are other issues not considered the 

energy of the node. 

Based on the above analysis of the problem, The most 

popular model for sensor networks is Multiple mobile 

agents, Clustering topology network model. In this 

model, To reduce energy consumption in wireless sensor 

networks the problem becomes how to cluster, How to 

select cluster heads, How to cluster routing and intercluster 

routing problem. According to the literature, 

Currently used topology as shown in figure 2.


In order to achieve the purpose of reducing energy 

consumption of nodes in wireless sensor networks, In this 

paper, we Proposed the wireless sensor network 

management model based on clustering and multiple 

Mobile Agents 

At the same time clustering based on mobile agents in 

wireless sensor network management model to improve 

the clustering algorithm, Agent routing algorithm. 

Simulation results show that the proposed network 

management structure and algorithm can achieve the 

purpose of reducing power consumption of sensor nodes. 

Figure 2. Clustering structure 

IV. WIRELESS SENSOR NETWORK MANAGEMENT MODEL 

A. Problems of traditional Wireless Sensor Network 

In traditional wireless sensor network, data collection 

was conducted through each sensor node and the data 

collected transferring to the designated destination node 

sink. In this mode, the power of the wireless sensor 

networks is mainly consumed in data transfer of sensors. 

The power is the most important resource of the wireless 

sensor network. power consumption The communication 

among the net nodes is much larger than that of computer 

processing and perception and it focuses on the states of 

sending, receiving and idleness. In traditional wireless 

sensor network the large amounts of power is consumed 

in data transfer processing resulting in a rapid death of 

the sensor nodes. It is suitable for deploying in the 

environment of a few data monitoring not for deploying 

with a large-scale and a long time 

Because of traditional shortcomings of wireless sensor 

network, clumps and management measure is popular at 

present That is through differentiating a number of 

different regions in the whole wireless sensor networks 

and choosing the suitable node which is called cluster 

head in each region, the cluster will give a basic 

processing and then transfer the data to the terminal sink 

node. This method in a certain extent can reduce the 

consumption of sensor nodes. 


B. Wireless sensor network management model is based 

on LEACH method 

Suppose, In a two-dimensional square area A, There 

are N sensor nodes are randomly and evenly distributed, 

the sensor network has the following properties: 

A sensor nodes and Sink nodes are stationary, the sink 

node is far away from the network area, and it is unique. 

The sensor nodes have the same Initialization energy, 

can not be added. 

Sink node has enough energy. 

Sensor node can calculate its distance to the cluster 

nodes. Sink node through the launch of the test signal 

strength. 

Nodes in all directions the same amount of energy. 

In wireless sensor networks, the differences in signal 

transmission of energy will affect the performance of 

routing protocols.which Including the Receiver Model 

and the Launch model, 

A model is assumed, as shown in figure 2, The model 

considers the Energy consumption of Transmit 

Electronics, the Energy consumption by power amplifier, 

The Energy consumption which Receive Electronics 

receive signals. 

Figure 3. Energy consumption model for sensor networks 

According to energy model, When the transmission 

distance is d, the data is L bits, the Energy consumption 

of the Transmit Electronics is : 

The Energy consumption of the Receive Electronics 

is : 

E = E − elec ( k) 

= L× 

recieve Rx E elec 

C. LEACH protocol shortcomings 

LEACH protocol is only applicable to homogeneous 

network, heterogeneous network can not get good results. 

The mechanisms of each node to probability act as 

cluster heads.Without considering the residual energy of 

nodes, without considering the node location.In each 

cycle, all nodes must act as a cluster head. 

In the process of transferring data. From cluster head 

to the base station. 

All the cluster heads are sent directly to the base 

station.


V. CLUSTER HEAD ELECTION 

On the optimal probability, How to determine the 

cluster head node election The people did a lot of 

simulation and analysis, They think that the optimal 

probability as theSpace density function which Evenly 

distributed nodes in the monitoring area.The best 

clustering can be achieved optimal energy distribution in 

the network, Bringing the total minimum energy 

consumption. 

Suppose, N nodes Evenly distributed in a square area 

whose Side length is 2a Distribution density observe 

poisson distribution Within Parameters for λ .The N is a 

random variable for number of sensor nodes, N= λ A.The 

p is the probability of cluster head election, np is the final 

number of clusters was calculated. Assuming the base 

station is located in the center of the square area, Then 

the average distance From a cluster head node to the base 

station as: 

1 

E[ 

D | N n] 

x y dA 0. 

765a 

i i 

A 4a 2 

2 2⎛ 

⎞ 

= = ⎜ ⎟ = 

i ∫ + 

⎜ ⎟ 

⎝ ⎠ 

B is the Poisson distributed random variables, means 

the distance from Cluster head node to the base stations 

( ) i y at the i 

+ x , In the network has np cluster head 

node from the cluster head to base station.And between 

the cluster head and cluster head, the position is 

independent of each other.between.then, the length is 

0.765npa.from All the cluster head to o the sink node. 

The cluster head obey Poisson point process pp0 whose 

( λ) 

Intensity is λ λ = p 

i i , Cluster member nodes obey 

Poisson point process pp1 whose Intensity is 

λ ( λ = ( 1− 

p) 

λ) 

0 0 

, we define the e1 is the energy 

consumption which Member nodes within a unit to 

transfer data to the cluster head.then: 

E 

[ ] [ L | N = n] 

| N = n = 

E e 

1 

r 

e2 is all the Energy consumption which All the 

ordinary nodes in the network transmit data to their 

respective cluster heads .then: 

[ e | N = n] 

= npE[ 

| N = n] 

2 

1 

E e 

e3 is the Energy consumption which Cluster head 

node transfer data to Sink nodes.then: 

[ | N = n] 

0.765npa 

E e = 

3 

r 

e is the energy consumption of the entire network, then: 


E 

= 

[ e | N = n] 

= E[ 

| N = n] 

+ E[ 

e | N = n] 

2 

3 

np 

r 

2 

e 

( 1− 

p) 

0. 

765npa 

+ 

3 

2 

λ 

r 

p 

In Theorem:N= λ A, then: 

⎡ ( 1− 

p) 

0. 

765 pa ⎤ 

E[] 

e = E[ 

e | N = n] 

= E[ 

N] 

⎢ + ⎥ 

⎢⎣ 

2r 

pλ 

r ⎥⎦ 

⎡ ( 1− 

p) 

0. 

765 pa ⎤ 

= λA⎢ 

+ ⎥ 

⎢⎣ 

2r 

pλ 

r ⎥⎦ 

When the above formula to obtain the minimum, the 

system Can find the optimal value p which Determine 

the system probability of cluster head election. 

p = 

⎡ 

⎤ 

⎢ 

⎥ 

⎢ + 

⎥ 

⎢ 

⎥ 

⎢ 

⎥ 

⎢ 

+ ⎥ 

⎢ + + + ⎥ 

⎢ 

⎥ 

⎢ 2 

2 3 ⎥ 

⎢ 

+ + + 

⎥ 

⎢ 

3 

⎣ 

⎥⎦ 

1 

2 

2 3 

1 

1 

3e 

3 

2 

3e( 

2 27e 

3 3e 

27e 

4) 

( 2 27e 

3 3e 

27e 

4) 

1 

. 

3e 

2 

In Theorem: e = 3. 

06 λ , This time, p is the 

probability of the optimal cluster head election and 

Minimum energy consumption in the whole 

network.Then the p into the formula can Calculate the 

distance threshold T(n) for each round.Finally, the 

number of the optimal number of cluster heads will be 

obtained for each round.. 

With the operation of the network, the network energy 

change, P value also changes, the number of cluster heads 

in the network also with the dynamic changes. 

VI. MOBILE AGENT 

Mobile agent is the combination of distributed 

technology and artificial intelligence, simply, which is 

intelligent agent with mobility. The main idea is to 

transfer the code of calculation module to each node, then 

to finish calculation on a node, and return the processing 

results to the objective sink, so it can reduce the power of 

sensor node generated by transmitting data. 

Through the discussion above, we can complete the 

establishment of cluster group, at the same time selected 

cluster head node. Through the use of mobile agent, we 

can in each cluster head node transfer data information, 

and will eventually results return to the objective sink 

node. 

2


Assuming mobile agent migration process, only 

improve data information accuracy, the S MA is fixed 

and it does not consider the energy consumption of free 

0 

monitoring. When E = = 

idl e b4 , the node energy 

transmission agent can be defined as 

a 

E d b b d ). 

tx S MA 

( ) ( 1 2 + = 

, 

Receiving energy consumption can be defined as 

b . 3 S 

, b , 1 2 b 

E = rx 

MA , 

b , 3 b4 

In the 

is the constants with a sensor 

node wireless transceiver related; d is the transmission 

distance between the nodes; 2 ≤ a ≤ 4 is the attenuation 

factor for signal transmission path with energy 

consumption for measurement; the migration cost that 

agent moving from 

v() 

i 

a 

ij 

to 

⎧ 

⎪( 

= ⎨ 

⎪⎩ 

∞ 

( ) j v 

b 

1 

+ 

b 

2 

is : 

d 

a 

ij 

). 

S 

MA 

+ 

b 

3 

. 

S 

MA 

Thend ij is the distance between () i v 

d ij 

d 

and 

ij 

≤ 

> 

R 

R 

max 

max 

( ) j v 

; node 

can be reached when d ij is no greater than Rmax , on 

the contrary, the aim will not be visited, this process that 

node perception aim is considered the process of target 

signal 

So, information gains, the 

SE 

⎧ 

⎪ 

( j) 

= ⎨ 

⎪⎩ 

b 

0 

5 

d 

−a 

' 

jo 

( ) j v 

d 

d 

d jo ( ) 

is the distance between 

v j 

jo 

jo 

is 

> 

≤ 

D 

D 

max 

max 

and the aim; a ' 

is the attenuation factor to arrive at destination. When 

d jo 

is no more than Dmax , node may perceived goals, 

a 

d jo 

the information gains and into inverse. 

VII. SIMULATION AND RESULTS 

Using the improved algorithm to calculate cluster head 

node based on the LEACH algorithm, based on this, 

realize routing optimization algorithm of mobile agent, 

using NS - 2 to realize simulation, network is set in the 

area of 1000× 1000 sensor node random distribution, the 

number of node from 10 change to 1000, Compare 

energy consumption between without optimization 


' 

algorithm and based on the improved algorithm of mobile 

agent, the simulation results are as shown in figure 4. 

Figure 4. Comparison of energy consumption 

Random application: ten test scenario for each node 

scale, every scene test 10 times, taking average T. The 

simulation results as shown in table 1. 

Ran 

do 

m 

sce 

ne 

TABLE I. SHOWS THE RESULTS OF PERFORMANCE 

COMPARISON 

Not optimization 

Energy 

consump 

tion 

algorithm 

Informat 

ion 

Improved agent method 

Energy 

consumption 

Information 

1 545612 1078.33 216532 948.76 

2 432125 1468.41 154334 1399.78 

3 409563 960.12 126697 842.64 

4 896521 321.46 301682 355.47 

5 502184 1823.68 192360 1253.14 

6 57620 1123.17 221302 987.17 

7 496172 989.88 182780 1132.54 

8 457890 1572.91 155076 1475.21 

9 870475 309.24 270817 333.69 

10 429761 1254.71 131721 982.06 

VIII. CONCLUSION 

Based on the traditional sensor network algorithm to 

optimize the generating cluster head node through the use 

of energy calculation was used by routing algorithms 

which have been improved, This dissertation firstly 

analyzes the advantages and disadvantages of the existing 

routing protocols.The classical LEACH(Low Energy 

Adaptive Clustering Hierarchy)protocol of hierarchical 

sensor networks is analyzed and discussed in detail, and 

then it presents a new energy-efficient routing protocol of 

WSN:Multi-Hierarchical Algorithm based on 

Clustering(MHAC).Simulation results show that MHAC 

Can balance energy load and prolongs the life time 0f 

WSN. 

The prominence feature of the wireless sensor 

networks is energy limited.The algorithm of the 

nowadays ale fall to pay attention to feature ofthe nodes 

energy, when issue the topology discovery, this algorithm 

consume much energy and can not guarantee the 

connectivity of the networks.The mobile agent-based 

topology discovery algorithm considerated the 

improvement on energy aspect for traditional.Through the 

simulation ofthe algorithm, the energy consumption can


save than Waditional algorithm when hold oil some 

feature oftraditional algorithm and have even luore 

connectivity. 


This work was supported by Project Y200804680 of 

the Research planning issues. 

REFERENCES 

[1] Shaojun Yang, Haoshan Shi, Rui Huang.Spatial-Temporal 

Information Integration Framework Based on Mobile- 

Agent in Wireless Sensor Networks.In Proc.of 16th 

International Conference on Computer 

Communication(ICCC2004), 2004, beijing, China;1096- 

1100, (ISIP:000228632800198) 

[2] LI N, Hou J C, Sha L.Design and analysis of an MSTbased 

topology control algorithm[A].In:Proceedings 12th 

Joint Conf on IEEE Computer and Communications 

Socienties(INFOCOM 2003)[C].San Francisco, 

2003.1702-1712 

[3] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and 

exchange anisotropy, ” in Magnetism, vol. III, G. T. Rado 

and H. Suhl, Eds. New York: Academic, 1963, pp. 271– 

350. 

[4] Tynan R, Marsh D, O'Knae D, O'Hare GMP.Agents for 

Wireless Sensor Network Power Management[A].In : 

Proceedings of the 2005 International Conference on 

Parallel Processing Workshops[c].June 205.413`418 

[5] Chen Min, Kwon T, Choi Y.Data Dissemination based on 

Mobile Agent in Wireless Sensor 

Networks[A].In:Preceedings of the IEEE Conference on 

Local Computer Networks 30TH 

Anniversary(LCN'05)[C].IEEE Computer Society, 

SYDNEY, AUSTRALIA, November 2005.2~3 

[6] Kui Wu, Yong Gao, Fulu Li, Yang Xiao. Lightweight 

Deployment-Aware Scheduling for Wireless Sensor 

Networks[J]. Mobile Networks and Applications, 2005, 

10(6) 

[7] Zhang wenjuan, Zhu Xiangbin, Mobile Agent-based 

Clustering Data Fusion in WSN[J], Computer & Digital 

Engineering, March 2010. 

[8] Wang Jietai, Yang Shaojun, Yu Haixun, Application of 

Mobile Agent in Wireless Sensor Networks, Computer 

Engineering, March 2008 

[9] Xiao Qing, Jiao Jian, Application for artificial bee 

algorithm in migration of mobile agent[J], Application 

Research of Computers, June.2010. 

[10] Li Ming, Fan Gaojuan, An EIW-DSR Route Algorithm 

Based on the Energy Integrated Weight in Ad Hoc 

Networks[J], Computer Engineering & Science, November 

2010 

[11] Fdd Zhang Sheng, He Qingquan, Improved ant colony 

algorithm to solve mobile agent in wireless sensor 

networks[J], Application Research of Computers, 

November 2010. 

[12] LinGe Wang, Management Model Research of Wireless 

Sensor Network Based on Mobile Agent, Intelligent 

Computatyion Technology and Automaton(ICICTA), 2010, 

shenzhen, china 

[13] FanGaoJuan, Management Model Research of Wireless 

Sensor Network Based on Mobile Agent, 2007, 5 


`` 

LinGe Wang ZheJiang Province, China. 

Birthdate: January, 1979. is Master of 

computer technology.graduated from 

fudan University . And research interests 

on Network Engineering, Information 

Security, Wireless sensor networks. 

he is a senior lecturer of Dept. Network 

Ningbo Dahongying University 

college of software. 

YueDou QI ZheJiang Province, China. 

Birthdate: Feb, 1964. is Bachelor of 

Computer Application. graduated from 

QiQiHaEr University . And research 

interests on data mining, complex 

networks, business intelligence. 

He is a professor of Dept. Network 

Ningbo Dahongying University college of 

software.


Covert Flow Graph Approach to Identifying 

Covert Channels 

XiangMei Song 

School of Computer Science and Telecommunication Engineering, 

Jiangsu University, Zhenjiang, 210013, China 

Email: jlsxm@ujs.edu.cn 

ShiGuang Ju 

School of Computer Science and Telecommunication Engineering, 

Jiangsu University, Zhenjiang, 210013, China 

Email: jushig@ujs.edu.cn 

Abstract—In this paper, the approach for identifying covert 

channels using a graph structure called Covert Flow Graph 

is introduced. Firstly, the construction of Covert Flow 

Graph which can offer information flows of the system for 

covert channel detection is proposed, and the search and 

judge algorithm used to identify covert channels in Covert 

Flow Graph is given. Secondly, an example file system 

analysis using Covert Flow Graph approach is provided, 

and the analysis result is compared with that of Shared 

Resource Matrix and Covert Flow Tree method. Finally, the 

comparison between Covert Flow Graph approach and 

other two methods is discussed. Different from previous 

methods, Covert Flow Graph approach provides a deep 

insight for system’s information flows, and gives an effective 

algorithm for covert channel identification. 

Index Terms—multilevel security, covert channels, 

information flows, covert flow graph, shared resource 

matrix, covert flow trees 


Multilevel secure computer systems are used to protect 

hierarchic information by enforcing both mandatory and 

discretionary access controls. They can restrict the flow 

of information through legitimate communication 

channels[1]. However, covert channels are usually 

beyond the scope of the security model. Covert channels 

usually signal information through system facilities not 

intended for data transfer. That is, the sending process 

alters some system attributes, and the receiving process 

monitors the alteration[2]. In order to decrease the threat 

of covert channels, several covert channel analysis 

techniques have been proposed and utilized in the past 

thirty years. Among these techniques are the Shared 

Resource Matrix methodology (SRM)[3], the Noninterference 

approach[4], the Information Flow 

analysis[5], the Covert Flow Trees methodology 


Apr. 20, 2011. 

project number: 60773049, 61003288, 20093227110005, 

BK2010192, 07JDG014, 08KJD520015 


doi:10.4304/jnw.6.12.1740-1746 

(CFT)[6], the Backward Tracking approach[7], and 

others. 

The SRM method is one of the most successful 

approaches for covert channel identification. The method 

starts from identifying shared resources. A shared 

resource is any object or collection of objects that may be 

referenced or modified by more than one subject. All 

identified shared resources are enumerated by a matrix 

structure, and then each resource is carefully examined to 

determine whether it can be used to transfer information 

from one subject to another covertly. The usage of the 

matrix structure makes the SRM method simple and 

intuitive. However, the shared resources matrix is 

helpless when constructing covert communication 

scenarios, and amount of analysis work by hand is 

enormous. Lots of research work to improve the SRM 

method has been done. The CFT method is virtually a 

transformation of the SRM method. Furthermore, 

McHugh[8] made three extensions to the matrix structure, 

and Shen and Qing[9-10] optimized the SRM method. 

The CFT method uses a tree structure instead of the 

shared resources matrix. Due to its tree structure, the CFT 

method is capable of recording flow paths and helpful to 

construct covert communication scenarios. But the covert 

flow trees usually are huge, and the construction of the 

trees could fall into infinite loop. To resolve the problem, 

a constraint parameter named REPEAT has to be 

introduced, which may lead to lose some potential covert 

channels. 

This paper presents a graph data structure that models 

system information flow from one shared resource 

attribute to another. This data structure is referred to as 

Covert Flow Graph (CFG). The process for constructing a 

covert flow graph is easy, and the graph can include 

almost information flows in a system. By searching for 

information flow paths, operation sequences can be 

offered that will help the analysis work for detecting 

covert channels. To demonstrate this technique, an 

example file system is analyzed. The result is compared 

to other covert channel analysis methods.


II. THE COVERT FLOW GRAPH APPROACH 

The goal of Covert Flow Graph approach is to identify 

operation sequences that support potential 

communication channels exploited by two users. The 

Covert Flow Graph is a direct graph. The nodes of graph 

describe the information flows from one or more shared 

resource attributes to another one in an operation. While 

the direct edges denote the dependency relationships 

between two operations that share the same attribute and 

generate information flows. Section Ⅱ-A presents the 

graph notation and semantics used in the construction of 

Covert Flow Graph. Section Ⅱ -B explains how to 

construct a covert flow graph. Section Ⅱ-C discusses the 

reason for pruning of Covert Flow Graph. Section Ⅱ-D 

introduces an algorithm for searching information flow 

paths and judging potential covert channels. 

A. Graph notation and semantics 

Let SA denote the collection of all shared resources 

(or shared attributes) in system and OP denote the 

collection of all primitive operations. Specially, the set 

SA contains an attribute, named output , whose value is 

returned by primitive operations. For opi ∈ OP (1 ≤ i ≤ n, 

n is the number of primitive operations), 

let , , 

i i i 

SAR SAM SAO ⊂ SA ; SAR i contains all recognized 

attributes by op i , SAM i contains all modified attributes 

by op i and SAO i contains all output attributes by op i . 

The Covert Flow Graph is a direct graph. Fig.1 shows 

three kinds of nodes in Covert Flow Graphs. The triple 

< SARi, opi, v > in Fig 1.(a), where v∈ SAMi, 

presents 

that v is modified by op i according to the values of the 

shared attributes in SAR i . Another triple 

< v, opi , output > in Fig 1.(b), where v∈ SAOi, 

indicates 

that v is returned by op i . The OUTPUT node in 

Fig.1(c) is the finish node which appears only once in a 

covert flow graph. Its use will be discussed later. 

Figure 1. Nodes used in Covert Flow Graphs 

Definition 1. Covert Flow Graph (CFG): 

CFG =< SA, SAO, OP, AR, OUTPUT, S, E > . SA is the 

set of all shared attributes. SAO ⊂ SA is the set of all 

returned attributes by primitive operations. OP is the set 

of all primitive operations. AR = { SARi| i = 1,..., n} 

. 

S = S1∪S2 ∪ S3 

is the node set; S1⊆ AR× OP× SA, 

S2 ⊆ SAO × OP × { output} 

, S3= OUTPUT , 

E = E1∪E2 ∪ E3 

is the edge set, 

E1 = { < si, sj > | si, sj ∈S1∧ si =< SARi, opi, v >∧ 

, 

s =< SAR , op , v > ∧v ∈ SAR } 

j j j j j 


E2 = { < si, sj > | si ∈S1∧sj ∈S2 ∧ si =< SARi, opi, v > 

∧ s j =< v, op j, output > ∧v∈SAOj} E = { < s , s > | s ∈S ∧s ∈ S } . 

3 i j i 2 j 3 

The directed edges in E 1 connecting two nodes 

describes the dependency relationship between two 

operations, such as opi and op j . It means that one shared 

attribute like v is modified by op i and then referenced 

by op j . In E 2 , the directed edges present a shared 

attribute named v is modified by one operation named 

op i and then its value is returned by another operation 

named op j . The node like < vop , i , output> 

must be 

connected to the finish node by a directed edge in E 3 , 

which means the value of the attribute v is returned 

by op i . 

B. Construction of Covert Flow Graph 

Similarly to CFT methods, here creates reference list, 

modify list and return list for each primitive operation; 

then uses these lists as input to construct Covert Flow 

Graph. The information for creating three lists can get 

from system’s description, formal specification or 

implementation code. So, Covert Flow Graphs can be 

applicable to either phase of system life cycle. 

Constructing a covert flow graph has two main steps. 

The first step is to construct nodes. For ∀vij ∈ SAMi 

(1 < j < mi, 

m i is the number of the attributes that are 

modified by op i ), generate the triple < SARi, opi, vij 

> ; 

and for ∀vik ∈ SAOi(1< 

k < oi, 

o i is the number of the 

attributes whose values are returned by op i ), generate the 

triple < vik , opi , output > . Furthermore, the finish node 

OUTPUT should be generated. Next, generate oriented 

edges among the nodes. Firstly, for ∀opi, op j ∈ OP 

(1 < i, j < n), 

if there is an attribute v that is modified by 

opi and referenced or returned by op j , then op j is 

dependent on op i in connection with v . In other words, 

for every two inequality operations, such as op i and op j , 

if having constructed the nodes < SARi, opi, v1 

> and 

< SAR j, op j, 

v2 

> ( v1∈ SARj) 

or < v, op j , output > in the 

first step, , then an oriented edge should be generated 

from the former node to the latter one. Besides, for the 

node < v, op , output > , it should be connected to the 

j 

finish node. 

Fig 2 illustrates an example graph. The operation lists 

used to build this example is just the same one used in 

[6], defined as follows in Table 1. 

According to the method above, for the operation op 4 

in Table Ⅰ, SAR 4 is { A } , SAM 4 is { BC , } , then 

generate triples as < { A}, op4, B > and < { A}, op4, C > ; 

and SAO4 is { A } , so < A, op4, output > is created. 

Therefore three nodes marked with gray color have been


Figure 2. Example CFG representing information flows 

TABLE I. EXAMPLE OPERATION LISTS 

Operation 1 Operation 2 Operation 3 Operation 4 

Reference List: Reference List: Reference List: Reference List: 

D A B A 

Modify List: Modify List: Modify List: Modify List: 

A, B B Null B, C 

Return List: Return List: Return List: Return List: 

Null Null B A 

Attributes: A, B, C, D 

constructed in Fig 2. The bold edge from 

< { D}, op1, A> 

to < A, op4, output > is generated 

op . 

because A is modified by op 1 and referenced by 4 

And as A ’s value is returned by op 4 , the node 

< A, op , output > is connected to the finish node. 

4 

C. Pruning of Covert Flow Graph 

When a covert flow graph has been constructed, it can 

be pruned before the analysis. The pruning work is a twostep 

process. First, remove the node that has indegrees 

but no outdegrees, except for the finish node and the 

edges connected to it. Because only those paths that end 

with the finish node in the covert flow graph may be 

potential covert storage channels. Second, identify and 

remove the starting nodes provided that the paths started 

with those nodes cannot occur in practice. In a system, 

such pairs of operations often exist that one operation, 

named post-executed operation, must be executed 

consecutively after the execution of the other operation, 

named pre-executed operation, and the consecutive 

executions of the two operations can nullify each other’s 

effects, such as Lock_File and Unlock_File 1 . Almost no 

operation sequences are started with post-executed 

operations under the running circumstance. When 

analyzing the primitive operations of a system, such pairs 

of consecutive operations should be identified. And in 

this step, if a starting node presents a post-executed 

operation, then remove the node and the edges emitted 

from it. 

D. Search for information flow paths and identification 

of covert channels 

A pruned covert flow graph includes all information 

flow paths in the system, but not all of them are covert 

1 the operation Lock_File and Unlock_File come from a file 

system used in [3], which is referred to following. 


storage channels. The next task is to search for 

information flow paths and identify covert storage 

channels. According to the minimum criteria that a covert 

storage channel must be satisfied [3], exploiting covert 

channels to communicate between two users has three 

characters: 

(1) The sending process (or user) must be able to 

modify some shared attribute’s value. 

(2) The receiving process must be able to detect the 

attribute change. Namely, the attribute’s value 

should be returned by the operation invoked by 

receiving process. 

(3) The security class of the sending process must be 

dominant or incomparable to that of the receiving 

process. 

These characters have special behaviors on operation 

sequences of covert channels. Because the operation 

names are included in nodes of covert flow graphs, the 

operation sequences can be acquired when searching for 

information flow paths in covert flow graphs. According 

to (1)-(3) characters above the criteria rules for covert 

channel identification based on Covert Flow Graphs can 

get as following. 

Regulation 1. If the start operation in an operation 

sequence for covert communication has dependency, the 

operation sequence will not built covert channels. 

In systems, whether some operation’s reading or 

writing action can execute or not is decided by other 

operation’s execution results. The former operation is 

dependent on the latter one. According to character (1), 

the sending process must modify shared attribute’s value 

independently, so this kind of dependent operation cannot 

be exploited by sending process. 

Regulation 2. If one operation sequence can built 

covert channels, its corresponding information flow paths 

must be ended with the finish node in covert flow graph. 

According to character (2), the receiver has to invoke 

an operation to output the attribute’s value finally. 

Because the nodes presenting some attribute’s value 

returned must be connected to the finish node, Regulation 

2 is valid. 

Regulation 3. If one operation sequence can built 

covert channels, the user authority which the start 

operation needs should not be dominated by that of the 

end operation in the operation sequence. 

If Regulation 3 is not satisfied, then the covert 

communication will be a legal channel between the 

sender and receiver because the sender’s security class is 

lower than or equal to the receiver’s. 

According to Regulation 2, only those paths that end 

with the finish node in covert flow graph may be 

potential covert channels. So, the search method for 

information flow paths consists of the following steps: 

Firstly, get the converse digraph of a covert flow graph, 

named CFG -1 . In CFG -1 , the finish node is the only node 

without indegrees. Secondly, use the deep first search 

method to find out all the paths which begin with the 

finish node. While searching, determine whether it can be 

exploited by covert channels. The judge basis is 

Regulation 2 and 3. In order to avoid endlessly cycles


while searching, one directed edge must appear only once 

in a path. The search and judge algorithm is as following: 

Algorithm 1: the search and judge algorithm 

1 

Procedure PathSearching( CFG − ) 

1 

Input: CFG − 

Output: PATH // information flow paths 

Begin 

initial stack 1 // stack 1 is used to backtrack 

initial PATH 

T :=Φ //T is the set of direct edges 

push( stack 1 , OUTPUT ) 

// push the OUTPUT node into the stack 

while stack 1 != NULL do 

pop( stack 1 , v ) 

push( stack 1 , NULL) 

//NULL presents the null node 

flag:=FALSE 

−1 

while ( ∃v → w∈CFG ) ∧( v→ w∉ T) 

do push( stack 1 , w ) 

T =T ∪{ v→ w} 

flag:=TRUE 

od 

if flag=TRUE then 

push( stack 2 , ( vw)) , 

else 

JudgeCovertChannel( ) 


while v =NULL do 

pop( stack 2 , ( vw)) , 

T=T-{ v→ w} 


od 

push( stack 1 , v ) 

fi 

od 

End 

Procedure JudgeCovertChannel( ) 

Output: CC // potential covert channels 

Begin 

i:=0 

j:=0 

while PATH !=NULL do 

pop( PATH , ( vw)) , 

Array[i++]:= w 

if the operation in w is independent 

then lab[j++]=i 

od 

v :=Array[i-1] 

for k:=0 to j-1 do 

w :=Array[lab[k]] 

if sl( w≮ ) sl() v then 

// sl( x ) presents node’s security class 

CC ← Array[lab[k]..i-1] 

output CC 

fi 

od 

End 


In Algorithm 1, procedure PathSearching is used to 

1 

deep first search for paths in CFG − . In order to find out 

all paths, here needs a stack structure for backtracking, 

named stack 1 . While backtracking, determine the steps 

with the number of NULL nodes pop from stack 1 . 

stack 2 is another stack structure used to recorder the 

nodes in a path. During the search time, as long as a 

direct edge has been found out, the pair of nodes 

corresponding to the edge should be pushed into stack 2 

Furthermore, a set T is defined to denote whether every 

edge in a path appears only once in order to avoid 

endlessly cycles. Once searching out a path, procedure 

JudgeCovertChannel will be invoked to judge whether 

covert channels exist in the path. 

III. EXAMPLE FILE SYSTEM ANALYSES USING COVERT 

FLOW GRAPH 

This section presents the results from an example 

covert channel analyses using the CFG approach 

described in Section 2. A brief description of the example 

system is included, which is taken from [6], in order to 

provide an overview of the basic functions of the 

primitive operations and attributes. For more detailed 

descriptions of the system the reader is referred to [3], the 

paper from which the specification was taken. The 

operation description lists used in the construction of 

covert flow graph are also taken from [6]. 

A. A brief description of the file system example 

The attributes of the system includes six file attributes, 

three process attributes, and one attribute associated with 

the global state of the system: Current_Process. 

Current_Process contains the ID of the process currently 

running on the CPU. The file attributes of the system are 

File_ID, Locked, Locked_By, Value, Security_Class, and 

In_Use. The three process attributes are Process_ID, 

access_Rights and Buffer. The operations are discussed 

in more detail in the following paragraphs. 

The Write_File operation is used by a process to 

change the contents of a file. The file is locked by the 

current process. The value of the file is modified to 

contain the contents of the current process's buffer. 

The Read_File operation is used by a process to 

interrogate the contents of file. If the current process is 

included in the in-use set for the file specified, the value 

of the file is copied to the current process's buffer. 

The Lock_File operation is used by a process to 

modify the contents of file. A process must lock a file 

before modifying it and must unlock the file after the 

modification is complete. If the current process has write 

access for the specified file, if the file specified is 

unlocked, and if its in-use set is empty, then the file is 

locked, and its locked by attribute is set to the id of the 

current process. 

The Unlock_File operation makes a file accessible 

when a process is done modifying its contents. If the 

specified file's locked by attribute is the current process, 

the file is unlocked.


The Open_File operation is used by a process to 

initiate retrieval of the contents of a file. This primitive 

guarantees that no other process is modifying the contents 

of the file being interrogated. If the current process has 

read access or the specified file and the file is not locked, 

the current process's id is added to the in-use set for this 

file. 

The Close_File operation is used when a process has 

completed interrogation of a file and wants to release it so 

that it can be modified. If the current process's id is an 

element of the in-use set for the specified file, then it is 

removed from that set. 

The File_Locked operation is used by a process to 

determine whether a file locked. If the current process has 

write access for the specified file, then, if the file is 

locked, a value of true is returned. If the file is unlocked 

the value false is returned. If the current process lacks 

write access for the specified file, the result is undefined. 

The File_Opened operation is used by a process to 

determine whether a file is open for reading. If the current 

process has write access for the specified file, then, if the 

file's in-use set is nonempty, a value of true is returned. If 

it is empty the value false is returned. If the current 

process does not have write access for the specified file, 

the result is undefined. 

The View_Buf operation is introduced to explicitly 

state how a process is allowed to view its buffer attribute. 

The lists constructed from the operation descriptions 

are as in Table Ⅱ. 

TABLE II. FILE SYSTEM OPERATION DESCRIPTION LISTS 

Operation Reference List 

Buffer, 

Modify List Return List 

Write_File 

Current_Process, 

Locked_By, 

Locked 

Value, 

Value Null 

Read_File current_Process, 

In_Use 

Buffer Null 

View_Buf Buffer 


Null Buffer 

Lock_File 

Access_Rights, 

Locked, In_Use, 

Security_Class 

Locked, 

Locked, 

Locked_By 

Null 

Unlock_File Locked_By, 

Current_Process 


Locked Null 

Open_File 


Security_Class, 

Locked, 

In_Use Null 

Close_File 


In_Use 


In_Use Null 

File_Opened Security_Class, 

In_Use 


Null In_Use 

File_Locked Security_Class, 

Locked 

Null Locked 

B. Example covert flow graph and scenario list for file 

system example 

Fig.3 is the CFG constructed for the file system 

example. To make the analysis easier, two nodes marked 

with dark grey color is considered firstly. 


The triple describes the information flows by executing the 

operation Close_File. The node is the only one that connects to the marked 

node, which describes the information flows by executing 

the operation Open_File, shown in Fig.3. While 

Open_File and Close_ file are the pair of consecutive 

operations, they can nullify each other’s effects when 

existing in an operation sequence. Therefore they can be 

reduced from the operation sequence. And in the CFG, 

the dotted edge from to 

should 

be deleted . This results in the node has no indegree. Because 

Close_File is post-executed operation, the node 

and 

edges from it can also be deleted in the CFG. 

Similarly, the Lock_File and Unlock_File are the pair 

of consecutive operations. But they can nullify each 

other’s effects only on the Locked attribute. So the dotted 

edge from 

to should be deleted, however the 

edge from to should not be 

deleted. 

In Fig 3, the path with bold black lines is one of the 

information flow paths searched out by using Algorithm 

1. Each subpath to the finish node in the path may be a 

potential covert channel. Because Read_File, Write_File 

and Unlock_File are dependent other operations and 

Lock_File’s security class is not dominate View_File’s, 

such subpaths that starts with these four operations could 

not be used as covert channels. Only the subpaths staring 

with Open_File can be exploited. The corresponding 

operation sequences are shown in Fig 4. 

Figure 3. The covert flow graph of example system 

Figure 4. Potential covert communication sequences starting with 

Open_File


The two sequences in Fig 4 need further analysis by 

constructing covert scenarios to determine whether they 

are covert channels. The result is that only sequence (2) is 

a covert channel. In this covert channel, the high security 

class user can choose whether to invoke Open_File, while 

another user with low security class can judge the former 

user’s action by invoking a serial of operations. 

Fig 5 shows covert communication sequences existing 

in the example system using Covert Flow Graph method. 

The method finds out six covert channels that were 

provided by CFT in reference [6], as sequences (a)-(f) in 

Fig 5. Besides, sequence (g) presents a new covert 

channel which was not found by CFT. Corresponding 

covert scenario can be constructed as following: the 

sender can affect the receiver’s observation result through 

whether invoking Open_File or not. If the sender invokes 

Open_File to open a file, then the receiver can not locked 

the same file. The following operation Open_File 

invoked by receiver will be successful and File_Opened 

will return TRUE to receiver. Otherwise, the receiver will 

get FALSE from File_Opened. Therefore, the receiver 

can detect whether the sender has opened the given file. 

Table Ⅲ enumerates the covert channel analysis 

results for the above file system with Shared Resource 

Matrix, Covert Flow Tree method and Covert Flow 

Graph. Using SRM method, only the exploited shared 

resource attribute can be detected. While both CFT and 

CFG approach can provide detailed covert 

communication sequences. 

Figure 5. Potential covert communication sequences existing in 

example system 

TABLE III. CORRESPONDENCE BETWEEN CHANNEL 

ANALYSIS TECHNIQUES FOR THE FILE SYSTEM EXAMPLE 

SRM CFT CFG 

Cover channel using 

File_Locked to 

sense changes in 

Locked 

Covert channel 

using Lock_File to 


Locked 


using Lock_File to 


In_Use 


using File_Opened 

to sense changes in 

In_Use 

Covert 

communication 

sequences A 

Covert 

communication 

sequences B 

Covert 

communication 

sequences C, D, E 

Covert 

communication 

sequences F 

Covert 

communication 

sequences A 

Covert 

communication 

sequences B 

Covert 

communication 

sequences C, D, E, G 

Covert 

communication 

sequences F 

VI. COMPARISON AMONG SRM, CFT AND CFG 

The Shared Resource Matrix approach works well 

since it has been introduced. The major problem may be 


that it cannot afford the operation sequences which can 

help the analysis of covert channels. In contrast, the CFT 

approach, which can present the operation sequences by a 

tree structure. Compared with CFT, the CFG may have 

two advantages as follows: 

(4) The CFG can provide almost complete 

information flows of a system in one graph, while 

the CFT has to construct trees for every shared 

attribute that would be modified by operations. 

Usually the size of the tree structure is quite 

large. For example, the CFT representing the 

information flow via attribute In_Use for the file 

system example used 136 nodes. The CFG in 

Fig.6 only uses 11 nodes for all attributes that can 

be exploited for covert communication. 

(5) The CFT construction algorithm dependents on a 

parameter, called REPEAT, which is used to 

control the constructing CFT with infinite tree 

paths. The parameter defines the number of times 

any attribute may be repeated in an inference 

path, thus providing the analyst with a way to 

avoid cpu or memory exhaustion by controlling 

the depth of the CFT paths. But unsuitable value 

of REPEAT may result in missing covert 

channels. For example, when REPEAT set to 0, 

scenario D and E would not be discovered. While 

the CFG avoids this problem. 

Notwithstanding these advantages, CFG encounters 

problems similar to the CFT approach. In the CFG, 

pseudo communication paths still exist. One way to 

reduce pseudo communication paths is to consider a finer 

relationship between the referenced and modified 

attributes and to consider conditional modifies and 

references, on which our research group and others are 

working. 

V. CONCLUSION AND FUTURE WORK 

This paper introduces a technique for detecting covert 

channels. The approach uses covert flow graphs, which 

can present the information flow paths and operation 

sequences. A algorithm for searching information flow 

paths and judging potential covert channels is introduced. 

To illustrate the approach, one example file system is 

analyzed and the result is compared to previous channel 

analysis of the same system using CFT approach. 

Compared with SRM methods, Covert Flow Graph 

approach can provide operation sequences. In the 

meantime, Covert Flow Graph approach avoids the 

difficult problem that CFT method has encountered. 

In future work, other example system should be 

analyzed by Covert Flow Graph approach. The emphasis 

will be put on automated tool for the construction of 

covert flow graphs. 


This work was supported by the National Natural 

Science Foundation of China (Grant Nos. 60773049, 

61003288), the Ph.D. Programs Foundation of Ministry 

of Education of China (Grant Nos. 20093227110005), the 

Natural Science Foundation of Jiangsu Province (Grant


Nos. BK2010192), the People with Ability Foundation of 

Jiangsu University(Grant Nos.07JDG014), the 

Fundamental Research Project of the Natural Science in 

Colleges of Jiangsu Province (Grant Nos. 

08KJD520015). 

REFERENCES 

[1] D. E. Bell, L. J. LaPadula, “Secure Computer Systems: 

Unified Exposition and Multics Interpretation,” Mitre 

Crop., Bedford, MA, Tech. Rep. ESD_TR_75_306(1975). 

[2] R. A. Kemmerer, P. A. Porras, “Covert Flow Trees: a 

Visual Approach to Analyzing Covert Storage Channels,” 

IEEE Transactions on Software Engineering, vol.17, no. 

11, pp. 1166 – 1185, Nov. 1991. 

[3] R. A. Kemmerer, “Shared Resource Matrix Methodology: 

an Approach to Identifying Storage and Timing Channels,” 

ACM Transactions on Computer Systems, vol. 1, no. 3, pp. 

256-277, Aug. 1983. 

[4] J. Goguen, J. Meseguer, “Security Policies and Security 

Models.,” In: Proc. 1982 Symposium on Security and 

Privacy, pp. 11-20, IEEE Press, New York (1982). 

[5] D. E. Denning, “A Lattice Model of Secure Information 

Flow,” Communications of the ACM, vol. 19, no. 5, pp. 

236-243, May 1976. 

[6] P. A. Porras, R. A. Kemmerer, “Covert Flow Tree Analysis 

Approach to Covert Storage Channel Identification.,” 

Comput. Sci. Dept., Univ. California. Santa Barbara, Tech. 

Rep. No. TRCS 90-26, Dec 1990. 

[7] S.H. Qing, J.F. Zhu,: “Covet Channel Analysis on 

ANSHENG Secure Operating System.,” Journal of 

Software, vol. 15, no. 9, pp. 1385-1392, 2004. 

[8] J. McHugh, “Handbook for the Computer Security 

Certification of Trusted Systems - Covert Channel 

Analysis.” Technical Report, Naval Research Laboratory, 

Feb 1996. 

[9] J.J. Shen, S.H. Qing, Q.N. Shen, L.P. Li, “Covert Channel 

Identification Founded on Information Flow Analysis,” 

Lecture Notes in Computer Science, Vol. 3802, pp. 381- 

387, 2005. 

[10] J.J. Shen, S.H. Qing, Q.N. Shen, L.P. Li, “Optimization of 

covert channel identification, ” In: Proceeding of the Third 

IEEE International Security in storage workshop 

(SISW’05), 13 Dec 2005. 

[11] J. Zeng, S.G. Ju, X.M. Song, “Construct Information Flow 

Graph Based on PDG,” Computer Science and 

Computational Technology, Vol. 1, pp. 756-759, 20-22 

Dec. 2008. 

[12] Y.J. Wang, J.Z. WU, H.T. Zeng, L.P. DING, X.F. LIAO, 

“Covert Channel Research,” Journal of Software, Vol. 21, 

No. 9, pp.2262-2288, Sep 2010. 


XiangMei Song JiLin Province, China. 

Birthdate: Nov, 1979. is Computer 

Science doctoral student, studying in 

School of Computer Science and 

Telecommunication Engineering, Jiangsu 



She is a senior lecturer of Dept. 

information security, School of Computer 

Science and Telecommunication Engineering, Jiangsu 

University. 

ShiGuang Ju JiangSu Province, China. 

Birthdate: May, 1955. is Computer 

Science Ph.D., graduated from National 

Polytechnic Institute (Mexico). And 

research interests on information security 

and data base. 

He is a professor and Ph.D. supervisor 

of School of Computer Science and 

Telecommunication Engineering, Jiangsu 

University.


A Novel HAVE Message of Peer-to-peer 

Protocol in BitTorrent Systems 

Jianyong Li 

School of Computer & Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, China 

Email: lijianyong@zzuli.edu.cn 

Jianchun Li, Daoying Huang and Qiang Wei 

School of Computer & Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, China 

Email: lijianchun@zzuli.edu.cn, dyhuang@zzuli.edu.cn, weiqiang200456@163.com 

Abstract—In BitTorrent systems, there are eleven types of 

messages for data communication between the peers, among 

which HAVE, REQUEST and PIECE messages are the 

three main transmission parts in terms of quantity and flow. 

In order to improve the efficiency of network transmission 

and decrease the management costs of file delivery, this 

paper investigates the mechanism of HAVE message of 

BitTorrent systems and propose a novel MultiHAVE 

message scheme, which comprises several HAVE messages 

via a proper set timer. Experiment results show that under 

the environment of high bandwidth and consistent peers, 

together with assistant of the timer, the flow ratio of 

MultiHAVE message to HAVE message can be reduced to 

11%, so MultiHAVE message can decrease the flow of 

messages and prevent the HAVE message storm efficiently. 

Furthermore, MultiHAVE message can adapt itself to 

various BT systems with various bandwidths. If the action 

of network peers is inconsistent, it can degenerate to the 

original HAVE message and keep the high performance of 

BitTorrent systems. 

Index Terms—BitTorrent, protocol, Peer-to-peer networks, 

MultiHAVE message, performance analysis 


BitTorrent (BT) is a Peer-to-peer (P2P) protocol 

designed to distribute and replicate data quickly, 

efficiently and fairly [1-2]. It possesses similar 

technological principle to other P2P downloading 

software. In BT system, each peer is a client as well as a 

server. So the more people download the file, the quicker 

its speed is. Numerous practical results have verified the 

flexibility, efficiency and reliability of BT systems [3]. 

However, the widely usage of BT systems may result 

in message storm and decrease the communication 

efficiency. Recent studies showed that the proportion of 

P2P traffic on the backbone links has increased from 10% 

to 80% [4-6] and the BitTorrent traffic has increased from 

26% to 52% of the total P2P traffic during the first half of 

2004, and even amounts to 60% in 2005, according to the 

report of CacheLogic [4]. Due to the extensive use of BT 

systems and the congestion of local network, many ISPs 

began to constrain the application of BT systems. 

However, some of the original file-distributing services 


doi:10.4304/jnw.6.12.1747-1753 

based on the central servers need to invoke the support of 

BT systems. 

In order to improve the performance of BT systems, 

many researches have been carried out to modify the 

existing BitTorrent mechanisms. Qureshi [7] suggested 

the use of proximity in BitTorrent overlay network and 

the peers that are close by in the real world should be 

close by in the overlay network. Bindal et al [8] proposed 

a new algorithm based on biased neighbor selection for 

the cross-ISP problem. In [9], Yamazaki et al put forward 

a so-called Cost-Aware BitTorrent strategies to reduce 

the ISP costs. To improve the piece exchange mechanism, 

Garbacki et al [10] proposed a protocol named 2Fast 

which extended the bartering model of BitTorrent and 

Garbacki et al [11] extended it by proposing a novel 

mechanism in which incentives are built around 

bandwidth rather than content. Noticing that a free-rider 

is a node that downloads pieces from other peers but does 

not upload any pieces to others, Sirivianos et al [12] 

presented a new free-riding technique named the large 

view exploit and suggested a modification to the 

BitTorrent tracker and clients to address the problem. 

In this paper, we investigate the performance 

enhancement of BitTorrent systems by inducing the 

management costs. It is known that in BitTorrent systems, 

there are eleven types of messages for data 

communication between peers and the management costs 

are mainly depending on HAVE message, REQUEST 

message and PIECE message. In some specific 

applications, management costs even reach 23% [13]. A 

HAVE message is sent once the peer has received the 

entire piece and verified the corresponding hash value in 

the torrent file. The purpose of the message is to inform 

all the connected peers that they could update the 

download piece information which was notified by the 

BITFIELD message in HANDSHAKE stage. In a BT 

system the peers that the tracker returns can reach up to 

50 due to the numerous peers joining the system. 

Correspondingly, the ratio of the number and flow of 

HAVE message will increase quickly and result in a 

possible HAVE message storm. 

Actually, sending HAVE message to all peers in a high 

frequency cannot improve other peers’ downloading rate.


So reducing the frequency of HAVE message can not 

only relieve the burden of peer in receiving and sending 

HAVE message but also reduce the network bandwidth 

costs. Under above consideration, in this paper, we 

propose a novel HAVE message mechanism, 

MultiHAVE message, to improve the efficiency of 

network transmission and decrease the management costs 

of file delivery. The proposed MultiHAVE message is 

composed of several HAVE message via a proper set 

timer. The regular sending scheme of MultiHAVE 

message is analyzed as well. In order to show the 

effectiveness of the proposed mechanism, we compare 

the performance of MultiHAVE message and HAVE 

message. Experiment results show that under the 

environment of high bandwidth and consistent peers, the 

flow ratio of MultiHAVE message to HAVE message 

reduces to 11%. So the proposed MultiHAVE message 

can effectively decrease the amount of HAVE message 

and reduce the management costs of BT system. 

The rest of this paper is organized as follows. In 

Section 2, we propose a novel structure of MultiHAVE 

message and illustrate the regular sending scheme of the 

MultiHAVE message. In Section 3, we compare the 

performance of MultiHAVE message HAVE message. 

Experiment results are given in Section 4 to verify the 

efficiency of the proposed scheme. Section 5 summarizes 

the paper and draws the conclusion. 

II. STRUCTURE OF MULTIHAVE MESSAGE AND REGULAR 

SENDING SCHEME OF MULTIHAVE MESSAGE 

A. Structure of MultiHAVE Message 

The purpose of the HAVE message is to inform all the 

connected peers that they could update the download 

piece information which was notified by the BITFIELD 

message in HANDSHAKE stage. Sending HAVE 

message to all peers in a high frequency cannot improve 

other peers’ downloading rate. In this subsection we 

propose a new HAVE message mechanism, which 

comprises several HAVE messages via a proper set timer. 

Noticing that HANDSHAKE, KEEP ALIVE message 

and the other 9 messages have 4B message prefix and 1B 

message ID, the structure of MultiHAVE message can be 

formulated as follows: 

(1) 4B long Message prefix. Message prefix shows the 

bytes size of message ID and the payload in MultiHAVE 

message. The value range is n × 4 + 1 , where n is the 

number of piece’s index in payload. 

(2) 1B Message ID. The largest message ID in current 

BT system is 8. Here the value is declared as 9. 

(3) Payload. The length is n × 4 B, where n is the 

number of pieces. Each 4B represents the index of a 

piece. 

The comparison between MultiHAVE message and 

HAVE message is shown in TABLE I. 

B. Regular Sending Scheme of MultiHAVE Message 

The purpose of sending HAVE message is to notify 

other peers of the local peer’s downloaded piece state. It 

could also update the downloaded piece information 


TABLE I. 

THE COMPARISON OF MULTIHAVE MESSAGE AND HAVE MESSAGE 

Message name Length prefix 

Message 

ID 

Payload 

HAVE 0005 4 Integer/4B 

MultiHAVE Payload + 1 9 

Variable 

length 

which was notified by the BITFIELD message in 

HANDSHAKE stage. 

Conventionally, when the peer receives a piece, a 

HAVE message is sent to tell all the connected peers that 

it has the piece. As the connecting number of peers 

increases, the largest increase range of HAVE message is 

( ) 

2 

O n , where n is the connecting number of the peers. In 

particular, under the high-bandwidth network 

environment, in a choke conversion cycle (10s) or an 

optimistic unchoking cycle (30s), a high-speed peer may 

receive hundreds of MB data. Calculated with a typical 

size of piece as 256KB, the data-receiving peers will send 

400 or 1200 HAVE messages to all its connecting peers 

in 10s or 30s. If the default number of connection is 50, 

the peer would send a total of 20000 or 60000 HAVE 

messages, 2000 per second on average. Obviously, under 

this circumstance, a serious HAVE message storm will 

appear at the end of receiving peer. The above 

calculations are only HAVE message that a receiving 

peer has sent. In fact, each peer has similar action 

because the relationship between them is symmetrical. If 

each peer has balanced equivalent sending and receiving 

data action in a period, the entire bandwidth is shared by 

uploading and downloading. Then the data that each peer 

receives are reduced by half and the frequency of peer 

sending HAVE message will be reduced to 1000 per 

second consequently. It should be noted that, due to the 

symmetry of peer action (called peer action consistency), 

in this period, each peer should receive totally 1000 

HAVE messages per second from the other 50 peers, and 

it will bring 51000 HAVE messages per second among 

51 peers. Clearly, high density HAVE message 

transmission will seriously affect the entire network 

performance. 

When a small number of low-bandwidth peers and a 

large number of high-bandwidth peers coexist in a BT 

system, the high-bandwidth peers may send a mass of 

HAVE messages in a period. To the low-bandwidth 

peers, HAVE message is the message that they must 

receive and handle. The large amount of HAVE messages 

will definitely occupy their valued bandwidth and block 

the PIECE message which carries real data. In some 

serious cases, the low-bandwidth peers may not 

download any data during a long time. In other words, in 

a network where large numbers of high-bandwidth peers 

are constantly joining, the low-bandwidth peers are 

probable to be attacked by the HAVE message storm. 

In order to avoid forming the new MultiHAVE 

message storm, the frequency of sending MultiHAVE 

message should be taken into consideration when 

deciding the payload of MultiHAVE message. In 

practice, it can be managed by a timer. When the timer


times out, the peer aggregates the entire HAVE messages 

produced by the newly- received pieces, composing them 

into one MultiHAVE message and sending it. At the 

same time the timer starts the next round of re-timing. 

Different to the 10s choking algorithm cycle and 30s 

optimistic unchoking cycle, when choosing a long 

MultiHAVE regular cycle (such as 30s), for highbandwidth 

peers, it is likely that the two connecting highbandwidth 

peers may send NOT INTERESTED 

messages and choke each other, because they may not 

find the new piece’s timely change between them in 10s 

cycle. For low-bandwidth peers (56k modem), if the size 

of piece is 256KB, they cannot get a complete piece in 

this cycle, when a complete piece has been achieved, the 

timer times out and a MultiHAVE message is sent. 

According to the length of the interval, the MultiHAVE 

messages that the low-bandwidth peer sends always 

includes only a piece of payload. At the same time, the 

MultiHAVE message returns to the original HAVE 

message and will not affect its downloading performance. 

To be summarized, the principles of choosing timer 

value are as follows: 

(1) It can prevent the new MultiHAVE message storm; 

(2) It cannot exceed the choking algorithms cycle. 

Based on the above two principles, the 5s (less than 

10s) interval is chosen for the MultiHAVE scheme. 

III. PERFORMANCE COMPARISON OF MULTIHAVE 

MESSAGE AND HAVE MESSAGE 

In this section, we calculate the flow of MultiHAVE 

and HAVE message and compare their performance 

consequently. 

First, assume n be the number of peers that 

connecting with peer A, the frequency of MultiHAVE 

message set by peer A is 

1 

f MS ( n) 

= n , (1) 

T 

where T I is the timer interval of MultiHAVE message. 

The frequency of MultiHAVE message received by peer 

A can be formulated as 

n 

MR n) 

= ∑ FM 

i 

i= 

1 

f 

I 

( , (2) 

where F M is the frequency of the i-th peer connecting 

i 

with the peer A and sending MultiHAVE message to peer 

A. The flow of MultiHAVE message sent by peer A is 

4Bd 

n n 

fl MS ( n) 

= + ( OP 

+ 5) 

(3) 

S T 

with B d being the download bandwidth of peer A, S P 

being the size of the piece, O P being the size of the 

TCP/IP header, which is 40B. 

The flow of MultiHAVE message received by peer A 

is 


p 

I 

fl 

n 

MR n) 

= ∑ FLM 

i 

i= 

1 

( , (4) 

where FL M is the flow of the i-th peer connecting with 

i 

the peer A and sending MultiHAVE message to peer A. 

Similarly, the frequency HAVE message sent by peer 

A is 

Bd 

fHS 

( n) 

= n . (5) 

S 

The frequency of HAVE message received by peer A is 

f 

p 

n 

HR n) 

= ∑ FH 

i 

i= 

1 

( , (6) 

where F H is the frequency of the i-th peer connecting 

i 

with and sending HAVE message to peer A. 

The flow of HAVE message sent by peer A is 

fl ( n) 

f ( n) 

× ( 4 + O + 5) 

. (7) 

HS 

= HS 

P 

The flow of HAVE message received by peer A can be 

presented as 

fl 

n 

HR n) 

= ∑ FLH 

i 

i= 

1 

( , (8) 

where FL H is the flow of the i-th peer connecting with 

i 

the peer A and sending HAVE message to it. 

Supposing that the peer actions be consistent, we have 

1 

fMS 

( n) 

= fMR( 

n) 

= n , (9) 

T 

4Bd 

n n 

fl MS ( n) 

= flMR 

( n) 

= + ( OP 

+ 5), 

(10) 

S T 

B n 

f = 

p 

I 

d 

HS ( n) 

= f HR( 

n) 

, (11) 

S p 

fl ( n) 

fl ( n) 

= f ( n) 

× ( 4 + O + 5) 

. (12) 

HS 

= HR HS 

P 

The ratio of the frequency of sending HAVE message 

to the frequency of sending MultiHAVE message is as 

follows: 

f 

f 

HS 

MS 

I 

I 

( n) 

= 

( n) 

Bdn 

S p 

n 

T 

BdTI 

= 

S p 

. (13) 

The ratio of the flow of sending HAVE message to the 

flow of sending MultiHAVE message is as follows:


fl 

fl 

HS 

MS 

( n) 

f HS ( n) 

× ( 4 + OP 

+ 5) 

= 

( n) 

n ⎛ 4 

⎞ 

⎜ 

TI 

Bd 

+ O 5⎟ 

⎜ 

P + 

T 

⎟ 

I ⎝ 

S p ⎠ 

Bd 

n 

× ( OP 

+ 9) 

S p 

= 

n ⎛ 4 

⎞ 

⎜ 

TI 

Bd 

+ O + 5⎟ 

⎜ 

P 

T 

⎟ 

I ⎝ 

S p ⎠ 

TI 

Bd 

( OP 

+ 9) 

= 

. 

4T 

B + S ( O + 5) 

I 

d 

p 

P 

(14) 

According to (13) and (14), if the peers actions are 

consistent, the improvement of MultiHAVE message to 

HAVE message is relative with the download bandwidth 

B d , the size of piece S p and the timer interval of 

MultiHAVE message T I . If the these parameters have 

been confirmed, the ratio of the flow of sending HAVE 

message to the flow of sending MultiHAVE message is 

constant. The ratio of the flow of sending HAVE message 

to the flow of sending MultiHAVE message is constant, 

too. 

For example, suppose each peer have a maximum 

upload and download speed, 5MB/s, the size of piece be 

256KB, the timer interval be 5s and the downloading files 

are big enough. Further suppose the peers actions being 

consistent and ignore the seed peers, then the frequency 

of sending and receiving of MulitHAVE message and 

HAVE message, the flow of MultiHAVE and HAVE 

message can be shown in Figure 1 and Figure 2 

respectively. 

As can be seen from Figure 1 and Figure 2, when the 

download bandwidth B d =5MB, the size of piece 

S p =256KB and MultiHAVE message regular intervals 

T I =5s, the ratio of the frequency of sending HAVE 

message to that of sending MultiHAVE message is 100, 

The ratio of the flow of sending HAVE message to that of 

sending MultiHAVE message is 11.01. The frequency 

and flow of sending MultiHAVE message have been 

improved a lot than that of HAVE message. The 

improvement of frequency is mainly due to MultiHAVE 

message sending the payload of HAVE message in 

aggregation, and the improvement of flow is a decrease 

of the 40B overhead of TCP/IP header of HAVE message 

which are repeatedly sent. 

It need to be pointed out that these conclusions are 

based on the assumption that the peers are highbandwidth 

peers and actions are consistent. In the actual 

network environment, all peers often have different 

bandwidths, that is, high-bandwidth peers and lowbandwidth 

peers co-exist, and the time when each peer 

joins BT system is also different. In such cases, the peers 

will lose coherence and show diversification. For highbandwidth 

peers, due to the fact that each peer joins BT 

system in different time, there might not be the full 

downloading flow, it will lead to the decline in the 

payload of MultiHAVE message, thus the ratio of the 


frequency of sending HAVE message to that of 

MultiHAVE message will reduce a lot, the ratio of the 

flow will also reduce. When the two ratios are reduced to 

1, MulitHave message will return to HAVE message. In 

addition, for the low-bandwidth peers, the time of 

downloading a piece is often longer than the timer 

interval of the MultiHAVE message, then, MulitHave 

message will also return to HAVE message. But whatever 

the circumstances, the HAVE message storm in BT 

system will be prevented. In fact, along with the 

continuous improve -ment of the network environment, 

more and more peers will have the characteristics of highbandwidth, 

so MultiHAVE message scheme will also 

play a more effective role. 

FSR (times/s) 

Flow (B/s) 

1E+03 

1E+02 

1E+01 

1E+00 

MultiHAVE message HAVE message 

6 7 8 9 10 11 12 13 14 15 

n 

Figure 1. Frequency of sending and receiving(FSR) 

of MultiHAVE and HAVE message 

1E+05 

1E+04 

1E+03 

1E+02 

1E+01 

1E+00 

MultiHAVE message HAVE message 

6 7 8 9 10 11 12 13 14 15 

n 

Figure 2. Flow of MultiHAVE and HAVE message 

IV. EXPERIMENT 

In this section, some experiments are carried out to 

illustrate the effectiveness of the proposed MultiHAVE 

message scheme. As to the experiment parameters we 

refer to the first BitTorrent client developed by Bram 

Cohen, the inventor of the protocol[2]. The main 

parameters and their default values are as follows: 

(1) The maximum upload rate, no limitation;


(2) The minimum number of peers in the peer set 

before requesting more peers to the tracker, default to be 

20; 

(3) The maximum number of connections the local 

peer can initiate, default to be 50; 

(4) The maximum number of peers in the peer set, 

default to be 80; 

(5) The number of peers in the active peer set 

including the optimistic unchokes, default to be 4; 

(6) The block size, set to be 2MB; 

(7) The number of pieces downloaded before 

switching from random to rarest first piece selection, 

default to be 4. 

In addition, the downloading file size is 2.15GB, the 

Torrent file is 43.1KB and the downloading file is 

divided into 2205 pieces. 

The experimental evaluation of the BitTorrent protocol 

is very complex and each experiment in not reproducible 

as it heavily depends on the behavior of peers, the 

number of seeds and leechers in the torrent, and the 

subset of peers randomly returned by the tracker. 

However, by choosing a large variety of peers and 

designing the experiment process deliberately, we can 

identify the fundamental behaviors of the BitTorrent 

protocol. 

During the experiment, we send ten kinds of messages 

in the BT system peers. All the messages are with TCP. 

The size of each message is given with the TCP/IP header 

overhead of 40B. The details of each message are shown 

in TABLE II. 

TABLE II. 

THE COMPARISON OF MULTIHAVE MESSAGE AND HAVE MESSAGE 

Message name Message size/B Function 

HANDSHAKE 108 Initiate a connection 

CHOKE 45 Choke the remote peer 

UNCHOKE 45 Unchoke the remote peer 

INTERESTED 45 Interested the remote peer 

NOT 

INTERESTED 

45 Not interested the remote peer 

Announce each remote peer 

49 when the local peer has 

HAVE 49 

BITFIELD 

⎡ Numberpiece 

⎤ 

⎢ 

⎥ + 45 

⎢ 8 ⎥ 

received a new piece 

Notify the remote peer of the 

pieces the local peer already 

has 

REQUEST 47 

Request data to the remote 

peer 

PIECE Length piece + 53 Send data to the remote peer 

CANCEL 47 Cancel request message 

The BT systems adopted in the experiment are 1 seed 

and 5 downloaders, 1 seed and 10 downloaders and 1 

seed and 20 downloaders, where the classical HAVE 

message and the proposed MultiHAVE message in this 

paper are adopted respectively. Experiment results are 

shown in Figure 3~Figure 5, where “After Extension” 

and “Before Extension” columns describe the message 


flow of BT system with conventional message and 

MultiHAVE message respectively. 

1E+10 

1E+08 

1E+06 

1E+04 

1E+02 

1E+00 

After Extension 

Before Extension 

HS C UC I NI H BF R P CA 

Figure 3. Bytes per Type of Messages in 1 seed and 5 downloader 

1E+10 

1E+08 

1E+06 

1E+04 

1E+02 

1E+00 





1E+10 

1E+08 

1E+06 

1E+04 

1E+02 

1E+00 





It can be seen that at each case, though the flue of 

UNINTERESTED and BITFIELD and other messages 

change little in the BT systems with the proposed 

MultiHAVE message, the HAVE message reduced 89% 

approximately. Furthermore, the flux of BITFIELD 

message reduces about a half than that of the BT systems 

with original HAVE message. So the proposed


MultiHAVE message scheme can reduce the total 

message amount in the BT systems and hence decrease 

the management costs. 

It should be point out that the above experiments are 

carried out in BT systems with high bandwidth and 

consistent peers. In order to complete the MultiHAVE 

message, the timers are used. In the real network 

environment, the peers often possess various bandwidths, 

that is, high-bandwidth peers and low-bandwidth peers 

coexist in the system. Furthermore, we cannot demand all 

the peers in the network join the BT system at the same 

time. Actually, they join the system stochastically. In 

such cases, the peers will lose coherence. For highbandwidth 

peers, due to each peer joins BT system in 

different time, there might not be the full downloading 

flow, it will lead to the decline in the payload of 

MultiHAVE message, thus the ratio of the frequency of 

sending HAVE message to that of MultiHAVE message 

will reduce a lot, the ratio of the flow will also reduce. 

When the two ratios are reduced to 1, the MultiHAVE 

message will degenerate to the original HAVE message. 

Furthermore, for the low-bandwidth peers, the time of 

downloading each piece is often longer than the timer 

interval of the MultiHAVE message, so the MultiHAVE 

message will also degenerate to HAVE message. So 

whatever the circumstances are, the HAVE message 

storm in BT system will be prevented considerably. 

In fact, along with the continuous improvement of the 

network environment, more and more peers will have the 

characteristics of high-bandwidth, so MultiHAVE 

message scheme can work effective to prevent the HAVE 

message storm in BT systems. 

V. CONCLUSION AND FURTHRE WORK 

In this paper we propose a novel HAVE message 

scheme, MultiHAVE message, to prevent the possible 

message storm in BT systems. MultiHAVE message 

comprises several HAVE messages via a proper set timer. 

By adjusting the timer interval, we can change the size of 

MultiHAVE message. We compare the performance of 

the proposed MultiHAVE message and conventional 

HAVE message to illustrate the effectiveness of the 

MultiHAVE message. Experiments on BT systems with 

high-bandwidth, consistent peers show that the proposed 

MutiHave message scheme can significantly reduce the 

flow of HAVE message, thus reducing the management 

costs in BT system and effectively preventing the HAVE 

message storm. When the action of network peers is 

diverse for the low -bandwidth peers, the MultiHAVE 

message will degenerate to the original HAVE message, 

thus remaining the high performance of BT system. 

There are still further works need to be carried out. For 

instance, when the BT client that is compatible with 

MultiHAVE message communicates with the BT client 

that is incompatible with MultiHAVE message, how to 

match them intelligently is an unsolved problem. 



This work was supported by the National Natural 

Science Foundation of China under Grant 60974005, the 

Specialized Research Fund for the Doctoral Program of 

Higher Education under Grant 20094101120008, the 

Natural Science Foundation of Henan Province under 

Grant 092300410201, Zhengzhou Science and 

Technology Research Program under Grant 

0910SGYN12301-6 and the Science Fund for 

Distinguished Yong Scholars of Henan Province under 

Grant 0612000600. The authors would like to thank Dr 

Yanhong Liu for her invaluable suggestions. 

REFERENCES 

[1] “Bittorrent,” http://www.bittorrent.com/. 

[2] B. Cohen, “Incentives build robustness in BitTorrent,” in 

First Workshop on Economics of Peer-to-peer Systems, 

Berkeley, USA, June 2003. 

[3] R. L. Xia and J. K. Muppala, “A survey of BitTorrent 

performance,” IEEE Communications Surveys & Tutorials, 

2010, vol. 12, no 2, pp. 140-158. 

[4] Andrew Parker. “The True Picture of Peer-to-Peer 

Filesharing”. http://www.cachelogic.com/research/slide9. 

php, May 2005. 

[5] T. Karagiannis, A. Broido, M. Faloutsos, and K. C. Claffy. 

“Transport Layer Identification of P2P Traffic”. In 

Proceedings of ACM IMC, Taormina, Sicily, Italy, 

October 2004. 

[6] T. Karagiannis, A. Broido, N. Brownlee, and K. C. Claffy. 

“Is P2P Dying or Just Hiding?”. In Proceedings of IEEE 

GLOBECOM, Dalla, Texas, USA, Nov. 29 - Dec. 3, 2004. 

[7] A. Qureshi, “Exploring proximity based peer selection in a 

BitTorrentlike protocol,” MIT 6.824 student project, 2004 

[8] R. Bindal, P. Cao, W. Chan, J. Medved, G. Suwala, T. 

Bates, and A. Zhang, “Improving traffic locality in 

BitTorrent via biased neighbor selection,” in ICDCS ’06: 

Proc. 26th IEEE International Conference on Distributed 

Computing Systems. Washington, DC, USA: IEEE 

Computer Society, 2006, p. 66. 

[9] S. Yamazaki, H. Tode, and K. Murakami, “CAT: A costaware 

BitTorrent,” in 32nd IEEE Conference on Local 

Computer Networks (LCN 2007), Oct 2007, pp. 226–227. 

[10] P. Garbacki, A. Iosup, D. Epema, and M. van Steen, “2fast: 

Collaborative downloads in p2p networks,” in P2P ’06: 

Proc. Sixth IEEE International Conference on Peer-to-Peer 

Computing. Washington, DC, USA: IEEE Computer 

Society, 2006, pp. 23–30. 

[11] P. Garbacki, D. Epema, and M. van Steen, “An amortized 

tit-for-tat protocol for exchanging bandwidth instead of 

content in p2p networks,” Self-Adaptive and Self- 

Organizing Systems, 2007. SASO ’07. First International 

Conference on, pp. 119–128, July 2007. 

[12] M. Sirivianos, J. Park, R. Chen, and X. Yang, “Freeriding 

in BitTorrent networks with the large view exploit,” in 

IPTPS’07, 2007. 

[13] Arnaud Legout, Guillaume Urvoy-Keller, and Pietro 

Michiardi. “Understanding BitTorrent: An Experimental 

Perspective”. Technical Report, INRIA, Sophia Antipolis, 

July 2005.


network control. 


Jianyong Li received his master degree 

from the Department of Computer, 

Huazhong University of Science and 

Technology in 2001. He is currently an 

associate professor with the School of 

Computer and Communication 

Engineering, Zhengzhou University of 

Light Industry. His research interest 

covers Peer-to-peer networks and 

Jianchun Li received his master degree 

from the Department of Computer, 

Zhengzhou University in 2005. He is 

currently a lecturer with the School of 

Computer and Communication 

Engineering, Zhengzhou University of 

Light Industry. His research interest 

covers computer networks and 

distributed computing systems. 

paper. 

Daoying Huang received his Ph. D. 

degree from the PLA Information 

Engineering University in 2001. Since 

2006, he has been a professor with the 

School of Computer and 

Communication Engineering, 

Zhengzhou University of Light Industry. 

His research interest covers computer 

networks and distributed computational 

systems. Corresponding author of this 

Qiang Wei is currently a master 

candidate with the School of Computer 

and Communication Engineering, 

Zhengzhou University of Light Industry. 

His research interest covers computer 

networks.


Image-based Position Estimation and Adaptive 

Modulation Coding in Vehicular Communication 

Hao Yang 1 , Qingmin Meng 1,2 , Xiong Gu 1 , and Baoyu Zheng 1 

1 School of Geography and Biological Information, 

Key Lab of Broadband Wireless Communication and Sensor Network Technology (Ministry of Education) 

Nanjing University of Posts and Telecommunications, Nanjing, 210003, China 

2 National Mobile Communications Research Lab, Southeast University, Nanjing, 210096, China 

Email: {yanghao, mengqm, zby}@njupt.edu.cn, guxiong108@gmail.com 

Abstract—Vehicle position estimation is a key technology for 

Inter-Vehicle Communications, while template matching 

can be used to get information of vehicular position. In the 

paper, a simplified template matching, namely area-based 

template match is considered. A vehicular communication 

system designed for wireless data application is proposed 

where a camera is fixed in a vehicle which is served as a 

base station. By means of comparison between the outline 

area of vehicular image and reference templates, the base 

station can obtain the position estimation of the vehicle. The 

reference templates can be pre-calculated from a group of 

field experiment data. Based on supervised learning, we 

develop an image-based vehicle position estimation method 

and evaluate its effect on an adaptive coding modulation 

scheme. The computer simulation results show that in the 

wireless fading channel with the OFDM physical model, 

compared with fixed modulation coding scheme, the studied 

adaptive modulation and coding (AMC) scheme taking 

account of the position estimation can gain greater 

throughput. 

Index Terms—Inter-Vehicle Communications, supervised 

learning, template matching, OFDM, adaptive modulation 

and coding 


In recent years, research on how to achieve Inter- 

Vehicle Communications (IVC) has become one of the 

focuses of research and application. It is emerging as a 

key part of Intelligent Transportation Systems (ITS) 

which facilitates the ITS to realize short distance 

wideband wireless communication without expensive 

infrastructure. IVC has attracted research attention from 

both the academia and industry of, notably, US, EU, and 

Japan [1]. Refering to [2], we find that IVC can be briefly 

divided into two categories: one is mainly to solve traffic 

safety, called Safety Application; the other mainly 

contributes to providing value-added services, such as 

meeting passengers’ need for business, entertainment and 

information functions in the car, called User Application. 

In other words, IVC can provide various road traffic 

applications ranging from traffic safety to pleasant 

Manuscript received March 1, 2011; revised April 10, 2011; 

accepted April 20, 2011. 

Project number: 2010ZX03003-003-02, 60972039, 61001077, 

20090451239. 


doi:10.4304/jnw.6.12.1754-1759 

driving. In [3], IVC is simplified into three layer model 

which consists of physical layer, data link layer and 

application layer. Literature [4] gived the specification of 

Dedicated Short Range Communications (DSRC), a type 

of high-speed mobile broadband. Recently, many 

automobile manufactures regard DSRC as a vehicle 

communication platform called DSRC Vehicle Ad Hoc 

Network (VANET). Specially IEEE 802.11 adds the 

Wireless Access to Vehicle Environment (WAVE) [5] to 

form the IEEE 802.11p and the latter is very closely 

related to the IEEE 802.11a standard [6]. 

Orthogonal Frequency Division Multiplexing (OFDM) 

is a multiplexing technique that divides a channel with a 

higher data rate into multiple orthogonal sub-channels 

with a lower data rate. OFDM has been adopted in 

several wireless standards such as digital audio 

broadcasting (DAB), digital video broadcasting (DVB-T), 

the IEEE 802.11a local area network (LAN) standard 

and the IEEE 802.16a metropolitan area network (MAN) 

standard [7]. OFDM is also being pursued for the abovementioned 

DSRC for road side to vehicle 

communications. 

The significance of the paper is to propose an image- 

based IVC design. In order to improve the performance 

of the AMC in the OFDM transmission, we use the 

supervised learning of machine learning to estimate the 

position of the vehicle. 

The remainder of the paper is organized as follows: 

Section 2 introduces some relevant research work about 

image processing; Section 3 describes the system model 

and vehicle position estimation; Section 4 gives the signal 

model and the AMC selection; In Section 5, we bring out 

the simulation and results; and finally the conclusion is 

given in Section 6. 

II. RELATED WORK OF IMAGE PROCESSING 

Digital image processing refers to handling digital 

images or video frames by means of a digital computer. 

The results of digital image processing are generally 

images or a set of characteristics and parameters related 

to the images [8]. 

Image processing techniques can be used to measure 

distance. In [9], Lu et al. proposed a novel measuring 

system using a scan-counter method via a CCD camera. 

The system can be used to measure the distance between


a CCD camera and an object. Set on either side of a CCD 

camera, two laser projectors in the system produced two 

parallel rays that projected two bright spots on the object 

and the CCD. The interval between the two bright spots 

in the video image was calculated. As there is a linear 

relationship between the actual distance and the interval 

of the two bright spots, the actual distance from the CCD 

camera to the object can be obtained from a simple 

formula. Later, Hsu et al. [10] brought forward a new 

method for calculating the distance. The proposed 

scheme counted pixel number variation of reference 

points in the images to acquire the displacement of the 

camera movement along the photographing direction. 

In [11], Chang et al. proposed a method to use images 

to measure the relative distance between vehicles. The 

procedures of the method were divided into two parts. 

First, the location of the license plate in the image was 

found by several image processing techniques. Second, 

the image size of the plate was obtained by the region 

growing technique, then the relative distance was 

computed by using the geometric relation. 

In [12], Lü et al. put forward an efficient measuring 

method for live plant leaf area. The proposed method was 

composed of four steps. First, image geometric 

distortions were corrected by using mapping function. 

Then, image segmentation was performed using threshold 

method and leaf region was obtained. Next, leaf contour 

was extracted and contour region was filled. Finally, leaf 

area was calculated through pixel number statistic. 

An object size in images can be obtained by using the 

result of contour extraction. There are many papers 

focused on this topic. Active contour model, known as 

“snakes”, is a framework for delineating an object outline 

from a noisy image [13]. Snakes have been successfully 

used in segmentation, matching and tracking the 

interested target. In [14], Dubuisson proposed a specific 

method for the contour extraction of the moving object. 

The method is based on the fusion of a motion 

segmentation technique, which uses image subtraction 

and color segmentation based on the split-and-merge 

paradigm and edge information. The edge information 

can be obtained by using the Canny edge detector. He 

also applied the object matching in intelligent 

vehicle/highway system. 

III. SYSTEM MODEL AND VEHICLE POSITION 

ESTIMATION 

The scene of IVC in the paper is shown in Fig.1. The 

three vehicles form a linear topology of the Ad Hoc 

Network and each vehicle is regarded as a 

communication node. Each vehicle is considered to be 

equipped with a whole communication system, which 

consists of three main components: the wireless 

transceiver, the microcomputer and the camera. The main 

function of each part is as follows: 

1) The Wireless Transceiver: It is used for the 

receiving and sending of information between vehicles of 

short distance. 

2) The Microcomputer: On one hand it receives the 

image information from camera through a specific 


interface and then handles the information and displays 

the results on the screen; On the other hand, through a 

specific interface it communicates with the wireless 

transceiver. 

3) The Camera: it is the main sensing component of 

the system and used for capturing the surrounding 

environment. In the paper, it is used for capturing the 

snapshots of the vehicle so as to track its position. 

Figure 1. The scene of the IVC 

A. Assumptions of the model 

Before introducing the specific design, for the sake of 

simplicity, we make the following assumptions of the 

system. 

1) In order to facilitate the camera to capture the 

snapshots of the vehicle, we assume vehicles are traveling 

in the queue, that is to say, they are traveling in a straight 

line. 

2) Taking into account the driver has good vision in 

front of the vehicle, we assume that the camera is 

installed at the tail of the vehicle and only captures the 

snapshots of the following vehicle. 

3) As the communication process as well as the control 

process of the vehicle with its preceding one and the 

following one is similar, we just consider the vehicle 

communicate with its following vehicle. Hereinafter, 

when referring to a vehicle that transmits data, we called 

it active vehicle, otherwise we called it inactive vehicle. 

4) We assume that the camera is with ordinary and 

fixed focal length optical lens. 

5) We assume that the type and size of the vehicle are 

the same. 

B. Vehicle Position Estimation 

Considering the position of the vehicle changes rapidly 

in IVC, it is difficult to achieve exact matching of the 

vehicle. The paper performs a fast matching based on the 

contour area of the vehicle. The active vehicle selects the 

appropriate modulation and coding scheme for OFDM 

transmission with the assumption of an ideal OFDM 

channel estimation. The key part of the process is to 

determine the distance between vehicles through 

snapshots of the vehicle, which we will use machine 

learning algorithms. Machine learning is generally 

divided into supervised learning, unsupervised learning 

and reinforcement learning [15]. For our scenario, 

supervised learning is adopted. In this method of 

learning, a training set is given, and then we attempt to 

identify the relationship between input and output 

through a learning algorithm and then achieve a function


h , called a hypothesis. When a new input x is given, we 

can get the predicted output y through the function h . 

The process is shown in Fig.2. Supervised learning 

consists of two important parts namely regression and 

classification. The difference between them is whether 

the predicted output is continuous or discrete. If the 

predicted output is continuous then it is a regression 

problem, otherwise a classification problem. As the 

output in the paper is continuous, we consider the former 

one. 

Figure 2. The supervised learning process 

In the paper we consider an n-dimensional linear 

regression, in which the relationship between the input 

features x and predicted output y is linear. As 

illustrated in equation (1), 

n 

∑ 

T 

h( x) = θ x = θ x (1) 

i= 

0 

where θ and x are both vectors, and we set x 0 = 1. 

In 

order to work out the value of θ , we first introduce the 

cost function, 

m 1 

i i 2 

J( θ) = ∑( hθ( x ) − y ) 

2 i= 

1 

(2) 

1 

T 

= ( Xθ −Y) ( Xθ−Y) 2 

it indicates the difference between the predicted output 

i i i i T 

and the practical output. x = [ x0, x1,..., xn] 

is the 

i 

specific input feature vector, y refers to the 

corresponding output, and m defines the number of the 

1 2 m T 

training data. X= [ x , x ,..., x ] is the matrix of the 

1 2 m T 

whole input features, and Y = [ y , y ,..., y ] is the 

whole corresponding output. 

After defining the cost function, all we need to do is to 

choose appropriate θ so as to minimize J ( θ ) . The 

intuitive approach is to make derivation for each θ i , as 

illustrated in formula (3), 

1 

T 

∇ θJ( θ) =∇θ{ ( Xθ −Y) ( Xθ−Y)} 2 

(3) 

T T 

= XXθ−XY in which ∇θJ ( θ ) means J ( θ ) makes a derivation for 

θ in matrix form. Set the derivatives to zero, and we can 

have the following standard expression. 


i i 

T T 

XXθ= XY (4) 

Solving the above equation, we get the appropriate θ 

to make J ( θ ) minimized. The final expression is 

T −1 

T 

θ = ( XX) XY (5) 

IV. SIGNAL MODEL AND ADAPTIVE MODULATION AND 

CODING SELECTION 

A. Signal model 

See [16-19]. A wireless channel including path loss, 

shadow fading, small scale fading and additive 

background noise is considered. The channel impulse 

response for the small scale fading can be modelled as 

described with 

Lp 

−1 

∑ α , δ ( τ ) 

(6) 

h () t = t− 

ij l ij l 

l= 

0 

where α lij , represents the discrete time-domain channel 

coefficient which is independent and identically 

distributed (i.i.d) complex Gaussian variable, L p denotes 

the number of paths in a frequency selective fading 

channel and τ l denotes the path delay term. 

The transmission parameters of OFDM are: total 

subcarrier number N , subchannel number K , 

subcarrier spacing B (KHz) and channel spacing W 

(MHz). The signal between transmit node i and receive 

node j can be represented as 

Lp 

−1 

−α 

∑ βα, ( τ ) 

(7) 

r () t = Pd s t− + n () t 

ij t ij i lij i l ij 

l= 

0 

In equation (7), P t is the transmission power and d ij 

indicates the distance between node i , j . α denotes 

pass loss index and β i refers to log-normal shadowing 

2 

term, i.e., 10log 10 βi ~ N(0, 

σ db ) . si() t is the 

transmission signal from node i and nij () t denotes 

additive white Gaussian noise with zero mean and power 

spectral density N 0 . 

In the studied OFDM transmission scheme, distinct 

Quadrature Amplitude modulation (QAM) schemes are 

used according to differing separated spacing between the 

two vehicles communicating with each other. The cyclic 

prefix is used in OFDM signals as a guard interval whose 

length needs to be larger than the maximum excess delay 

to mitigate the effect of Intersymbol Interference (ISI) 

due to the multipath propagation. The cyclic prefix is 

added after the IFFT at the transmitter and is removed in 

order to get the original signal at the receiver. 

The average frequency response of subchannel k in 

an OFDM receiver is Hij ( k ) . For simplicity reasons, we 

ignore the subchannel index and define the gain of the


−α 

2 

subchannel as G = d β | H | , therefore the signal 

to noise ratio (SNR) is 

i ij i ij 

Pi⋅Gi γ i = 

N ⋅ B 

0 

B. Selection of adaptive modulation and coding scheme 

The selection principle of AMC is to choose the 

appropriate scheme that makes the throughput of the 

vehicle transmission maximum in OFDM transmission. 

Considering the M-ary quadrature amplitude modulation 

(M-QAM), the modulation level and coding rate of node 

i are i M and C i , respectively. As seen in [20], the 

practical modulation and coding schemes (MCS) will 

cause the loss of SNR when the bit error rate p b is 

considered. Then we consider the rate formula as 

bi = log 2(1 

+ φγi), φ =− 1.5 / ln(5 pb) 

(9) 

Assume the length of the packet is L , define the 

throughput of node i is R f , the data rate is R m and 

packet error rate (PER) is P e ,then we have: 

fR = Rm*1 ( − Pe) 

(10) 

log2 i M 

Rm = N⋅B⋅Ci⋅ (11) 

1 (1 ) L 

P = − − p 

(12) 

e b 

V. SIMULATION AND RESULTS 

The simulation training set for the vehicle is obtained 

from the practical measurement. First we fix a vehicle V1 

(base station) and make another vehicle V2 drive in a 

straight line towards V1. Then we use the camera 

installed in V1 to capture the snapshots of the V2 and 

estimate the distance d between them. Finally in order 

to reduce the dimension of the unique input feature, we 

regard the area of the vehicle as the input feature and the 

corresponding distance between vehicles as the training 

set output. The type of the vehicle is Peugeot 307, the 

camera employed PAL form, the lens focal length is 12 

millimeters and the image resolution is 720*576 pixels. 

When the distance between vehicles is too far, the vehicle 

size in the image is too small for the camera to capture. 

On the other hand when the distance is too close, the 

vehicle size in the image is too large and occupies the 

whole image. Taking both into consideration, we choose 

the distance ranging from 15 meters to 70 meters. The 

practical measured data is shown in Table 1. As observed 

from Table 1, when the vehicle spacing is close, the 

outline of V2 becomes larger in size; when the vehicle 

spacing gradually increases, the contour dimension 

gradually becomes smaller. After getting the training set, 

image fitting can be performed by using the mentioned 

linear regression method. 

For the plane curve fitting, n points on the plane 

generally can always be completely fitted by using n-1 

order polynomial fitting. However, even though the fitted 

curve can pass through the points perfectly, we can not 


(8) 

definitely say that the curve is a best prediction. In the 

studied process the prediction is the vehicle spacing for 

different outline areas of the vehicle. Two major issues 

for the curve fitting are over-fitting and under-fitting. In 

general, the under-fitting shows if the order is lower 

compared with the actual model’s and mainly behaves 

that most of the data are not good fitted as show in Fig.3, 

while the over-fitting shows if the order is higher than the 

actual model’s and mainly behaves that all the data are 

better fitted as show in Fig.4. The selection of the order 

plays a decisive role in the curve fitting. We employed a 

3-rd order fitting and the fitting result for the area is 

shown in Fig.5. 

TABLE I. 

THE TRAINING SET OF LINEAR REGRESSION 

Distance (m) 15 16 18 20 25 

Area (pixels) 37395 30820 25314 19725 13150 

Distance (m) 30 35 40 45 50 

Area (pixels) 9780 6903 4931 4068 2958 

Distance (m) 55 60 70 - - 

Area (pixels) 2588 2301 1725 - - 

distance between two vehicles(m) 

80 

70 

60 

50 

40 

30 

20 

10 

0 0.5 1 1.5 2 2.5 3 3.5 4 

x 10 4 

0 

area of the vehicle in the image(pixels) 

Figure 3. The under-fitting for the training set with 2ndorder 


80 

70 

60 

50 

40 

30 

20 

10 

0 0.5 1 1.5 2 2.5 3 3.5 4 

x 10 4 

0 


Figure 4. The over-fitting for the training set with 7th-order


System simulation parameters are partially based on 

IEEE 802.11a. The number of the subcarriers, the 

subchannels, subcarrier spacing and channel spacing is 

N = 52 , K = 4 , B = 312.5 [KHz] and W = 20 

[MHz], respectively. The carrier frequency ranges from 

5.850 GHz to 5.925GHz. A quasi-static six-path fading 

channel model is considered, whose Rician coefficient is 

4 and standard deviation of log-normal shadowing is 8dB. 

The power gain in each tap is defined as [0.8084, 0.462, 

0.253, 0.259, 0.0447, 0.01] and the delay with T=1/W 

spaced taps is given as [0, 2, 4, 6, 9, 13]. In the simplified 

path-loss model [19], a reference distance, d 0 = 15 [m], 

is defined and the corresponding normalized distance is 

defined as ( 0 / d d ). In order to get a simple result, five 

MCS are considered, i.e., QPSK-1/2, QPSK-3/4, 

16QAM-1/2, 16QAM-3/4 and 64QAM-3/4. Assume that 

all the subcarriers can obtain equal treatment and all 

subcarriers use the single QAM modulation scheme in an 

interval of fading block. When packet length is 1000 

Bytes, we have L = 8000 [bits]. According to Eq.10 we 

calculate the objective function and select the MCS that 

makes the value of the objective function maximum. The 

throughput performance comparison under different SNR 

values is shown in Fig. 6. One curve represents the near 

constant performance with a fixed modulation mode of 

QPSK-1/2, another represents the performance with the 

AMC mode taking account of the position information. 

Obviously, with the proposed AMC the system 

throughput can be remarkably improved. 


Position estimation has a significant effect on the 

choosing of transmission parameters of wireless vehicular 

communications. However the research work in this field 

is less. The work of this paper shows such a preliminary 

design, namely vehicle-location awareness OFDM 

transmission. By using the supervised learning algorithms 

of machine learning, the base station can first perform 

identification and area matching and then predict the 

separated spacing between the two vehicles 

communicating with each other. The spacing information 

can be used to the subsequent selection of modulation and 

coding scheme. Therefore, the throughput performance of 

the vehicle communication system will be significantly 

improved. 


The authors wish to thank National Mobile 

Communications Research Laboratory, Southeast 

University. This work was supported by National Science 

and Technology of major special projects (2010ZX 

03003-003-02), the National Natural Science Foundation 

of China (60972039 and 61001077) and the national postdoctoral 

research funding (20090451239). 



80 

70 

60 

50 

40 

30 

20 

10 

0 0.5 1 1.5 2 2.5 3 3.5 4 

x 10 4 

0 


Figure 5. Three-order fitting for the training set 

Throughput(bit/s) 

x 107 

8 

7 

6 

5 

4 

3 

2 

1 

Adaptive MCS 

Fixed MCS 

0 

5 10 15 20 25 

SNR(db) 

30 35 40 45 

Figure 6. The throughput comparison between fixed MCS 

and adaptive MCS 

REFERENCES 

[1] J. Luo, and J. P. Hubaux, “A Survey of Inter-Vehicle 

Communication,” Tech. Rep, 2004. 

[2] M. Rudack, M. Meincke, K. Jobmann, and M. Lott, “On 

traffic dynamical aspects inter vehicle communication 

(IVC),” In Proc. of the 57th IEEE Semiannual Vehicular 

Technology Conference (VTC’03 Spring), 2003. 

[3] Ugur Keskin, “In-Vehicle Communication Networks: A 

Literature Survey,” July 28, 2009. 

[4] ASTM International. ASTM E2213-03 Standard 

Specification for Telecommunications and Exchange 

Between Roadside and Vehicle Systems - 5GHz Band 

Dedicated Short Range Communications (DSRC) Medium 

Access Control (MAC) and Physical Layer (PHY) 

Specifications, 2003.. 

[5] IEEE 1609 - Family of Standards for Wireless Access in 

Vehicular Environments (WAVE), U.S. Department of 

Transportation, January 9, 2006. 

[6] IEEE Standard 802.11a-1999, Part 11: Wireless LAN 

Medium Access Control (MAC) and Physical Layer (PHY) 

specifications: High-speed Physical Layer in the 5 GHz 

Band. 

[7] IEEE Standard IEEE 802.16a, for Local and Metropolitan 

Area Networks Part 16, Air Interface for Fixed Broadband 

Wireless Access Systems: 

http://grouper.ieee.org/groups/802/16/.


[8] Rafael C. Gonzalez and Richard E. Woods, “Digital image 

processing,” Second Edition, Beijing, Publishing House of 

Electronics Industry, September, 2007. 

[9] Ming-Chih Lu, Wei-Yen Wang and Chun-Yen Chu, 

“Image-Based Distance and Area Measuring Systems,” 

IEEE Sensors Journal, Vol. 6, No.2, April 2006, pp495- 

503. 

[10] Chen-Chien Hsu, Ming-Chih Lu, Wei-Yen Wang and Yin- 

Yu Lu, “Distance measurement based on pixel variation of 

CCD images,” ISA Transactions, Vol. 48, No. 4, October 

2009, pp389-395. 

[11] Tang-Hsien Chang, Chun-hung Lin, Chih-sheng Hsu, and 

Yao-jan Wu, “A Vision-Based Vehicle Behavior 

Monitoring and Warning System,” In Proc. of Intelligent 

Transportation Systems, 2003. 

[12] Chaohui Lü, Hui Ren, Yibin Zhang, and Yinhua Shen, 

“Leaf Area Measurement Based on Image Processing,” In 

2010 International Conference on Measuring 

Technology and Mechatronics Automation. 

[13] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active 

contour models,” Internat. J. Comput. Vision 1 (1987) 

pp321–331. 

[14] Marie-Pierre Dubuisson and Jain. A. K, “Object Contour 

Extraction using Color and Motion,” in Computer Vision 

and Pattern Recognition, Proceedings CVPR '93, IEEE 

Computer Society Conference, 1993. 

[15] CS 229: Machine Learning. http://www.stanford.edu/class/ 

cs229/. Autumn 2010. 

[16] R. C. Daniels, C.Caramanis, and R.W.Heath, “A 

Supervised Learning Approach to Adaptation in Practical 

MIMO-OFDM Wireless Systems,” in Global 

Telecommunications Conference, New Orleans, Lo, Nov. 

2008, pp1-5. 

[17] Qingmin Meng, Xiong Gu, Feng Tian, Baoyu Zheng, “ k- 

NN Based MCS Selection in Distributed OFDM Wireless 

Networks,” In 2011 international conference on 

Automation, Communication, Architectonics, and 

Materials (ACAM2011), to be published, June 18-19, 

Wuhan, China. 

[18] T. S. Rappaport, Wireless Communications: Principles and 

Practice, 2nd ed. NJ: Prentice-Hall, 2001. 

[19] Andrea Goldsmith. Wireless Communications. Cambridge 

University Press, 2005. 

[20] Koji Yamamoto, “Tradeoff between Area Spectral 

Efficiency and End-to-End Throughput in Rate-Adaptive 

Multihop Radio Networks,” IEICE Trans. Commu., Vol. 

E88-B, No.9, 2005. 


Hao Yang Jiangsu Province, China. 

Birthdate: November, 1969. He is Signal 

and Information Processing Ph.D., 

graduated from the School of Information 

Science and Engineering, Southeast 


image processing. 

He is a senior lecturer of the School of 

Geography and Biological Information, 

Nanjing University of Posts and Telecommunications. 

Qingmin Meng Jiangsu Province, China. 

Birthdate: September, 1965. He received 

Ph.D. degree in radio engineering from 

Southeast University, Nanjing, China, in 

2007. Then he joined the Faculty of 

School of Telecommunications and 

Information Engineering, Nanjing 

University of Posts and 

Telecommunications. His current research 

interests include multihop relaying in next 

generation broadband wireless communication, the application 

of machine learning to resource allocation in cognitive radio 

networks and vehicular opportunistic communication. 

Xiong Gu Hubei Province, China. 

Birthdate: May, 1988. He is working for 

master degree in School of 

Telecommunications and Information 

Engineering, Nanjing University of Posts 

and Telecommunications. His current 

research interests include machine 

learning and its application in resource 

allocation in cognitive radio networks 

and vehicular opportunistic communication.


A Request Distribution Algorithm for Web 

Server Cluster 

Wei Zhang 

School of Computer Science and Engineering, Beihang University, Beijing, 100191, China 

State Key Laboratory of Rail Traffic Control and Safety,Beijing Jiaotong University,Beijing 100044, China 

zhangwqh@cse.buaa.edu.cn 

Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, Li Ruan 

School of Computer Science and Engineering, Beihang University, Beijing, 100191, China 

{zhumf, xiaolm,ruanli}@buaa.edu.cn 

Abstract—With the explosively increasing of web-based 

applications’ workloads, Web server cluster encounters 

challenge in response time for requests. Request distribution 

among servers in web server cluster is the key to address 

such challenge, especially under heavy workloads. In this 

paper, we propose a new request distribution algorithm 

named llac (least load active cache) for load balancing 

switch in web server cluster. The goal of llac is to improve 

the cache hit rate and reduce response time. Packets are 

parsed in IP level, and back-end servers are notified to 

cache hot files using link change technology, neither 

changing URL information nor modifying the service 

program. This avoids switching overhead between user 

mode and kernel mode. The load balancing switch directly 

creates connection with the selected server, avoiding 

migrating connection overhead. This policy estimates the 

current composited load of each server and selects the 

server with the least load to serve the request. It also 

improves the resource utilization of web servers. 

Experimental results show that llac achieves better 

performance for web applications than wrr (weight round 

robin) which is a popular request distribution. 

Index Terms—Web Cluster, Request Distribution, LLAC 


The enormous growth of the internet industry 

introduces web-based application as popular demanding 

programs. Users are becoming increasingly reliant on the 

web for their daily activities such as electronic commerce, 

on-line banking, stock trading, reservations and product 

merchandising. Therefore the performance of a web server 

system plays an important role in success of many internet 

related companies. Traditionally, a single server machine 

can only handle a limited amount of requests and can’t 

scale up with demand. The better way to cope with 

growing processing demands for web servers is by adding 

more hardware resources instead of completely replacing 

one server with a faster one [1]. More and more web sites 

use a web cluster, composed of a front-end request 

dispatching server, also called load balancing switch, and 

several back-end servers handing requests. By distributing 

requests from clients to separate servers for load balancing 


doi:10.4304/jnw.6.12.1760-1766 

or load sharing, web cluster have proved to be a better 

solution than using an overloaded single server. Due to 

various technical issues regarding the management of a 

web server cluster, request distribution algorithms (which 

are implemented in the load balancing switch) are 

particularly important to boost the performance of cluster 

web servers [2]. The ratio of the peak load to light load for 

internet applications is usually on the order of 300% [19]. 

J.C.Mongul said [3], web site happened to collapse mostly 

because of popular and hot event access. A famous 

example, the normally well-provisioned Amazon.com site 

suffered a forty-minute down time due to an overload 

during the popular holiday season in November 2000. 

Popular web sites often face the challenge to deal with 

huge amount of requests in short time. This paper 

addresses the problem of request distribution so that web 

server cluster can serve its peak workload demand. We 

simultaneously use client-side and server-side information 

to select server, and avoid switching overhead between 

user mode and server mode. We also avoid migrating 

connection overhead. We present a new request 

distribution algorithm with several contributions in which 

as follows: Firstly, design combined load model based on 

collection of typical load information. This model is used 

with online measurements of load information to estimate 

the processing capacity of web servers. This gives us 

reliable load descriptors for web servers which are used in 

the decision making process of the request distribution 

algorithm. Secondly, in order to increase the speed of 

accessing the popular or hot files, our approach resorts to 

active caching technology. Packets are captured and 

analyzed using netfilter mechanism in IP level, avoiding 

switching overhead between user mode and server mode 

and migrating connection overhead. The active caching 

technology does not modify URL or server program, 

resorting to link change technology to put hot files in 

memory file system. Finally, we propose and implement a 

novel request distribution algorithm which works on the 

basis of the composited load and file access frequency. 

We call this novel request distribution algorithm for 

llac(least load and active caching) shortly.


The rest of the paper is organized as follows. Section II 

discusses the related work. In Section III we propose 

request distribution architecture and algorithm, and then 

we discuss each module separately. Section IV describes 

the experimental results of a web cluster prototype using 

llac. Section V concludes the paper. 

II. RELATED WORKS 

Numerous dispatching algorithms are proposed for web 

server cluster. We can classify dispatching algorithms as 

layer-4 and layer-7 algorithms. 

A layer-4 algorithm only considers web server-side 

information, but doesn’t use client-side information. In 

this approach, clients directly create connection with the 

selected server. This algorithm is easy to implement, but 

cannot make good use of server’s resources according to 

the customer’s request. It includes random policy [4], 

round-robin policy [4], weight round-robin policy [5], 

least connection policy [6], fast response time [6] and so 

on. Random and round-robin are easy to implement, but 

they don’t consider servers capacity. This can easily lead 

to unbalance. Wrr associates an evaluated weight with 

each server node in a cluster which is proportional to the 

server’s capacity. Initial weight is set by the administrator, 

disturbing by human factors. Least connection doesn’t 

consider that each request may have different response 

time and different demand for resources. Fast Response 

Time is influenced by the network environment, so can’t 

evaluate the performance of a web server effectively. 

A layer-7 algorithm not only considers server 

information, but also can use client’s user level 

information, such as session identifiers, type of URL, 

cookies and so on. However, clients need to create TCP 

connection with the load balancing switch in order to 

analyze information of customer. This involves to two 

copies of packets between user space and kernel space. As 

customers firstly establish connections with the load 

balancing switch, so the connections need to be migrated 

to the selected server. Migrating connections are very 

time-consuming and consume large amounts of system 

resources. Layer-7 algorithms can consider more 

information deciding to select which server to response to 

a request and make good use of server resources, in 

particular cache resource. However, they require 

migration of connection and copy of packets between user 

space and kernel space, bringing a certain degree loss of 

performance. Examples of the layer-7 include LARD 

(locality-aware request distribution) [7], WARD 

(workload aware request distribution) [8], CAP (client 

aware policy) [9] and so on. LARD is well known 

dispatching policy that aims to improve cache hit rate in 

web server. In LARD policy, the load balancing policy 

dispatches the request of the same web object to the same 

back-end web server. However, LARD may lead to load 

unbalancing due to different popularity of web pages. 

WARD is static partitioning that assigns dedicated servers 

to specific groups of requests. Although this policy is 

useful from the system management point of view and 

achieves a higher cache hit rate, it does have poor server 

utilization. Degradation in the utilization is due to 


resources that are not utilized and cannot be shared among 

all of the clients. The main goal of CAP is to improve load 

sharing in web clusters that provides multiple types of 

services. The load balancing switch classifies requests 

from clients into four classes: normal, CPU bound, disk 

bound, and CPU and disk bound. However, requests with 

the same type might consume different amounts of 

resources. 

Considering the shortcomings of the above methods, 

we propose llac. Using netfiler mechanism to intercept 

packets and analyze the URL information in IP level, 

Load balancing switch notifies the back-end server cache 

hot files, selects least-load server to process requests. We 

use both client-side and server-side information, avoid 

switching user mode and kernel mode and migrating TCP 

connection overhead. In case of hot files, use cache to 

improve response time. 

III. LLAC FOR WEB CLUSTER 

In the sites, most of the crashes occur during the hot 

visit. Therefore, increase accessing speed of hot files eases 

the pressure on the sites to a large extent. The main 

objective in designing such an algorithm is to minimize 

the average response time of popular or hot requests 

(proactive cache hot files, read them from memory file 

system, so reducing disk write times), make good use of 

server resources in the cluster and increase the utilization 

and throughput of cluster web servers. In this regard, we 

use a linear model to compute each server’s composite 

load, which helps to decide which should be chosen to 

serve request. The module of llac uses information of 

clients and servers to make request distribution. The load 

balancing switch parses packets in IP layer using netfilter 

mechanism, records packet access frequency and informs 

web servers to cache popular or hot files. Fig. 1 shows 

system components we design. 

A. Load Collection 

The Load Collection is responsible for tracking the 

processor, network and memory usage of web server. We 

gather resource utilization traces by running a set of 

microbechmarks. The full list of metrics is shown in 

TABLE I. These statistics can all be gathered easily in 

Linux with the sysstat monitoring package [10]. We focus 

on this set of resource measurements since they can easily 

be gathered with low overhead and are representative of 

estimating the performance of the web server [11]. Since 

these traces must also be gathered from the live 

application, it is crucial that a lightweight monitoring 

system can be used to gather data. To improve 

performance, we create a thread for every load indicator in 

parallel to execute. The load collection tracks the usage of 

each resource over a measurement interval and reports 

these statistics to the load calculation at the end of each 

interval. 

B. Load Calculation 

This section describes how to create models which 

characterize the relationship between a set of resource 

utilization metrics gathered from an application running 

on the web server and the composited load. The model


creation employs a component which is a linear equation. 

Using values from the load collection module, we form 

an equation which calculates the total load as a linear 

combination of the different metrics. 

T = α 0 + α1 

* U1 

+ α1 

* U 2 + � + α 9 * U 9 (1) 

Where 

• U i is a value of metric collected for a benchmark 

executed in the web server; 

• The set of coefficients α 0 , α1, 

�, α n is the 

model that describes the relationship between the total 

load and Resource Utilization Metrics. Unfortunately, 

finding a set of good parameters is a rather empirical job, 

with very little support from theory. The main objective 

Figure1. System Architecture 

TABLE I. RESOURCE UTILIZATION METRICS 

is to tune the parameters to achieve good system 

performance, without asking too many questions about 

why it works well. Often, it is just a matter of “let’s try 

this approach and see what happens”. 

• This is the total load of the web server. 

Load calculation module passes the results to load 

management module located in the load balancing switch. 

Web Server adopts active push method to report their 

composited load. Pushing way than active asking can 

further reduce the burden of the load balancing switch. 

Also, Using UDP unicast data transmission can reduce 

the burden on the network bandwidth. 

CPU Memory Network Disk 

User Space % Memory Used % Rx packets/sec Read req/sec 

Kernel Space % Swapper Used % Tx packets/sec Write req/sec 

Rx bytes/sec 

Read blocks/sec 

IO Wait % 

Tx bytes/sec 

Write blocks/sec 

C. Load Management 

In the listening state, if it receives the server’s 

composited load information, it creates a child thread and 

notifies the current load value to llac module located in 

the kernel space using IPVSadm management tool. 

Frequency statistics module is mounted on the 

IP_LOCAL_IN in the Netfilter[12] in order to parsing the 

request packets and count the access frequency statistics. 

The priority of the frequency statistics function must be 

higher than the IPVS, otherwise the request will be 

forwarded out at this point, leading to not reaching 

frequency statistics function. Meanwhile, we use hash lists 

in order to raise the speed of accessing and searching. 


They link together through the general list head pointer 

inside the structure. We sort the file from more to less 

according to access frequency, through sorted_list 

pointing to the sorted list (Fig. 2) so that we can easily get 

hot file information. 

D. Cache Replacement 

We use memory file system divided from memory 

space for hot file cache. Our novelty lies in using link 

change technology to modify file location on disk 

(changed to symbolic link) pointing to the location on 

memory file system caching the file. It brings many 

benefits. For example, we do not need to modify the URL 

information. Also, service program does not require


modification, we can access the cached file from the 

memory space. Whether the file is on disk or in memory 

file is transparent to the user or service procedures. 

Due to the size limit of RAM resource, when memory 

space in memory file is not sufficient to accommodate the 

needs of caching file, cached files need to be replaced out 

from the cache using replacement policy. We use the 

IV. EXPERIMENTAL RESULTS 

To analyze the proposed dispatching algorithm, it is 

implemented on a web server cluster. We implement the 

experimental testbed with hardware and software 

configurations as described below. 

B. Software Setup 

following cache replacement strategy. We sort the files 

according to the ratio of file access frequency and file size. 

When need to be replaced, give priority to small ratio. 

Follow this way, the access frequency which is low and 

the size which is large will be replaced out. To a certain 

extent, this improves the cache hit rate, also improves the 

cache utilization. 

Figure2. File Frequency Statistics 

TABLE II. HARDWARE ENVIRONMENT 

A. Hardware setup 

The web server cluster consists of a load balancing 

switch node, connected to the web server nodes. All the 

nodes are connected through a high speed gigabit LAN 

switch. The distributed architecture of the cluster is 

hidden from the clients via a unique virtual IP address. 

The hardware environment is shown in TABLE II. 

CPU Memory(GB) HD NIC 

Front-end Intel(R) E5345 2.33GHz 8cores DDR2 4 60GB 80003ES2LAN 

Back-ends (1-4) Intel(R) E5345 2.33GHz 8cores DDR2 4 60GB 80003ES2LAN 

TABLE III. SOFTWARE ENVIRONMENT 

OS Kernel IPVS Web server Benchmarks 

LB switch Red Hat Linux5.0 2.6.18 1.0.4 ————— ————— 

Web server Red Hat Linux5.0 2.6.18 —— Apache 2.0.40 ————— 

Client Red Hat Linux5.0 2.6.18 —— ————— WebBench 5.0 

TABLE III shows the experimental software 

environment. All the machines in the cluster run Linux 

kernel 2.6.18 as their operating system, and the load 


balancing switch uses IPVS for request dispatching. We 

use Apache 2.0.40 for HTTP service installed as the web 

server. HTTP/1.1 connection is applied. In addition, all


clients have WebBench[13] installed for request 

proposing. 

WebBench is a performance testing software for web 

servers, including both the controller and clients. The 

controller is able to control clients for proposing requests, 

to record and summarize the experimental data, and then 

output the experimental results. In addition, WebBench 

can control the mixed ratio of request types transmitted 

from clients by the programmable workload. 

We perform all experiments to analyze the system 

performance under different ratios of request types (e.g. 

different localities of hot Web pages). We also create a 

workload generator to generate a synthetic workload for 

various ratios of request types. The performance metrics 

we used are the requests per second (req/s) megabits per 

second (Mbps) and number of successful requests, which 

are the experimental results summarized and reported by 

WebBench. 

C. Experimental evaluation 

In this section, we present performance evaluation of 

our proposed llac request distribution algorithm. In this 

test, WebBench is used and hot Web pages are built from 

the requested web pages of the default workload. 

Furthermore, we prepare the click through rate (CTR) 

with 20%, 40%, 60% and 80%, change the percentage of 

the hot web pages in requested web pages to 10%, 20%, 

30% and 40%. We compare the experimental results with 

that of wrr. We also 

compare llac with only using ll or ac. Fig. 3, 4, and 5 

shows that our llac outperforms wll, ll, and ac. 

Figure5. number of successful requests 

The reason that llac policy performs better than wrr, ll 

is because the llac policy uses frequency-based 

mechanism to achieve high cache rates of servers. The 

reason that llac policy performs better than ac is because 

the llac policy considers server’s current load, assessing 

the current load and selecting the appropriate node to 

response the request. 

Experimental results demonstrate that when the web 

server cluster is under heavy load, the llac policy can 

handle more requests and show better performance. 


Figure 3 number of requests per minute 

Figure4. data transfer per second 

V. SUMMARY 

This paper presents a new request distribution 

algorithm for web server cluster, called llac. This 

research focuses on reducing hot files access time and 

uses the resources of web servers more efficiently. Our 

experimental results show that our proposed llac policy 

can get better performance than wrr, ll, and ac under 

heavy load condition. This policy reduces response time 

especially for hot files, because hot files are retrieved 

from memory. The node with least load is selected to 

serve the request so that it results in resource utilization 

getting better used. In future work we plan to experiment 

with more benchmark to further verify effectiveness of 

llac. 


This study is sponsored by the fund of the State Key 

Laboratory of Software Development Environment under 

Grant No. SKLSDE-2009ZX-01, the National Natural 

Science Foundation of China under Grant No. 61003015, 

the Doctoral Fund of Ministry of Education of China 

under Grant No. 20101102110018 and the Fundamental 

Research for the Central Universities under Grant No. 

YWF-10-02-058.


REFERENCES 

[1] A. Chandra, P. Pradhan, R. Tewari, S. Sahu, P. Shenoy. 

"An observation-based approach towards self-managing 

web servers", Computer Communications, 2006, pp1174- 

1188. 

[2] V. Cardellini, E. Casalicchio, M. Colajanni, S. Tucci, 

"Mechanisms for quality of service in web clusters", 

Computer Networks, vol.37, No.6, 2001, pp761-771. 

[3] M.E. Crovella, A. Bestavros. "Self-Similarity in World 

Wide Web Traffic: Evidence and Possible Causes", 

IEEE/ACM Transactions on Networking, vol.5, No.6, 

1997, pp835-846. 

[4] V. Cardellini, E. Casalicchio, M. Colajanni, and P.S. Yu. 

"The State of the Art in Locally Distributed Web-Server 

Systems", ACM Computing Surveys, vol.34, No.2, 2002, 

pp 263-311. 

[5] M. Andreolini, E. Casalicchio. "A cluster-based web 

system providing differentiated and guaranteed services", 

Cluster Computing , vol.7, No.1, 2004, pp7-19. 

[6] E. Choi. "Performance test and analysis for an adaptive 

load balancing mechanism on distributed server cluster 

systems", Future Generation Computer Systems, No.20, 

2004, pp 237-247. 

[7] V.S. Pail, M. Aront, G. Bangat. "Locality-Aware Request 

Distribution in Cluster-based Network Servers", ACM 

SIGOPS Operating Systems Review, USA:ACM , 1998, 

pp205-216. 

[8] L. Cherkasova, M. Karlsson. "Scalable Web Server 

Cluster Design with Workload-Aware Request 

Distribution Strategy WARD", Advanced Issues of E- 

Commerce and Web-Based Information Systems, 

Washington:IEEE Computer Society, 2001, pp212-221. 

[9] E. Casalicchio, M. Colajanni. "A client-aware dispatching 

algorithm for Web clusters providing multiple services", 

The International World Wide Web Conference 

Committee (IW3C2), 2001, pp535-544. 

[10] Sysstat-7.0.4. http://perso.orange.fr/sebastien.godard/ 

[11] M. Andreolini , S. Casolari , Michele Colajanni. "Models 

and framework for supporting runtime decisions in Webbased 

systems", ACM Transactions on the Web (TWEB), 

vol.2, No.3, 2008, pp1-43. 

[12] CHRISTIAN BENVENUTI. Understanding LINUX 

NETWORK INTERNALS. 2006. 

[13] http://linux.softpedia.com/get/System/Benchmarks/Webbench-1378.shtml 

[14] M.L. Chiang, Y.C. Lin, L.F. Guo. "Design and 

implementation of an efficient web cluster with contentbased 

request distribution and file caching", Journal of 

Systems and Software, vol.81, No.11, 2008, pp 2044-2058 

[15] S. Sharifian, S.A. Motamedi, M.K. Akbari. "A contentbased 

load balancing algorithm with admission control for 

cluster web servers", Future Generation Computer 

Systems , vol.24, No.8, 2008, pp775-787. 

[16] M.L. Chiang, C.H. Wu, Y.J. Liao, Y.F. Chen. "New 

Content-aware Request Distribution Policies in Web 

Clusters Providing Multiple Services", Proceedings of the 

2009 ACM symposium on Applied Computing, 

USA:ACM, 2009, pp79-83. 

[17] Z.Y. Xu, J.Z. Han, L. Bhuyan. "Scalable and 

Decentralized Content-Aware Dispatching in Web 

Clusters", IEEE International Performance, Computing, 

and Communications, Washington:IEEE Computer 

Society, 2007, pp202-209. 

[18] Y.K. Chang. "Fully Pre-Splicing TCP for Web Switches", 

Proceedings of the First International Conference on 


Innovative Computing, Information and Control, 

Washington:IEEE Computer Society , 2006, pp737-740. 

[19] S. Chase , D.C. Anderson. "Managing energy and server 

resources in hosting centers", In Proc. of the eighteenth 

ACM symposium on Operating systems principles, 2001, 

pp103-116. 

[20] Tarek F. Abdelzaher, Kang G. Shin, and Nina Bhatti. 

"Performance Guarantees for Web Server End-Systems: A 

Control-Theoretical Approach", IEEE Transactions on 

Parallel and Distributed Systems, June 2001. 

[21] Yasushi Saito, Brian N. Bershad, and Henry M. Levy. "An 

approximation-based load-balancing algorithm with 

admission control for cluster web servers with dynamic 

workloads", Journal of Supercomputing, vol.53, No.3, 

2010, pp 440-463. 

Wei Zhang HeBei Province, China. 

Birthdate: Dec, 1983. is a PhD 

candidate in the Department of 

Computer science and Technology at 

Beihang University. She received her 

master degree in 2008. Her research 

interests include virtualization, load 

balancing and cloud computing. 

Huan Wang HuNan Province, China. 

Birthdate: Oct, 1986. is Computer 

Science and Engineering Master, 

graduated from Dept. Computer 

Science Beihang University. And 

research interests on operating system, 

load balancing, parallel computing and 

massive data processing. 

Binbin Yu born in July 1987. Now 

study in Computer Science College at 

Beihang University for the master 

degree. Mainly concentrates on load 

balancing and cloud computing. 

Wei Xu Fujian Province, China. 

Birthdate: Nov, 1986. is Computer 

Science and Technology MA, graduated 

from Dept. Computer Science and 

Engineering BeiHang University. And 

research interests on virtualization, 

operating system, high performance 

computing, cloud computing.



Mingfa Zhu born in 1945. Ph.D, 

Professor, Senior membership of China 

Computer Federation. His main research 

areas are computer architecture, 

computer system software, high 

performance computing, virtualization 

and cloud computing 

Limin Xiao born in 1970. Ph.D, 

Professor, Senior membership of China 

Computer Federation. His main research 

areas are computer architecture, 

computer system software, high 

performance computing, virtualization 

and cloud computing. 

Li Ruan born in 1978. Ph.D, Lecturer, 

Membership of China Computer 

Federation, Her main research areas are 

computer architecture, computer system 

software, high performance computing, 

virtualization and cloud computing.

Aims and Scope. 

Call for Papers and Special Issues 

Journal of Networks (JNW, ISSN 1796-2056) is a scholarly peer-reviewed international scientific journal published monthly, focusing on theories, 

methods, and applications in networks. It provide a high profile, leading edge forum for academic researchers, industrial professionals, engineers, 

consultants, managers, educators and policy makers working in the field to contribute and disseminate innovative new work on networks. 

The Journal of Networks reflects the multidisciplinary nature of communications networks. It is committed to the timely publication of highquality 

papers that advance the state-of-the-art and practical applications of communication networks. Both theoretical research contributions 

(presenting new techniques, concepts, or analyses) and applied contributions (reporting on experiences and experiments with actual systems) and 

tutorial expositions of permanent reference value are published. The topics covered by this journal include, but not limited to, the following topics: 

• Network Technologies, Services and Applications, Network Operations and Management, Network Architecture and Design 

• Next Generation Networks, Next Generation Mobile Networks 

• Communication Protocols and Theory, Signal Processing for Communications, Formal Methods in Communication Protocols 

• Multimedia Communications, Communications QoS 

• Information, Communications and Network Security, Reliability and Performance Modeling 

• Network Access, Error Recovery, Routing, Congestion, and Flow Control 

• BAN, PAN, LAN, MAN, WAN, Internet, Network Interconnections, Broadband and Very High Rate Networks, 

• Wireless Communications & Networking, Bluetooth, IrDA, RFID, WLAN, WMAX, 3G, Wireless Ad Hoc and Sensor Networks 

• Data Networks and Telephone Networks, Optical Systems and Networks, Satellite and Space Communications 

Special Issue Guidelines 

Special issues feature specifically aimed and targeted topics of interest contributed by authors responding to a particular Call for Papers or by 

invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are of interest to the Journal. 

Preference will be given to proposals that cover some unique aspect of the technology and ones that include subjects that are timely and useful to the 

readers of the Journal. A Special Issue is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length. 

The following information should be included as part of the proposal: 

• Proposed title for the Special Issue 

• Description of the topic area to be focused upon and justification 

• Review process for the selection and rejection of papers. 

• Name, contact, position, affiliation, and biography of the Guest Editor(s) 

• List of potential reviewers 

• Potential authors to the issue 

• Tentative time-table for the call for papers and reviews 

If a proposal is accepted, the guest editor will be responsible for: 

• Preparing the “Call for Papers” to be included on the Journal’s Web site. 

• Distribution of the Call for Papers broadly to various mailing lists and sites. 

• Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be 

informed the Instructions for Authors. 

• Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact 

information. 

• Writing a one- or two-page introductory editorial to be published in the Special Issue. 

Special Issue for a Conference/Workshop 

A special issue for a Conference/Workshop is usually released in association with the committee members of the Conference/Workshop like 

general chairs and/or program chairs who are appointed as the Guest Editors of the Special Issue. Special Issue for a Conference/Workshop is 

typically made of 10 to 15 papers, with each paper 8 to 12 pages of length. 

Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop: 

• Selecting a Title for the Special Issue, e.g. “Special Issue: Selected Best Papers of XYZ Conference”. 

• Sending us a formal “Letter of Intent” for the Special Issue. 

• Creating a “Call for Papers” for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees. 

Information about the Journal and Academy Publisher can be included in the Call for Papers. 

• Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus 

the evaluation from the Session Chairs and the feedback from the Conference attendees. 

• Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. 

Authors should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced. 

• Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact 

information. 

• Writing a one- or two-page introductory editorial to be published in the Special Issue. 

More information is available on the web site at http://www.academypublisher.com/jnw/.

(Contents Continued from Back Cover) 

Covert Flow Graph Approach to Identifying Covert Channels 

XiangMei Song and ShiGuang Ju 

A Novel HAVE Message of Peer-to-peer Protocol in BitTorrent Systems 

Jianyong Li, Jianchun Li, Daoying Huang, and Qiang Wei 

Image-based Position Estimation and Adaptive Modulation Coding in Vehicular Communication 

Hao Yang, Qingmin Meng, Xiong Gu, and Baoyu Zheng 

A Request Distribution Algorithm for Web Server Cluster 

Wei Zhang, Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, and Li Ruan 

1740 

1747 

1754 

1760

Journal of Networks - Academy Publisher

Create successful ePaper yourself

Delete template?

Save as template?