Journal of Networks - Academy Publisher
Journal of Networks - Academy Publisher
Journal of Networks - Academy Publisher
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Journal</strong> <strong>of</strong> <strong>Networks</strong><br />
ISSN 1796-2056<br />
Volume 6, Number 12, December 2011<br />
Contents<br />
REGULAR PAPERS<br />
Botnet Detection Architecture Based on Heterogeneous Multi-sensor Information Fusion<br />
HaiLong Wang, Jie Hou, and ZhengHu Gong<br />
XOEM plus OWL-based STEP Product Information Uniform Description and Implementation<br />
Chengfeng Jian and Haizhong Meng<br />
Design <strong>of</strong> Greenhouse Control System Based on Wireless Sensor <strong>Networks</strong> and AVR Microcontroller<br />
Yongxian Song, Chenglong Gong, Yuan Feng, Juanli Ma, and Xianjin Zhang<br />
Simulation <strong>of</strong> Networked Control System based on Smith Compensator and Single Neuron<br />
Incomplete Differential Forward PID<br />
Haitao Zhang and Zhen Li<br />
A Web Crawler System Design Based on Distributed Technology<br />
Shaojun Zhong and Zhijuan Deng<br />
A Ranking Method <strong>of</strong> Retrieval Results Based on Web Comprehending<br />
Zhijuan Deng and Shaojun Zhong<br />
An Encryption Scheme with Hidden Keyword Search for Outsourced Database<br />
Xiaoming Wang, Guoxiang Yao, and Zhen Zhang<br />
A Method <strong>of</strong> Object-based De-duplication<br />
Fang Yan and YuAn Tan<br />
Analysis on E-consumers’ Purchasing Behavior Based on Data-driving Model<br />
Lijuan Huang<br />
Repair Method <strong>of</strong> Complex Network Based on Matthew Effect<br />
Minsheng Tan, Qiang Cui, Lingfeng Zhu, and Hui Zhao<br />
Study and Design an Anycast Routing Protocol for Wireless Sensor <strong>Networks</strong><br />
Demin Gao, Huanyan Qian, Zheng Wang, and Jiguang Chen<br />
Management Model Research <strong>of</strong> Low-power Wireless Sensor Network<br />
LinGe Wang and YueDou Qi<br />
Covert Flow Graph Approach to Identifying Covert Channels<br />
XiangMei Song and ShiGuang Ju<br />
A Novel HAVE Message <strong>of</strong> Peer-to-peer Protocol in BitTorrent Systems<br />
Jianyong Li, Jianchun Li, Daoying Huang, and Qiang Wei<br />
1655<br />
1662<br />
1668<br />
1675<br />
1682<br />
1690<br />
1697<br />
1705<br />
1713<br />
1719<br />
1726<br />
1734<br />
1740<br />
1747
Image-based Position Estimation and Adaptive Modulation Coding in Vehicular Communication<br />
Hao Yang, Qingmin Meng, Xiong Gu, and Baoyu Zheng<br />
A Request Distribution Algorithm for Web Server Cluster<br />
Wei Zhang, Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, and Li Ruan<br />
1754<br />
1760
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1655<br />
Botnet Detection Architecture Based on<br />
Heterogeneous Multi-sensor Information Fusion<br />
HaiLong Wang and Jie Hou<br />
National University <strong>of</strong> Defense Technology, Changsha, 410073, China<br />
Email: {hlwang1981, jhou1983}@gmail.com<br />
ZhengHu Gong<br />
National University <strong>of</strong> Defense Technology, Changsha, 410073, China<br />
Email: gzh@nudt.edu.cn<br />
Abstract—As technology has been developed rapidly, botnet<br />
threats to the global cyber community are also increasing.<br />
And the botnet detection has recently become a major<br />
research topic in the field <strong>of</strong> network security. Most <strong>of</strong> the<br />
current detection approaches work only on the evidence<br />
from single information source, which can not hold all the<br />
traces <strong>of</strong> botnet and hardly achieve high accuracy. In this<br />
paper, a novel botnet detection architecture based on<br />
heterogeneous multi-sensor information fusion is proposed.<br />
The architecture is designed to carry out information<br />
integration in the three fusion levels <strong>of</strong> data, feature, and<br />
decision. As the core component, a feature extraction<br />
module is also elaborately designed. And an extended<br />
algorithm <strong>of</strong> the Dempster-Shafer (D-S) theory is proved<br />
and adopted in decision fusion. Furthermore, a<br />
representative case is provided to illustrate that the<br />
detection architecture can effectively fuse the complicated<br />
information from various sensors, thus to achieve better<br />
detection effect.<br />
Index Terms—botnet, botnet detection, network security,<br />
information fusion, D-S theory<br />
I. INTRODUCTION<br />
Internet threats have recently transformed from highly<br />
visible, disruptive attacks to stealthy attacks used for<br />
pr<strong>of</strong>it, and at the center <strong>of</strong> this change are the botnets [1].<br />
These botnets have been the workhorses <strong>of</strong> many various<br />
disastrous attacks, such as information theft [2],<br />
distributed denial <strong>of</strong> service (DDoS) [3], and sending<br />
spam [4]. The threats can disable the infrastructure and<br />
cause the financial damage, which leads to a severe<br />
challenge for the global network security. Hence, in order<br />
to detect botnet attacks effectively, we need to have a<br />
correct and comprehensive understanding <strong>of</strong> the botnet<br />
attacks. In particular, we must fuse all the gathered<br />
information related to botnet activities from<br />
heterogeneous multi-source sensors, and then carry out<br />
further analysis for decision-making. Therefore, we can<br />
Manuscript received January 1, 2011; revised June 1, 2011; accepted<br />
July 1, 2011.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1655-1661<br />
say that information fusion is absolutely a necessary<br />
component for botnet detection [5].<br />
Botnet is a network composed by computers on which<br />
the s<strong>of</strong>tware called ‘bot’ is automatically installed<br />
without user intervention, and is remotely controlled via<br />
command and control channel for malicious purpose [6].<br />
Its activities have the following common characteristics.<br />
First, they have more action phases and representation<br />
forms than the traditional malware attacks. The activity<br />
cycle <strong>of</strong> a botnet attack usually consists <strong>of</strong> four stages,<br />
i.e., propagation, infection, communication, and attack<br />
[7]. Even in the same stage, different botnet attacks could<br />
exhibit various activity forms, such as propagating by<br />
system vulnerabilities or email. Second, the botnet<br />
activities are wide-ranging from a private host, local area<br />
network to the backbone [8]. Third, the botnet activities<br />
are always hidden. Since their resulting network traffic is<br />
small, the bots can upgrade itself without exposition for a<br />
long time [9]. These three characteristics make great<br />
difficulty in botnet detection. However, it is well known<br />
that the traces <strong>of</strong> botnet would be recorded during its<br />
actions over a wide range [10]. There are diverse types <strong>of</strong><br />
information sources which can be retrieved, such as<br />
network packets, network flows, system logs, alerts from<br />
anti-virus s<strong>of</strong>tware or intrusion detection systems, and the<br />
analysis results from botnet detection tools. Though the<br />
information could be used to identify the traces <strong>of</strong> botnet,<br />
it is usually large-scale, uncertain and redundant.<br />
Despite <strong>of</strong> the importance <strong>of</strong> information fusion for<br />
botnet detection, most <strong>of</strong> the existing work does not focus<br />
on this field. To our best knowledge, the existing botnet<br />
detection schemes can discover bots to some extent, but<br />
they do not make full use <strong>of</strong> the multifarious information<br />
related to botnet activities and are not able to handle the<br />
entire situation <strong>of</strong> botnet infiltration. In recent years,<br />
multi-sensor information fusion has been rapidly<br />
developed and applied in many sophisticated application<br />
areas, especially network security [11]. In the view <strong>of</strong><br />
integrating the complicated information from<br />
heterogeneity and multi-source in an efficient way, we<br />
propose a botnet detection architecture based on<br />
information fusion techniques. In the architecture, we<br />
design a novel feature extraction module and adopt an
1656 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
extended decision fusion algorithm, which enables the<br />
detection to achieve three-level fusions <strong>of</strong> data, feature<br />
and decision.<br />
The remainder <strong>of</strong> the paper is organized as follows.<br />
Section 2 discusses background technologies and related<br />
work. Section 3 presents the botnet detection architecture<br />
based on heterogeneous multi-sensor information fusion.<br />
Section 4 introduces a fusion algorithm used in the<br />
architecture and gives the pro<strong>of</strong> <strong>of</strong> the algorithm. Section<br />
5 shows an illustration <strong>of</strong> botnet detection. Finally,<br />
section 6 concludes the paper.<br />
II. RELATED WORK<br />
Botnet is a new type <strong>of</strong> attack which is developed and<br />
syncretized from network worm, Trojan, backdoor tools<br />
and other traditional forms <strong>of</strong> malicious code [12].<br />
However, compared to these traditional attacks, the major<br />
difference is that the botnet has a one-to-many control<br />
relationship among attackers and bots [13]. This feature<br />
makes botnet more privacy, flexible and efficient than<br />
any other malicious programs.<br />
With the evolution <strong>of</strong> botnet, the detection techniques<br />
for it have also developed. Many diverse schemes for<br />
botnet detection have been proposed, such as honeypot or<br />
honeynet for capture and analysis [14], correlation<br />
analysis <strong>of</strong> malicious behaviors[15], detection approaches<br />
for different C&C mechanisms (e.g. IRC, HTTP, DNS, or<br />
P2P) [16-19], and identifying bots from DDoS and spam<br />
[20, 21]. However, these techniques mainly focus on the<br />
network traffic and obtain evidences <strong>of</strong> botnet activities<br />
indirectly. For example, the evidence for detecting the<br />
upgrade <strong>of</strong> bot is obtained by identifying the upgrade<br />
binaries in the traffic, rather than directly derived from<br />
the code server which logs the download event. Single<br />
information source and indirect evidences cause the<br />
following three problems for botnet detection. First, it<br />
usually brings the false-positive and false-negative.<br />
Second, it will extend the detection cycle. Generally,<br />
multiple rounds observations are required to give the<br />
correct results. Third, due to the inadequate information<br />
collection, it is very difficult to be aware <strong>of</strong> new botnet or<br />
botnet variations. Therefore, the research on detection<br />
architecture with the ability <strong>of</strong> integrating heterogeneous<br />
multi-sensor information should be paid more attention.<br />
Robert et al. [22] design a multi-layered architecture<br />
for the detection <strong>of</strong> a wide range <strong>of</strong> existing and new<br />
botnets. The architecture can integrate many techniques<br />
to detect the gather information from all the available<br />
network information sources: network traffic data, system<br />
process information, and file system information.<br />
Napoleon et al. [23] introduce a risk-aware networkcentric<br />
management framework to detect and prevent<br />
targeted botnet attacks as well as propagation attempts<br />
with the network. The framework systematically collects<br />
network traffic and vulnerabilities in s<strong>of</strong>tware,<br />
comprehensively analysis and discovers characteristics<br />
and unique behaviors <strong>of</strong> bots, and dynamically<br />
determines associated risks and generates corresponding<br />
detection rules. Zhang et al. [24] develop a top-down<br />
analytical framework as a basis for critical evaluation on<br />
© 2011 ACADEMY PUBLISHER<br />
the existing countermeasures. The framework correlates<br />
and integrates the observations and reports <strong>of</strong> anti-botnet<br />
tools at different layers, i.e., Internet, intranet, and host,<br />
for achieving a whole snapshot <strong>of</strong> the botnet. Alireza et<br />
al. [25] propose an architecture which is called “Visual<br />
Threat Monitor” that combines data mining and<br />
visualization to enhance botnet traffic detection. The<br />
processing pipeline <strong>of</strong> the architecture consists <strong>of</strong><br />
correlation, statistical analysis, clustering, aggregation,<br />
and visualization. On the basis <strong>of</strong> the studies [15, 26, 27],<br />
Gu et al. [28] present a general detection framework to<br />
realize more accurate botnet detection over local area<br />
network. It analyzes <strong>of</strong> traffic and network flow,<br />
correlating with multiple alerts or events <strong>of</strong> intrusion<br />
detection system. The aforementioned detection<br />
architectures have some problems in the aspect <strong>of</strong><br />
information fusion. First, the types <strong>of</strong> the information<br />
source are incomplete. And there is no proper division<br />
method towards the information source according to the<br />
botnet activities, which would cause large redundancy<br />
information and ill-targeted collection. Second, the<br />
aforementioned schemes are lack <strong>of</strong> a powerful algorithm<br />
to fuse large-scale information from different sources and<br />
obtain the correlation between attackers and their botnets,<br />
though they adopt some correlation analysis methods.<br />
Third, most <strong>of</strong> the existing frameworks do not have<br />
independent feature extraction module, or function <strong>of</strong><br />
feature extraction is too simple.<br />
III. ARCHITECTURE OVERVIEW<br />
Information fusion technique is a kind <strong>of</strong> information<br />
processing method makes use <strong>of</strong> information from<br />
multiple sensors, and related information from associated<br />
database, achieves improved accuracies and more specific<br />
inferences than could be achieved by the use <strong>of</strong> a single<br />
sensor alone [29]. Network security is the latest<br />
application <strong>of</strong> information fusion, and all these<br />
applications are mainly about the improvement <strong>of</strong> IDS<br />
[30]. Information fusion processes are <strong>of</strong>ten categorized<br />
as data, feature or decision level fusion depending on the<br />
processing stage at which fusion takes place [31]. Data<br />
level fusion, combines several sources <strong>of</strong> raw information<br />
to produce new information that is expected to be more<br />
informative and synthetic than the inputs. Feature level<br />
fusion, various features are combined into a feature map<br />
that may be used by further process. Decision level<br />
fusion, combines decisions coming from several expert<br />
knowledge. According to the processes <strong>of</strong> information<br />
fusion, we give the botnet detection architecture based on<br />
heterogeneous multi-sensor information fusion in this<br />
paper, which consists <strong>of</strong> several parts in Figure 1.<br />
A. Information Collection<br />
This part adopts a role-based collaborative information<br />
collection model, which is our recent work in [32]. This<br />
part includes the s<strong>of</strong>tware and hardware <strong>of</strong> the<br />
information collection system, the main task is to collect<br />
all the information related to botnet activities from<br />
heterogeneous multi-source sensors. The information can<br />
be gathered from computers in the network, network
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1657<br />
Figure. 1 Botnet detection architecture based on heterogeneous multi-sensor information fusion.<br />
security equipments such as firewall, intrusion detection<br />
system (IDS), and network equipments such as router and<br />
switch. The function <strong>of</strong> this part is implemented by the<br />
information collection agent.<br />
To figure axis labels, use words rather than symbols.<br />
Do not label axes only with units. Do not label axes with<br />
a ratio <strong>of</strong> quantities and units. Figure labels should be<br />
legible, about 9-point type.<br />
Color figures will be appearing only in online<br />
publication. All figures will be black and white graphs in<br />
print publication.<br />
B. Pre-processing<br />
To increase the effect and efficiency <strong>of</strong> further<br />
information fusion, Pre-processing is needed. Preprocessing<br />
module is composed <strong>of</strong> classification and<br />
refinement. Classification is to divide the information<br />
source into original information, indirect information and<br />
direct information. The original information includes the<br />
real record <strong>of</strong> network and system behaviors without any<br />
security analysis, such as packet payload, system process<br />
information, etc. However, the indirect information is the<br />
alarm information from general security s<strong>of</strong>tware, such as<br />
anti-virus s<strong>of</strong>tware, firewall, honeypot, etc. The indirect<br />
information, always combined with original information,<br />
could be the indirect evidences for botnet detection.<br />
Besides, direct information is the analysis result <strong>of</strong><br />
technical botnet detection tools (e.g. BotHunter [15]),<br />
which could be the direct evidence for botnet detection.<br />
Refinement is to filter out unwanted information, detect<br />
the suspicious information on preset rules, unify the<br />
presentation, and store the result into the information<br />
database.<br />
C. Feature Extraction<br />
Feature extraction is the core module <strong>of</strong> the<br />
architecture. The existing detection techniques use the<br />
following two extraction modes:<br />
• Utilizing the botnet samples captured by<br />
honeypot or honeynet (including bots, message<br />
© 2011 ACADEMY PUBLISHER<br />
contents, etc.). Because the sample data is<br />
relatively pure, extracted data features can be<br />
directly adopted as the essential features<br />
(signature or pattern) <strong>of</strong> botnet.<br />
• Utilizing general information (such as flow data,<br />
logs, etc.) and indirect information. The main<br />
process is: first <strong>of</strong> all, try to discover data<br />
features; then, compare to the results found by<br />
the proved botnet detection system; finally, verify<br />
whether the data features <strong>of</strong> information belong<br />
to the essential features <strong>of</strong> botnets.<br />
Figure. 2 Feature extraction.<br />
Our feature extraction covers the above-mentioned two<br />
modes. As shown in figure 2, the structure <strong>of</strong> feature<br />
extraction module consists <strong>of</strong> four parts, including<br />
attribute selection, data feature analysis, validation and<br />
scheme management. The data feature analysis integrates<br />
data mining methods, such as statistical data analysis,<br />
pattern recognition, artificial neural networks, support<br />
vector machines, etc. Its goal is to provide a mechanism<br />
for the identification <strong>of</strong> new features in the data sets from<br />
the attribute selection. For general and indirect<br />
information, it must be verified before being stored into<br />
the feature database as signatures or patterns. Meanwhile,<br />
through the extracted features, the scheme management
1658 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
gives feedbacks to the modules <strong>of</strong> the data feature<br />
analysis and the attribute selection for dynamic<br />
optimization. Besides, this part divides the analytic<br />
results into four main categories as the inputs <strong>of</strong> the<br />
fusion analysis, i.e., propagation, infection,<br />
communication and attack, according to the stages <strong>of</strong><br />
botnet activities.<br />
D. Fusion Analysis<br />
Fusion Analysis is also the key <strong>of</strong> the architecture and<br />
the main module <strong>of</strong> producing high-level and qualified<br />
alerts. This part gives the final conclusion for the<br />
decision-making to reflect the real situation <strong>of</strong> the botnet<br />
activities. The detailed process will be described in<br />
section 4.<br />
E. Database<br />
The information database stores the results producing<br />
by pre-processing module. The results have been divided<br />
into three categories, which is useful for the following<br />
fusions. The feature database stores signatures and<br />
patterns from the feature extraction. And the knowledge<br />
database includes vulnerability database, security policy,<br />
client configuration records, etc., which can provide a<br />
strong data support for the decision-making.<br />
F. Control and Collaboration Mechanism<br />
Control mechanism is used to react against the<br />
<strong>of</strong>fending events taking place on or within the detection<br />
system. Depending on the result <strong>of</strong> analysis and<br />
synthesis, this part adopts the measure responding to the<br />
main modules. And some responding work can be<br />
finished automatically by control mechanism through<br />
adjusting system parameters. Collaboration mechanism<br />
provides the communications and function calls among<br />
the detection systems or with other security products.<br />
IV. METHOD OF FUSION ANALYSIS<br />
Decision fusion algorithms in the fusion analysis<br />
confront three critical requirements as follows:<br />
• Flexibility. The algorithm should require no prior<br />
probability and conditional probability. Since the<br />
botnet behaviors are <strong>of</strong>ten random and uncertain,<br />
it is difficult to obtain the prior knowledge.<br />
• Compatibility. It can effectively integrate<br />
heterogeneous multi-sensor information, and in<br />
particular with the accumulation <strong>of</strong> evidences, the<br />
decision will be more accurate.<br />
• Scalability. It has the ability to easily fuse new<br />
evidences from the emerging sensors without<br />
changing the framework <strong>of</strong> algorithm.<br />
The Dempster-Shafer (D-S) theory is the right one that<br />
can meet these requirements among the main algorithms.<br />
The D-S theory is a mathematical theory <strong>of</strong> evidence,<br />
introduced in the 1960's by Arthur Dempster [33] and<br />
developed in the 1970's by Glenn Shafer [34]. It is<br />
viewed as a mechanism for reasoning under epistemic<br />
(knowledge) uncertainty. First, we will give a brief<br />
introduction <strong>of</strong> D-S theory [35]. Then, in our architecture<br />
we will introduce an extended D-S theory proposed in<br />
© 2011 ACADEMY PUBLISHER<br />
[36] to fuse the results from feature extraction. And we<br />
will give a pro<strong>of</strong> <strong>of</strong> the extended theory which was not<br />
proved in [36].<br />
D-S Theory<br />
Frame <strong>of</strong> discernment (Θ) is a finite set mutually<br />
exclusive propositions and hypotheses about some<br />
problem domain. Basic probability assignment (bpa) is<br />
stated in [34] as: “If Θ is a frame <strong>of</strong> discernment, then a<br />
function m: is called a basic probability assignment<br />
whenever<br />
m( ∅ ) = 0<br />
(1)<br />
∑ m( A)<br />
= 1<br />
(2)<br />
A⊂Θ<br />
The mass value <strong>of</strong> A (m(A)) is also called A’s basic<br />
probability number, and it is understood to be the<br />
measure <strong>of</strong> the belief that is committed exactly to A.”<br />
( ) ( )<br />
Bel A = ∑ m B<br />
(3)<br />
B⊆A Plausibility function (Pl) takes into account all the<br />
elements related to A (either supported by evidence or<br />
unknown).<br />
( ) 1 ( )<br />
Pl A = − Bel ¬ A<br />
(4)<br />
For the subset A, Bel(A) and Pl(A) represent upper and<br />
lower belief bounds, and the interval [Bel(A), Pl(A)]<br />
represents the belief range.<br />
12<br />
( )<br />
m A<br />
=<br />
∑<br />
∑<br />
B∩ C= A<br />
B∩C≠∅ ( ) ( )<br />
m B m C<br />
1 2<br />
( ) ( )<br />
m B m C<br />
1 2<br />
Dempster’s rule <strong>of</strong> combination can be used to<br />
combine the mass values <strong>of</strong> all features from each<br />
individual sensor to achieve the overall summary mass<br />
values for each sensor. These summary values from all<br />
sensors are combined to give the summary mass values<br />
for the system. Initially, the bpas are used to assign the<br />
mass values to appropriate hypothesis. Then the resulting<br />
mass values are used to calculate the belief for the<br />
appropriate hypothesis. Finally all beliefs are combined<br />
with Dempster’s rule <strong>of</strong> combination to gain the overview<br />
belief for the appropriate hypothesis, as shown in (5).<br />
Extended D-S Theory<br />
Dempster’s rule <strong>of</strong> combination gives equivalent trust<br />
to the evidences provided by different sensors as shown<br />
in (5). But actually it is not the case. Observations show<br />
that for the same type <strong>of</strong> sensors, the local ones should<br />
provide more credible information than the remote ones;<br />
even if the same sensor, installed in different locations <strong>of</strong><br />
network will have different detection capacity; different<br />
types <strong>of</strong> sensors, may have different detection capability<br />
and accuracy for the same type <strong>of</strong> attack, so that the<br />
provided evidences will have great difference in<br />
importance and reliability. To solve these problems, the<br />
extended D-S theory assigns different weight to different<br />
(5)
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1659<br />
sensors. It means that different sensors are given different<br />
trusts. As shown in (6), using a weighted index method,<br />
the evidences after the rule combination should meet the<br />
basic nature <strong>of</strong> the bpa, that is to say, (2).<br />
12<br />
( )<br />
m A<br />
=<br />
∑<br />
∑<br />
B∩ C= A<br />
B∩C≠∅ ( ) ( )<br />
w1 w2<br />
1 2<br />
⎡⎣m B ⎤⎦ ⎡⎣m C ⎤⎦<br />
( ) ( )<br />
w1 w2<br />
1 2<br />
⎡⎣m B ⎤⎦ ⎡⎣m C ⎤⎦<br />
(6)<br />
In (6), m1 and m2 are the mass functions over Θ,<br />
m ∅ = 0 . We just need to prove m ( A)<br />
= 1 ,<br />
and ( )<br />
12<br />
∑<br />
A⊂Θ<br />
which is shown in (7). So m12 is also the mass function<br />
and the evidences after the rule combination in (6) are<br />
truly to meet the basic nature <strong>of</strong> the bpa.<br />
∑<br />
∩ =<br />
∑ m12 ( A) = m12 ( ∅ ) + ∑ m12 ( A)<br />
= ∑<br />
A⊂Θ A⊂Θ, A≠∅A⊂Θ, A≠∅∑<br />
12<br />
( ) ( )<br />
⎡⎣m1 w1 B ⎤⎦ ⎡⎣m2 w2<br />
C ⎤⎦<br />
B C A<br />
w1 w2<br />
⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C)<br />
⎤⎦<br />
w1 w2 ∑ ∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) ⎤⎦ A⊂Θ, A≠∅ B∩ C= A<br />
w1 w2 ∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) ⎤⎦ B∩C≠∅ w1 w2<br />
∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C)<br />
⎤⎦<br />
B∩C≠∅ w1 w2<br />
∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C)<br />
⎤⎦<br />
1<br />
B∩C≠∅ B∩C≠∅ = = =<br />
In the extended D-S theory, the weights can be<br />
obtained by training samples based on maximum entropy<br />
or minimum mean square error, and also can be the<br />
experience values from several tests.<br />
Weight Assignment<br />
In addition to the situations <strong>of</strong> sensors, our researches<br />
show that weight assignment should take the stages <strong>of</strong><br />
botnet activities into account. The features which are<br />
extracted from the stages <strong>of</strong> communication and attack<br />
are more credible than those from the stages <strong>of</strong><br />
propagation and infection. Moreover, other factors, such<br />
as vulnerability, might also affect the weight assignment.<br />
V. SCENARIO<br />
As shown in Figure 3, this is a typical environment for<br />
botnet attacks. The environment contains the local area<br />
network and backbone network, involving three<br />
application servers (EMAIL, WWW, and DNS), one<br />
management server, three firewalls, an attacker, a<br />
honeynet, several zombies, etc. The Attacker sends the<br />
© 2011 ACADEMY PUBLISHER<br />
(7)<br />
Figure. 3 Illustration <strong>of</strong> botnet detection.<br />
commands to the zombies through command and control<br />
channel. According to the commands, the bots on the<br />
zombies will carry out some actions such as propagation,<br />
information theft, DDoS attack, spam, etc. The thin oneway<br />
arrow in figure 3 shows the process <strong>of</strong> command and<br />
control communication. Towards this typical<br />
environment, BotHunters are deployed for two local area<br />
networks; Spam filters is used in EMAIL server; Servers<br />
and hosts are equipped with terminal monitor and log<br />
analysis tools; network traffic monitor for flow and traffic<br />
information; vulnerability scanning systems to collect<br />
vulnerability, configuration, and port information for<br />
servers and hosts. All the sensors through the collection<br />
agents send the information to the management server for<br />
fusion analysis. Then the management server gives the<br />
final results to the administrator. The thick one-way<br />
arrow shows the aforementioned process.<br />
To show the work <strong>of</strong> the botnet detection architecture,<br />
an example <strong>of</strong> sending spam is provided in figure 3. It<br />
can be observed that the attacker discovers the zombies<br />
online and commands the bots on the zombies to send<br />
spam to victim host A and B. On the one hand, the<br />
BotHunters can detect the bots in the network by<br />
monitoring the traffic. On the other hand, other sensors<br />
can also hold every stage <strong>of</strong> spam botnet activities. The<br />
log analysis tools in DNS server could discover some<br />
suspicious hosts, for the spam bots usually perform<br />
DNSBL lookups on the DNS server to determine whether<br />
they are blacklisted [37]. The terminal and traffic<br />
monitors could retrieve the direct evidences <strong>of</strong> anomalies<br />
from the communications. In a word, all the sensors send<br />
the suspicious information to the management server.<br />
Then, the management server carries out pre-processing,<br />
feature extraction and fusion analysis to integrate and<br />
analyze the received information, so that the<br />
administrator can fully master the evidences <strong>of</strong> the<br />
interactions between the attacker and the zombies within<br />
a short time. And, this fusion process can also identify the<br />
zombies [38] as well as the position <strong>of</strong> the attacker [39].<br />
If the administrator only monitored the traffic or only<br />
checked email records to identify the spam activities, it<br />
may take more time and cause more false-positive alarms.<br />
Theoretically speaking, the results from fusion analysis<br />
are more accurate than those from the BotHunters.
1660 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
VI. CONCLUSIONS<br />
In this paper, we have introduced a botnet detection<br />
architecture based on heterogeneous multi-sensor<br />
information fusion. Also, we described functionalities<br />
and features <strong>of</strong> each component in the architecture,<br />
highlighting the module <strong>of</strong> feature extraction. In addition,<br />
we introduced an extended algorithm <strong>of</strong> D-S theory and<br />
gave its pro<strong>of</strong>. Finally, we elaborated a representative<br />
case <strong>of</strong> detecting spam botnet to demonstrate the<br />
feasibility <strong>of</strong> our architecture.<br />
For the future work, we are going to implement the<br />
prototype and deploy it in the real network, and then<br />
evaluate the accuracy <strong>of</strong> the fusion algorithm to compare<br />
the existing detection method. Our ultimate goal is to<br />
develop a practical botnet detection system, following the<br />
architecture proposed in this paper, to integrate multiple<br />
information fusion techniques, and eventually provide<br />
identification, evaluation and prediction for the botnet.<br />
ACKNOWLEDGMENT<br />
The authors would like to thank Tao Li for his helpful<br />
comments for improving this paper. This material is<br />
based upon work supported in part by the National<br />
Natural Science Foundation <strong>of</strong> China under Grant<br />
No.61070200 and No.61003303, the National Science<br />
and Technology Support Program <strong>of</strong> China under Grant<br />
No.2008BAH37B03, the National High-Tech Research<br />
and Development Plan <strong>of</strong> China under Grant<br />
No.2009AA01Z432, and the National Grand<br />
Fundamental Research 973 Program <strong>of</strong> China under<br />
Grant No.2009CB320503.<br />
REFERENCES<br />
[1] K. Singh, A. Srivastava, J. Giffin, and W. Lee, “Evaluating<br />
Email’s Feasibility for Botnet Command and Control,”<br />
Proc. 38th Annual IEEE/IFIP International Conference on<br />
Dependable Systems and <strong>Networks</strong> (DSN 2008), USA,<br />
2008, pp. 376-385.<br />
[2] K. Bohn, “Teen questioned in computer hacking probe,”<br />
CNN [Online], 2004, Available:<br />
http://www.cnn.com/2007/TECH/11/29/fbi.botnets/index.h<br />
tml.<br />
[3] J. Davis, “Hackers take down the most wired country in<br />
europe,” WIRED MAGZINE: ISSUE 15.09 [Online],<br />
2007, Available:<br />
http://www.wired.com/politics/security/magazine/15-<br />
09/ff_estonia.<br />
[4] T. Holz, M. Steiner, and F. Dahl, “Measurements and<br />
mitigation <strong>of</strong> peer-to-peer-based botnets: A case study on<br />
storm worm,” Proc. 1st USENIX Workshop on Large-<br />
Scale Exploits and Emergent Threats (LEET’08), 2008.<br />
[5] H. Wang and Z. Gong, “Collaboration-based botnet<br />
detection architecture,” Proc. 2nd International Conference<br />
on Intelligent Computation Technology and Automation,<br />
Zhangjiajie, China, 2009.<br />
[6] Zhaosheng Zhu, Guohan Lu, and Yan Chen, “Botnet<br />
Research Survey”, Proc. 32nd Annual IEEE International<br />
Computer S<strong>of</strong>tware and Applications Conference, Finland,<br />
2008.<br />
[7] J. Govil, “Examining the criminology <strong>of</strong> bot zoo,” Proc.<br />
6th International Conference on Information,<br />
Communications and Signal Processing, Singapore, 2007.<br />
© 2011 ACADEMY PUBLISHER<br />
[8] J. Govil, “Criminology <strong>of</strong> botnets and their detection and<br />
defense methods,” Proc. 2007 IEEE International<br />
Conference on Electro/Information Technology (EIT’07),<br />
2007.<br />
[9] D. Geer, “Malicious bots threaten network security,” IEEE<br />
Computer, vol. 38, no. 1, pp. 18-20, 2005.<br />
[10] M. Rajab, J. Zarfoss, and F. Monrose, “A multi-faceted<br />
approach to understanding the Botnet phenomenon,” Proc.<br />
ACM SIGCOMM/USENIX Internet Measurement<br />
Conference (IMC’06), Brazil, 2006.<br />
[11] G. Giorgio, R. Fabio, and S. Carlo, “Information fusion in<br />
computer security,” Information Fusion, vol. 10, no. 4, pp.<br />
272-273, 2009.<br />
[12] J. Zhuge, X. Han, Y. Zhou, Z. Ye, and W. Zou, “Research<br />
and Development <strong>of</strong> Botnets,” <strong>Journal</strong> <strong>of</strong> S<strong>of</strong>tware, vol. 19,<br />
no. 3, pp. 702-715, 2008.<br />
[13] J. Zhuge, X. Han, Z. Ye, and W. Zou, “Discover and track<br />
botnets,” Proc. Chinese Symposium on Network and<br />
Information Security (NetSec), Beijing, 2005.<br />
[14] J. Cheng, J. Yin, Y. Liu, and J. Zhong, “Advances in the<br />
Honeypot and Honeynet Technologies,” <strong>Journal</strong> <strong>of</strong><br />
Computer Research and Development, vol. 45, no. 1, pp.<br />
375-378, 2008.<br />
[15] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee,<br />
“BotHunter: Detecting malware infection through idsdriven<br />
dialog correlation,” Proc. 16th USENIX Security<br />
Symposium (Security’ 07), 2007.<br />
[16] J. R. Binkley and S. Singh, “An algorithm for anomalybased<br />
Botnet detection,” Proc. USENIX SRUTI’06, 2006,<br />
pp. 43–48.<br />
[17] J. Lee, H. Jeong, J. Park, M. Kim, and B. Noh, “The<br />
activity analysis <strong>of</strong> malicious http-based botnets using<br />
degree <strong>of</strong> periodic repeatability,” Proc. 2008 International<br />
Conference on Security Technology, 2008, pp. 83-86.<br />
[18] H. Choi, H. Lee, and H. Lee, “Botnet detection by<br />
monitoring group activities in DNS traffic,” Proc. 7th IEEE<br />
International Conference on Computer and Information<br />
Technology, Aizu-Wakamatsu City, Japan, 2007.<br />
[19] S. Matthew and I. Igor, Detection <strong>of</strong> peer-to-peer botnets,<br />
Masters Thesis, University <strong>of</strong> Amsterdam, 2008.<br />
[20] F. Freiling, T. Holz, G, Wicherski, “Botnet Tracking:<br />
Exploring a Root-cause Methodology to Prevent Denial <strong>of</strong><br />
Service Attacks,” Proc. 10th European Symposium on<br />
Research in Computer Security (ESORICS’05), 2005.<br />
[21] Z. Duan, P. Chen, F. Sanchez, Y. Dong, M. Stephenson,<br />
and J. Barker, “Detecting Spam Zombies by Monitoring<br />
Outgoing Messages, ” Proc. IEEE INFOCOM’09<br />
Conference, Brazil, 2009.<br />
[22] E. Robert, C. Adele, and B. Pranab, “A Multi-Layered<br />
Approach to Botnet Detection,” Proc. 2008 International<br />
Conference on Security and Management (SAM’08),<br />
USA, 2008.<br />
[23] N. Paxton, G.J. Ahn, and B. Chu, “Towards practical<br />
framework for collecting and analyzing network-centric<br />
attacks,” Proc. IEEE International Conference on<br />
Information Reuse and Integration, USA, 2007.<br />
[24] Z. Zhang and Y. Kadobayashi, “A holistic perspective on<br />
understanding and breaking botnets: Challenges and<br />
countermeasures,” <strong>Journal</strong> <strong>of</strong> the National Institute <strong>of</strong><br />
Information and Communications Technology, vol. 55, no.<br />
2, pp. 43-59, 2008.<br />
[25] S. Alireza, F. Maryam, and A. Rodina, “Architecture for<br />
applying data mining and visualization on network flow for<br />
botnet traffic detection,” Proc. 2009 International<br />
Conference on Computer Technology and Development,<br />
Cairo, Egypt, 2009.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1661<br />
[26] G. Gu, J. Zhang, and W. Lee, “BotSniffer: Detecting<br />
Botnet command and control channels in network traffic,”<br />
Proc. 15th Annual Network and Distributed System<br />
Security Symposium (NDSS’08), USA, 2008.<br />
[27] G. Gu, J. Zhang, and R. Perdisci, “Botminer: Clustering<br />
analysis <strong>of</strong> network traffic for protocol- and structureindependent<br />
Botnet detection,” Proc. 17th USENIX<br />
Security Symposium (Security’08), USA, 2008.<br />
[28] G. Gu, Correlation-based Botnet Detection in Enterprise<br />
<strong>Networks</strong>, PhD Thesis, Georgia Institute <strong>of</strong> Technology,<br />
USA, 2008.<br />
[29] B.V. Dasarathy, “A special issue on information fusion in<br />
computer security,” Information Fusion, vol. 10, no. 4, pp.<br />
271, 2009.<br />
[30] Y. Niu, Q. Zheng, and H. Peng, “Network security<br />
management based on data fusion technology,” Proc. 7th<br />
International Conference on Computer-Aided Industrial<br />
Design and Conceptual Design, China, 2006.<br />
[31] B.V. Dasarathy, “Decision Fusion,” IEEE Computer<br />
Socienty Press, 1994.<br />
[32] H. Wang and Z. Gong, “Role-based collaborative<br />
information collection model for botnet detection,” Proc.<br />
2010 International Symposium on Collaborative<br />
Technologies and Systems (CTS 2010), Chicago, USA,<br />
2010.<br />
[33] A.P. Dempster, “Upper and lower probabilities induced by<br />
a multivalued mapping,” Ann. Math. Statist., 1967, pp.<br />
325-339.<br />
[34] G. Shafer, A Mathematical Theory <strong>of</strong> Evidence, Princeton<br />
University Press, Princeton and London, 1976.<br />
[35] Qi Chen, Uwe Aickelin, “Dempster-Shafer for Anomaly<br />
Detection,” Proc. the International Conference on Data<br />
Mining (DMIN 2006), Las Vegas, USA, 2006, pp. 232-<br />
238.<br />
[36] L. Ma, L. Yang, and J. Wang, “Research on Security<br />
Information Fusion from Multiple Heterogeneous<br />
Sensors,” <strong>Journal</strong> <strong>of</strong> System Simulation, vol. 20, no. 4, pp.<br />
181-187, 2008.<br />
[37] A. Ramachandran, N. Feamster, and D. Dagon, “Revealing<br />
Botnet membership using DNSBL counterintelligence,”<br />
Proc. USENIX SRUTI’06, 2006.<br />
[38] S. Kondo and N. Sato, “Botnet traffic detection techniques<br />
by C&C session classification using SVM,” Proc. 2nd<br />
International Workshop on Security, Japan, 2007.<br />
[39] M. Rajab, J. Zarfoss, and F. Monrose, “My botnet is bigger<br />
than yours (maybe, better than yours): Why size estimates<br />
remain challenging,” Proc. 1st Workshop on Hot Topics in<br />
Understanding Botnets (HotBots 2007), Boston, USA,<br />
© 2011 ACADEMY PUBLISHER<br />
2007.J. Clerk Maxwell, A Treatise on Electricity and<br />
Magnetism, 3 rd ed., vol. 2. Oxford: Clarendon, 1892,<br />
pp.68–73.<br />
HaiLong Wang JiLin Proviince, China.<br />
Birthdate: May, 1981. is Electronic<br />
Engineering B.E., graduated from Dept.<br />
Electronic Engineering Naval University<br />
<strong>of</strong> Engineering, Wuhan, China, in 2004.<br />
And research interests on network and<br />
information security, distributed<br />
computing, computer network<br />
architecture.<br />
He is currently working toward the Ph.D. degree at the<br />
School <strong>of</strong> Computer, National University <strong>of</strong> Defense<br />
Technology, Changsha, China.<br />
Jie Hou HeBei Proviince, China.<br />
Birthdate: July, 1983. is Communication<br />
Engineering B.E., graduated from Dept.<br />
Communication Engineering Chinese<br />
People’s Armed Police Force Institute <strong>of</strong><br />
Engineering, Xi’an, China, in 2005. And<br />
research interests on the next generation<br />
computer network architecture, network<br />
and information security.<br />
She is currently working toward the<br />
Ph.D. degree at the School <strong>of</strong> Computer, National University <strong>of</strong><br />
Defense Technology, Changsha, China.<br />
ZhengHu Gong HuNan Province, China.<br />
Birthdate: August, 1945. is Electronic<br />
Engineering B.E., graduated from Dept.<br />
Electronic Engineering Tsinghua<br />
University, Beijing, China, in 1970. And<br />
research interests on computer network<br />
and communication, network security,<br />
computer network architecture.<br />
He is currently a Pr<strong>of</strong>essor with the<br />
School <strong>of</strong> Computer, National University <strong>of</strong> Defense<br />
Technology, Changsha, China.
1662 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
XOEM plus OWL-based STEP Product<br />
Information Uniform Description and<br />
Implementation<br />
Chengfeng Jian<br />
Zhejiang University <strong>of</strong> Technology, Hangzhou, 310023, China<br />
Email: jiancf@zjut.edu.cn<br />
Haizhong Meng<br />
Zhejiang University <strong>of</strong> Technology, Hangzhou, 310023, China<br />
Email: mhz_1986@126.com<br />
Abstract—Aimed at the current inconsistencies in the OWLbased<br />
STEP description, the mapping rules between<br />
EXPRESS and OWL are established on the base <strong>of</strong> uniform<br />
semantic model named XOEM+OWL, then the<br />
implementation method <strong>of</strong> STEP-OWL converter is put<br />
forward and the corresponding examples are shown.<br />
Index Terms—OWL, STEP, XOEM, EXPRESS<br />
I. INTRODUCTION<br />
With the development <strong>of</strong> the semantic web and<br />
semantic grid, knowledge sharing and exchange <strong>of</strong><br />
product information over the Internet became the main<br />
research focus. Currently, there are many research<br />
methods to realize the semantic description <strong>of</strong> STEP [1]<br />
by means <strong>of</strong> semantic web such as RDF, DAMIL, OWL<br />
[2], etc [3-5]. Summarized the results <strong>of</strong> these studies,<br />
their thinking is similar to STEP using the same XML<br />
data representation, are trying to use RDF or OWL to<br />
replace EXPRESS described. The limitations <strong>of</strong> this<br />
approach is: different from the data format conversion,<br />
OWL semantic description <strong>of</strong> a variety <strong>of</strong> ways, for the<br />
same kind <strong>of</strong> product information, OWL can be used<br />
many different ways to describe their internal semantics,<br />
even in the same kind <strong>of</strong> OWL language to describe the is<br />
difficult to standardize the understanding <strong>of</strong> semantic<br />
consistency. Therefore, trying to only through the<br />
description <strong>of</strong> OWL to realize semantic description <strong>of</strong> the<br />
unity <strong>of</strong> product information is difficult, which is<br />
currently difficult for these studies have further reasons.<br />
Overall, although the realization <strong>of</strong> the expression <strong>of</strong><br />
STEP in OWL, but mainly through the EXPRESS and<br />
OWL syntax match between the establishment <strong>of</strong><br />
mapping between ontology definitions and descriptions<br />
<strong>of</strong> their lack <strong>of</strong> consistency, lack <strong>of</strong> a unified model and<br />
define the appropriate constraints.<br />
project number: 60603087<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1662-1667<br />
In this paper, aimed at OWL-based STEP semantic<br />
description, the mapping rules between EXPRESS and<br />
OWL are established on the base <strong>of</strong> uniform semantic<br />
model named XOEM+OWL, then the implementation<br />
method <strong>of</strong> STEP-OWL converter is put forward and the<br />
corresponding examples are shown.<br />
II. XOEM+OWL-BASED SCHEMA MAPPING<br />
XOEM [6] is the data model <strong>of</strong> the XML-based STEP<br />
representation. It is difficult to realize the direct mapping<br />
between XOEM and OWL because OWL belongs to the<br />
semantic layer and the XOEM belongs to the data layer.<br />
XOEM has strong capability on the description <strong>of</strong> data<br />
object but the weak capability on the reasoning <strong>of</strong><br />
constraint. So it is necessary to build the model that it can<br />
realize the conversion from XOEM and introduced from<br />
OWL pattern graph. That’s called XOEM+OWL [7].<br />
According to the OO conception, table1 shows the<br />
comparison:<br />
XOEM+OWL model is based on the XOEM model.<br />
We can also get the follow definition reference to<br />
XOEM:<br />
Object: = Atomic Object | Complex Object<br />
Atomic Object: = (oid, label_name, attribute_type,<br />
attribute_value )<br />
Complex Object: = (oid, label_name, Reference)<br />
Reference: = (link, oid, label_name )
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1663<br />
TABLE I.<br />
EXPRESS-XML-OWL<br />
OO EXPRESS XML OWL<br />
Object Entity<br />
Object<br />
instance<br />
Object<br />
property<br />
Method<br />
Element<br />
type<br />
Definition1.<br />
Given directed graph G=(V, E).<br />
Class<br />
Entity instance Element Individual<br />
Entity<br />
attribute<br />
ENTITY<br />
Function<br />
Element<br />
Element Class<br />
ObjectProperty/<br />
DatatypeProperty<br />
Declaration Entity Schema DTD Ontologies<br />
Relationship<br />
Complex<br />
Constraint<br />
express<br />
Hierarchy<br />
Complex<br />
Constraint<br />
express<br />
Assumption: v0, v1…vi, …vn ∈ V, e1, e2…en∈ E.<br />
Convention3: If the above rules cannot be achieved or<br />
is difficult to achieve, under the circumstances, using the<br />
original translation.<br />
According to above description, table2 shows the<br />
different corresponding Schema graph relation.<br />
TABLE II.<br />
XOEM+OWL-BASED SCHEMA GRAPH DESCRIPTION<br />
OWL Schema Graph<br />
Class Node<br />
Property with basic datatypes<br />
as range (Attribute)<br />
Property with other class as<br />
range (Attribute)<br />
Node with edge joining it to the class<br />
with name “hasProperty”<br />
Edge between the two class nodes<br />
Individual Node with edge joining it to the class<br />
with name “hasInstance”<br />
Class – subclass relationship Edge between class node to subclass<br />
node with name “hasSubClass”<br />
Exists r: d > 0 , r ∈V<br />
v ( V { r})<br />
, i ∈ −<br />
,<br />
i=0, 1, …n.<br />
III. MAPPING RELATIONSHIP BETWEEN OWL AND<br />
EXPRESS EXPRESSION<br />
Definition2.<br />
Given directed graph G(V, E, r).<br />
A. SCHEMA definition<br />
SCHEMA defined as a collection <strong>of</strong> STEP ENTITY<br />
Exist G(Vi, Ei), vi∈ V.<br />
and types, which can be refer to each other for the<br />
i<br />
V = { v j | v j ∈V<br />
∧ v k , v j};<br />
purpose <strong>of</strong> type reuse. The definition <strong>of</strong> SCHEMA can be<br />
corresponding to the Ontology in OWL which part #1 in<br />
i<br />
E<br />
i<br />
i<br />
= { < v j , v k > | v j ∈ V ∧ v k ∈ V ∧ < v j , v k >∈ E }; Figure 1 shows.<br />
B. Basic data type definition<br />
Rule1.<br />
Basic data types EBNF expressed as shown in Figure 2.<br />
For the XOME+OWL object, the Node <strong>of</strong> the directed OWL uses XML Schema embed data type, so as follows:<br />
graph is represented as Object. It is mapping into the For simple data types, mapping directly into the xsd<br />
Class <strong>of</strong> OWL.<br />
data types in the XML schema.<br />
For the construction <strong>of</strong> data types, mapping into owl:<br />
Rule2.<br />
oneOf.<br />
For the XOME+OWL object’s property, the Edge <strong>of</strong> For the aggregate data types, mapping into Owl:Class<br />
the directed graph is represented as Property. It is aggregate with attribute (lowerboundary, upperboundary,<br />
correspond to the property <strong>of</strong> the Class or the<br />
“hasSubClass” among the classes in the OWL.<br />
repetitiveness, if ordered, storage type)<br />
Convention1: If the relevant concepts or data types<br />
<strong>of</strong> EXPRESS can be directly expressed in OWL, then be<br />
expressed using the OWL keyword in priority, to ensure<br />
TypeDecl::=’’<br />
TYPE_HEAD::=TYPE_ID+ TYPE_ID::MarkupDecl*<br />
TYPE_BODY::=TYPE_DECLARATION+SMarkupDecl*<br />
TYPE_DECLARATION::=’’<br />
accuracy by reasoning tools.<br />
BASE_TYPE::=SimpleTypes|ConstuctedTypes|Aggregatio<br />
nTypes|TypeRef<br />
Convention2: If the relevant concepts or data types <strong>of</strong> WHERE CLAUSE::=’WHERE’|RuleDecl<br />
EXPRESS cannot be directly expressed in OWL, but can<br />
expressed by combining the OWL relevant concepts for<br />
the same purpose, and ensuring the accuracy <strong>of</strong><br />
semantics. The combination approach is the better.<br />
Figure2. The BENF expression <strong>of</strong> basic data type in EXPRESS<br />
© 2011 ACADEMY PUBLISHER
1664 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
SCHEMA config_control_design; #1<br />
Entity action; #2<br />
name:label; #3<br />
description: OPTIONAL text; #4<br />
…<br />
DERIVE<br />
scl: REAL:=NVL(scale, 1); #5<br />
;<br />
#1<br />
#2<br />
#3<br />
<br />
<br />
<br />
#4<br />
<br />
<br />
#5<br />
<br />
<br />
<br />
<br />
Figure1. Mapping Example between EXPRESS and OWL<br />
C. Entity definition<br />
ENTITY is an important concept in EXPRESS, so the<br />
mapping <strong>of</strong> entity is the most important. In EXPRESS,<br />
the definition <strong>of</strong> entity is shown in Figure 3(The BENF<br />
expression <strong>of</strong> entity in EXPRESS). The concept <strong>of</strong> class<br />
in OWL can be equivalent to that <strong>of</strong> entity. In this paper,<br />
we map the entity to class, But in OWL class, the<br />
definition <strong>of</strong> attributes and classes are separate, while In<br />
EXPRESS, that is defined together. In order to resolve<br />
property name conflicts, we plus the entity name at front<br />
<strong>of</strong> the attribute name.<br />
①Entity name<br />
For entity name, mapping into owl class, e.g. , see it in Figure1 #2.<br />
②Entity inheritance<br />
We adopt rdfs:subClassOf to represent the<br />
‘SUPERTYPE OF’ and “SUBTYPE OF”<br />
③Simple Attribute<br />
We call those Attributes Simple Attributes whose types<br />
are only simple data type or another entity. If the type is<br />
simple data type, then mapping into , or mapping into , such as Figure1 #3.<br />
④Aggregate Attribute<br />
© 2011 ACADEMY PUBLISHER<br />
For Aggregate Attribute, first define Attribute class<br />
as the subclass <strong>of</strong> class in 3) <strong>of</strong> 2.3.2, then set the<br />
Attribute’s lower boundary, upper boundary,<br />
repetitiveness, order, storage type.<br />
⑤OPTIONAL Attribute<br />
For optional attribute, not only mapping into<br />
or, but also<br />
providing attribute cardinality constraints<br />
(maxCardinality). It is shown in Figure1 #4.<br />
⑥DERIVE Attribute<br />
For DERIVE attribute, not only mapping into<br />
or, but also<br />
providing attribute constraints (allValuesFrom). It is<br />
shown in Figure1 #5.<br />
EntityDecl∷= ’’<br />
ENTITY_HEAD∷=ENTITY_ID S INHERITANCE?<br />
ENTITY_BODY∷=ENTITY_DECLARATION + S<br />
MarkupDecl*<br />
ENTITY_DECLARATION∷=ENTITY_AttrDecl *<br />
SENTITY_ClauseDecl?<br />
ENTITY_ClauseDecl∷=INVERSE_ClAUSE |<br />
UNIQUE_ClAUSE | WHERE_ClAUSE<br />
WHERE_ClAUSE∷=’WHERE’ | RuleDecl<br />
UNIQUE_ClAUSE∷=’UNIQUE’ S Unique_Rule+<br />
ENTITY_AttrDecl∷=Explicit_AttrDecl |<br />
Derive_AttrDecl | Inverse_AttrDecl<br />
Figure3. The BENF expression <strong>of</strong> entity in EXPRESS<br />
D. Function and rule definition<br />
In function and rule, there is a wealth <strong>of</strong> mathematical<br />
operations and Constraint mechanism on objects, but<br />
these expressions in OWL at this aspect are limited, so<br />
we adopt the literal translation with SWRL according to<br />
the Conversion 3.<br />
In addition to the above, there are many other concepts<br />
in EXPRESS, the mapping methods are similar.<br />
Ⅳ. DESIGN AND IMPLEMENTATION OF STEP-OWL<br />
CONVERTER<br />
A. Conversion <strong>of</strong> EXPRESS-OWL<br />
The mapping method <strong>of</strong> EXPRESS to OWL file has<br />
been described in detail in part2, so the most important<br />
task for the implementation <strong>of</strong> EXPRESS-OWL file<br />
conversion is lexical analysis. Here we have adopted a<br />
two-step to complete conversion, which are pre-<br />
converter and post-converter.<br />
① Pre-converter<br />
The so-called pre-converter resolve EXPRESS file to<br />
JAVA classes (Figure 4) in accordance with established<br />
EXPRESS keyword vocabulary (Figure 5) file.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1665<br />
ENTITY<br />
entityName : String<br />
superEntity : List<br />
subEntity : List<br />
SimpleAttribute : Map<br />
deriveAttribute : Map<br />
condition : Map<br />
Figure4. ENTITY class diagram<br />
public interface Vocabulary<br />
{<br />
public final static String<br />
ABSTRACT=”ABSTRACT”;<br />
public final static String<br />
AGGREGATE=” AGGREGATE”;<br />
public final static String ALLAS=” ALLAS”;<br />
public final static String AS=”AS”;<br />
public final static String BAG=” BAG”;<br />
public final static String BEGIN=”BEGIN”;<br />
public final static String BINARY=” BINARY”;<br />
public final static String<br />
BOOLEAN=” BOOLEAN”;<br />
public final static String CASE=” CASE”;<br />
…<br />
Figure5. Part <strong>of</strong> EXPRESS keyword vocabulary<br />
The conversion method is roughly the same, so we use<br />
entity conversion for example. According to the EBNF<br />
description <strong>of</strong> entities and the characteristics <strong>of</strong> definition<br />
<strong>of</strong> EXPRESS entity in Figure 3, we can find that the<br />
keyword vocabulary EXPRESS entity definition is<br />
ENTITYEND_ENTITY 、 DERIVE 、 INVERSE 、<br />
WHERE 、 SUPERTYPE OF 、 SUBTYPE OF. Preconverter’s<br />
physical process is shown in Figure 6.<br />
no<br />
init<br />
r ead a l i ne <strong>of</strong><br />
EXPRESS f i l e<br />
has keywor d<br />
ENTI TY?<br />
yes<br />
save the infomat i on<br />
and cont i nue r ead a<br />
line<br />
has keywor d<br />
END_ENTI T£¿<br />
yes<br />
split the string<br />
according ';'£¬and write<br />
it to JAVA cl ass<br />
no<br />
no<br />
document<br />
end£¿<br />
end<br />
Figure6. Pre-converter’s physical process<br />
② Post-converter<br />
The so-called post-converter is a documents writer<br />
based on the work <strong>of</strong> pre-converter, which generate OWL<br />
documents according to the mapping method in chapter 2.<br />
© 2011 ACADEMY PUBLISHER<br />
yes<br />
Figure 7 is the part <strong>of</strong> the STEP-OWL converter’s<br />
convert result for STEP AP203 shown in Protégé.<br />
Figure7. AP203 converted entity relationship results in Protégé<br />
B. STEP Part21 file conversion<br />
STEP Part21 [8] [9] file can be divided into two parts:<br />
HEADER and DATA. HEADER describe the file name<br />
file reference application protocol; DATA section<br />
composed by a number <strong>of</strong> data instances, each data<br />
instance composed by ID, "=", function statements.<br />
Although the data structure is a single paragraph, but the<br />
statement describes a variety <strong>of</strong> functions, how to design<br />
to meet the description <strong>of</strong> data example’s diversity is the<br />
focus <strong>of</strong> the conversion.<br />
①Lexical Analysis<br />
Read STEP file from left to right, just scan the<br />
character stream and then identify the word based on<br />
word formation rules. This step is divided word into data<br />
instance ("#" plus the number), the variable value<br />
(integer, string, data value), reserved words (the special<br />
characters and other special characters in Part21 physical<br />
file).<br />
②Syntax Analysis<br />
Syntax analysis’s task is to combine the word sequence<br />
into various grammatical phrases based on the lexical<br />
analysis, such as the "Program", "statement", "expression<br />
", etc. Syntax analysis charges the step file is correct or<br />
not on structure and analyze the expression phrase in<br />
hierarchical.<br />
③Semantic Analysis<br />
Semantic Analysis is a translation <strong>of</strong> syntax mapping<br />
based on lexical analysis and syntax analysis. According<br />
to the keywords generate by syntax analysis, we search<br />
the keywords in STEP Application Protocol library, and<br />
insert into file at the appropriate location based on<br />
conversion rules. Use data instance #5 =<br />
AXIS2_PLACEMENT_3D ('NONE', #6, #7, #8); for<br />
example.<br />
Step 1.Divide the data instance into #5, =,<br />
AXIS2_PLACEMENT_3D, (, 'NONE', #6, #7, #8).
1666 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Step2.Decompose the words in step 1 hierarchically,<br />
we can get #5->AXIS2_PLACEMENT_3D ->’NONE’,<br />
#6, #7, #8.<br />
Step3.Semantic processor search AXIS2_ PLACE<br />
MENT_ 3D based on definition in AP203 and map to<br />
AP203, for example, ’none’ means the value <strong>of</strong> property<br />
‘name’, which inherited from representation_item; # 6, #<br />
7, # 8, represent the values <strong>of</strong> the ‘location’ (inherited<br />
from the placement), ‘axis’, ‘ref_direction’.<br />
Through the above three steps, the data instance #5 =<br />
AXIS2_PLACEMENT_3D (‘NONE’, #6, #7, #8); is<br />
convert to owl file by STEP-OWL Converter, see in<br />
Figure 8.<br />
<br />
<br />
NONE<br />
<br />
<br />
<br />
<br />
….<br />
….<br />
…<br />
Figure8. Data instance # 5<br />
C. Example<br />
The examples <strong>of</strong> the convert result by the STEP-OWL<br />
converter for the STEP Part21 file are shown in Figure 9<br />
and Figure 10.<br />
© 2011 ACADEMY PUBLISHER<br />
Figure9. AP203 in OWL format<br />
Figure10. STEP Part21 in OWL format<br />
V. CONCLUSION<br />
XOEM+OWL-based STEP product information<br />
description can realize the semantic description while<br />
maintaining semantic consistency and effectiveness.<br />
There are still some issues that need further studies such<br />
as the semantic consistency for the Entity Function and<br />
Procedure with SWRL.<br />
ACKNOWLEDGMENT<br />
Supported by the National Natural Science Foundation<br />
<strong>of</strong> China (No. 60603087), the Project <strong>of</strong> the Science and<br />
Technology Department <strong>of</strong> Zhejiang Province (No.<br />
2009C320076)<br />
REFERENCES<br />
[1] ISO10303-28, Industrial automation systems and<br />
integration, Product data representation and exchange,<br />
part28: Implementation methods: XML representations <strong>of</strong><br />
EXPRESS schemas and data.<br />
[2] OWL Web Ontology Language,<br />
http://www.w3.org/TR/owl-features/<br />
[3] Pan, Wen-Lin, “A formal EXPRESS-to-OWL mapping<br />
algorithm, ” Key Engineering Materials, vol.419, pp. 689-<br />
692, 2010.<br />
[4] Zhao, W., Liu, J.K, “OWL/SWRL representation<br />
methodology for EXPRESS-driven product information<br />
model. Part I. Implementation methodology, ” Computers<br />
in Industry, vol.59, pp. 580-589, August, 2008.<br />
[5] Ricardo Jardim-Goncalves, Nicolas Figay, Adolfo Steiger-<br />
Garcao, “Enabling interoperability <strong>of</strong> STEP Application<br />
Protocols at meta-data and knowledge level, ”<br />
International <strong>Journal</strong> <strong>of</strong> Technology Management, vol.36,<br />
pp.402-421, April, 2006.<br />
[6] Jian Cheng-Feng, Tan Jian-Rong, “Description and<br />
Identification <strong>of</strong> STEP Product Data with XML, ” <strong>Journal</strong><br />
computer-aided design and computer graphics, vol.13, pp.<br />
983-990, Novemember, 2001.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1667<br />
[7] Jian Cheng-Feng, Zhang Mei-yu, “A Uniform Product<br />
Knowledge Representation Semantic Model, ” 2006 IEEE /<br />
WIC / ACM International conference on web intelligence,<br />
Hong Kong, pp.953-956 , Decemember, 2006.<br />
[8] ISO 10303-11 Industrial automation systems and<br />
integration, Product data representation and exchange,<br />
Part11: Description methods: The EXPRESS language<br />
reference manual.<br />
[9] ISO 10303-21 Industrial automation systems and<br />
Integration, Product data representation and exchange,<br />
Part21: Clear text encoding <strong>of</strong> the exchange structure.<br />
© 2011 ACADEMY PUBLISHER<br />
Chengfeng Jian Zhejiang Province,<br />
China. Birthdate: June, 1973. Ph.D.,<br />
graduated from Zhejiang University. And<br />
research interests on CAD/PDM and<br />
Semantic Web/Semantic Grid.<br />
He is an associate pr<strong>of</strong>essor <strong>of</strong> Dept.<br />
Computer Science and Technology<br />
Zhejiang University <strong>of</strong> Technology.<br />
Haizhong Meng Zhejiang Province,<br />
China. Birthdate: July, 1986. BA.,<br />
graduated from Zhejiang Sci-Tech<br />
University. And research interests on<br />
Semantic Web and STEP<br />
He is currently a postgraduate student<br />
<strong>of</strong> Dept. Computer Science and<br />
Technology Zhejiang University <strong>of</strong><br />
Technology.
1668 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Design <strong>of</strong> Greenhouse Control System Based on<br />
Wireless Sensor <strong>Networks</strong> and AVR<br />
Microcontroller<br />
Yongxian Song<br />
The Institute <strong>of</strong> Electronic Engineering Huaihai Institute <strong>of</strong> Technology, Lianyungang , 222005,China<br />
Email: soyox@126.com<br />
Chenglong Gong, Yuan Feng, Juanli Ma and Xianjin Zhang<br />
The Institute <strong>of</strong> Electronic Engineering Huaihai Institute <strong>of</strong> Technology, Lianyungang, 222005, China<br />
Email: soyox@163.com<br />
Abstract—In order to accurately determine the growth <strong>of</strong><br />
greenhouse crops, the system based on AVR Single Chip<br />
microcontroller and wireless sensor networks is developed,<br />
it transfers data through the wireless transceiver devices<br />
without setting up electric wiring, the system structure is<br />
simple. The monitoring and management center can control<br />
the temperature and humidity <strong>of</strong> the greenhouse, measure<br />
the carbon dioxide content, and collect the information<br />
about intensity <strong>of</strong> illumination, and so on. In addition, the<br />
system adopts multilevel energy memory. It combines<br />
energy management with energy transfer, which makes the<br />
energy collected by solar energy batteries be used<br />
reasonably. Therefore, the self-managing energy supply<br />
system is established. The system has advantages <strong>of</strong> low<br />
power consumption, low cost, good robustness, extended<br />
flexible. An effective tool is provided for monitoring and<br />
analysis decision-making <strong>of</strong> the greenhouse environment.<br />
Index Terms—wireless sensor networks, AVR, greenhouse<br />
I. INTRODUCTION<br />
Greenhouse is a kind <strong>of</strong> place which can change plant<br />
growth environment, create the best conditions for plant<br />
growth, and avoid influence on plant growth due to<br />
outside changing seasons and severe weather [4-5]. For<br />
greenhouse measurement and control system, in order to<br />
increase crop yield, improve quality, regulate the growth<br />
period and improve the economic efficiency, the<br />
optimum condition <strong>of</strong> crop growth is obtained on the<br />
basis <strong>of</strong> taking full use <strong>of</strong> natural resources by changing<br />
greenhouse environment factors such as temperature,<br />
humidity, light, CO2 concentration. Greenhouse<br />
measurement and control system is a complex system,<br />
it needs to various parameters in greenhouse automatic<br />
monitoring, information processing, real-time control and<br />
online optimization. The development <strong>of</strong> greenhouse<br />
measurement and control system has made considerable<br />
progress in the developed countries, and reached the<br />
Manuscript received March. 5, 2011; revised March.25, 2011;<br />
accepted April. 10, 2011.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1668-1674<br />
multi-factors comprehensive control level, but if we<br />
introduce the foreign existing systems, the price is very<br />
expensive and maintenance isn’t convenient. In recent<br />
years, our country have launched many studies in aspects<br />
<strong>of</strong> greenhouse structure and control, and made a lot <strong>of</strong><br />
achievements, but the greenhouse measurement and<br />
control system is mostly based on cable, so it is not only<br />
wiring complex, but also unfavorable to improve the<br />
system efficiency. With the rapid development <strong>of</strong> the low<br />
cost, low power sensor and wireless communication<br />
technology, the conditions that construct wireless<br />
greenhouse measurement and control system becomes<br />
mature, and it is important to realize agricultural<br />
modernization [1-3]. According to the needs <strong>of</strong> quickly<br />
and accurately acquisition greenhouse environment<br />
information, in the paper, we have further studies in<br />
aspects <strong>of</strong> greenhouse environment information<br />
collection, treatment, transmission and so on, and we<br />
have developed greenhouse measurement and control<br />
system based on AVR microcontroller and wireless<br />
sensor networks. This system has high practical value to<br />
realize information and automation <strong>of</strong> large-scale<br />
greenhouse monitoring and improve work efficiency.<br />
II. THE GENERAL STRUCTURE OF THE SYSTEM<br />
The greenhouse measurement and control system<br />
compose <strong>of</strong> the monitoring center, sensor nodes and<br />
control equipments. Sensor nodes are deployed in every<br />
place <strong>of</strong> greenhouse, the responsible for periodic<br />
acquisition greenhouse environment information and send<br />
it to control center. The control center analyze these data<br />
which has been obtained, then relevant decisions are<br />
made and send control message to greenhouse control<br />
equipment, which regulate greenhouse environment<br />
parameters to obtain best growth environment for crops.<br />
Modern greenhouse has very large size, and which adopt<br />
hierarchical system structure. Supposed that greenhouse<br />
is rectangular area, the measurement system overall<br />
structure is shown in Fig.1.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1669<br />
Figure.1.The system structure <strong>of</strong> Greenhouse WSN measurement and<br />
control<br />
In Fig.1, the rectangular greenhouse was divided into<br />
several same area <strong>of</strong> greenhouse, each measurement and<br />
control area is managed by a base station, and is divided<br />
into many virtual grids which have the same sizes and is<br />
non-overlapping. A number <strong>of</strong> sensor nodes are deployed<br />
in virtual grid and make a cluster, each cluster includes a<br />
cluster head (sink node) and some cluster member nodes.<br />
Cluster head generated from the member nodes through<br />
cluster head election algorithm, and cluster member<br />
nodes compose <strong>of</strong> sensor nodes which can collect<br />
environmental data and control nodes which can control<br />
actuators and adjust environmental parameters. Control<br />
node does not participate in cluster head election, it<br />
obtain command which the monitoring center send from<br />
cluster head node and execute corresponding control<br />
operation. The star network composed <strong>of</strong> Cluster head<br />
nodes, sensor nodes and control nodes, it mainly<br />
complete data acquisition and control <strong>of</strong> greenhouse<br />
environment. The data which is collected is transmitted<br />
directly from sensor nodes to cluster head, the cluster<br />
nodes transferred data to the base station by way <strong>of</strong><br />
multiple hops, at last, the base station transferred each<br />
cluster head node data which is packaged to the<br />
monitoring center. Base station is relay station between<br />
the monitoring center and greenhouse WSN nodes, the<br />
network control is realized by managing all the nodes <strong>of</strong><br />
single greenhouse measurement and control area. The<br />
monitoring center is not only total console <strong>of</strong> more<br />
greenhouse network, but also data center <strong>of</strong> measurement<br />
and control system <strong>of</strong> the greenhouse network , and take<br />
charge <strong>of</strong> control and management <strong>of</strong> the entire system.<br />
III. GREENHOUSE WIRELESS SENSOR NETWORK NODE<br />
DESIGN<br />
Greenhouse wireless sensor network measurement and<br />
control system consists <strong>of</strong> two types <strong>of</strong> nodes, namely,<br />
sensor nodes and sink nodes. Sensor node composed <strong>of</strong><br />
CPU module, wireless communication module, power<br />
supply module, sensor module and position switch which<br />
can set their physical location information. Sink node<br />
contains three modules: CPU module, wireless<br />
© 2011 ACADEMY PUBLISHER<br />
communication module, continuous power supply<br />
module and serial communication module.<br />
A. Sensor node module design<br />
Sensor node composed <strong>of</strong> CPU module, wireless<br />
communication module, sensor module, position switch<br />
and energy supply module. Its structure is shown in Fig.2.<br />
Sensor module is responsible for monitoring area<br />
information collection and data transfer, according to the<br />
application requirements, and can choose temperature<br />
sensor, humidity sensors, light sensor, carbon dioxide<br />
concentrations sensor etc. Processor module is<br />
responsible for controlling the sensor node operation,<br />
storage and processing the data which collected by the<br />
node and forwarded by other nodes. Wireless<br />
communication module is responsible for wireless<br />
communication, exchanging control information and<br />
transceiver acquisition data between this node and other<br />
nodes. Position setting switch is used to set a sensor node<br />
specific physical location in greenhouses. Energy supply<br />
module can provide energy which the work need for<br />
sensor node, in the paper, we adopt solar self-supply<br />
module for node power supply.<br />
Figure.2 Sensor node structure chart<br />
Figure.3. Sink node structure chart<br />
B. Sink node module design<br />
Sink node mainly complete the sensor nodes data<br />
gathering and fusion within communication network, and<br />
realize ascending and descending communication<br />
protocol conversion. It released monitoring task <strong>of</strong><br />
management nodes, and the data collected is forwarded to<br />
the external network through a serial port. It is not only<br />
an enhanced sensor node, but also special gateway device<br />
which hasn’t monitoring function and only has wireless<br />
communication interface. Its structure is shown in Fig.3.<br />
It composes <strong>of</strong> a power supply module, storage module,<br />
processor module, node communication module and<br />
serial interface communication module and so on.<br />
Because sink node need process many sensor nodes data,
1670 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
it work longer hours and dormancy time is short, the<br />
battery energy can’t satisfy sink node energy<br />
consumption, so the sink node adopt solar self-supply<br />
module for nodes power supply in the paper.<br />
C. Power supply module<br />
In order to solve energy supply problem <strong>of</strong> sensor<br />
nodes, we adopted solar energy supply system in the<br />
paper, and the structure is shown in Fig. 4. Fig.4 show<br />
that power supply module have energy collector, energy<br />
storage, backup energy memory, power management and<br />
control section. Energy collector composes <strong>of</strong> solar<br />
panels, and it is responsible for transforming solar energy<br />
into electrical energy. Energy storage include the main<br />
level energy storage, constitute <strong>of</strong> super capacitance, and<br />
is responsible for storing energy which is collected by<br />
solar battery and provide energy for wireless network<br />
sensor nodes. Backup energy memory composes <strong>of</strong><br />
lithium battery, and provide energy source for system in<br />
an emergency. Power management and control section is<br />
responsible for monitoring status <strong>of</strong> energy memory<br />
which provide power supply according to the energy<br />
status, and take solar cell as energy memory supplement<br />
energy.<br />
Figure.4. Solar self-supply module structure<br />
IV. THE DESIGN OF MONITORING CENTER<br />
The monitoring center control operation <strong>of</strong> the whole<br />
network through the base station <strong>of</strong> all measurement and<br />
control area, and which the main task include sending<br />
control command for network, collection and handling<br />
monitoring data <strong>of</strong> each node in network and data is<br />
stored into database, historical data is inquired and<br />
analyzed. The monitoring center mainly composes <strong>of</strong> PC<br />
and wireless communication module. The hardware<br />
structure is shown in Fig. 5.<br />
In Fig.5, the PC is taken as upper computer, CC2430 is<br />
taken as a wireless communication module, and the<br />
communication between them is realized through serial<br />
port. In short, the main function <strong>of</strong> the monitoring center<br />
is described below.<br />
1. Network management and control function. Such as<br />
starting or stopping network operation, configuration<br />
network parameters. Network parameters include sensor<br />
node data acquisition frequency, the frequency submitting<br />
the data to base station, the length <strong>of</strong> each task time slot,<br />
the routing probability vector and so on. The monitoring<br />
© 2011 ACADEMY PUBLISHER<br />
center can also inquire operation state, environmental<br />
data and send control node to control command etc.<br />
2. Data storage function. The monitoring center need<br />
to preserve historical monitoring data for enquiries, this<br />
function is realized through the database.<br />
3. Data analysis and decision support functions. The<br />
monitoring data is analyzed by agricultural expert system<br />
and establish the most suitable greenhouse environment<br />
control strategy.<br />
The base station <strong>of</strong> measurement and control not only<br />
controls all nodes <strong>of</strong> the district, but also is<br />
communication hub between the monitoring center and<br />
measurement and control area, mainly providing data<br />
forwarding and data buffer function.<br />
Figure.5.The monitoring center hardware structure<br />
A. System s<strong>of</strong>tware design<br />
V. SYSTEM SOFTWARE<br />
Figure.6 System s<strong>of</strong>tware flowchart<br />
Modular design thought is adopted in system s<strong>of</strong>tware<br />
program which mainly composed <strong>of</strong> data collection<br />
system <strong>of</strong> the greenhouse and wireless control systems.<br />
The data acquisition system transfer the data that is<br />
wireless sensor node acquisition own surrounding<br />
environment information to sink node by wireless
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1671<br />
network. The data message that is fused is sent to<br />
controller by sink node. Meanwhile, the sink node<br />
receives instructions from controller, and forwards<br />
instructions to the sensor node. The flow chart <strong>of</strong> system<br />
s<strong>of</strong>tware is shown in Fig.6.<br />
B. The s<strong>of</strong>tware design <strong>of</strong> monitoring center<br />
The monitoring center send the system starts<br />
commands in spare time slot (Tidle) and receive the<br />
network monitoring data <strong>of</strong> each node in cluster interstate<br />
communication (Tinter) time slot. If necessary, other<br />
management control commands can be sent in spare time<br />
slot and routing time slot. In network formation time slot<br />
and communications time slot within the cluster, each<br />
node is busy with networking in greenhouse, and don’t<br />
monitor commands <strong>of</strong> control center, so the management<br />
control command for network need not be sent and<br />
complete some data processing tasks. We adopt<br />
Micros<strong>of</strong>t access for the monitoring center database. The<br />
program flowchart <strong>of</strong> monitoring center spare time slot is<br />
shown in Fig.7<br />
Figure.7.The program flowchart <strong>of</strong> monitoring center spare time slot<br />
In spare time slot, the monitoring center mainly<br />
completes start-up system functions. If the system is the<br />
first start, then must connect to database. Then, the<br />
monitoring center send starts commands to the base<br />
station <strong>of</strong> all measurement and control area in<br />
greenhouse, if not received a confirmation <strong>of</strong> the base and<br />
no more than retransmission times, and the starts<br />
commands is resent. If exceed retransmission times, and<br />
fault diagnosis module is run. If received confirmation<br />
frame that the base station returns and spare time slot is<br />
not over, the monitoring center can complete other<br />
management control command.<br />
In cluster interstate communication, the main task <strong>of</strong><br />
monitoring center collect data that greenhouse WSN<br />
submitted and store in database. If users have<br />
management control requirements, and it may priority<br />
executed. The program flowchart <strong>of</strong> monitoring center<br />
cluster interstate communication time slot is shown in<br />
Fig.8.<br />
© 2011 ACADEMY PUBLISHER<br />
Figure.8 The program flowchart <strong>of</strong> monitoring center cluster interstate<br />
communication time slot<br />
C. The nodes deployed algorithm <strong>of</strong> measurement and<br />
control system based on WSN in Greenhouse<br />
In greenhouse WSN measurement and control system,<br />
the sensor nodes deployed in greenhouse periodically<br />
collected various environmental data and send it to<br />
control center with multiple hops communication manner,<br />
and it belongs to the typical centralized data collection<br />
network. In Such system, due to the nodes near the base<br />
station forward large quantities <strong>of</strong> data and premature<br />
deaths, and the network is divided and even completely<br />
paralyzed. The energy consumption hotspot is caused as a<br />
result <strong>of</strong> load distribution imbalance between the nodes,<br />
so we take phenomenon as funnel effect [6-7]. This<br />
article solve funnel effect <strong>of</strong> greenhouse WSN<br />
measurement and control system through redundancy<br />
node technology, using a single measurement and control<br />
area <strong>of</strong> greenhouse as the research object, taking the<br />
node's next-hop choose road probability as edge fuzzy<br />
weights, and introduce fuzzy graph theory, and the data<br />
probability from source cluster head to the destination<br />
node cluster head node by m jump is calculated, so we<br />
obtain network data load distribution in greenhouse<br />
measurement and control area by it, and the redundant<br />
nodes deployed algorithm (RNDA) based on cluster load<br />
balancing was designed. In order to balance the network<br />
load, we adopt three ways in the algorithm, namely, the<br />
multi-path routing, redundant nodes deployment and<br />
cluster head election. The key <strong>of</strong> RNDA is that<br />
determines each cluster head routing probability<br />
vector v P , and can construct network topology through<br />
this vector. In greenhouse WSN measurement and control<br />
system, v P <strong>of</strong> cluster head v is pre-set according to the<br />
nodes geographical location. In fact, v P became the basis<br />
for routing algorithms, when network begin to run, every<br />
kind <strong>of</strong> node communicate each other by using the same
1672 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
preset v P , if the neighbor <strong>of</strong> a cluster head that can<br />
communicate can’t produce cluster head due to<br />
energy <strong>of</strong> all nodes are exhausted, and cluster head<br />
topology will change, so the cluster head v P should be<br />
adjusted. The cluster interstate communications model is<br />
shown in Fig.9, in order to narrative convenient, the<br />
monitoring area is divided into the5 5 × grid, we can set<br />
automatically grid number in simulation.<br />
p<br />
p<br />
p<br />
v( ev4)<br />
v( ev5)<br />
ve ( v6)<br />
p<br />
v( ev3)<br />
p<br />
v( ev7)<br />
v( ev2)<br />
ve ( v1)<br />
p = { p , p , p , p ,<br />
p p p p<br />
p<br />
v v( ev1) v( ev2) v( ev3) v( ev4)<br />
ve ( 5) ,<br />
v ve ( 6) ,<br />
v ve ( 7) ,<br />
v ve ( 8)<br />
}<br />
v<br />
Figure.9 Cluster interstate communications model<br />
p<br />
p<br />
v( ev8)<br />
Fig.9 (b) shows that each cluster head has eight routing<br />
direction at most, namely, v P has 8 component.<br />
According to cluster head category, taking one part or a<br />
few directions to give choose road probability value.<br />
P (e)<br />
These choose road probability v can be freely<br />
chosen, and ensure that the sum <strong>of</strong> choose road<br />
probability is 1. In Fig.9 (a), according to the<br />
geographical position, the cluster head is divided into hot<br />
cluster head H (black dots representation), boundary<br />
cluster head, general cluster head (colorless circle) etc.<br />
We consider that the cluster head which adopt data fusion<br />
strategy and doesn’t adopt data fusion strategies has on<br />
impact the network lifetime in simulation, The main<br />
purpose <strong>of</strong> WSN data fusion reduce the network data<br />
quantity through integration <strong>of</strong> each sensor node<br />
redundant information. In simulation experiments, the<br />
data fusion is put into practice in cluster head nodes,<br />
supposed that data fusion coefficient is 1( σ = 1 ) when<br />
the data fusion strategy is not executed. If the data fusion<br />
strategy is adopted, the different data fusion coefficient is<br />
chosen according to different fusion degree. Because the<br />
sensor nodes belong to isomorphism sensor nodes here,<br />
the type <strong>of</strong> the information collected is consistent,<br />
according to statistical knowledge, the small range<br />
environmental parameters hasn’t too large difference, so<br />
we fuse all child nodes data <strong>of</strong> one grid into a data, and<br />
describe environmental information <strong>of</strong> the grid (e.g.<br />
temperature, humidity). In Simulation experiments,<br />
supposed that the data fusion coefficient is<br />
1<br />
a ( σ = 1/a ) when the data fusion strategy is<br />
adopted, a is the activities node number inside grid,<br />
© 2011 ACADEMY PUBLISHER<br />
a are all set to 5 in the following simulated experiments.<br />
In Matlab 7.0, M document program is written according<br />
to algorithm process and the performance <strong>of</strong> RNDA<br />
algorithm is researched, and compare with uniform<br />
deployment way. In a uniform deployment mode, the<br />
redundant nodes is evenly distributed in each cluster, the<br />
networks is operated in three tasks slot mode.<br />
1. Fig.10 shows that is 4× 4 grid which d is<br />
25 cm (namely, d = 25m<br />
), communications distance<br />
d 2<br />
within the cluster is CI = d d 2<br />
and CO = dCI.<br />
Fig.10 (a) data fusion coefficient isσ<br />
= 1/a , Fig.10 (b)<br />
data fusion coefficient isσ<br />
= 1.<br />
Network lifetime/round<br />
(a) Data fusion coefficient<br />
(b)Data fusion coefficient<br />
Uniform deployment<br />
RNDA deployment<br />
σ = 1/a<br />
σ = 1<br />
Redundant nodes<br />
Figure.10.The Redundant nodes have impact on the network lifetime(<br />
4 4<br />
× grid)<br />
2. Fig.11 shows that is 5 5 × grid which d is<br />
20 cm (namely, d = 20m<br />
). Fig.11 (a) data fusion
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1673<br />
coefficient isσ<br />
= 1/a , Fig.11 (b) data fusion coefficient<br />
isσ<br />
= 1.<br />
(a) Data fusion coefficient<br />
σ = 1/a<br />
σ = 1<br />
(b)Data fusion coefficient<br />
Figure.11.The Redundant nodes have impact on the network<br />
lifetime ( 5 5 × grid)<br />
As can be seen from the above graph, the network<br />
lifetime that the data fusion strategy is adopted is<br />
probably 2 ~ 3 times than the data fusion strategy isn’t<br />
adopted. Virtual grid number has also impact on the<br />
network life, the more virtual grid were classified in<br />
monitoring area, the greater the network data quantity is,<br />
and the shorter the network lifetime is. RNDA compare<br />
with uniform deployment mode, Fig.11 (a) shows that the<br />
network lifetime improved 35.8 percent in A and B dot.<br />
When we extend the same network life, RNDA can save<br />
a lot <strong>of</strong> redundant nodes. Compared with uniform mode,<br />
the RNDA only deployed 24% redundant nodes when the<br />
4<br />
network lifetime is 3 . 5×<br />
10 round.<br />
© 2011 ACADEMY PUBLISHER<br />
VI. CONCLUSION<br />
According to the characteristics <strong>of</strong> modern greenhouse<br />
production, the paper introduce wireless sensor network<br />
technique to greenhouse wireless detection-control<br />
system, and the whole greenhouse system can automatic<br />
adjust by combining wireless sensor network technology<br />
with greenhouse control technology. In hardware, WSN<br />
nodes mainly compose <strong>of</strong> Atmega128L and wireless<br />
transceiver chip CC2420. In s<strong>of</strong>tware, the modularized<br />
design ideas is adopted in this paper, the sensor nodes<br />
deployment is made a in-depth analysis, the simulation<br />
results show that this algorithm can effectively prolong<br />
the network life.<br />
REFERENCES<br />
[1] Du Xiaoming, Chen Yan.The Realization <strong>of</strong> Greenhouse<br />
Controlling System Based on Wireless Sensor<br />
Network[J].JOURNAL OF AGRICULTURAL<br />
MECHANIZATION RESEARCH, 2009(6): 141-144.<br />
[2] Qiao Xiaojun, Zhang Xin, Wang Cheng, et al. Application<br />
<strong>of</strong> the wireless sensor networks in agriculture[J],<br />
Transactions <strong>of</strong> the CSAE,2005, 9(21):232-234.<br />
[3] S.L. Speetjens, H.J.J. Janssen, etc. Methodic design <strong>of</strong> a<br />
measurement and control system for climate control in<br />
horticulture[J]. COMPUTERS AND ELECTRONICS<br />
IN AGRICULTURE, 2008, (64):162-172.<br />
[4] Wang Linji.The Design <strong>of</strong> Realizing Change Temperature<br />
Control in Greenhouse by PLC [J].ELECTRICAL<br />
ENGINEERING, 2008, 5: 81-83.<br />
[5] Liu Yanzheng, Teng Guanghui, Liu Shirong.The problem<br />
<strong>of</strong> the control system for Greenhouse Climate[J].CHINESE<br />
AGRICULTURAL SCIENCE BULLETIN. 2007,23: 154-<br />
157.<br />
[6] C. Y. Wan, S. B. Eisenman, A. T. Campbell, et al.<br />
Overload traffic management for sensor networks[J]. ACM<br />
Transactions on Sensor <strong>Networks</strong>, 2007, 3, Article No. 18.<br />
[7] G. S. Ahn, E. Miluzzo, A. T. Campbell, et al. Funneling-<br />
MAC: A Localized, Sink-Oriented MAC For Boosting<br />
Fidelity in Sensor <strong>Networks</strong>[C]. Proceedings <strong>of</strong> the 4th<br />
international conference on Embedded networked sensor<br />
systems. New York: ACM, 2006: 293-306.<br />
[8] Li Nan, Liu Chengliang, Li Yanming, Zhang Jiabao, Zhu<br />
Anning. Development <strong>of</strong> remote monitoring system for soil<br />
moisture based on 3S technology alliance[J]. Transactions<br />
<strong>of</strong> the CSAE, 2010, 26(4): 169-173.<br />
[9] P. Santi, J. Simon. Silence Is Golden with High<br />
Probability: Maintaining a Connected Backbone in<br />
Wireless Sensor Network[C]. 1st European Workshop on<br />
Wireless Sensor <strong>Networks</strong>. Berlin: wireless sensor<br />
networks, proceedings, 2004: 106-121.<br />
[10] F. Chen, P. Jiang, Q. He. Phased waking coverage scheme<br />
based on hibernation <strong>of</strong> redundant nodes for wireless<br />
sensor networks[C]. Proceedings-International Symposium<br />
on Computer Science and Computational Technology. NJ:<br />
Institute <strong>of</strong> Electrical and Electronics Engineers Computer<br />
Society. 2008: 709-713<br />
[11] Z.M. Li, L. Lei. Sensor Node Deployment in Wireless<br />
Sensor <strong>Networks</strong> Based on Improved Particle Swarm<br />
Optimization[C].Proceedings <strong>of</strong> 2009 IEEE International<br />
Conference on Applied Superconductivity and<br />
Electromagnetic Devices. 2009:25-27.<br />
[12] J.H. Tarng, B.W. Chuang, PC. Liu. A relay node<br />
deployment method for disconnected wireless sensor
1674 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
networks:Applied in indoor environments[J]. <strong>Journal</strong> <strong>of</strong><br />
Network and Computer Applicatons. 2009.32:652-659.<br />
[13] Y.C. Wang, Y.C. Tseng. Distributed Deployment Schemes<br />
for Mobile Wireless Sensor <strong>Networks</strong> to Ensure Multilevel<br />
Coverage[J]. IEEE TRANSACTIONS ON PARALLEL<br />
AND DISTRIBUTED SYSTEMS. 2008.19(9):1280-1294.<br />
[14] P. Gajbhive, A. Mahajan. A Survey <strong>of</strong> Architecture and<br />
Node deployment in Wireless Sensor Network[C]. 1st<br />
International Conference on the Applications <strong>of</strong> Digital<br />
Information and Web Technologies, ICADIWT 2008: 426-<br />
430.<br />
[15] W.T. Xu, X.H. Hao, C.L. Dang. Connectivity Probability<br />
Based on Star Type Deployment Strategy for Wireless<br />
Sensor <strong>Networks</strong>[C].Proceedings <strong>of</strong> the 7th World<br />
Congress on Intelligent Control and Automation.<br />
2008:1738-1742.<br />
Yongxian Song was born in xuzhou,on<br />
April 1,1975. He r eceived the B.S. degree<br />
in Applied Electronic Technology from Hu<br />
aihai Institute <strong>of</strong> Technology,<br />
Lianyungang,China, in 1997, and the M.S<br />
degree in Control Theory and Control<br />
Engineering from Jiangsu university,<br />
Zhenjiang, China , in 2006. From 2009 to<br />
now, He is studing for Ph.D degree in Control Theory and<br />
Control Engineering from Jiangsu university, Zhenjiang, China.<br />
Since 2006, he has been a teacher in Huaihai Institute <strong>of</strong><br />
Technology, Lianyungang, China. His current research interests<br />
include signal processing ,intelligent control, and industrial<br />
control .<br />
© 2011 ACADEMY PUBLISHER<br />
Chenglong Gong was born in 1964, male.He<br />
received the B.S. degree in Automatic<br />
Control from University <strong>of</strong> Electronic<br />
Science and Technology, Chengdu, China, in<br />
1984, and the M.S degree in Automation<br />
Control from China University <strong>of</strong> Mining and<br />
Technology, Xuzhou, China , in 1988.<br />
He is currently working as a pr<strong>of</strong>essor with the department<br />
<strong>of</strong> electronic engineering <strong>of</strong> Huaihai Institute <strong>of</strong> Technology,<br />
Lianyungang 222005, China. His main research interesting is<br />
automatic measurement, control and system theory, computer<br />
network applications.<br />
Yuan Feng was born in Lianyungang ,on<br />
March 28,1978. He received the B.S. degree<br />
in Computer hardware and application from<br />
Huaihai Institute <strong>of</strong> Technology,<br />
Lianyungang, China, in 1999, and the M.S<br />
degree in Industrial Control from Nanjing<br />
University <strong>of</strong> Science, Nanjing, China, in<br />
2007.From 1999 to now, he has been a teacher in Huaihai<br />
Institute <strong>of</strong> Technology, Lianyungang,China. His current<br />
research interests include signal processing, Computer Control<br />
Technology.<br />
Juanli Ma female, lecturer, born in 1976,<br />
1995-1999 studied at Gansu University <strong>of</strong><br />
Technology, studying electrical automation,<br />
and obtained a bachelor degree. 2004-2007<br />
studied at the Northwestern Polytechnical<br />
University, studying control theory and<br />
control engineering and obtained a Master<br />
degree in Engineering. From1999 to now, she<br />
has been working in the Huaihai Institute <strong>of</strong> Technology.<br />
Xianjin Zhang was born in suqian, in1975.<br />
He received the B.S. degree in Applied<br />
Electronic Technology from Guilin University<br />
<strong>of</strong> Electronic Technology, Guilin, China, in<br />
1998, and the M.S degree in Power Electronic<br />
and Control Engineering from Nanjing<br />
University <strong>of</strong> Aeronautics & Astronautics,<br />
Nanjing, China, in 2005. Since 2005, he<br />
has been a teacher in Huaihai Institute <strong>of</strong> Technology,<br />
Lianyungang, China. His current research interests include<br />
electric and electronical converting technique.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1675<br />
Simulation <strong>of</strong> Networked Control System based<br />
on Smith Compensator and Single Neuron<br />
Incomplete Differential Forward PID<br />
Haitao Zhang<br />
Electronic and Information Engineering College, Henan University <strong>of</strong> Science and Technology, Luoyang, China<br />
Email: zhang_haitao@163.com<br />
Zhen Li<br />
Electronic and Information Engineering College, Henan University <strong>of</strong> Science and Technology, Luoyang, China<br />
Email: lizhenzhen1228@163.com<br />
Abstract—In the networked control system with random<br />
time delay in forward and feedback channels, a kind <strong>of</strong><br />
controller based on Smith compensator and signal neuron<br />
incomplete differential forward PID is presented. First,<br />
using root locus method and simulink simulation s<strong>of</strong>tware,<br />
the influences <strong>of</strong> network’s time delay on the system<br />
stability and dynamic performance are analyzed. Then,<br />
combined with incomplete differential forward PID control<br />
algorithm, Smith compensation model is established.<br />
Compared with existing Smith compensator, the proposed<br />
control model is easy to be implemented, and can also get<br />
better control performance in the case <strong>of</strong> miss-matching<br />
compensator model. Finally, the simulation research on a<br />
DC motor is done, and the simulation results show the<br />
effectiveness <strong>of</strong> the proposed method.<br />
Index Terms—networked control system; Smith<br />
compensator; incomplete differentia forward PID; single<br />
neuron<br />
I. INTRODUCTION<br />
With the extensive application <strong>of</strong> large-scale control<br />
system, the networked control system (NCS) has been<br />
concerned by many researchers. In these networked<br />
control system, the communication among controllers,<br />
sensors and actuators is performed through the networks.<br />
In comparison with the traditional control, it has the<br />
characteristics <strong>of</strong> resources sharing, high reliability, low<br />
cost, easy to maintain and extend. However, since the<br />
carrying capacity <strong>of</strong> the network and communication<br />
bandwidth is fixed and limited, this will inevitably lead<br />
to the collision and retransmission <strong>of</strong> information, which<br />
causes the network-induced delay in the process <strong>of</strong><br />
information transmission. The network-induced delay<br />
makes real-time capability <strong>of</strong> the system become worse,<br />
even leads to system instability. At present, for the<br />
random network-induced delay, two main research<br />
Manuscript received Mar. 1, 2011; revised Apr. 1, 2011; accepted<br />
Apr. 12, 2011.<br />
Project number: 61040010<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1675-1681<br />
methods are adopted in NCS design, the deterministic<br />
method and the stochastic method. The deterministic<br />
method is to convert the random delay to fixed delay by<br />
introducing data buffer, then use the existing method to<br />
design the controller[1][2]. However, this approach<br />
artificially extends the random delay <strong>of</strong> the controller,<br />
and lowers the system control performance. The<br />
stochastic method is directed by the random discrete time<br />
model. Nilsson discusses LQG optimal controller’s<br />
design within the framework <strong>of</strong> discrete control system<br />
in which the independent random delay is less than a<br />
sampling period and its time delay obeys Markov<br />
distribution[3], but this method must know the<br />
probability characteristics <strong>of</strong> time delay in advance,<br />
including mean, variance and other properties. The<br />
amount <strong>of</strong> computation is so large that it is not easy to<br />
achieve. Hu proposes the use <strong>of</strong> stochastic optimal<br />
control and optimal state estimation methods[4], the<br />
method is mainly used in the occasions when time delay<br />
is more than one sample period. Bauer uses Smith<br />
predictor to compensate time delay in the networked<br />
control system, the control structure is simple, but it is<br />
necessary to know the exact value <strong>of</strong> the network delay<br />
in advance [5].<br />
The rest <strong>of</strong> the paper is organized as follows. In<br />
Section 2, we present the system structure <strong>of</strong> NCS, and<br />
analyze the influence <strong>of</strong> network-induced delay on the<br />
system stability and dynamic performance. Section 3<br />
presents a design method <strong>of</strong> NCS based on Smith<br />
compensator and single neuron incomplete differential<br />
forward PID controller. Section 4 gives the simulation<br />
results aiming at the model <strong>of</strong> DC motor, and the results<br />
shows the effectiveness <strong>of</strong> proposed method. Finally, a<br />
brief summary are discussed in Section 5.<br />
II. SYSTEM DESCRIPTION<br />
A. System Structure<br />
The basic structure <strong>of</strong> the networked control system is<br />
shown in Figure 1. The controller, actuator and sensor<br />
transmit data over the network, so there are essentially
1676 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
three kinds <strong>of</strong> computer delays in the system:<br />
communication delay sc<br />
τ between the sensor and the<br />
controller, computational delay c<br />
τ in the controller and<br />
communication delay ca<br />
τ between the controller and the<br />
c<br />
actuator. Because τ is very small, usually it is<br />
considered to merge into ca<br />
τ , so the system delay is<br />
sc ca<br />
expressed as τ = τ + τ . In order to analyze the system<br />
with the effect <strong>of</strong> network delay, we use the approach <strong>of</strong><br />
continuous-time systems to analyze networked control<br />
system, and a typical block diagram <strong>of</strong> the networked<br />
control system is shown in Figure 2. Where , R() s ,<br />
U() s, Y() s and Es () = Rs () − Ys () are the reference, control,<br />
output, and error signals in S domain respectively.<br />
Gc() s is the transfer function <strong>of</strong> controller, and Gp() s is<br />
the transfer function <strong>of</strong> controlled object.<br />
Figure 1. The basic structure <strong>of</strong> networked control system<br />
ca<br />
s<br />
e τ −<br />
Rs () E() s U() s Y() s<br />
Gc() s Gp() s<br />
sc<br />
s<br />
e τ −<br />
Figure 2. A typical block diagram <strong>of</strong> networked control systems<br />
The transfer function <strong>of</strong> the closed-loop system shown<br />
in Figure 2 can be expressed as follows:<br />
ca<br />
−τ<br />
s<br />
Y() s Gc() s Gp() s e<br />
= (1)<br />
sc ca<br />
−τ s −τ<br />
s<br />
Rs () 1 + Gc( s) Gp( s) e e<br />
ca sc<br />
In Figure 2, τ and τ are respectively the time delay<br />
<strong>of</strong> forward and feedback channel. ca<br />
τ makes the control<br />
signal not timely to react on the controller object, and the<br />
response <strong>of</strong> the system lags behind the input <strong>of</strong> the<br />
sc<br />
systems. τ makes the system not timely to produce new<br />
control signal.<br />
In order to analyze the closed-loop control system<br />
with the effects <strong>of</strong> network delay, a typical approach is to<br />
use a rational function to approximate the delays. The<br />
function is as follows [6].<br />
−τsτs−n e ≅ (1 + )<br />
(2)<br />
n<br />
ca sc<br />
where τ may be τ or τ .<br />
Because the primary branches <strong>of</strong> the root locus <strong>of</strong><br />
control system usually contain the dominant eigenvalues<br />
<strong>of</strong> the system, this approximation is adequate for<br />
practical applications.<br />
B. System Stability Analysis<br />
In this paper, we use a DC motor as controlled object<br />
to analyze the system stability, the transfer function <strong>of</strong><br />
the controlled plant is expressed as follows[7]:<br />
2029.826<br />
Gp() s =<br />
(3)<br />
( s+ 26.29)( s+<br />
2.296)<br />
© 2011 ACADEMY PUBLISHER<br />
The controller uses PID control, the transfer function<br />
expressed as follows:<br />
2<br />
Kp(( Kd / Kp) s + s+ ( Ki / Kp))<br />
Gc() s =<br />
s<br />
(4)<br />
1<br />
= Kp( Tds+ 1 + )<br />
Ts i<br />
Where, P K , d T and T i are the proportional gain,<br />
differential time constant and integral time constant,<br />
respectively.<br />
We use the formula (1) to (4), and select the following<br />
controller parameters: K p =0.1701, d T =0, T i =0.45, n<br />
=4, then the open-loop transfer function is expressed as<br />
follows:<br />
ca sc<br />
−τ s −τ<br />
s<br />
Gc() s e Gp() s e<br />
τ s −n<br />
= Gc() s Gp()(1 s + )<br />
n<br />
1 1<br />
= Kp( Tds+ 1 + ) Gp( s)<br />
Ts τ s<br />
i<br />
n<br />
( + 1)<br />
n<br />
0.1701( s + 2.222) 2029.826<br />
=<br />
* (5)<br />
s ( s+ 26.29)( s+<br />
2.296)<br />
1<br />
*<br />
τ s 4<br />
( + 1)<br />
4<br />
345.2734( s + 2.222)<br />
=<br />
τ s 4<br />
ss ( + 26.29)( s+<br />
2.296)( + 1)<br />
4<br />
Seen from the formula (5), with τ changing from 0 to<br />
positive infinity, the system increases four-fold openloop<br />
negative real poles which are from negative infinity<br />
to 0. The existence <strong>of</strong> the poles enhances the system<br />
order, changes the distribution <strong>of</strong> the root locus in the<br />
real axis and shifts the root locus to the right, which is<br />
disadvantageous to the stability <strong>of</strong> the system.<br />
Imaginary Axis<br />
15<br />
10<br />
5<br />
0<br />
-5<br />
-10<br />
τ=0.1<br />
τ=0.2<br />
τ=0.5<br />
Root Locus<br />
-15<br />
-8 -6 -4 -2<br />
Real Axis<br />
0 2<br />
Figure 3. Primary branches <strong>of</strong> root locus with different delay<br />
We select different time delay to analyze the<br />
networked control system, and provide a reference that is<br />
used to analyze networked control system with delay
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1677<br />
effect. The primary branches <strong>of</strong> root locus with different<br />
time delay is shown in Figure 3. It is gotten by the<br />
programming in Matlab.<br />
Seen from Figure 3, with the time delay changing, the<br />
system stability region changes with the delay, the<br />
greater the delay is, the smaller the system stability<br />
region becomes.<br />
C. Dynamic Performance Analysis <strong>of</strong> System<br />
From the upper sub-section, we know the stable region<br />
decreases with the increase <strong>of</strong> time delay. In the section,<br />
we will analyze the influence <strong>of</strong> random time delay on<br />
the performance <strong>of</strong> networked control systems in<br />
Simulink. The simulation model is shown as Figure 4,<br />
the parameter is set as follows: the input signal is 50rad/s,<br />
sampling period is 10ms, the selected control algorithm<br />
is formula (4), the model <strong>of</strong> the controlled object is<br />
formula (3), the time delay obeys uniform distribution<br />
which is simulated by the producer <strong>of</strong> Gauss random<br />
number and network delay module.<br />
Figure 4. Simulation model <strong>of</strong> network control system<br />
Under the function <strong>of</strong> typically input signal, the<br />
performance criteria which reflect the time response <strong>of</strong><br />
control system are composed <strong>of</strong> two parts: static<br />
performance criterion and dynamic performance<br />
criterion. We choose the mean square errors MSE ,<br />
overshoot P M and adjusting time ts to reflect the tracking<br />
error, control accuracy, stability and rapidity <strong>of</strong> the<br />
response <strong>of</strong> the control system. The performance cost<br />
function is as follows[8]:<br />
J = ω1J1+ ω2J2+ ω3J3<br />
(6)<br />
2<br />
⎧⎪ ( MSE − MSE0) , MSE > MSE0<br />
J1<br />
= ⎨<br />
⎪⎩ 0 , MSE ≤ MSE0<br />
2<br />
⎧⎪ ( MP − MP0) , MP > MP0<br />
J2<br />
= ⎨<br />
(7)<br />
⎪⎩ 0 , MP ≤ MP0<br />
2<br />
⎧⎪ ( ts − ts0) , ts > ts0<br />
J3<br />
= ⎨<br />
⎪⎩ 0 , ts ≤ ts0<br />
Where,<br />
N 1 2<br />
MSE = ∑ e ( k)<br />
(8)<br />
N K = 0<br />
MSE represents the mean square error <strong>of</strong> system,<br />
ek ( ) = yk ( ) − rk ( ) represents the output error <strong>of</strong> system<br />
when t = kh,<br />
where k represents sampling sequence and<br />
© 2011 ACADEMY PUBLISHER<br />
s represents sampling period. MSE 0 , M P0<br />
and t s0<br />
are<br />
nominal mean square error, nominal overshoot, nominal<br />
adjusting time under the circumstance that system hasn’t<br />
time delay. J 1 , 2 J and J3 are the performance criteria <strong>of</strong><br />
MSE , P M and ts that they deviate from nominal value.<br />
J 1 , 2 J and 3 J satisfy J1 = J2 = J3<br />
= 0 when the system<br />
has no time delay . ω 1 , ω 2 , ω3 are the weight coefficients<br />
<strong>of</strong> 1 J , 2 J and J 3 respectively, their range are from 0 to 1,<br />
and meet 1 2 3 1<br />
ω + ω + ω = .<br />
When the system has no time delay, we get the step<br />
response curve <strong>of</strong> the system by the execution <strong>of</strong><br />
simulation model <strong>of</strong> Figure 4, and get the nominal value<br />
<strong>of</strong> MSE , P M and t s by the computation. Their nominal<br />
value is as follows: MSE 0 = 0.00595 , M P0<br />
= 5% ,<br />
t s0<br />
= 0.309 , J = 0 .<br />
When ω 1 = 1 , ω2 = ω3<br />
= 0 , then J = J1<br />
, and the cost<br />
function reflects the response process <strong>of</strong> system and the<br />
relative stability at the steady state. Its output curve<br />
changed with time delay is shown in Figure 5(a).<br />
When ω 2 = 1 , ω1 = ω3<br />
= 0 , then J = J 2 , and the cost<br />
function reflects the stability <strong>of</strong> system. Its output curve<br />
changed with time delay is shown in Figure 5(b).<br />
When ω 3 = 1 , ω1 = ω2<br />
= 0 , then J = J3<br />
, and the cost<br />
function reflects the rapidity <strong>of</strong> system response. Its<br />
output curve changed with time delay is shown in Figure<br />
3-4(c).<br />
Seen from Figure 5, the time delay could lower the<br />
stability and dynamic performance <strong>of</strong> system. If τ < 12s ,<br />
the control accuracy becomes lower a little, but the<br />
system still has the better stability and dynamic<br />
performance. If τ ≥ 12s , the dynamic performance <strong>of</strong><br />
system becomes poor, and the stability and control<br />
accuracy also becomes poor. If the time delay<br />
reaches 40s , the system becomes instable.<br />
Figure 5. The waveforms <strong>of</strong> 1 J , 2 J and 3<br />
J with delay change<br />
In the following section, we will improve the<br />
performance <strong>of</strong> networked control system by introducing<br />
Smith compensator and single neuron incomplete<br />
differentiation on the basis <strong>of</strong> classical PID control.<br />
III. SMITH COMPENSATOR AND SIGNLE NEURON<br />
INCOMPLE DIFFERENTIAL FORWARD PID CONTROLLER
1678 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
A. The Principle <strong>of</strong> Smith Compensator<br />
The control system model with Smith compensator is<br />
shown as Figure 6.<br />
ps e τ −<br />
R() s E() s U() s Y() s<br />
Gc() s<br />
Gp() s<br />
p ()(1 )<br />
s τ ′<br />
G s e − ′ −<br />
p<br />
Figure 6. Structure <strong>of</strong> Smith predictor<br />
p Gc() s is the transfer function <strong>of</strong> controller, () s<br />
Gps e τ −<br />
is the transfer function <strong>of</strong> controlled object with the pure<br />
p time delay. () s τ ′<br />
G s e − ′ is the compensator function<br />
p<br />
which is introduced by Smith compensator. Then the<br />
closed-loop transfer function is expressed as the<br />
following:<br />
−τ<br />
s<br />
p<br />
Y() s<br />
Gc() s Gp() s e<br />
=<br />
Rs () −τ ps−τp′ s<br />
1 + G ( s) G ′ ( s) + G ( s)( G ( s) e −G<br />
′ ( s) e )<br />
c p c p p<br />
When the system satisfies G ′<br />
p () s = Gp() s , τ ′<br />
p = τ p ,<br />
the formula (9) is simplified and gets the following<br />
relation:<br />
() () () p<br />
() 1 () ()<br />
s<br />
Y s Gc s Gp s −τ<br />
= e<br />
(10)<br />
Rs + Gc sGp s<br />
Its characteristic equation is expressed as the following:<br />
1 + Gc( s) Gp( s)<br />
= 0<br />
(11)<br />
The formula (11) doesn’t include pure time delay, so<br />
Smith compensator eliminates the influence <strong>of</strong> pure time<br />
delay on the system stability which could makes the<br />
control system instable.<br />
B. System Structure <strong>of</strong> NCS<br />
Smith compensator as a control algorithm is<br />
commonly used in the system with time delay. To reduce<br />
the influence <strong>of</strong> time delay on networked control system<br />
performance, Smith compensator has been introduced<br />
into the networked control system, and the networked<br />
control system with Smith compensator is shown as<br />
Figure 7.<br />
Rs () ()<br />
G ′ () s<br />
p<br />
Es U() s<br />
ca<br />
s<br />
e τ −<br />
G s Gp() s<br />
()<br />
c<br />
e G ′ () s e<br />
cam scm<br />
−τ s −τ<br />
s<br />
p<br />
sc<br />
s<br />
e τ −<br />
Y() s<br />
Figure 7. The structure <strong>of</strong> networked control system with Smith<br />
compensator<br />
In Figure 7, Gp() s is the transfer function <strong>of</strong> controlled<br />
object, G ′<br />
p () s is the predicted model <strong>of</strong> controlled object,<br />
cam scm<br />
τ and τ is the predicted model <strong>of</strong> time delay<br />
ca<br />
τ and sc<br />
τ respectively. Then the closed-loop transfer<br />
function <strong>of</strong> the system is expressed as follows:<br />
−τ<br />
s<br />
Y() s<br />
Gc() s e Gp() s<br />
=<br />
ca sc cam scm<br />
Rs () −τ s −τ s −τ s −τ<br />
s<br />
1 + G () s G ′ () s + G ()( s e G () s e −e<br />
G ′ () s e )<br />
c p c p p<br />
© 2011 ACADEMY PUBLISHER<br />
ca<br />
(9)<br />
(12)<br />
When the designed system using Smith compensator<br />
doesn’t have model mismatch, i.e. G'( s) = G ( s)<br />
,<br />
scm sc cam ca<br />
τ = τ , τ = τ , the transfer function <strong>of</strong> the system<br />
shown in Figure 7 could be simplified as follows:<br />
Y() s G () () ca<br />
c s Gp s −τ<br />
s<br />
= e<br />
(13)<br />
Rs () 1 + Gc() sGp() s<br />
In the case, the networked control system could be<br />
simplified to Figure 8. Known from this Figure, when<br />
mathematical model <strong>of</strong> the object is exact there is no<br />
longer process <strong>of</strong> pure time delay in the closed-loop<br />
circuit after using Smith compensator, thus the delay no<br />
longer affects the characteristic equation <strong>of</strong> system.<br />
Compared with control system with no network-induced<br />
delay, it is actually a control system which postpones<br />
time delay ca<br />
τ . For this reason, after adding Smith<br />
compensator, the control quality will be improved and<br />
the stability <strong>of</strong> system can be ensured.<br />
R( s )<br />
Y() s<br />
ca<br />
Gc () s Gp() s<br />
s<br />
e τ −<br />
Figure 8. The simplified diagram when the model is match<br />
C. Related Research <strong>of</strong> Smith Compensator<br />
Since the Smith compensator is based on the accurate<br />
mathematical model <strong>of</strong> controlled object and network<br />
delay, random network-induced delay and disturbances<br />
make the model with Smith compensator mismatches<br />
with controlled object.<br />
Owing to the upper reasons, it is difficult to get better<br />
effect to compensate network delay only utilizing Smith<br />
compensator. In order to overcome the impacts <strong>of</strong> those<br />
factors, it is necessary to introduce effective means <strong>of</strong><br />
control. Some researchers present many improved<br />
methods including two aspects <strong>of</strong> structural<br />
improvement[9] and parameter tuning[10].<br />
In order to make Smith Predictor applied to network<br />
control system and achieve the satisfied control effect,<br />
Du Feng presents two kinds <strong>of</strong> new Smiths compensator.<br />
One is to design the double dynamic Smith compensator<br />
<strong>of</strong> the pure time delay <strong>of</strong> controlled object and network<br />
delay. The structure doesn’t need to measure, predictor<br />
and identify the time delay online, and adapt to the<br />
networked system with random, time-varying, uncertain<br />
time delay[11]; the other is to bring the pure time delay<br />
<strong>of</strong> the controlled object and the time delay between the<br />
controller and actuator in the forward path out <strong>of</strong> the<br />
closed-loop controlled circuit, and eliminate the time<br />
delay between the transducer and actuator <strong>of</strong> feedback<br />
control loops completely. Then we needn’t schedule<br />
feedback channel to adjust network flow so that the<br />
network bandwidth can be utilized effectively, and the<br />
robustness <strong>of</strong> system to the packet loss is raised in the<br />
feedback control loops[12].<br />
Sujuan Wang, etc. regard the network and controlled<br />
object as a time-varying controlled system, estimate the<br />
time-delay <strong>of</strong> system using fading memory LSM and do<br />
the compensation by Smith compensator. By combing<br />
p
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1679<br />
immune feedback control with PI control, they get fuzzy<br />
immune PI controller which can adjust the parameter <strong>of</strong><br />
the PI controller according to the change <strong>of</strong> controlling<br />
amount, and achieve the intention to overcome the<br />
control error caused by the error <strong>of</strong> time delay’s<br />
estimation[13-14].<br />
Aiming at the networked systems with long time delay,<br />
Huiying Chen, etc. combine Smith compensator and T-S<br />
fuzzy model, and get PI controller with Smith<br />
compensator aiming at the networked systems with long<br />
time delay. The approach makes the closed-loop system<br />
obtain better stability and robustness[15].<br />
Peng Chen, etc. construct an adaptive Smith<br />
compensator which is used to compensate the long time<br />
delay <strong>of</strong> NCS based on IP network by adding a firstorder<br />
filter on the feedback loop. This approach<br />
eliminates the negative effects caused by time delay<br />
efficiently and gets the better robustness[16-17].<br />
Reiquan Lin, etc. give a design method <strong>of</strong> neuron<br />
adaptive controller based on Smith compensator. They<br />
study its application in the electric heating furnace by<br />
simulation method, and prove that this controller could<br />
efficiently makes up the deficiency <strong>of</strong> poor robustness<br />
and poor anti-jamming <strong>of</strong> the conventional Smith<br />
compensator[18-19].<br />
As a basic unit <strong>of</strong> neural networks, neuron has the selflearning<br />
ability and adaptability. The algorithm <strong>of</strong> this<br />
control system that is constructed with neuron is simple,<br />
easy to realize, and has better robustness. Besides, the<br />
most prominent characteristic is that the system doesn’t<br />
need accurate identification <strong>of</strong> the controlled object, and<br />
the structure and parameter <strong>of</strong> controlled object. So the<br />
design <strong>of</strong> single neuron adaptive controller doesn’t need<br />
system modeling. Considering these characteristic <strong>of</strong><br />
single neuron, we presents a incomplete differential<br />
forward PID algorithm based on Single Neuron, and<br />
apply it to Smith compensator so as to compromise the<br />
easy implement and the control performance <strong>of</strong> the<br />
existing Smith compensator.<br />
D. Single neuron incomplete differential forward PID<br />
The single neuron model is shown as Figure 9. The<br />
single neuron is an information processing cell with<br />
many inputs and single output. x 1 , 2 x , … , xn are the<br />
inputs <strong>of</strong> neuron, and ω 1 , ω 2 ,…, ω3 are respective weight<br />
value <strong>of</strong> the input x 1 , 2 x , … , n<br />
x . θ is the threshold <strong>of</strong><br />
neuron, f [] ⋅ is excitation function, and yP is the output<br />
<strong>of</strong> neuron. The yP can be expressed as the following<br />
formula:<br />
n<br />
⎡ ⎤<br />
yP = f ⎢∑ ωixi −θ⎥<br />
(14)<br />
⎣ i=<br />
1 ⎦<br />
x1<br />
x2<br />
ω<br />
� �<br />
Σ<br />
� ω<br />
xn<br />
1<br />
ω 2<br />
© 2011 ACADEMY PUBLISHER<br />
n<br />
θ f [] ⋅<br />
yP<br />
Figure 9. Model <strong>of</strong> single neuron<br />
The single neuron model has been applied to PID<br />
control systems. Figure 10 is the model structure <strong>of</strong><br />
single neuron PID control systems.<br />
r<br />
y<br />
x1<br />
x2<br />
x3<br />
ω1<br />
ω2<br />
ω3<br />
Σ<br />
K<br />
1<br />
Z −<br />
Figure 10. Structure <strong>of</strong> single neuron PID<br />
uk ( )<br />
In Figure 10, r and y are respectively the input and<br />
output <strong>of</strong> the system, and satisfy e( k) = r( k) − y( k)<br />
.<br />
x 1 , 2 x and x3 are the inputs <strong>of</strong> neurons, they satisfy the<br />
following relation:<br />
x1( k) = ek ( ) −ek ( −1)<br />
x2( k) = e( k)<br />
(15)<br />
x3( k) = ek ( ) −2 ek ( − 1) + ek ( −2)<br />
ω 1 , ω 2 , and ω3 are respectively the weight value <strong>of</strong> the<br />
input 1 x , 2 x , and x 3 .<br />
Supposed that the proportional coefficient is K , and<br />
K > 0 , then the output <strong>of</strong> the controller can be<br />
expressed as follows:<br />
3<br />
uk ( ) = uk ( − 1) + K∑ ω ( kx ) ( k)<br />
(16)<br />
i=<br />
1<br />
i i<br />
In the control algorithm <strong>of</strong> single neuron, the<br />
coefficient K reflects the adjusting amplitude. Generally,<br />
if the error is bigger, the adjusting amplitude is also<br />
bigger so as to satisfy the requirement <strong>of</strong> rapidity <strong>of</strong> the<br />
system; if the error is smaller, the adjusting amplitude is<br />
also smaller so as to satisfy the requirement <strong>of</strong> stability<br />
<strong>of</strong> the system.<br />
We use Delta learning rule as the learning method <strong>of</strong><br />
weight value, and it can be expressed as the following :<br />
∆ ωij ( k) = η[<br />
di( k) − oi( k)] oj( k)<br />
(17)<br />
Where, ∆ ωij<br />
expresses the weight increment from i th to<br />
j th , η is learning ratio, i o and o j are respectively the<br />
activation value <strong>of</strong> i and j , and di is the expecting<br />
output value.<br />
Single neuron PID control method implements the<br />
adapting control <strong>of</strong> system by adjusting the weight<br />
coefficient. In order to ensure the convergence and<br />
robustness <strong>of</strong> learning method, we normalize the formula<br />
(16) and (17), and get the following the expression:<br />
3<br />
uk ( ) = uk ( − 1) + K∑ w'( kx ) ( k)<br />
(18)<br />
i=<br />
1<br />
3<br />
i i i<br />
i=<br />
1<br />
i i<br />
w '( k) = w ( k) / ∑ || w ( k)<br />
||<br />
(19)<br />
w1( k + 1) = w1( k) + ηPe(<br />
k) x1( k)<br />
w2( k + 1) = w2( k) + ηI<br />
e( k) x2( k)<br />
(20)<br />
w3( k + 1) = w3( k) + ηDe(<br />
k) x3( k)<br />
Where η P , ηI and ηD are respectively proportion,<br />
integration and differentiation coefficient.
1680 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
The single neuron PID controller improves the<br />
traditional PID controller, and overcomes the sensitivity<br />
to parameter change <strong>of</strong> traditional PID control. It has<br />
better learning ability and easily ensures real-time<br />
capability[20]. In addition, the system can get better<br />
control effect under the occasions <strong>of</strong> mismatching object<br />
model.<br />
The differentiation item is sensitive to the change <strong>of</strong><br />
input value and the random disturbance, but the<br />
incomplete differential forward PID controller can<br />
improve the deficiency.<br />
The incomplete differential forward PID only<br />
differentiates to the feedback value, and adds a one-order<br />
filter, so the change <strong>of</strong> input value doesn’t affect the<br />
controller, and the change <strong>of</strong> output value doesn’t<br />
produce a very large control value. So we can combine<br />
the merit <strong>of</strong> single neuron and incomplete differential<br />
forward PID, and design the single neuron incomplete<br />
differential forward PID controller.<br />
The model <strong>of</strong> single neuron incomplete differential<br />
forward PID control is shown in Figure 11.<br />
Figure.11 Structure <strong>of</strong> incomplete differential forward PID control<br />
x<br />
Since we use the incomplete differential, 1 x , 2 x and 3<br />
should satisfy the following relation:<br />
x1( k) = ek ( ) −ek ( −1)<br />
x2( k) = e( k)<br />
(21)<br />
x3( k) = y1( k) −2 y1( k− 1) + y1( k−2)<br />
2<br />
=∆ y1( k)<br />
w 1 , 2 w and w 3 are the weight values <strong>of</strong> neurons, and<br />
still satisfy the equation (20). The control formula still<br />
satisfies the equation (18)-(20).<br />
The neuron controller uses incremental algorithms, so<br />
the relation <strong>of</strong> differential time constant and new learning<br />
algorithm satisfied:<br />
w3 w3'Td = = (22)<br />
w1 w1'h In this paper, single neuron control and incomplete<br />
differential forward PID control which are widely<br />
applied in actual control are introduced into the control<br />
system with Smith compensator so as to improve the<br />
robustness <strong>of</strong> the controller.<br />
IV. SIMULATION<br />
To verify the effectiveness <strong>of</strong> the method, we use a DC<br />
motor as the controlled object to simulate in<br />
Matlab/Simulink environment. The sampling period<br />
T=10ms, the reference input r=50rad/s, the network delay<br />
in forward and feedback channel is produced by gauss<br />
random generator in Simulink toolbox. The initial value<br />
© 2011 ACADEMY PUBLISHER<br />
<strong>of</strong> neuron weighting w 1 (0) = w 2 (0) = w 3 (0) =0.1, the<br />
learning rate <strong>of</strong> neuron η P =5, η I =0.03, η D =1.5, the<br />
proportional coefficient K =0.2, the incomplete<br />
differential coefficient γ =0.1. Using simple PID method,<br />
PID control with Smith compensator method and single<br />
neuron incomplete differential forward PID with Smith<br />
compensator respectively, then observe step responses<br />
under conditions <strong>of</strong> different random delay and the<br />
mismatch model. The results show in Figure 12 to Figure<br />
14.<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
1<br />
2<br />
3<br />
0<br />
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />
Figure 12. The response when the mean <strong>of</strong> delay is 5ms<br />
1-PID; 2-PID control with Smith compensator; 3- Single neuron<br />
incomplete differential forward PID with Smith compensator<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
1<br />
2<br />
3<br />
0<br />
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />
Figure 13. The response when the mean <strong>of</strong> delay is 30ms<br />
1-PID 2-PID control with Smith compensator 3- Single neuron<br />
incomplete differential forward PID with Smith compensator<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
3<br />
1<br />
2<br />
0<br />
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />
Figure 14 The response when the object model is 2500/s2+30s+80<br />
1-PID; 2-PID control with Smith compensator; 3- Single neuron<br />
incomplete differential forward PID with Smith compensator<br />
The simulation results show that, when the delay is<br />
small all <strong>of</strong> the algorithm can achieve stable control<br />
performance.<br />
With the increase <strong>of</strong> delay, the control effect <strong>of</strong> these<br />
methods differs significantly. In the case <strong>of</strong> simple PID<br />
control, there is obvious oscillation in the response curve.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1681<br />
The system reaches the stable state, but the response time<br />
becomes longer, the rapidity becomes lower, and the<br />
overshoot becomes bigger. However, the other two<br />
methods still quickly reaches stable, and the rapidity <strong>of</strong><br />
the response are not affected by the delay. This proves<br />
the validity <strong>of</strong> Smith compensator.<br />
When the model <strong>of</strong> Smith compensator does not fully<br />
match with the object model, Simple PID control system<br />
has serious oscillation, and doesn’t reach stable state in<br />
the simulation time. PID control system with Smith<br />
compensator reaches stable state, but the response time is<br />
longer, and has small oscillation. But model mismatch<br />
makes no great difference to the single neuron<br />
incomplete differential forward PID with Smith<br />
compensator. This shows the validity <strong>of</strong> proposed<br />
method.<br />
V. CONCLUSION<br />
By combing Smith compensator with single neuron<br />
incomplete differential forward PID algorithm, the static<br />
and dynamic performances <strong>of</strong> the networked control<br />
systems are improved. The proposed method is easy to<br />
be implemented, and the simulation results show that the<br />
method could get better control effect than conventional<br />
Smith compensator.<br />
ACKNOWLEDGMENT<br />
This work was supported by Project 61040010 <strong>of</strong> the<br />
National Science Foundation <strong>of</strong> China.<br />
REFERENCES<br />
[1] R. Luck, A. Ray, “An observer based compensator for<br />
distributed delays”, Automatica, vol.26, No.5, pp903-908,<br />
May 1990.<br />
[2] Z. X. Yu, H. T. Chen, Y. J. Wang, “Research on Markov<br />
Delay Characteristic-Based Closed Loop Network Control<br />
System”, Control Theory and Applications, vol.19, No.2,<br />
pp263-267, February 2002.<br />
[3] J. Nilsson, “Real-time Control Systems with Delays”,<br />
Lund. Sweden: Lund Institute <strong>of</strong> Technology, 1998.<br />
[4] S. S. Hu, Q. X. Zhu, “Stochastic Optimal Control and<br />
Analysis <strong>of</strong> Stability <strong>of</strong> Networked Control Systems with<br />
long delay”, Automatica, vol.39, No.11, pp1877-1884,<br />
July 2003.<br />
[5] P. H. Bauer, M. Sichitiu, C. Lorand, etc., “Total Delay<br />
Compensation in LAN Control Systems and Implications<br />
for Scheduling”, Proc. <strong>of</strong> the American Control<br />
Conference, Arlington, vol.6, pp4300-4305, 2001.<br />
[6] Y. Tipsuwan, M. Y. Chow, “Gain Scheduler Middleware:<br />
a Methodology to Enable Existing Controllers for<br />
Networked Control and Teleoperation-Part I: Networked<br />
control”, IEEE Transactions on Industrial Electronics,<br />
vol.5, No.6, pp1218-122, 2004<br />
[7] Y. Tipsuwan, M. Y. Chow, “Control Methodologies in<br />
Networked Control Systems”, Control Engineering<br />
Practice, vol.11, No.10, pp1099-1111, October, 2003.<br />
[8] Y. Tipsuwan, M. Y. Chow, “On the Gain Scheduling for<br />
Networked PI Controller Over IP Network”,<br />
Mechatronics, vol.9, No.3, pp491-498, 2004<br />
© 2011 ACADEMY PUBLISHER<br />
[9] K. Watanabe, “A process-model control for linear system<br />
with delay”, IEEE Transaction on Automatic Control,<br />
vol.26,No.6, pp261-1269, 1981<br />
[10] J. J. Liu, W. M. Ni, Y. P. Yang, “New method for<br />
designing robust Smith predictor”, <strong>Journal</strong> <strong>of</strong> TsingHua<br />
University (Science and Technology), vol.39, No.9, pp54-<br />
57, 1999.<br />
[11] F. Du, Q. Q. Qian,W. C. Du, “Networked Control<br />
Systems Based on New Smith Predictor”, <strong>Journal</strong> <strong>of</strong><br />
Southwest Jiaotong University, vo.4, No.1, pp65-69, 2010<br />
[12] F. Du, “Research <strong>of</strong> Networked Control Systems Based on<br />
New Smith Predictor”, Chengdu, Southwest Jiaotong<br />
University, 2008.<br />
[13] T. N. Shi, S. J. Wang, H. W. Fang, “Fuzzy Immune PI<br />
Control <strong>of</strong> Networked Control System Based on Prediction<br />
Compensation”, <strong>Journal</strong> <strong>of</strong> TianJin University, vol.42,<br />
No.11, pp959-964, 2009.<br />
[14] S. J. Wang, “Fuzzy Immune PI Control <strong>of</strong> Networked<br />
Control System Based on Prediction Compensation”,<br />
Tianjin: Tanjin University, 2008.<br />
[15] H. Y. Chen, Q. Guan, W. L. Wang, “Design <strong>of</strong> a fuzzy PI<br />
controller with Smith predictor for networked control<br />
systems with long time delay”, vol.33, No.4, pp418-420,<br />
2005.<br />
[16] P. Chen, L. K. Dai, “Adaptive Smith compensator for<br />
NCSs over IP networks”, Control Theory and Application,<br />
vol.23, No.1, pp115-118, 2006.<br />
[17] P. Chen, “Modeling and Controller Design for NCS over<br />
IP Network”, Zhejian: Zhejian University, 2005.<br />
[18] R. Q. Lin, G. W. Lin, “Models and simulation <strong>of</strong> neuron<br />
PID applied in electric oven based on MATLAB<br />
language”, vol. 30, No.1, pp55-58, 2002.<br />
[19] R. Q. Lin, F. W. Yang, “Realization <strong>of</strong> a Class <strong>of</strong> Neuron<br />
Controller Based on Smith Predictor”, Information and<br />
Control, vol.33, No.2, pp137-140, 2004.<br />
[20] Y. H. Tao, “New PID Control and Application”, Beijing:<br />
Mechanic Industry Press, 1998.<br />
Haitao Zhang Henan Province, China.<br />
Birthdate: November, 1972. is Control<br />
Theory and Control Engineering Ph.D.,<br />
graduated from the Institute <strong>of</strong><br />
Automation, Chinese <strong>Academy</strong> <strong>of</strong><br />
Sciences. And research interests on<br />
intelligent control and computer<br />
application technology.<br />
He is an associate pr<strong>of</strong>essor <strong>of</strong><br />
Electronic and Information Engineering<br />
College, Henan University <strong>of</strong> Science and Technology.<br />
Zhen Li Henan Province, China.<br />
Birthday: Jan, 1987. is Automation B.S.,<br />
graduated from Electronic and<br />
Information Engineering College, Henan<br />
University <strong>of</strong> Science and Technology,<br />
China.<br />
She is a graduate student <strong>of</strong> Electronic<br />
and Information Engineering College,<br />
Henan University <strong>of</strong> Science and<br />
Technology. And research interests on<br />
networked control system.
1682 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
A Web Crawler System Design Based on<br />
Distributed Technology<br />
Shaojun Zhong<br />
Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />
infor2000@qq.com<br />
Zhijuan Deng<br />
Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />
66162815@qq.com<br />
Abstract—A practical distributed web crawler architecture<br />
is designed. The distributed cooperative grasping algorithm<br />
is put forward to solve the problem <strong>of</strong> distributed Web<br />
Crawler grasping. Log structure and Hash structure are<br />
combined and a large-scale web store structure is devised,<br />
which can meet not only the need <strong>of</strong> a large amount <strong>of</strong><br />
random accesses, but also the need <strong>of</strong> newly added pages.<br />
Experiment results have shown that the distributed Web<br />
Crawler's performance, scalability, and load balance are<br />
better.<br />
Index Terms—Search Engine, Web Crawler, Grasping<br />
Strategy, Distributed System<br />
I. INTRODUCTION<br />
The production, transmission, collection and query <strong>of</strong><br />
information are one <strong>of</strong> the most basic human activities.<br />
Considering information with writing as a carrier,<br />
traditionally libraries, corresponding cataloguing systems<br />
and pr<strong>of</strong>essionals help us quickly find the information we<br />
need with “book” or “article” as the grain size. With the<br />
development <strong>of</strong> computer and information technology,<br />
there comes the field <strong>of</strong> Information Retrieval (IR) as<br />
well as the retrieval system <strong>of</strong> the whole text about books<br />
or literatures, making it convenient for us to obtain the<br />
relevant information with the grain size <strong>of</strong> “key words”.<br />
The openness <strong>of</strong> World Wide Web and the widespread<br />
accessibility <strong>of</strong> the information on it greatly encourage<br />
people to create while bringing new opportunities for<br />
development and technological challenges for the<br />
information retrieval <strong>of</strong> World Wide Web.<br />
The scale <strong>of</strong> traditional IR is relatively limited and the<br />
retrieved objects usually undergo serious screening and<br />
pretreatment. The number <strong>of</strong> queries it responds to is<br />
generally not very big. However, the information inquiry<br />
system (meaning search engine here), which provides<br />
services on web, is different with traditional IR both in<br />
scale and response time. Search engine has to deal with<br />
large-scale information (information swarms in and some<br />
are even fake) and a great number <strong>of</strong> accesses, which still<br />
requires fast response.<br />
Search engine is an application system, which<br />
develops based on IR, suits the features <strong>of</strong> web (or www)<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1682-1689<br />
and provides information query service. Search engine is<br />
generally defined as a kind <strong>of</strong> s<strong>of</strong>tware system used on<br />
web, which collects and discovers information with<br />
certain strategies, deals with and organizes the<br />
information and finally <strong>of</strong>fers web information query<br />
service for users. How does a s<strong>of</strong>tware system like search<br />
engine work? If s<strong>of</strong>tware system works on a data set, the<br />
data it operates includes not only unpredictable user<br />
queries but also huge web pages with dynamic change in<br />
number and these web pages will not come to the system<br />
automatically but need the system to grasp them. But in<br />
face <strong>of</strong> a large amount <strong>of</strong> user queries, it is impossible for<br />
the system to “search” online whenever there is an<br />
inquiry. So, the basis for large-scale search engine should<br />
be a batch <strong>of</strong> web pages gathered beforehand [1].<br />
Therefore, web page catcher is also called Web<br />
Crawler. As a foremost part <strong>of</strong> search engine, it is an<br />
all-important studying object. Like the dynamic system<br />
carrying rocket system in aerospace, Web Crawler is the<br />
basis <strong>of</strong> search engine and all <strong>of</strong> the data it collects come<br />
from the work <strong>of</strong> Web Crawler in a smart, reasonable and<br />
powerful way.<br />
Search engine is one <strong>of</strong> the most high-end and complex<br />
Internet technologies and all companies keep the core<br />
technology to themselves. Some big companies have<br />
already had a mature solution to large web crawlers and<br />
have already put them into use. However, these large<br />
search engines can only provide ordinary users with<br />
common and non-customized search services. They could<br />
not take into consideration <strong>of</strong> various requirements <strong>of</strong><br />
different users and single web crawlers fall down on their<br />
jobs in many cases. The flexible customization and the<br />
incomparable information acquisition speed and scale <strong>of</strong><br />
the distributed web crawlers have satisfied people’s<br />
growing demand on user-oriented web information.<br />
Therefore, this paper presents a distributed design method<br />
<strong>of</strong> web crawler, and strives to achieve a robust, scalable<br />
and efficient hybrid strategy <strong>of</strong> a distributed search<br />
engine.<br />
II. CORE TECHNOLOGY OF DISTRIBUTED WEB CRAWLER
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1683<br />
A. Priority Strategy <strong>of</strong> Webpage Grasping<br />
Priority strategy <strong>of</strong> Webpage grasping determines the<br />
grasping efficiency. Grasping strategies can be roughly<br />
divided into three kinds, i.e. depth-first strategy,<br />
breadth-first strategy and best-first strategy. Depth-first<br />
strategy could be employed when the amount <strong>of</strong><br />
information is not huge. However, the rapid development<br />
<strong>of</strong> the Internet and the massive existence <strong>of</strong> web data will<br />
inevitably run into huge data by adopting depth-first<br />
algorithm strategy. Therefore, grasping strategies <strong>of</strong> the<br />
search engine will generally be breadth-first strategy and<br />
best-first strategy, as well as some <strong>of</strong> their improved<br />
algorithms [2].<br />
B. Diameter <strong>of</strong> the World Wide Web<br />
Diameter <strong>of</strong> the World Wide Web or ‘Web Diameter’<br />
is defined as ‘If d is used to represent a path from Web u<br />
to Web v, then the average length <strong>of</strong> the shortest path<br />
formed by all the different pairs <strong>of</strong> connected pages on<br />
the World Wide Web is called Web Diameter. According<br />
to this definition and the calculation <strong>of</strong> large-scale web<br />
pages, it can be known that Web Diameter is about 17[3].<br />
The calculation formula <strong>of</strong> Web Diameter is<br />
d=0.35+2.06 log (N) (1)<br />
Study shows that the diameter <strong>of</strong> China’s World Wide<br />
Web is 16.26[4], namely if there is a path between any<br />
two web pages, click less than 17 times on average, you<br />
can reach one web page from another, which is shown in<br />
Figure1.<br />
Figure 1. Diagram <strong>of</strong> Diameter <strong>of</strong> the World Wide Web<br />
After analyzing the Diameter <strong>of</strong> the World Wide Web,<br />
the following two conclusions are obtained:<br />
(1) Traversing Algorithm has affected the crawler’s<br />
efficiency to a large extent. The World Wide Web page<br />
structure is not that deep as we have imagined, but<br />
unexpectedly wider. Therefore, the traversal mode <strong>of</strong> the<br />
crawler generally adopts the breadth-first one. Certainly,<br />
there is the reason <strong>of</strong> the importance <strong>of</strong> web pages, and<br />
this kind <strong>of</strong> means can help to grasp more important web<br />
pages.<br />
(2) The World Wide Web is so complex that a chosen<br />
grasping circuit cannot necessarily and invariably<br />
guarantee the best. In order to prevent this problem, the<br />
diameter <strong>of</strong> the web needs to be fully considered, and<br />
"depth-first strategy" should be adopted to control the<br />
grasping depth. In this way, the problem can be perfectly<br />
solved [5].<br />
Let’s look at the following example:<br />
Suppose starting from seed site A, seed site B and seed<br />
site C, there are three paths to web page P, the lengths<br />
respectively being 3, 19 and 127 (CostAP=3; CostBP=19;<br />
© 2011 ACADEMY PUBLISHER<br />
Figure 2. Path cost diagram <strong>of</strong> different seed sites<br />
CostCP=127). As to grasp web page P from seed site A is<br />
very quick while seed site B and C reach P after a long<br />
path, it is apparently not economic enough.<br />
To prevent the Crawler from unlimited breadth-first<br />
grasping, a certain depth must be limited. Once reaching<br />
this depth, grasping should be stopped. The value <strong>of</strong> this<br />
depth is the length <strong>of</strong> diameter <strong>of</strong> the World Wide Web.<br />
When stopping at the maximum depth, those excessively<br />
deep un-grasped web pages always expect to reach from<br />
other seed sites in a more economic way. For example,<br />
seed site B and C stop grasping once reaching the depth<br />
<strong>of</strong> 17, leaving the opportunity for grasping web page P to<br />
the Crawler starting from seed site A to grasp. It is not<br />
hard to see that limiting the grasping depth destroys<br />
conditions <strong>of</strong> infinite loops and loops , if there are, will<br />
stop after limited times. Moreover, the combination <strong>of</strong><br />
depth strategy and breadth-first strategy can effectively<br />
guarantee the closeness in the course <strong>of</strong> grasping, namely<br />
always grasping web pages under the same domain name<br />
in the process <strong>of</strong> grasping while web pages under other<br />
domain names are rare[6].<br />
C. Judgement <strong>of</strong> the Web Importance<br />
While maintaining the priority strategy <strong>of</strong> web page<br />
grasping, please grasp important web pages first to ensure<br />
those more important web pages can be arranged with<br />
limited resources. Which web pages are more important?<br />
How to measure the importance?<br />
The measure <strong>of</strong> importance is decided by the following<br />
three aspects, i.e. IB (P), IL (P) and ID (P).<br />
1) IB(P)<br />
It is mainly decided by the number and quality <strong>of</strong> back<br />
links. Firstly, the more links (a great many back links) a<br />
web page has, the more it is recognized by other pages.<br />
Furthermore, there will be more opportunities for it to be<br />
visited by net-citizen and its importance is more obvious.<br />
Secondly, the more it is pointed to by more important<br />
web pages, the more important it is. The most classic is<br />
cheating web pages, which artificially set lots <strong>of</strong><br />
Backlinks pointing to their own web pages to increase the<br />
importance <strong>of</strong> web pages. If the quality is not considered,<br />
local optimal will appear, rather than problem <strong>of</strong> global<br />
optimal.<br />
2) IL (P)<br />
It is a function <strong>of</strong> URL string which only investigates<br />
the string itself. IL (P) is realized mainly through some<br />
models, for example, it attaches more importance to URL<br />
containing ‘com’ or ‘home’. It also regards that the URL<br />
with fewer slashes is more important.<br />
3) ID(P)<br />
ID (P) represents that in a seed site set; there is a link<br />
(breadth-first traverse rules) in every seed site that can<br />
arrive at the web page. ID (P) is another important index
1684 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
<strong>of</strong> the web pages. The closer it is to the seed site, the<br />
more opportunities it has to be visited. Therefore, it is<br />
more important and the seed site is where the most<br />
important web pages are. The farther it is to the seed site,<br />
the less important it is.<br />
D. Non-Repeated Grasping Strategy<br />
Massive web page images are other important<br />
characteristics <strong>of</strong> web. According to the 24 million page<br />
statistics by Google system, 22% <strong>of</strong> the web pages are<br />
images. The existences <strong>of</strong> a lot <strong>of</strong> duplicated web pages<br />
are unfavorable to the users’ query. It not only wastes the<br />
storage space <strong>of</strong> search engines, but also decreases the<br />
system efficiency.<br />
The reasons, on the one hand, are that the collecting<br />
program does not clearly record the visited URLs. On the<br />
other hand, the domain names and IP addresses have a<br />
multiply corresponding relation. The first problem can be<br />
solved by making a record <strong>of</strong> the visited URLs, and<br />
making a contrast between the new URLs and the visited<br />
ones every time. The second problem is relatively<br />
complex, because different URLs may refer to the same<br />
IP.<br />
There are four kinds <strong>of</strong> corresponding relationships<br />
between the domain names and IP addresses, namely:<br />
one-to-one, one-to-many, many-to-one and<br />
many-to-many. One-to-one relationship won't cause<br />
repeated collection, but the others are likely to do so.<br />
1) Algorithm Based on B-tree<br />
Due to the huge amount <strong>of</strong> web pages, web page<br />
grasping requires network bandwidth, machines, time and<br />
so on. The repeated grasping <strong>of</strong> the same web page<br />
greatly reduces the efficiency <strong>of</strong> the system, so the<br />
Crawler system should design a strategy to avoid<br />
repeated web page grasping to ensure that a web page is<br />
grasped only one time in a certain period <strong>of</strong> time [7].<br />
B-tree is a kind <strong>of</strong> balanced multiway search tree.<br />
What the file system <strong>of</strong> operating system uses is the<br />
search algorithm <strong>of</strong> B-tree, which can also be used to<br />
design the algorithm matching URL to avoid repeated<br />
grasping in the Crawler. B-tree can be empty or multiway<br />
tree. A B-tree <strong>of</strong> m order must meet the following<br />
requirements:<br />
(1) A tree can have m subtrees at most;<br />
(2) If the root node is not the leaf node, at least two<br />
subtrees are necessary;<br />
(3) All non-terminal nodes except root have at least<br />
two subtrees;<br />
(4) All non-terminal nodes contain the following<br />
information data: (n, A0, K1, A1, K2, A2, …, Kn, An, )<br />
Each node includes n pointers pointing to each<br />
keyword record. Ki(i=1, …, n) is keyword and<br />
Ki
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1685<br />
make sure whether it has been grasped by looking into<br />
the Hash table.<br />
E. Webpage Revisiting Strategy<br />
The popularity <strong>of</strong> web results from the information<br />
web brings. Information is constantly changing, and the<br />
webpage information updating is unavoidable. However,<br />
the earlier grasped information may be out-<strong>of</strong>-date or <strong>of</strong><br />
no use at all. A strategy is thus needed to solve the<br />
problem <strong>of</strong> timeliness <strong>of</strong> information, and it is called<br />
webpage revisiting strategy. Through revisiting, these<br />
webpages can keep pace with the changes <strong>of</strong> the World<br />
Wide Web.<br />
In 2000, Cho and Garcia-Monlina <strong>of</strong> Stanford<br />
University randomly chose 500, 000 web page samples<br />
and found that 23% <strong>of</strong> the web pages were updated on a<br />
daily basis while 40% <strong>of</strong> the web pages with .com as the<br />
suffix <strong>of</strong> their domain names wais updated every day.<br />
The half-life <strong>of</strong> web pages is 10 days. In addition, study<br />
shows that the process web pages change boils down to<br />
model <strong>of</strong> Poisson process [8].<br />
To describe the model <strong>of</strong> Poisson process, X(t) is used<br />
to represent the number <strong>of</strong> changes <strong>of</strong> web pages in the<br />
period <strong>of</strong> (0, t) and the Poisson distribution with λ as<br />
its parameter meets the following nature.<br />
As for s>=0, t>=0, random variable<br />
X ( s + t)<br />
− X ( s)<br />
conforms to Poisson distribution,<br />
namely<br />
k<br />
λ t ) − λ t<br />
Pr{ X ( s + t ) − X ( s ) = k } =<br />
(1)<br />
(<br />
k !<br />
�<br />
In which k=1, 2, 3…<br />
The expected value <strong>of</strong> random<br />
variable X ( s + t)<br />
− X ( s)<br />
is λ t .<br />
E [ X ( s + t ) − X ( s )] = λ t<br />
(2)<br />
It can be proved through a simple method. Suppose<br />
that time cycle (time interval) is 1, then<br />
∞<br />
∞ k − k<br />
λ �<br />
E[ X ( t + 1)<br />
− X ( t)]<br />
=<br />
k = λ<br />
In which,<br />
obtained.<br />
∑ ∞<br />
k = 1<br />
∑kPr{ X ( t + 1)<br />
− X ( t)}<br />
= ∑<br />
k= 0 k=<br />
1 k!<br />
(3)<br />
k −1<br />
λ<br />
= �<br />
( k − 1)!<br />
E [ X ( t + 1)<br />
− X ( t)]<br />
= λ t = λ<br />
(4)<br />
λ<br />
, with (2), it can be<br />
Through the trace analysis <strong>of</strong> 500, 000 random web<br />
pages, Cho and Garcia-Molina came to the important<br />
conclusion that the update <strong>of</strong> most web pages belonged to<br />
Poisson distribution [9].<br />
F. Robots Protocol<br />
Robots Protocol is a standard Web Crawler should<br />
conscientiously observe with Robots.txt document as its<br />
main content. In general conditions, Crawler writers will<br />
© 2011 ACADEMY PUBLISHER<br />
observe this protocol. A Crawler can still acquire web<br />
information without observing Robots.txt standard; but if<br />
a webmaster finds that a Crawler has problems, he will<br />
connect with its owner through its logo, or even prevent<br />
this Web Crawler form extracting some web pages in<br />
other ways. So Crawler developers shall conscientiously<br />
observe this protocol [10].<br />
After entering a web page, web spider will first visit<br />
the text file equipped with Robots Protocol, which is<br />
usually in the root directory <strong>of</strong> web server, such as<br />
www.163.com/Robots.txt. With the protocol file,<br />
Robots.txt, webmasters can define the directories Web<br />
Crawler can not visit or the specific directories certain<br />
Web Crawlers can not visit [11]. For instance, if the<br />
executable directory and temporary file directory <strong>of</strong> some<br />
web pages do not want to be searched by search engine,<br />
webmasters can define these two directories as directories<br />
which deny access.<br />
The file format <strong>of</strong> Robots is as follows.<br />
User-agent:<br />
It is the name <strong>of</strong> Crawler. In the file “Robots.txt”, if<br />
more than one User-agent records show that many<br />
Crawlers are limited by this protocol, this file shall have<br />
at least one User-agent record. If the value <strong>of</strong> this record<br />
is set as *, this protocol is effective for any Crawler. In<br />
the file “Robots.txt”, record like“User-agent:*”can only<br />
have one.<br />
Disallow:<br />
It is used to describe a URL which does not want to be<br />
visited. This URL can be a complete path or part <strong>of</strong> it.<br />
Any URL started with Disallow can not be visited by<br />
Robot [12].<br />
For example:<br />
A: “Disallow:/help”means that neither /help.html nor<br />
/help/index.html allows Crawler to grasp.<br />
B: “Disallow:/help/”means that Crawler can grasp<br />
/help.html but can not grasp /help/index.html.<br />
C: If the record <strong>of</strong> Disallow is empty, all pages <strong>of</strong> this<br />
website can be grasped by Crawler and in file<br />
“/robots.txt”, there are two or more Disallow records. If<br />
“/robots.txt”is an empty file, this website is open to any<br />
Crawler and can be grasped.<br />
Apart from observing Robots Protocol, Crawler should<br />
do its best to reasonably plan grasping strength by<br />
weakening the grasping strength during daytime while<br />
moderately increasing grasping strength at night when<br />
visit <strong>of</strong> Web host is low. Because <strong>of</strong> time difference,<br />
when it is daytime in Eastern Hemisphere, Western<br />
Hemisphere is at night. So the Crawler can enhance the<br />
strength <strong>of</strong> grasping American and European websites<br />
during the day while increasing the strength <strong>of</strong> grasping<br />
websites <strong>of</strong> its own country at night [13].<br />
Even so, Crawler always inevitably brings trouble to<br />
Web host <strong>of</strong> other World Wide Web. So monitoring<br />
program <strong>of</strong> website grasping is indispensable. This<br />
program records the grasping traffic <strong>of</strong> every website to<br />
avoid problems caused when grasping strength is<br />
occasionally excessive.
1686 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
III. A DISTRIBUTED DESIGN OF WEB CRAWLER SYSTEM<br />
Thousands <strong>of</strong> WWW servers on the web form mass<br />
information through the web link between them, with<br />
each connection between the hosts being relatively<br />
independent. Single processor system is restricted by the<br />
CPU handling capacity, disk storage capacity and<br />
network bandwidth resources, etc. It is impossible to<br />
boast the ability <strong>of</strong> dealing with such huge amounts <strong>of</strong><br />
information, not to mention to catch up with the rapid<br />
growth <strong>of</strong> web information. The distributed technology<br />
becomes a choice. As a design <strong>of</strong> distributed system, it<br />
pursues the following goals: (1) The grasping ability <strong>of</strong> a<br />
single machine should not decrease a lot when the<br />
number <strong>of</strong> grasping machines increases, i.e. the<br />
communication and management expenses <strong>of</strong> the system<br />
should be reduced to the minimum while pursuing load<br />
balance. (2) Starting from the actual operation, dynamic<br />
configuration <strong>of</strong> the system should be considered, i.e. to<br />
allow the addition or removing <strong>of</strong> one or more machines<br />
during the operation.<br />
A. A Distributed Structure Design <strong>of</strong> Web Crawler<br />
System<br />
To design a robust and efficient web crawler, it is<br />
needed to make the task distribution across multiple<br />
machines in concurrent processing. Huge webpages<br />
should be independently distributed on the network and<br />
they should provide adequate possibility and rationality<br />
for concurrent accesses. Meanwhile, concurrent<br />
distribution will save network bandwidth resources.<br />
Besides, in order to improve the recall ratio, precision and<br />
search speed <strong>of</strong> the whole system, the internal algorithm<br />
<strong>of</strong> the search should boast certain intellectualization.<br />
Therefore, the distributed web crawler adopts the<br />
following structure design as in Figure. 3.<br />
In system design, it is needed to make the task<br />
distribution across multiple machines in concurrent<br />
processing. Huge web pages should be independently<br />
distributed on the network and they should provide<br />
adequate possibility and rationality for concurrent<br />
accesses. Meanwhile, concurrent distribution will save<br />
network bandwidth resources. Besides, in order to<br />
improve the recall ratio, precision and search speed <strong>of</strong> the<br />
whole system, the internal algorithm <strong>of</strong> the search should<br />
boast certain intellectualization.<br />
The core <strong>of</strong> system distribution is data distribution.<br />
The chief dispatcher is responsible for distributing URL<br />
to every distributed crawler. The distributed crawlers<br />
grasp webpages according to the HTTP protocol. In order<br />
to improve the speed, hundreds <strong>of</strong> distributed crawlers<br />
can usually be launched simultaneously. Distributed<br />
crawlers simultaneously analyze and dispose <strong>of</strong> the<br />
collected web pages, extract URL links and other relevant<br />
information, submit to their respective dispatchers, and<br />
their respective dispatchers submit them to the chief<br />
dispatcher.<br />
B. Basic Process Design for a Distributed Web Crawler<br />
Grasping<br />
Figure. 4 is a brief flow chart which only shows page<br />
processes with no errors. In this process, the web crawler<br />
will start working when one URL is added to the waiting<br />
queue. So long as there is one webpage or web crawler<br />
disposing <strong>of</strong> one webpage in the waiting queue, the web<br />
crawler program will continue its working. When the<br />
waiting queue is null and there is no disposing <strong>of</strong> any<br />
webpages, the web crawler will stop working.<br />
C. The Design <strong>of</strong> a Cooperative Grasping Algorithm <strong>of</strong><br />
the Distributed Web Crawler<br />
In the circumstance <strong>of</strong> multiple crawlers grasping, how<br />
the workload will be decomposed becomes the major<br />
problem. If the division is not clear, it is probable that<br />
multiple crawlers have grasped the same web, thus<br />
causing additional expenses. There are two options to<br />
solve it.<br />
Scheme 1: To decompose through the web host's IP<br />
address and make a certain crawler grasp only the<br />
webpages <strong>of</strong> a certain section <strong>of</strong> addresses.<br />
Scheme 2: To decompose through the domain names<br />
<strong>of</strong> a web and make a certain crawler grasp only the<br />
webpages <strong>of</strong> a certain section <strong>of</strong> the domain names.<br />
World Wide Web determines the location <strong>of</strong> host<br />
according to the IP address in the network infrastructure,<br />
but as the IP address is dotted decimal, it is hard to<br />
remember. So domain name is adopted to map the IP<br />
address. Due to the kindness <strong>of</strong> domain name towards<br />
people, such a problem arises: many domain names<br />
correspond to the same IP. Medium-sized and small<br />
websites usually use this method to provide different<br />
Web services. It only takes economic factor into<br />
consideration, for only one server is needed; but large<br />
websites, like Sina, Sohu and other portals, generally<br />
adopt load balance IP multicast technology, which means<br />
Figure 3. A distributed structure design <strong>of</strong> web crawler system Figure 4. Basic process design for a distributed web crawler grasping<br />
© 2011 ACADEMY PUBLISHER
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1687<br />
that the same domain name corresponds to many IP<br />
addresses. In this way, robustness <strong>of</strong> the system is<br />
enhanced and load balance is achieved.<br />
Given the condition that many domain names<br />
correspond to the same IP address or that the same<br />
domain name corresponds to many IP addresses, a fairly<br />
good way is to decompose tasks according to domain<br />
names, which means that as long as the web pages <strong>of</strong><br />
large websites are not repeatedly grasped, small websites<br />
can accept strategy allocation tasks even if they<br />
repeatedly grasp. This method <strong>of</strong> allocation allocates<br />
domain names to different Crawlers to grasp and a<br />
Crawler can only grasp web pages <strong>of</strong> “appointed” domain<br />
name set. For example, sina.com.cn is “appointed” to be<br />
grasped by spider1, jxust.cn to be grasped by spider2 and<br />
sim.jx.cn is “appointed” to be grasped by spider3.<br />
The main differences between these two kinds <strong>of</strong><br />
solutions can be further understood by the following two<br />
examples.<br />
Suppose that we have 3 spiders to analyze 2 websites,<br />
www.jxust.cn and www.sim.jx.cn. They have different<br />
domain name and have the same IP address<br />
(218.87.136.5). The homepages are:<br />
http://www.jxust.cn/index.html and<br />
http://www.sim.jx.cn/index.html. After DNS, they are<br />
actually both http://218.87.136.5/index.html. The domain<br />
decomposition scheme will make spider2 and spider3<br />
repeatedly grasp this page. However, since the<br />
information <strong>of</strong> this site is not so much, the loss resulted<br />
from repeated grasps can be tolerated.<br />
The IP distribution scheme to grasp tasks is different.<br />
For example, sina.com.cn(71.5.7.138) is “appointed” to<br />
be grasped by spider1, sina.com.cn(71.5.6.136) to be<br />
grasped by spider2, jxust.cn(218.87.136.5) to be grasped<br />
by spider3 and sim.jx.cn(218.87.136.5) is “appointed” to<br />
be grasped by spider3. In this allocation scheme, there is<br />
no repetition in the problem <strong>of</strong> different domains pointing<br />
to the same IP, and the grasping tasks <strong>of</strong> jxust.cn and<br />
sim.jx.cn are both completed by spider3. However,<br />
sina.com.cn corresponds to several IPs, and the allocation<br />
is by spider1 and spider2 respectively. In this way, the<br />
grasping task <strong>of</strong> spider1 and spider2 repeat with each<br />
other. Obviously, sina is a large-scale web and the loss<br />
resulting from this repeated grasping will be huge.<br />
Through the comparison, the domain decomposition<br />
strategy is more reasonable which takes into<br />
consideration <strong>of</strong> the large website. Therefore, in Crawler<br />
system, the work <strong>of</strong> decomposing grasping tasks<br />
according to the domain name should be dealt with by a<br />
general scheduling to schedule web pages to different<br />
Crawlers to grasp through domain name decomposition.<br />
A formal scheduling distribution is as follows:<br />
Firstly, we suppose that n crawlers can work<br />
concurrently, and can define a function domain which can<br />
extract an URL domain name, such as:<br />
http://news.163.com.cn/20090824/08116145133.shtml<br />
Domain (URL) =news.163.com<br />
(1) For any URL, it will use the function domain to<br />
extract the domain name <strong>of</strong> URL.<br />
© 2011 ACADEMY PUBLISHER<br />
(2) Use MD5 signature function for the signatures<br />
domain, MD5 (domain (URL)).<br />
(3) Use MD5 signature value to do mould operations<br />
on n, int spider no=MD5 (domain (URL)) %n.<br />
(4) Allocate this URL to crawler numbered spider no<br />
to grasp.<br />
A mold operation can help a universal set be divided<br />
into several equivalence classes. Therefore, the union <strong>of</strong><br />
equivalence classes is equal to the universal set, and the<br />
elements in an equivalence class certainly do not belong<br />
to another equivalence class. A formal equivalence<br />
relation can be expressed as follows.<br />
Set U as an universal set, and it is mapped respectively<br />
to S1, S2, …, Sn through a certain equivalence relation. It<br />
satisfies the following two conditions:<br />
(1) S1∪S2∪...∪Sn=U<br />
(2) if(a∈Si)&(b∈Sj)&(Si!=Sj) then a!=b<br />
Generally, n is the integral power <strong>of</strong> 2. For example,<br />
the mod <strong>of</strong> 4, 8, 16, 32…can be rapidly obtained by the<br />
means <strong>of</strong> digit and (&), i.e. int spider no=MD5 (domain<br />
(URL)) & (n-1). Generally, to mod the integral power <strong>of</strong><br />
2, the means <strong>of</strong> & (n-1) could be employed (In it, n must<br />
be the integral power <strong>of</strong> 2) for rapid calculation.<br />
D. Large-Scale Web Storation Structure Design<br />
The World Wide Web keeps changing all the time, so a<br />
web page database must be able to delete the old version<br />
after deletion <strong>of</strong> web pages. In this way, storage voids<br />
may be left. Updating can be understood as addition after<br />
deletion and the addition <strong>of</strong> application order to the web<br />
database. Therefore, some disk space compact<br />
technologies have to be adopted to recover the storage<br />
voids. Besides, updating and visiting should be mutually<br />
excluded to avoid synchronization <strong>of</strong> the errors.<br />
Therefore, a good page storage structure can bring<br />
excellent access performance.<br />
To combine log structure and Hash structure based on<br />
its advantage is quite a good choice. For new web pages,<br />
the page's signature could be calculated through the URL.<br />
Then through modeling computation, a web page will be<br />
mapped to a unit on the Hash table, with each Hash table<br />
unit corresponding to the location <strong>of</strong> a log file. These<br />
newly added pages are mapped to Hash [1] through the<br />
calculation <strong>of</strong> Hash function, and then to the document<br />
Log1. You may want to randomly read an already<br />
accessed web page <strong>of</strong> URL, or still map to specific log<br />
files through similar Hash function calculation. Then you<br />
can search the B-tree index on the log file for<br />
corresponding page documents. You can acquire<br />
equivalent or even slightly better random access effect<br />
with log files (random access object files greatly<br />
decreased). What is worth mentioning most is that this<br />
kind <strong>of</strong> means can adopt processing batch writing-in,<br />
which will greatly improve the pure Hash structure. In<br />
each log file, writing-in queue will be added. Only when<br />
it has accumulated a certain amount <strong>of</strong> files, the<br />
processing batch can be realized, as shown in Figure. 5.
1688 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Figure 5. batch writing-in <strong>of</strong> newly added pages<br />
Figure 6. n node distributed cooperative grasping system performance<br />
decreases as time varies<br />
A Hash table can help to change the uncertainty <strong>of</strong> the<br />
insertion <strong>of</strong> those newly added web pages into certainty.<br />
Therefore, the addition <strong>of</strong> an inserted queue can be<br />
inserted into the target log files in batch mode. Through<br />
the Hash function decomposition, the size <strong>of</strong> each log on<br />
the basis <strong>of</strong> the Hash structure is far less than that in the<br />
log structure, and at the same time outweighs Hash<br />
barrels in the Hash structure a lot.<br />
Besides, it must be ensured that each log can be stored<br />
in memory. So to determine the size <strong>of</strong> the Hash table in<br />
Hash-Log, it is necessary to consider the size <strong>of</strong> actual<br />
physical memory and the scale <strong>of</strong> web pages which need<br />
to be stored.<br />
Table I gives a qualitative evaluation <strong>of</strong> three storage<br />
ways <strong>of</strong> web pages.<br />
To sum up, without lots <strong>of</strong> opportunities for random<br />
access, log structure can be the best way to store web<br />
pages. As for the possible great deal <strong>of</strong> random access<br />
and the need <strong>of</strong> many new web pages, Hash-Log is a<br />
more ideal way to store web pages, for it can effectively<br />
support distributed web page storage and effectively<br />
distribute web page storage to every storage node to<br />
increase the reliability and stability <strong>of</strong> web page storage<br />
in the condition that multi-machine is used to store web<br />
pages in a larger environment. The overall search effect<br />
will not be affected a lot even if a storage node goes<br />
wrong.<br />
IV. SYSTEM EVALUATIONS<br />
A. Operating System Environment<br />
B. Performance Evaluation<br />
The data statistics results show that different sites have<br />
quite different grasping rates and the grasping amount <strong>of</strong><br />
webpages which depends on the access speed <strong>of</strong> each<br />
site, and some sites have restrictions on crawlers’<br />
grasping. These restrictions include speed restrictions, as<br />
well as some web accessibility restrictions. Under the<br />
© 2011 ACADEMY PUBLISHER<br />
TABLE I.<br />
QUALITATIVE EVALUATION OF THREE STORAGE WAYS OF WEB PAGES<br />
Ordered<br />
access<br />
Random<br />
access<br />
Increase<br />
webpages<br />
Device<br />
type<br />
Lenovo<br />
M6000<br />
Lenovo<br />
T2900V<br />
Lenovo<br />
428E<br />
Web Site<br />
Device<br />
Purposes<br />
Server<br />
Client<br />
Client<br />
TABLE Ⅱ.<br />
HARDWARE DEVICES<br />
Device configuration Count<br />
Intel P4 3.2GHz/ Memory<br />
1G/ HardDisk 160G/ NIC<br />
10M-100M<br />
Intel Celeron 2GHz/<br />
Memory 1G/ HardDisk<br />
160G/ NIC 1000M<br />
Intel P4 3.2GHz/ Memory<br />
512M/ HardDisk 800G/<br />
NIC 10M-100M<br />
TABLE Ⅲ.<br />
GRASPING CONDITIONS OF SOME FAMOUS WEBSITES<br />
Web<br />
Size<br />
Count<br />
<strong>of</strong><br />
Webs<br />
Crawl<br />
Time<br />
(Hour<br />
condition <strong>of</strong> three-node-distributed cooperative grasping,<br />
the average rate can reach 7 pages per second, which<br />
renders very satisfactory results.<br />
C. Scalability Evaluation<br />
A system with good scalability can bring linear growth<br />
to its performance with the addition <strong>of</strong> cost. It is also easy<br />
to be streamlined or expanded.<br />
Below is the influence on the grasping result <strong>of</strong><br />
different numbers <strong>of</strong> cooperative grasping nodes. Figure4<br />
shows the operating result <strong>of</strong> the four kinds <strong>of</strong> different<br />
systems <strong>of</strong> scale respectively (the number <strong>of</strong> inspection<br />
cooperative grasping nodes= 1, 2, 4, 10 etc.) during the<br />
earlier 10 hours. Among them, the abscissa denotes the<br />
running time <strong>of</strong> the crawler system, with the unit being<br />
)<br />
Average<br />
Rate<br />
(Pages/<br />
Second)<br />
163.com 10.5G 180596 5 10.033 3<br />
sina.com.cn 9.3G 132769 5 7.376 3<br />
yahoo.com.<br />
8.7G 150101 5 8.339 3<br />
cn<br />
qq.com 8.2G 142969 5 7.943 3<br />
sohu.com 7.7G 131094 5 7.283 3<br />
1<br />
3<br />
10<br />
Crawl<br />
Nodes<br />
TABLE Ⅳ.<br />
THE VARIATION OF THE NUMBER OF URL ALLOCATION OF THE 3 NODES<br />
AS TIME VARIES<br />
Running<br />
Time<br />
Nodes<br />
Log Structure based<br />
structure on Hash Hash-Log<br />
++ - +<br />
+- ++ +<br />
++ - +<br />
10<br />
Minutes<br />
30<br />
Minutes<br />
1 Hour<br />
2<br />
Hours<br />
5<br />
Hours<br />
Node 1 1893 5574 11753 22961 53742<br />
Node 2 1967 5782 12587 24323 57933<br />
Node 3 1952 5637 12127 22939 55161
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1689<br />
hour. The y-coordinate represents the accumulated<br />
quantity <strong>of</strong> the grasped webpages.<br />
Figure. 6 shows that along with the increase <strong>of</strong> the<br />
number <strong>of</strong> cooperative grasping nodes, the basic system<br />
performance linearly increases. Therefore, this distributed<br />
system boasts good scalability and stability.<br />
D. Task Load Balance Evaluation<br />
The load balance <strong>of</strong> the system is based on the<br />
distributed web crawler cooperative grasping algorithm,<br />
which utilizes Hash function to allocate URL<br />
dynamically among the nodes. Since only one process is<br />
considered, one can not evaluate whether it has attained<br />
the load balancing only depending on the number <strong>of</strong><br />
URLs allocated to each node in the process. Instead, all<br />
phases <strong>of</strong> the whole cooperative grasping process should<br />
be analyzed to evaluate the effect <strong>of</strong> load balance (The<br />
whole grasping process is divided into several phases in<br />
time). The experiment is carried out with 3 nodes<br />
cooperatively grasping 163.com. TABLE III shows the<br />
URL distribution <strong>of</strong> each node in the whole process <strong>of</strong> 5<br />
hours’ running <strong>of</strong> the system.<br />
It is shown in TABLE IV that each node has grasped a<br />
basically equal number <strong>of</strong> webpages. The system load<br />
balance <strong>of</strong> distributed web crawler has reached the<br />
expected elementary objective.<br />
REFERENCES<br />
[1] Li Xiaoming, Yan Hongfei, Wang Jimin, Search Engine-<br />
Principle, Technology and System. Beijing: science press,<br />
2005.<br />
[2] M.Najork, J.Wiener, “Breadth-first search crawling yields<br />
high-quality pages, ” In 10th International World Wide<br />
Web Conference, 2001.<br />
© 2011 ACADEMY PUBLISHER<br />
[3] Reka Albert, Hawoong Jeong, Albert-Laszlo Barabasi,<br />
“Diameter <strong>of</strong> the World-Wide Web, ” Nature 401, pp.<br />
130-131, 1999.<br />
[4] Li Xiaoming, “Estimation <strong>of</strong> the Number <strong>of</strong> Static Web<br />
Pages in China, ” PKU_CS_NET_TR2002006, 2002.<br />
[5] A.Broker, R.Kumar, F.Maghoul, Tomkins, a.J.Winener,<br />
“Graph structure in the web: experiments and models, ”<br />
presented at Proceedings <strong>of</strong> the 9th World-Wide Web<br />
Conference, Amsterdam, 2000.<br />
[6] Arasu. A, Cho. J, Garcia-Molina. H, “Searching the Web, ”<br />
ACM Transactions on Internet Technology, pp. 42.<br />
[7] Narayannan Shivakuma, Hector Garcia-Molina, “Finding<br />
near-replicas <strong>of</strong> documents on the web, ” Web DB 1998,<br />
pp. 204-212.<br />
[8] CHO. J, GARCIA-MOLINA. H, “Estimating Frequency <strong>of</strong><br />
Change, ” ACM Transactions on Internet Technology, Vol.<br />
3, 2003.<br />
[9] A Standard for Robot Exclusion [EB/OL],<br />
http://www.robotstxt.org/wc/norobots.html<br />
[10] J. Talim, Z. Liu, Ph. Nain, E. G. C<strong>of</strong>fman. “Controlling the<br />
robots <strong>of</strong> Web search engines, ” Proceedings <strong>of</strong> the 2001<br />
ACM SIGMETRICS international conference on<br />
Measurement and modeling <strong>of</strong> computer systems,<br />
Cambridge, Massachusetts, United States, 2001.<br />
[11] Junghoo Cho, Hector Garcia-Molina, “Parallel crawlers, ”<br />
In Proceedings <strong>of</strong> the eleventh international conference on<br />
World Wide Web, Honolulu, Hawaii, USA, ACM Press,<br />
pp. 124-135, 2002.<br />
[12] Paolo Boldi, Bruno Codenotti, Massimo Santini and<br />
Sebastiano Vigna, UbiCrawler: A Scalable Fully<br />
Distributed WebCrawler, 2003.<br />
[13] Yan Hongfei, “Primary Exploration on Design, Realization<br />
and Application <strong>of</strong> Extensible Web Information Collection<br />
System, ” Beijing University Doctoral Dissertation, 2002.
1690 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
A Ranking Method <strong>of</strong> Retrieval Results Based on<br />
Web Comprehending<br />
Zhijuan Deng<br />
Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />
66162815@qq.com<br />
Shaojun Zhong<br />
Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />
infor2000@qq.com<br />
Abstract—This thesis put forward a method used to<br />
calculate query similarity <strong>of</strong> webpage search results based<br />
on Web comprehending. According to users’ query input,<br />
this method can use Web comprehending technology to<br />
display the important web pages closer to users’ query in<br />
the first page <strong>of</strong> the list, make users more satisfied with the<br />
response <strong>of</strong> search engine, running after recall ratio and<br />
ensure precision at the same time.<br />
Index Terms—similarity, Web comprehending, search<br />
engine, search results<br />
I. INTRODUCTION<br />
As known to all, the scale <strong>of</strong> the World Wide Web is<br />
great. According to the analysis <strong>of</strong> Lawrence and Giles,<br />
the page number <strong>of</strong> World Wide Web doubles every two<br />
years. As early as 1998, all researches thought the scale<br />
<strong>of</strong> World Wide Web was at the magnitude <strong>of</strong> billion.<br />
Reckoned according to this method, the scale <strong>of</strong> present<br />
World Wide Web has reached the magnitude <strong>of</strong> ten<br />
billion [1]. Search engine should display the related<br />
information list to input content according to the query<br />
strings that users input. Although the precision <strong>of</strong> the<br />
search strategy based on keyword matching is very high,<br />
it obviously ignores many semantic correlative<br />
vocabularies, which limits the ability to <strong>of</strong>fer users<br />
effective information.<br />
Topic-specific search engine is devoted to <strong>of</strong>fering<br />
users more comprehensive and pr<strong>of</strong>essional service<br />
related to the topic, combine with Web comprehending to<br />
<strong>of</strong>fer correlation search, and actively <strong>of</strong>fer users webpage<br />
search results <strong>of</strong> strong pr<strong>of</strong>ession and high correlation.<br />
As for topic-specific search engine, if the query words<br />
that users input are common, the page quantity related to<br />
the input contents by users is great. If the captured pages<br />
are displayed for users not according to any ranking rules,<br />
there will be many pages <strong>of</strong> less correlation and<br />
unimportance in the front position. As some investigation<br />
shows, when querying with search engine, most users<br />
only pay attention to the first page and don’t look at the<br />
next page after they get the search results. In this way, if<br />
the first page is unimportant or <strong>of</strong> less correlation, the<br />
precision will be affected and users will feel unsatisfied<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1690-1696<br />
with the results <strong>of</strong> search engine. So ranking the pages<br />
got by searching is very necessary.<br />
However, the pages got through complete matching<br />
query and the pages got through Web comprehending<br />
technology are different. In order to obey the principle <strong>of</strong><br />
important display in the first page, quantitative basis<br />
should be <strong>of</strong>fered to distinguish such two kinds <strong>of</strong> pages<br />
when ranking pages. That is, the method <strong>of</strong> calculate<br />
similarity value between query words and documents<br />
should be accordingly changed on the basis <strong>of</strong> complete<br />
matching.<br />
II. CALCULATION OF THE SIMILARITY BETWEEN WEB<br />
COMPREHENDING TECHNOLOGY AND SEARCH RESULTS<br />
A. Comprehending the Importance <strong>of</strong> Web Pages with<br />
PageRank Technology<br />
PageRank is an analysis method <strong>of</strong> network links put<br />
forward by Sergey Brin and Lawrence Page in 1998, who<br />
were the doctoral candidates <strong>of</strong> Stanford University. It<br />
evaluate all pages, assigns every page a value to measure<br />
its importance and finally uses in the ranking <strong>of</strong> search<br />
results [2].<br />
Specifically,PageRank assumes surfers do several<br />
steps <strong>of</strong> browse following the link, then follow the link<br />
again and browse after turning to a random starting page,<br />
so the value degree <strong>of</strong> a page is decided by the visiting<br />
frequency <strong>of</strong> random surfing. The basic thoughts <strong>of</strong><br />
PageRank algorithm are shown as follow: (1) if a page is<br />
referenced by many other pages, this page may be<br />
important page. (2) if a page is not referenced many times<br />
but referenced by an important page, this page may be<br />
also important page. (3) the importance <strong>of</strong> a page is<br />
averaged and transferred to the page that it references.<br />
Based on the link structure <strong>of</strong> the entire Web, PageRank<br />
technology calculates the importance <strong>of</strong> all pages, and it<br />
thinks users can visit the entire network via the<br />
hyperlinks between pages. But usually Web figure is not<br />
strongly connected, so PageRank applies the processing<br />
mode <strong>of</strong> random surfing: under the circumstance <strong>of</strong><br />
probability d, visitors may randomly turn to another node<br />
<strong>of</strong> Web figure, which is equivalent to adding a link<br />
between two pages without link. The application <strong>of</strong>
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1691<br />
PageRank in Google have proved that it really could<br />
greatly improve the precision <strong>of</strong> search results by<br />
integrating the PR value <strong>of</strong> the results after analyzing and<br />
comprehending Web page links [3].<br />
The value <strong>of</strong> PageRank is defined as follow:<br />
Assume there are pages Tl…Tn towards pageA (that is<br />
Tl…Tn reference pageA). Parameter d is a damping<br />
coefficient set between 0 and 1. C(A) is defined as link<br />
number starting from pageA. So the PageRank value <strong>of</strong><br />
pageA is obtained by the following (1).<br />
PR(<br />
A)<br />
= ( 1−<br />
d)<br />
+ d × (<br />
PR(<br />
Ti<br />
)<br />
)<br />
C(<br />
T )<br />
n<br />
∑<br />
i= 1 i<br />
PageRank values form probability distribution in the<br />
entire web page groups, so the sum <strong>of</strong> PageRank values<br />
<strong>of</strong> all web pages is 1. In the formula, PR(A) is the<br />
PageRank value <strong>of</strong> given by pageA; d is damping factor,<br />
0
1692 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
similarity. 3) rank the documents and return them to users<br />
according to the similarity with a given query word. So<br />
vector space model reports the comprehended Web<br />
document content back to retrieval system to optimize<br />
users’ query.<br />
C. Utilize Latent Semantic Analysis to Comprehend<br />
Web Documents<br />
Latent semantic index is also called as LSA (Latent<br />
Semantic Analysis), which was put forward to improve<br />
the effect <strong>of</strong> vector space model. The basis <strong>of</strong> latent<br />
semantic index is feature item-text matrix. Singular value<br />
decomposition is conducted in this matrix to get latent<br />
semantic structural model.<br />
The object <strong>of</strong> LSI is just certain relation between<br />
words in texts, that is, some latent incidence relation is<br />
implied in context usage pattern <strong>of</strong> terms in texts. So the<br />
method <strong>of</strong> statistical calculation is applied to analyze<br />
plenty <strong>of</strong> texts to find the latent incidence relation, it<br />
doesn’t need certain semantic coding, it only relies on the<br />
object relation with context, uses such latent semantics to<br />
express words as well as texts and finally realize the<br />
objective <strong>of</strong> eliminating correlation between words and<br />
simplifying text vectors. Because there is strong<br />
correlation between words, LSI uses such correlation to<br />
conduct statistic transform to context usage pattern <strong>of</strong><br />
concentrated words <strong>of</strong> texts to obtain a new semantic<br />
space [8].<br />
Some infrequent usages <strong>of</strong> vocabularies should be<br />
removed from main semantic structure, for example:<br />
misuse <strong>of</strong> some vocabularies, some uncorrelated<br />
vocabularies occasionally exist in the same documents,<br />
and “noise” vocabularies that can’t represent the topic <strong>of</strong><br />
the text such as high-frequency vocabularies,<br />
low-frequency vocabularies and others. The method <strong>of</strong><br />
truncated singular value decomposition to reduce<br />
dimensions is used to realize the objective <strong>of</strong> filtering<br />
information and removing noise. LSI projects<br />
high-dimensional representation <strong>of</strong> texts and vocabularies<br />
in the low-dimensional latent semantic space, which<br />
reduces the dimension <strong>of</strong> problems, at the same time, the<br />
low-dimensional representation shows the semantic<br />
relation between vocabularies and texts [9].<br />
D. Singular Value Decomposition<br />
The potential semantic indexing mainly applies<br />
matrix’s Singular Value Decomposition (SVD)<br />
technology. SVD is a common method in mathematical<br />
statistics, mainly utilized in the unlimited minimum cube<br />
and in the solution for matrix rank evaluation and<br />
relevant analysis on specification.<br />
Definition for matrix’s singular value: suppose A is the<br />
real matrix by m×n, and the arithmetic square root <strong>of</strong><br />
non-zero characteristic value <strong>of</strong> n-rank square matrix<br />
A T A is the singular value <strong>of</strong> matrix A.<br />
The decomposition theorem <strong>of</strong> matrix’s singular value:<br />
suppose A∈R m×n , the rank is r, then there exist m-rank<br />
orthogonal matrix U and n-rank orthogonal matrix V, so<br />
© 2011 ACADEMY PUBLISHER<br />
that ⎡<br />
T<br />
⎤ T<br />
U AV = ⎢ ⎥V ⎣ ⎦<br />
∑ 0 . And ⎡ ⎤ T<br />
A = U ⎢ V<br />
0 0<br />
⎥<br />
⎣ ⎦<br />
∑ 0 is the<br />
0 0<br />
singular value decomposition <strong>of</strong> matrix A.<br />
What’s used in information retrieval is a special form<br />
<strong>of</strong> singular value decomposition, because the matrix<br />
needing singular value decomposition in information<br />
retrieval usually is high-rank sparse matrix [10].<br />
Accurately generality, suppose vocabulary-text matrix<br />
A is a sparse matrix with m rows and n columns; therein,<br />
m>>n; it’s given that rank(A)=r and by the singular value<br />
decomposition theorem, A’ s singular value<br />
T<br />
decomposition is A = T0<br />
S 0 D . 0<br />
Each column <strong>of</strong> T0 is orthogonal and the length is 1,<br />
that is, T0 T T0=I; column vector <strong>of</strong> T0 is called as matrix<br />
A’s left singular value vector. S0 is called the standard<br />
type pattern <strong>of</strong> matrix A’s singular value, a unit value’s<br />
diagonal matrix, that is ∑= diag( λ1,<br />
λ2,...,<br />
λm<br />
) and<br />
there is λ λ ≥ ≥ λ ≥ λ ≥ ... = 0 , in which,<br />
1 ≥ 2 ... r r+<br />
1<br />
λi is A i ’s singular value.<br />
Each column <strong>of</strong> D0 is orthogonal and the length is 1,<br />
that is, D0 T D0=I; column vector <strong>of</strong> D0 is called the right<br />
singular value vector <strong>of</strong> matrix X.<br />
Generally, as for A =T0S0D0 T , matrix T0,S0,D0<br />
are all full rank matrixes, which indicates all information<br />
<strong>of</strong> original matrix A. The edge <strong>of</strong> SVD decomposition<br />
lies in using smaller matrix for best fit approximation<br />
[11]. If all elements on diagonal S0 are ordered by value<br />
size, then select the previous k maximum singular values,<br />
and others are set as 0, thus, the obtained result <strong>of</strong> matrix<br />
Ak is recorded as, an approximation value <strong>of</strong> original<br />
matrix A whose rank is k. It can be proved that in all<br />
matrixes with rank k, matrix Ak is the only one that is<br />
closest to A through F-norm evaluation. After 0 is<br />
introduced to S0, S0 can be simplified by deleting<br />
corresponding rows and columns; a new diagonal matrix<br />
S0 is obtained, meanwhile, take previous k columns <strong>of</strong> T0<br />
and D0, and matrix T and matrix D are obtained<br />
respectively, then A’s k-rank approximation matrix Ak<br />
can be structured.<br />
A =<br />
T<br />
≈ Ak<br />
TSD<br />
(4)<br />
This is an optimum k-rank model with mean square<br />
approximation, which can be used to estimate the<br />
necessary data.<br />
The selection <strong>of</strong> dimension factor k relates to the<br />
efficiency <strong>of</strong> semantic space model; too small k can lose<br />
some useful information and over large k would make<br />
arithmetic complicated; generally when k is selected, as<br />
for ∑= diag( λ1,<br />
λ2,...,<br />
λm<br />
) and there<br />
is λ 1 ≥ λ2<br />
≥ ... ≥ λr<br />
≥ λr<br />
+ 1 ≥ ... = 0 , then make k satisfy<br />
contribution rate inequality.<br />
k<br />
∑<br />
r<br />
i ∑<br />
i=<br />
1 i=<br />
1<br />
λ λ ≥ θ<br />
/ (θ can be40%,50%) (5)<br />
i
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1693<br />
Therein, θ includes threshold value <strong>of</strong> original<br />
information; contribution rate inequality is proposed<br />
according to the corresponding concept <strong>of</strong> factor analysis<br />
so as to measure representation level <strong>of</strong> k-dimension<br />
space for the whole space.<br />
Figure 2. Singular value decomposition diagram <strong>of</strong><br />
vocabulary-text matrix<br />
As for approximation matrix AK, T’s row vector is<br />
called vocabulary vector and D’s row vector is text<br />
vector; in view <strong>of</strong> that, text retrieval and treatment <strong>of</strong><br />
other texts are made, that is, latent semantic indexing<br />
(LSI); vocabulary vector and text vector can be projected<br />
into the same low k-dimension space which is called the<br />
latent semantic space. Figure. 3 is an example for<br />
vocabulary and text in latent semantic space.<br />
Figure 3. Expression <strong>of</strong> vocabulary and text in latent semantic<br />
space.<br />
Through singular value decomposition and selecting<br />
k-rank approximation matrix, LSI effectively solves the<br />
problems about synonym and polysem. For instance,<br />
“computer”, “computing machine”, “programming” and<br />
“home”, therein, “computer” and “computing machine”<br />
are synonyms, while “programming” is related to<br />
“computer” and “computing machine”, but “home” is<br />
totally irrelevant to other three words.<br />
In the retrieval system based on key words, if<br />
“computer” does not appear directly in the text, then<br />
when “computer” is input for retrieval, the text containing<br />
“computing machine” and that containing “home” can<br />
neither be covered. However, the users hope to find out<br />
text about “computing machine” when inquiring<br />
“computer”, or also find out text about “programming”<br />
whose association degree is lower compared with<br />
“computing machine”, but finding out text about “home”<br />
is out <strong>of</strong> the mind.<br />
Through the latent semantic space obtained by singular<br />
value decomposition, latent semantic indexing<br />
technology can well express inner relation between these<br />
words; in the space, the contexts <strong>of</strong> “computer”,<br />
“computing machine” and “programming” are consistent,<br />
to some degree, that is: the distance is shorter while<br />
© 2011 ACADEMY PUBLISHER<br />
farther from “home” so that the semantic relation<br />
between vocabulary is more foregrounded.<br />
As for vocabulary and text, it’s the same between text<br />
and text [12]. Generally speaking, it’s just necessary to<br />
select a smaller k value; the obtained semantic space can<br />
represent most information in original matrix A,<br />
meanwhile, information considered as “noise is removed.<br />
Besides, k-rank approximation matrix is much smaller<br />
than the terms <strong>of</strong> original m×n high-dimension sparse<br />
matrix. Reduction <strong>of</strong> matrix deduces calculation<br />
complication, helpful for improving retrieval efficiency.<br />
E. Calculation <strong>of</strong> Similarity Relation in Latent Semantic<br />
Indexing<br />
There are three important relations in semantic space:<br />
vocabulary and vocabulary, text and text, vocabulary and<br />
text. Because the approximation matrix Ak <strong>of</strong> primitive<br />
matrix A represents the most important and reliable latent<br />
semantic space in matrix A, vocabularies and texts are all<br />
projected into the same space, the similarity relation <strong>of</strong><br />
the three relations can be expediently calculated by virtue<br />
<strong>of</strong> approximation matrix T, S and D [10].<br />
1) Compare two vocabularies and do forward<br />
multiplication.<br />
T<br />
T<br />
T<br />
2<br />
A k × Ak<br />
= T × S × D × D × S × T = T × S ×<br />
Therein D T ×D =I, because D has been orthogonal and<br />
normal. Its row i-column j represents the similarity<br />
between vocabulary i and vocabulary j.<br />
2) Compare two texts and do backward multiplication.<br />
T<br />
T<br />
T<br />
2<br />
A k × Ak<br />
= D × S × T × T × S × D = D × S ×<br />
In the above formula, T T ×T =I, because T has been<br />
orthogonal and normal. Its row i-column j represents the<br />
similarity between texti and textj.<br />
3) Compare vocabulary and text, that is approximation<br />
matrix Ak <strong>of</strong> primitive matrix A.<br />
A ×<br />
T<br />
T<br />
D<br />
T<br />
(6)<br />
(7)<br />
T<br />
k = T × S D<br />
(8)<br />
4) The similarity between users’ query request and<br />
texts.<br />
In retrieval, users’ query request can be vocabularies,<br />
texts or any combinations <strong>of</strong> both. Firstly the system<br />
preprocesses users’ query, generates query vector q<br />
according to word frequency information, regards it as a<br />
“pseudo-text”, and represents it in k-dimension semantic<br />
space. Set q as primitive query vector, it’s represented in<br />
k-dimension semantic space as: q * =q T S -1 , in this way, the<br />
similarity <strong>of</strong> q * and other text vectors can be calculated in<br />
k-dimension space, there are three common formulas as<br />
follows:<br />
a) Inner-product formula<br />
k<br />
∑<br />
i=<br />
1<br />
*<br />
*<br />
Sim ( q , d ) = d × q<br />
(9)<br />
1<br />
b) Cosine formula<br />
j<br />
ji<br />
i
1694 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Sim<br />
*<br />
2 ( q , d j ) =<br />
k<br />
*<br />
∑ d ji × qi<br />
i=<br />
1<br />
k<br />
2<br />
* 2<br />
∑dji × ∑(<br />
qi<br />
)<br />
i=<br />
1<br />
(10)<br />
c) Pearson formula<br />
*<br />
Sim ( q , d ) =<br />
3<br />
j<br />
k<br />
∑<br />
i=<br />
1<br />
k<br />
∑<br />
i=<br />
1<br />
( d<br />
( d<br />
ji<br />
ji<br />
− d<br />
*<br />
− d )( q − q )<br />
ji<br />
ji<br />
) ×<br />
*<br />
i<br />
k<br />
∑<br />
i=<br />
1<br />
*<br />
( q − q )<br />
*<br />
i<br />
(11)<br />
In (9), (10) and (11), q * i is the weight <strong>of</strong> no. i<br />
vocabulary <strong>of</strong> query vector, dji is the weight <strong>of</strong> no. i<br />
vocabulary <strong>of</strong> no. j text vector, k is the dimension <strong>of</strong><br />
semantic space. Finally the texts are ranked according<br />
similarity, and the text list is reported back to users<br />
according to their query request.<br />
Ⅲ. IMPROVED RANKING METHOD OF WEB PAGE<br />
RETRIEVAL RESULTS<br />
A. The Ranking Method <strong>of</strong> Search Results under Web<br />
Comprehending<br />
The quantitative process <strong>of</strong> correlation <strong>of</strong> important<br />
web pages is the basis <strong>of</strong> ranking web pages. It can be<br />
known from the above that the PageRank values <strong>of</strong> web<br />
pages can reflect the importance <strong>of</strong> web pages, so just<br />
link relation in web page sets is needed to use, according<br />
to (12).<br />
n PR(<br />
T ) i<br />
PR = ( 1−<br />
d)<br />
+ d × ( )<br />
(12)<br />
C(<br />
T )<br />
∑<br />
i= 1 i<br />
The PageRank values <strong>of</strong> web pages are obtained, the<br />
query similarity <strong>of</strong> only partial web pages <strong>of</strong> high score is<br />
calculated, which can greatly reduce the scale <strong>of</strong> being<br />
vectors and calculating vector similarity.<br />
Documentd and queryq is simplified into sets <strong>of</strong><br />
vocabularies after word segmentation. Set ∑<br />
={ t1,t2,…,tN} is a dictionary, ti is lexical item, N is its<br />
scale, so<br />
m1<br />
m2<br />
mN<br />
d = { t1<br />
, t2<br />
, �,<br />
t N }<br />
n1<br />
n2<br />
nN<br />
q = { t1<br />
, t2<br />
, �,<br />
t N }<br />
In the above formula, mi and ni(i=1,2,…,N)represent<br />
the weights <strong>of</strong> corresponding words. Because dictionary<br />
is fixed, vectors are represented only with weight value.<br />
{ 1, 2 , , N } m m m d �<br />
=<br />
{ 1, 2 , , N } n n n q � =<br />
In the above formula, the typical TF*IDF calculation<br />
way is applied in weight calculation, the vector<br />
representation <strong>of</strong> documents can be obtained by<br />
normalizing mi.<br />
, , , ) w w w d � =<br />
( 1 2 N<br />
© 2011 ACADEMY PUBLISHER<br />
mi<br />
M<br />
w i = TFi<br />
× IDFi<br />
= × lg( )<br />
mi<br />
ki<br />
Therein,<br />
∑ (13)<br />
In (13), ki represents the involved document number <strong>of</strong><br />
lexical item ti in document sets, and M represents size <strong>of</strong><br />
document sets. In this way, the weight values <strong>of</strong> all<br />
feature items the documents are got. In the same way,<br />
queryq can be formatted into weight value <strong>of</strong> feature<br />
item.<br />
The similarity between document and query string<br />
ultimately decides the display order <strong>of</strong> web pages. Apply<br />
the calculation formula <strong>of</strong> the similarity between users’<br />
query and texts- cosine formula.<br />
Sim(<br />
q,<br />
d)<br />
=<br />
∑<br />
i=<br />
1<br />
k<br />
k<br />
2<br />
∑di× ∑<br />
i=<br />
1<br />
d<br />
i<br />
× q<br />
i<br />
q<br />
2<br />
i<br />
(14)<br />
Calculate the included angle cosine <strong>of</strong> document<br />
weight vector and query weight vector, that is, similarity<br />
value <strong>of</strong> documentd and queryq, and decide webpage<br />
ranking according to this value.<br />
B. The Improvement <strong>of</strong> Similarity Calculation Formula<br />
The webpage from all-pairs query, after all, is different<br />
from webpage from inquiring relevant words. In order to<br />
follow the principle <strong>of</strong> presenting the first page<br />
prominently, quantization basis should be provided when<br />
ranking webpage so as to distinguish the two kinds <strong>of</strong><br />
webpage. That means the method to calculate the<br />
similarity value between query word and text should be<br />
changed correspondingly.<br />
Assume the query content has been preliminarily<br />
filtered when inputting query content. In order to respect<br />
the query strings by users, the traditional cosine formula<br />
is still used to calculate the similarity between the text got<br />
by complete matching query and queryq.<br />
Specific to the documents got according to the<br />
semantic correlation words, its value should be<br />
appropriately reduced when calculating query similarity.<br />
It can be known from the above analysis that the<br />
similarity between vocabularies can be calculated by<br />
forward multiplication <strong>of</strong> matrix. After product matrix is<br />
unitized, set vocabulary similarity θ(0 ≤θ≤1) as<br />
reduction factor, used to calculate the similarity between<br />
the documents got by semantic related term and query<br />
sector. According to the above analysis, this paper put<br />
forward a new calculation formula <strong>of</strong> similarity between<br />
Web document and query sector, which is evolved from<br />
the cosine formula and used to calculate the similarity<br />
value <strong>of</strong> query strings and the documents got from<br />
primitive index term and semantic related term, the<br />
formula as follows.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1695<br />
⎧<br />
⎪<br />
⎪<br />
⎪<br />
⎪<br />
Sim(<br />
q,<br />
d)<br />
= ⎨<br />
⎪<br />
⎪<br />
⎪θ<br />
×<br />
⎪<br />
⎪⎩<br />
∑<br />
i=<br />
1<br />
k<br />
k<br />
∑<br />
i=<br />
1<br />
k<br />
∑<br />
i=<br />
1<br />
k<br />
∑<br />
i=<br />
1<br />
( d × q )<br />
d<br />
2<br />
i<br />
i<br />
×<br />
×<br />
∑<br />
i=<br />
1<br />
( d × q )<br />
d<br />
2<br />
i<br />
i<br />
n<br />
i<br />
n<br />
i<br />
∑<br />
i=<br />
1<br />
q<br />
2<br />
i<br />
q<br />
2<br />
i<br />
( 0 ≤ θ ≤ 1)<br />
(15)<br />
In (15), θ is the threshold value that is got in<br />
common singular value decomposition <strong>of</strong> matrix with<br />
mathematical statistics in the calculation <strong>of</strong> latent<br />
semantic similarity and meets contribution rate in<br />
k<br />
∑<br />
r<br />
∑<br />
λi / λi<br />
≥ θ<br />
equation i=<br />
1 i=<br />
1 with primitive information<br />
included. The contribution rate in equation is used to<br />
measure the representation degree <strong>of</strong> k-dimension<br />
sub-space to the entire space.<br />
According to users’ query input, utilizing the improved<br />
similarity calculation formula to calculate query<br />
similarity <strong>of</strong> search results can give consideration to<br />
precision ratio at the time <strong>of</strong> pursuing the recall ratio <strong>of</strong><br />
retrieval. Ranking web pages according to the similarity<br />
value in the above formula can make users more satisfied<br />
with the response <strong>of</strong> search engine.<br />
C. Experimental Results and Their Analysis<br />
The experiment chose the Web document set <strong>of</strong> the<br />
topic <strong>of</strong> “The 60th Anniversary <strong>of</strong> National Day” specific<br />
to “National Day”, “military parade” and other query<br />
words. Two web page ranking algorithms were made for<br />
the retrieval system respectively based on traditional<br />
cosine formula and similarity improvement formula as<br />
mentioned above. Because the page number displayed in<br />
the first page <strong>of</strong> result list by various common search<br />
engines is 10-20, the experiment tracked the browsing<br />
condition <strong>of</strong> users about retrieval back to the list,<br />
recorded users’ number <strong>of</strong> clicks in the first 10 pages and<br />
the first 20 pages to analyze users’ satisfaction. The<br />
detailed data is showed as table Ⅰ.<br />
TABLE I.<br />
COMPARISON OF USERS’ NUMBER OF CLICKING THE FIRST 10 PAGES/<br />
THE FIRST 20 PAGES<br />
Query<br />
Words<br />
Traditional Cosine Formula<br />
Improved Similarity<br />
Calculation Formula<br />
National<br />
7/10<br />
Day<br />
7/12<br />
Hoisting<br />
5/6<br />
the Flag<br />
6/13<br />
Evening<br />
Party<br />
4/7 8/9<br />
Military<br />
Parade<br />
5/11 10/15<br />
It can be seen from the above table that the users’<br />
satisfaction with different web page sets got by two<br />
© 2011 ACADEMY PUBLISHER<br />
similarity calculation formula is different specific to the<br />
same query words. The precision ratio <strong>of</strong> the first 10<br />
pages and the first 20 pages obtained based on the<br />
improved similarity calculation formula is high.<br />
Therefore, the pages included in the page list meet users’<br />
query request better and users are more satisfied with the<br />
page list. Viewing the retrieval system from the angel <strong>of</strong><br />
users, the returned content in the first page <strong>of</strong> list is closer<br />
to their query words, and the retrieval quality <strong>of</strong> this<br />
retrieval system will be higher. So the implication <strong>of</strong><br />
improved similarity calculation formula can make the<br />
similarity value <strong>of</strong> page query more precise, thereby<br />
optimize the result list and improve retrieval quality.<br />
C. Conduct retrieval by relevant inverted indexing file<br />
Retrieval process is actually extracting query word<br />
according to users’ query strings, a process <strong>of</strong> matching<br />
the query words in indexing wordlist <strong>of</strong> inverted file and<br />
generating result set.<br />
Concretely speaking, after users inputting the query<br />
strings, retrieval system firstly conducts word<br />
segmentation on Chinese character in the strings,<br />
removes the stop-use words and punctuations as well as<br />
extracts query words. Besides, search the inverted file<br />
provided by indexing system and match query words.<br />
Read information about word frequency according to the<br />
successfully-matched indexing items, read inverted list <strong>of</strong><br />
indexing words and keep record the document No.<br />
containing indexing word and the position <strong>of</strong> indexing<br />
word. Then, determine whether check the attribute<br />
content <strong>of</strong> relevant words according to the users’ choice<br />
about whether displaying semantically relevant document<br />
options provided by retrieval interface, If users choose to<br />
display relevant result, read the pointer-range content <strong>of</strong><br />
semantically relevant word <strong>of</strong> indexing word; obtain the<br />
information about semantically relevant word by in-list<br />
deviation. Later, calculate the PageRank score for the<br />
obtained document’s corresponding original webpage;<br />
choose the document with higher score and calculate the<br />
query similarity. Finally, conduct the list displaying<br />
processing <strong>of</strong> webpage; decide the webpage displaying<br />
order according to similarity value.<br />
The displaying work <strong>of</strong> result list should also include<br />
displaying the webpage abstract result and webpage<br />
snapshot; moreover, query words contained in the<br />
webpage title, abstract and webpage snapshot should be<br />
<strong>of</strong> high-light displaying. High-light displaying the query<br />
word’s positioning can use the position list <strong>of</strong> indexing<br />
word stored in inverted file.<br />
Retrieval system receives query strings input by users,<br />
and the following is the arithmetic for searching relevant<br />
inverted file and obtaining search result:<br />
Input: query strings<br />
Output: webpage result list<br />
Algorithm: Searcher<br />
1. Initialize webpage set Res=Φ;<br />
2. Conduct word segmentation for q, and delete<br />
stop-use words to obtain the query vector expression<br />
q={t1,t2,…,tm};
1696 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
3. For each ti∈q do<br />
4. Initialize the number set <strong>of</strong> result text Rid=Φ;<br />
5. Do matching term by term in the wordlist <strong>of</strong> related<br />
inverted file, to find index word ti;<br />
6. Read the inverted list <strong>of</strong> index item ti, and the<br />
document No. is stored in Rid; record word frequency<br />
and the occurrence position <strong>of</strong> item;<br />
7. If users choose to show that the attribute linked-list<br />
<strong>of</strong> semantically relevant words <strong>of</strong> related retrieval<br />
option &&ti is non-empty, then<br />
8. Read the in-list deviation attribute <strong>of</strong> semantically<br />
relevant words to find term t-ri;<br />
9. Record information about relevant item t-ri as<br />
procedure 6;<br />
10. End if<br />
11. Research webpage indexing file according to<br />
document no. in Rid and the obtained webpage is<br />
stored in Res;<br />
12. Calculate the PageRank value in Rid set by<br />
formula (1);<br />
13. Choose the webpage with larger PageRank value<br />
and calculate query similarity by formula (15);<br />
14. End for<br />
15. Rank the similarity <strong>of</strong> webpage in Res to decide<br />
the order <strong>of</strong> displaying list;<br />
16. Display webpage title, abstract and webpage<br />
snapshot in result list;<br />
17. Return to result list;<br />
REFERENCES<br />
[1] Xiaoming Li, Hongfei Run, Jiming Wang, Search<br />
Engine-Principle, Technology and System. Beijing:<br />
Science Press, 2005.<br />
[2] Page L, Brin S, Motwani R, The pagerank citation<br />
ranking:Bringing order to the web. Stanford Digital<br />
Libraries SIDL-WP, 1999.<br />
[3] Havelieala T H, “Topic-sensitive PageRank,” Proceedings<br />
<strong>of</strong> the 1lth International World Wide Web Conference.<br />
Hawaii, pp. 517-526, 2002.<br />
[4] Kleinberg J, “Authoritative sources in a hyperlinked<br />
environment,” Proceedings <strong>of</strong> the Ninth Annual<br />
ACMSIAM Symposium on Discrete Algorithms. San<br />
Francisco, California, pp. 668-677, 1998.<br />
© 2011 ACADEMY PUBLISHER<br />
[5] Xing Wenpu, Ghorbani A, “Weighted PageRank<br />
Algorithm,” Communication <strong>Networks</strong> and Services<br />
Research, Proceedings <strong>of</strong> Second Annual Conference. pp.<br />
305-314, 2004.<br />
[6] Hai Liu, Yuanyuan Wang, Xueren Zhang, “Study <strong>of</strong> Text<br />
Retrieval Problems Based on Latent Semantic Space,”<br />
Information Science, vol. 5, pp. 748-753, 2007.<br />
[7] Yuchang Lu, Mingyu Lu, “The Analysis and Construction<br />
<strong>of</strong> Word Weight Function in Sector Space Method,”<br />
<strong>Journal</strong> <strong>of</strong> Computer Research and Development, vol. 10,<br />
pp. 1205-1210, 2002.<br />
[8] Jiang Lu, “Study <strong>of</strong> the Application <strong>of</strong> Latent Semantic<br />
Analysis in Text Information Retrieval,” Wuhan:<br />
Huazhong University <strong>of</strong> Science and Technology, vol. 4,<br />
pp. 21-22, 2005.<br />
[9] Todd A, “Letsche,Michael W.Berry.Large-Scale<br />
Information Retrieval with Latent SemanticIndexing,”<br />
Information Science, vol. 1, pp. 105-137, 1997.<br />
[10] Nieholas Lester, Justin Zobel, Hugh Williams, “Effieient<br />
Online Index Maintenance for Contiguous Inverted Lists,”<br />
Information Processing and Management, vol. 4, pp.<br />
916-933, 2006.<br />
[11] Jiang Jiahui, Matrix Theoretical Basis. Dalian: Dalian<br />
University <strong>of</strong> Technology, pp. 65, 1995.<br />
[12] Sheng Jun, “Study on Markov Network Retrieval Model<br />
Baed on Latent Semantics,” Nanchang: Dissertation from<br />
Jiangxi Normal University, pp. 5-13, 2006.<br />
[13] Foltz P W, “The Measurment <strong>of</strong> Textual Coherence with<br />
Latent Semantic Analysis,” Discourse processes, vol. 1, pp.<br />
285-307, 1998.<br />
Zhijuan Deng, female, native place is<br />
Ganzhou, Jiangxi Province, born in<br />
Nov.1979, working in Faculty <strong>of</strong><br />
Science, Jiangxi University <strong>of</strong> Science<br />
and Technology as an instructor.<br />
Research directions: Web information<br />
mining, s<strong>of</strong>tware project management.<br />
Shaojun Zhong was born in Guzhou,<br />
China, in Oct.1979.He is now a Lecturer<br />
in Faculty <strong>of</strong> Science, Jiangxi University<br />
<strong>of</strong> Science and Technology, China. His<br />
research interests include data mining,<br />
network technology, and Intelligence<br />
computation.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1697<br />
An Encryption Scheme with Hidden Keyword<br />
Search for Outsourced Database<br />
Xiaoming Wang<br />
Jinan University/ Department <strong>of</strong> Computer Science, Guangzhou, 510632, China<br />
Email: wxm_gz@hotmail.com<br />
Guoxiang Yao and Zhen Zhang<br />
Jinan University/ Department <strong>of</strong> Computer Science, Guangzhou, 510632, China<br />
Abstract—An encryption scheme with hidden keyword<br />
search is proposed for Outsourced Database. In the<br />
proposed scheme, both pseudorandom function and<br />
polynomial function are employed in order to reduce<br />
computation and shortage overhead. The proposed scheme<br />
can not only provide controlled searching, and hidden<br />
searching as well as the provable secrecy for encryption, but<br />
also support the dynamic change <strong>of</strong> the permitted group<br />
users and be transparent to user when the users are added<br />
and removed since they are not involved in the process.<br />
Moreover, there is no interaction between database owner<br />
and server, server and user or database owner and user<br />
when the decrypted key is set up. Each user is only required<br />
to receive messages to setup their decrypted key and hence<br />
can query over encrypted data and decrypt the encrypted<br />
data. Therefore, the proposed scheme is more efficient and<br />
more practical for outsourced database.<br />
Index Terms—outsource database, hidden keyword search,<br />
added and revoked users<br />
I. INTRODUCTION<br />
The management <strong>of</strong> large databases is quite expensive,<br />
as it needs not only storage capacity, but also skilled<br />
personnel. A solution to this problem is outsourced<br />
database. In this solution, data owners store their data to a<br />
third-party service provider (server), which is not trusted.<br />
The server provides services to the users <strong>of</strong> the database.<br />
In outsourced database systems, the main problem is that<br />
sensitive data are stored on a third party site which is not<br />
under the data owner’s direct control; thus, data privacy and<br />
security can be put at risk. To protect resources from being<br />
disclosed to the server and outside attackers, as well as to<br />
realize access control on the server side, encryption<br />
methods are used to protect the sensitive data. By<br />
encrypting the data, the database owner should ensure<br />
that no one except the permitted users read the data.<br />
Although this solution can protect the data from outsider<br />
attackers and the server, the fundamental problem is that<br />
the search over the encrypted data seems very difficult,<br />
and it is hard to protect the user privacy as performing<br />
Manuscript received Mar. 1, 2011; revised April 5, 2011; accepted<br />
Aprl 12, 2011.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1697-1704<br />
queries over encrypted data.<br />
To resolve this problem, there is a need to develop a<br />
solution enabling the user to perform the search over the<br />
encrypted domain in such a way that the server does not<br />
learn any unauthorized information by performing the<br />
search. In 2000, Song et al.[1] first studied a secure<br />
keyword search scheme by using a symmetric cipher. In<br />
their scheme, a user stores her encrypted data in a nontrusted<br />
database and later searches the data with an<br />
encryption keyword that is encrypted by the user with his<br />
secure key. Their techniques provide provable secrecy for<br />
encryption, in the sense that the non-trusted server cannot<br />
learn anything about the plaintext given only the<br />
ciphertext. Their scheme is simple and fast. However,<br />
their scheme applies only to the private-key setting for<br />
user who owns his data and wishes to upload it to a thirdparty<br />
database that they do not trust, their scheme cannot<br />
be used for practical applications such as in an email<br />
routing system, outsourced database etc.[11]. In their<br />
scheme, only the user oneself can search on encrypted<br />
data. If the other user is allowed to search for a word, the<br />
encryption key is disclose to him or disclose to server a<br />
list <strong>of</strong> potential locations where might occur. If the server<br />
is allowed to search for too many words, he may be able<br />
to use statistical techniques to start learning important<br />
information about the documents. One possible defense is<br />
to periodically change the key, re-encrypt data under the<br />
new key. As a result, the user must again re-encrypt the<br />
data and transmitted to server by finite channel. If the<br />
owner does not have the resources stored locally, a<br />
further preliminary step is needed to re-acquire them<br />
from the service, and decrypt them and encrypt them<br />
again, as well as transmit them to server by finite<br />
channel, it is involve a lot <strong>of</strong> performance overhead and<br />
become practically impossible.<br />
In outsourced database, the permitted users are allowed<br />
to search and read the data stored at the external server by<br />
the data owner. The permitted users wish to retrieve some<br />
data or search for some data without revealing to the<br />
server which data it is. Aiming to these requirements, an<br />
encryption scheme with hidden keyword search for<br />
outsourced database is proposed based on Song et al.’s<br />
scheme. But the proposed scheme is different from Song<br />
et al.’s scheme. The proposed scheme allows a group <strong>of</strong><br />
the permitted users to search and read the data stored at
1698 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
the external server by the data owner, not like Song et<br />
al.’s scheme, only to allow the user oneself to search on<br />
encrypted data. In the proposed scheme, we employ both<br />
polynomial function and pseudorandom function in order<br />
to reduce computation and shortage overhead. The<br />
proposed scheme can not only provide controlled<br />
searching, and hidden searching as well as the provable<br />
secrecy for encryption, but also support the dynamic<br />
change <strong>of</strong> the permitted group users and be transparent to<br />
user when the users are added and removed since they are<br />
not involved in the process. Moreover, there is no<br />
interaction between database owner and server, server<br />
and user or database owner and user when the decrypted<br />
key is set up and updated. Each user is only required to<br />
receive messages to setup their decrypted key and hence<br />
can query over encrypted data and decrypt the encrypted<br />
data. Therefore, the proposed scheme is more efficient<br />
and more practical for outsourced database.<br />
The rest <strong>of</strong> the paper is organized as follows: Section 2<br />
presents related works. An encryption scheme with<br />
hidden keyword search for outsourced database is<br />
presented in section 3. In section 4, the security and<br />
properties <strong>of</strong> the proposed protocol are analyzed. Finally,<br />
the concluding remarks are given.<br />
II. RELATED WORK<br />
In the existing some schemes for designing encrypted<br />
outsourced databases [2-5], it is assumed that the entire<br />
database is encrypted with a single key and the users are<br />
granted the key. The assumption is only limited to<br />
protecting data on the server side and the users have<br />
complete access to the database. However, in real world,<br />
complete access to the encrypted outsourced data is not<br />
acceptable. It is desirable that the users can only have<br />
selective access to the encrypted data. Moreover, these<br />
proposals, in case <strong>of</strong> updates <strong>of</strong> the authorization policy,<br />
would require re-encrypting the resources and resending<br />
them to the service. If the owner does not have the<br />
resources stored locally, a further preliminary step is<br />
needed to re-acquire them from the service and decrypt<br />
them by finite channel, and a great <strong>of</strong> the new decryption<br />
keys are frequently transmitted to all the authorized users,<br />
these would involve a lot <strong>of</strong> performance overhead and<br />
become practically impossible for large databases<br />
accessed by a dynamic group <strong>of</strong> users. For resulting the<br />
problems, Vimercati et al.[6] proposed the overencryption<br />
approach to avoid the need for shipping<br />
resources back to the owner for re-encryption when<br />
security requirements change. In their scheme, the<br />
resources are encrypted by the owner for providing initial<br />
protection and are encrypted by the outsourced server to<br />
reflect policy modifications. One potential limitation <strong>of</strong><br />
the over-encryption scheme is that it may require to<br />
publishing too many tokens when the number <strong>of</strong> users is<br />
large[7]. In 2008, Liu et al.[7] proposed a new keyassignment<br />
approach based on secret sharing. In their<br />
scheme, resources are divided into different sets based on<br />
access control lists, and each set corresponds to a distinct<br />
encryption key. Users can use their corresponding key to<br />
derive the encryption key in order to access the resource.<br />
© 2011 ACADEMY PUBLISHER<br />
However, we consider Liu et al’s scheme is insecure<br />
against collusion attack. If two users share a resource,<br />
their scheme employs the two users as a subset to build a<br />
binary linear equation for deriving the encryption key.<br />
They randomly choose points (x, y) on this equation and<br />
assign as a key pair to users and choose another a point<br />
(xpub, ypub) on this equation as public token to publish.<br />
Each user uses the public token (xpub, ypub) together with<br />
his key pair to derive the decryption key. But a user can<br />
also reconstruct the binary linear equation using the<br />
public token (xpub, ypub) together with his key pair, they<br />
can compute many key pairs for many unauthorized<br />
users, thus the unauthorized users can access the<br />
resource.<br />
Database encryption prevents unauthorized users,<br />
including intruders braking into a network and database<br />
administrators, from seeing sensitive data in databases.<br />
However, it is very hard to protect the user privacy as<br />
performing queries over encrypted data. To resolve this<br />
problem, keyword search over encrypted data has<br />
received close attention in various environments such as<br />
encrypted web hard-systems, intelligent email routing,<br />
encrypted vendor systems, etc. In 2000, Song et al.[1]<br />
studied a secure keyword search scheme by using a<br />
symmetric cipher proposed a search technique on<br />
encrypted data. It deals with search problems between a<br />
user and a non-trusted server, also they gave out some<br />
practical solutions. In 2004, Golle et al.[8] first proposed<br />
the notion <strong>of</strong> conjunctive keywords searchable<br />
encryption, also they presented a solution to cope with<br />
this problem. They defined a security model for<br />
conjunctive keyword search over encrypted data and<br />
provided two secure constructions. In 2008, Wang et al.[9]<br />
first gave out a Keyword Field-Free Conjunctive<br />
Keyword Searches scheme, which answers the open<br />
problem asked by Golle et al. In their scheme, the target<br />
ciphertext includes a keyword set, the user could generate<br />
a trapdoor which consists a key word set; subset key<br />
words search means that if a keyword set <strong>of</strong> the target<br />
cipjhertext includes a keyword set <strong>of</strong> the trapdoor, then<br />
trapdoor and ciphertext were matched. However,<br />
reference[10] points out that there is a mistake in their<br />
pro<strong>of</strong> in Golle et al’s scheme.<br />
Above these schemes were constructed in a symmetric<br />
key setting. In this setting, a user encrypts and stores his<br />
private data in the storage <strong>of</strong> remote server. A user can<br />
then retrieve his private data with a particular keyword<br />
from the remote storage. However, these systems cannot<br />
be used for practical applications such as in an email<br />
routing system, outsourced database etc.[11].<br />
In 2005, Boneh et al.[12] first proposed a Public Key<br />
Encryption with Keyword Search (PEKS). With PEKS, a<br />
sender stores the encryption data to a server, the receiver<br />
makes a trapdoor for a keyword and sends the trapdoor to<br />
the server. Then the server can test whether or not the<br />
encryption and the trapdoor were made with the same<br />
keyword. If the keywords in the encryption and the<br />
trapdoor are same, the server sends the encryption to the<br />
receiver. Byun et al. [13] showed that the PEKS scheme<br />
is insecure against Off-line keyword-guessing attack.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1699<br />
That is, given a trapdoor, an attacker can learn which<br />
keyword is used to generate the trapdoor. Since a user<br />
usually queries commonly used keywords with low<br />
entropy, the keyword guessing attacks are meaningful.<br />
Rhee et al.[14], Tang and Camenisch et al.[15] gave out<br />
the way to cope with this attack accordingly. Later, many<br />
papers were published to extend PEKS. But PEKS only<br />
supports an efficient remote storage system for a<br />
designated receiver, which does not provide an efficient<br />
remote storage system for a number <strong>of</strong> users.<br />
III. PROPOSED SCHEME<br />
A. System model<br />
The proposed scheme uses DAS(database-as-a-service)<br />
model. System includes three entities data owner, server<br />
and user. This model is mostly suitable for one-to-many<br />
group where there is a single database owner and a large<br />
number <strong>of</strong> users. Database owner is responsible for<br />
producing, distributing, and updating encryption keys.<br />
Server is responsible for producing the query result on the<br />
encrypted data, and sending encrypted result to the user.<br />
User decrypts the result from the server using the<br />
decryption key in order to get the plaintext result.<br />
We assume that the data owner defines an access<br />
control policy to regulate access to the distributed<br />
resources. All the users <strong>of</strong> the outsourced database are<br />
divided into different groups according to their access<br />
privilege. Users with the same database access privilege<br />
are grouped together and can access the same part <strong>of</strong> the<br />
outsourced data. The outsourced database is protected<br />
with encryption. For the sake <strong>of</strong> simplicity, we assume<br />
the encryption operations to be referred to s single group.<br />
B. Setup<br />
Let p, q be distinct large primes and q|(p-1), a<br />
generator g in GF(p) with an order q, a pseudorandom<br />
function F and an additional pseudorandom function f,<br />
which will be keyed independently <strong>of</strong> F, a pseudorandom<br />
generator G, and a secure hash function H. We write<br />
fτ (x)<br />
for result <strong>of</strong> apply f to input x with secret key τ .<br />
The database owner, sever and each user ui have<br />
ε<br />
respectively a pair <strong>of</strong> keys such as ( ε o , y g o<br />
o = mod p )<br />
ε<br />
( , y g s<br />
ε<br />
ε s s = mod p)<br />
and ( ε i , y g i<br />
i = mod p ). In setup<br />
phase, the encrypted key is established by the database<br />
owner and is sent to the each group user. Without loss <strong>of</strong><br />
generality, assume that a group contains a set <strong>of</strong> privileged<br />
users U=(u1,u2,…un). The setup includes following<br />
several steps.<br />
(1)The database owner chooses at random a polynomial<br />
m−1<br />
f ( x)<br />
= a0<br />
+ a1x<br />
+ ... + am−1x<br />
mod q (1)<br />
where k=f(0)=a0, ai -s are the coefficients <strong>of</strong> f(x), m(m>n)<br />
be large positive integers.<br />
© 2011 ACADEMY PUBLISHER<br />
(2) The database owner chooses a random integerα and<br />
a set <strong>of</strong> random integers Dj=(d1, d2,…,dm), and computes<br />
ki for each group user ui as following<br />
m 1<br />
dl<br />
ki<br />
f ( xi<br />
) ∏ mod q<br />
x d<br />
l 1 i l<br />
−<br />
−<br />
α<br />
= , v = g mod p , (2)<br />
−<br />
=<br />
y p o ε<br />
δ i = i mod , Vi<br />
= Eδ<br />
( ki<br />
) y g i p<br />
i<br />
k<br />
i = mod , (3)<br />
m−1<br />
m−1<br />
−dl<br />
−x<br />
α<br />
i<br />
∑ f ( d j ) ∏ d<br />
j l l j j −dl<br />
d j −x<br />
= 1 = 1,<br />
≠<br />
i<br />
zi<br />
= g<br />
mod p , (4)<br />
for 1 ≤ i ≤ n , and sends Vi to each group user ui., then<br />
publishes(v,zi(i=1,…,n)).Where Eδ {⋅}<br />
denotes encryption<br />
i<br />
operation with a key δ i using symmetrical encryption<br />
algorithm such as AES, xi is the identifier <strong>of</strong> group user ui.<br />
(3) On receiving V i , the group user ui computes<br />
y p i ε<br />
δ i = o mod<br />
(5)<br />
and the decryptsV i , thus obtains k i .<br />
C. Encryption<br />
When the database owner encrypts a data M that<br />
contain the sequence <strong>of</strong> words w1, w2,…,wl, he does<br />
following steps:<br />
(1) The database owner computes<br />
g p<br />
k α<br />
σ = mod , X i = H ( wi<br />
, σ ) , ci = fτ<br />
( X i ) (6)<br />
for 1 ≤ i ≤ l , where Xi is n-bits long, then generates a<br />
sequence <strong>of</strong> pseudorandom values ei using the<br />
pseudorandom generator G, where each ei is n-m bits long.<br />
Finally the database owner computes Fc ( ei<br />
) , adds F ( )<br />
i<br />
c e<br />
i i<br />
in back <strong>of</strong> ei, and gets n bits long Bi<br />
=< ei<br />
|| Fc<br />
( ei<br />
) > .<br />
i<br />
(2) The database owner computes<br />
CTi = X i ⊕ Bi<br />
, C = M ⊕ H (σ ) , (7)<br />
and sends {C, CT i } to server.<br />
D. Trapdoor<br />
When a group user ui needs to query the data with<br />
words w1, w2,…, wl from the outsourced database, he<br />
needs to generate a trapdoor as following<br />
z v i p<br />
k<br />
σ ′ = i mod , X i′<br />
= H ( wi<br />
, σ ′ ) , c i′<br />
= fτ<br />
( X i′<br />
) , (8)<br />
generates trapdoor T i =< ci′<br />
, X i′<br />
> and sends Ti to server.<br />
Where i=1,2,…,l.<br />
E. Test<br />
On receiving Ti, server computes B i′<br />
= CTi<br />
⊕ X i′<br />
and<br />
splits B′ i into two parts, Bi ′ =< ei′<br />
|| r > , where e′ i denotes<br />
the first n-m bits <strong>of</strong> B′ i and r denotes the last m bits <strong>of</strong> B′ i .<br />
Then server computes Fc ( ei′<br />
) and tests F e r<br />
i<br />
c (<br />
i i′)<br />
= . If it<br />
holds, then Ti is correct. Server sends {C, CT i } to group<br />
user ui.<br />
The group user ui computes<br />
M = C ⊕ H (σ ′ )<br />
(9)<br />
thus obtains the data M.
1700 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
F. Adding group users<br />
Adding a new group user unew to a group does not<br />
require re-updating the decryption key. While adding unew<br />
to a group, database owner first picks an unused identifier<br />
xnew, and computes<br />
k<br />
new<br />
m 1<br />
∏<br />
l 1<br />
−<br />
=<br />
− dl<br />
= f ( xnew)<br />
mod q , (10)<br />
x − d<br />
new<br />
y p o ε<br />
δ new = new mod , Vnew<br />
= Eδ<br />
( knew<br />
) , (11)<br />
m−1<br />
m−1<br />
−dl<br />
−x<br />
α<br />
new<br />
∑ f ( d j ) ∏ d<br />
j l l j j −dl<br />
d j −x<br />
= 1 = 1,<br />
≠<br />
new<br />
znew<br />
= g<br />
mod p<br />
(12)<br />
and sends V new to new group user u new , publishes znew .<br />
Upon receivingV new , unew computes y p o ε<br />
δ new = new mod ,<br />
and decrypts Vnew with δ new and obtains secret key k new ,<br />
thus the group member unew can access server, query the<br />
outsourced database and decrypt the ciphertext received<br />
from the server by his secret key k new .<br />
G. Removing group users<br />
When a group user uB is removed from a group, the<br />
encryption key σ has to be changed in order to prevent<br />
the removed group users from querying and reading the<br />
restricted data. Database owner and server have to go<br />
through following steps:<br />
(1) The database owner chooses a random integer ρ ,<br />
and computes as following for t= 1,…,n, t ≠ B , i=1,2,…,l.<br />
m−1<br />
m−1<br />
−dl<br />
−x<br />
ρ<br />
t<br />
∑ f ( d j ) ∏ d<br />
j l l j j −dl<br />
d j −x<br />
= 1 = 1,<br />
≠<br />
t<br />
zt<br />
= g<br />
mod p<br />
,<br />
(13)<br />
ρ<br />
ρ<br />
l<br />
new<br />
v = g mod p , g p<br />
k σ = mod , (14)<br />
X = H ( w , σ ) , Y = H ( w , ) ⊕ H ( w , σ ) (15)<br />
i<br />
i<br />
i<br />
i σ i<br />
ε<br />
s<br />
s = H ( σ ) ⊕ H ( σ ) , y p<br />
o δ = mod , (16)<br />
s<br />
V = Eδ<br />
( Y1<br />
|| Y2<br />
|| ... || Yl<br />
|| s)<br />
(17)<br />
s new<br />
and publishes [( z t (t= 1,2,…,n, t ≠ B ), v ), then send Vs<br />
to server.<br />
(2) Server first computes y p<br />
s δ s = o mod and decrypts<br />
Vs to get (Yi,s), where i=1,2,…,l., then<br />
C = C ⊕ s = M ⊕ H (σ ) , (18)<br />
CT = CT ⊕ Y = B ⊕ H ( w , σ )<br />
(19)<br />
i<br />
i<br />
{ C , i T C } is deposited in outsourced server.<br />
i<br />
A non-removed user ut can compute σ = z v t<br />
t mod p ,<br />
then can generate a valid query trapdoor Ti and recover<br />
k<br />
the decryption key σ since z v t ρk<br />
σ = t = g mod p .<br />
However, the removed user uB cannot get σ since uB<br />
cannot obtain z B by public information z t (t=1,2,…,n,<br />
© 2011 ACADEMY PUBLISHER<br />
i<br />
ε<br />
i<br />
k<br />
t ≠ B ). If uB computes σ using his k B and old z B or<br />
other user z t , then σ ′ ≠ σ since<br />
k k k<br />
σ ′ z v B<br />
t zB<br />
v B ρ<br />
= ≠ ≠ g = σ mod p (20)<br />
Therefore, the removed user uB cannot recover the<br />
decryption keyσ , and he is prevented from generating<br />
query trapdoor and obtaining data, thus he is removed<br />
from the group.<br />
III. ANALYSIS<br />
A. Correctness<br />
Lemma 1. For a given ciphertext {C, CT i }, if the<br />
database owner follows the correct encryption procedure,<br />
then any privileged group user can correctly generate<br />
query trapdoor and decrypt the ciphertext to obtain data<br />
M.<br />
Pro<strong>of</strong>: Because<br />
m−1<br />
m<br />
m−1<br />
−d<br />
x<br />
d<br />
f d<br />
l − i<br />
−<br />
α<br />
f x l<br />
∑ ( j ) ∏<br />
+ α ( i ) ∏<br />
k<br />
d d d x<br />
x d<br />
j l l j j − l j − i l i − l<br />
ziv<br />
i 1 1,<br />
1<br />
σ ′<br />
= = ≠<br />
=<br />
= = g<br />
αk<br />
= g = σ mod p<br />
X i′<br />
= H ( wi<br />
, σ ′ ) = H ( wi<br />
, σ ) = X i , c i′<br />
= fτ<br />
( X i′<br />
) = ci<br />
Bi′<br />
=< ei′<br />
|| r >= CTi<br />
⊕ X i′<br />
= X i ⊕ Bi<br />
⊕ X i′<br />
,<br />
= Bi<br />
=< ei<br />
|| Fc<br />
( ei<br />
) ><br />
i<br />
then e i = ei′<br />
, Fc ( ei<br />
) = r , therefore T =< ′ ′ ><br />
i<br />
i ci<br />
, X i is correct.<br />
Becauseσ ′ = σ , so<br />
M = C ⊕ H ( σ ′ ) = M ⊕ H ( σ ′ ) ⊕ H ( σ ) . □<br />
Lemma 2. For a given ciphertext {C, CT i }, if the<br />
database owner and server follow the correct removing<br />
group user procedure, then equations (18) and (19)hold.<br />
Pro<strong>of</strong>: The equation (18) holds since<br />
s = H ( σ ) ⊕ H ( σ ) ,<br />
C = C ⊕ s = M ⊕ H ( σ ) ⊕ H ( σ ) ⊕ H ( σ )<br />
= M ⊕ H ( σ )<br />
The equation (19) holds since<br />
Y = H ( w , ) ⊕ H ( w , σ ) ,<br />
i<br />
i σ i<br />
CTi<br />
= CTi<br />
⊕ Yi<br />
= Bi<br />
⊕ H ( wi<br />
, σ ) ⊕ H ( wi<br />
, σ ) ⊕ H ( wi<br />
, σ )<br />
= Bi<br />
⊕ H ( wi<br />
, σ )<br />
□<br />
Lemma 3. If ui is a non-removed user, then he can<br />
generate a valid query trapdoor and decrypt the ciphertext<br />
to obtain data M.<br />
Pro<strong>of</strong>: Because ui has<br />
k<br />
i<br />
t<br />
dl<br />
f ( xi<br />
) ∏ mod q<br />
x d<br />
−1<br />
−<br />
=<br />
,<br />
−<br />
l=<br />
1<br />
then ui can compute correctσ<br />
i<br />
l
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1701<br />
k<br />
σ ′ = z v i<br />
i<br />
m−1<br />
m<br />
m−1<br />
−d<br />
x<br />
d<br />
f d<br />
l − i<br />
−<br />
ρ<br />
f x l<br />
∑ ( j ) ∏<br />
+ ρ ( i ) ∏<br />
d d d x<br />
x d<br />
j 1 l 1,<br />
l j j − l j − i l 1 i −<br />
= = ≠<br />
= l<br />
= g<br />
ρk<br />
= g = σ mod p<br />
therefore computes<br />
X i′<br />
= H ( wi<br />
, σ ′ ) = H ( wi<br />
, σ ) , c i = f ( X i′<br />
)<br />
and generates trapdoor T c′<br />
, X ′ > .<br />
i =< i i<br />
′ τ<br />
By the equations (15) and (19) know CTi = Bi<br />
⊕ X i ,<br />
so<br />
X ′ = H ( w , σ ) = X ,<br />
i<br />
i<br />
i<br />
i<br />
ci<br />
i<br />
Bi<br />
=< ei<br />
|| r >= CTi<br />
⊕ X i′<br />
= X i ⊕ Bi<br />
⊕ X i′<br />
,<br />
= B =< e || F ( e ) ><br />
then e i = ei<br />
, ( e ) = r , therefore T i =< ci′<br />
, X i′<br />
> is correct<br />
F i<br />
c i<br />
and the data is recovered from M = C ⊕ H (σ ) since<br />
C = C ⊕ s = M ⊕ H (σ )<br />
□<br />
Lemma 4. If ui is a removed user, then he cannot<br />
generate a valid query trapdoor and recover the<br />
decryption key σ .<br />
Pro<strong>of</strong>: Because<br />
k<br />
σ ′ = z v B<br />
B<br />
m−1<br />
m<br />
m−1<br />
−dl<br />
−xB<br />
−d<br />
α<br />
l<br />
∑ f ( d j ) ∏<br />
+ ρf<br />
( xB<br />
) ∏<br />
d<br />
j l l j j −dl<br />
d j −xB<br />
x<br />
l B −d<br />
= 1 = 1,<br />
≠<br />
= 1 l<br />
= g<br />
≠ σ mod p<br />
Therefore, the removed user uB cannot compute the<br />
decryption key σ , and he is prevented from generating<br />
query trapdoor and obtaining data. □<br />
B. Security Pro<strong>of</strong><br />
The security <strong>of</strong> the proposed scheme is based on<br />
security <strong>of</strong> pseudorandom function and the computational<br />
Diffie-Hellman problem(CDHB). To show the proposed<br />
scheme is secure, we first state a useful lemma 1. Due to<br />
space considerations, we omit the pro<strong>of</strong> <strong>of</strong> the lemma, but<br />
refer to the full version <strong>of</strong> this paper[1].<br />
Lemma 1[1]: If F is a (t,l,eF)-secure pseudorandom<br />
function, f is a (t,l,ef)-secure pseudorandom function, G is<br />
a (t,eG)-secure pseudorandom generator, and if the key<br />
material is chosen as described above. Then the algorithm<br />
described above for generating the sequence will a ( t −ψ , eH<br />
) -secure pseudorandom generator,<br />
where eH=l eF + ef+ eG+l(l-1)/(2/|X|), X={0,1} n-m .<br />
Definition 1: A encryption scheme with hidden<br />
keyword search for outsourced database semantically is<br />
secure against chosen keyword attack if F, f are secure<br />
pseudorandom functions, G is a secure pseudorandom<br />
generator, H is a secure one-way hash function, and there<br />
exits no polynomial-time adversary with a non-negligible<br />
advantage in the following game:<br />
(1) Setup: A challenger C first generates system<br />
parameters, data owner’s key pair and n group users’ key<br />
© 2011 ACADEMY PUBLISHER<br />
i<br />
pairs as the same as section 3 (B). The challenger C gives<br />
system parameters and public keys to an adversary A.<br />
Phase 1: The adversary A issues the following kinds <strong>of</strong><br />
queries adaptively:<br />
(2) Encryption queries: A produces a message M and<br />
sends encryption query for M to C. A will be given the<br />
result { CT i ,C}<strong>of</strong> encryption with input (M, k, wi) by C.<br />
(3) Trapdoor queries: The adversary A makes trapdoor<br />
queries for any keyword <strong>of</strong> his choice to the challenger C.<br />
If the trapdoor is valid, C responses the result; Otherwise,<br />
C returns the symbol⊥.<br />
(4) Challenge: The adversary A produces two keyword<br />
w0 and w1. The challenger C chooses a random bit<br />
b ∈{<br />
0,<br />
1}<br />
and computes a trapdoor Ti with input ( z i , k i ,<br />
v)) to the adversary A as a challenge.<br />
Phase 2: The adversary A issues new queries as in<br />
Phase 1. It is not allowed to make a trapdoor query for the<br />
target challengeT i .<br />
Guess: At the end <strong>of</strong> the game, A outputs a bit b′ . The<br />
adversary A wins this game if b ′ = b . The advantage <strong>of</strong> A<br />
is defined as Adv(A)=Pr[ [ b ′ = b ]-1/2.<br />
Theorem 1: The proposed encryption scheme with<br />
hidden keyword search is ( t , ε )-secure against chosen<br />
keyword attacks if F, f are secure pseudorandom<br />
functions, G is a secure pseudorandom generator, H is a<br />
secure one-way hash function and if there exists no<br />
polynomial-time algorithm that solves CDHP with<br />
( t 1, ε1)<br />
. Where t denotes the running time and ε the<br />
advantage that the adversary A succeeds.<br />
Pro<strong>of</strong>: Assume that there exists an ( t , ε )-adversary A<br />
that can break the encryption scheme with hidden<br />
keyword search in the game <strong>of</strong> Definition 1. In the<br />
following, we will demonstrate how to use A to construct<br />
a ( t1, ε1)<br />
- algorithmη1 that solves one-way hash function<br />
with the advantage ε 1 . η 1 simulates the challenger C and<br />
interacts with A as follows:<br />
Phase 1: The adversary A issues the following kinds <strong>of</strong><br />
queries adaptively:<br />
(1) Setup: η1 outputs the system parameters, data<br />
owner’s key pair as the same as those in Definition 1, and<br />
ϑi<br />
the group users’ public keys ( yi<br />
= g mod p ,i=1,2,…,n,<br />
ζ<br />
v = g mod p ), where ς , ϑi are random integers. After<br />
receiving the system parameters, data owner’s key public<br />
and zi, A outputs the target ui ∈U with public key (v, zi).<br />
(2) Encryption queries: For an encryption query on a<br />
message M chosen by A, η1 first computes<br />
g p<br />
k α )<br />
σ = mod , X i = H ( wi<br />
, σ ) , ci = fτ<br />
( X i )<br />
for 1 ≤ i ≤ l , where Xi is n-bits long, then generates a<br />
sequence <strong>of</strong> pseudorandom values ei using the<br />
pseudorandom generator G, where each ei is n-m bits long.<br />
Finally the database owner computes c ( ei<br />
) , adds<br />
F i
1702 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Fc ( ei<br />
) in back <strong>of</strong> ei, and gets n bits long<br />
i<br />
Bi<br />
=< ei<br />
|| Fc<br />
( ei<br />
) > .<br />
i<br />
Then computes<br />
CTi = X i ⊕ Bi<br />
, C = M ⊕ H (σ )<br />
Finally sends {C, CT i } is returned as the encryption<br />
result <strong>of</strong> this query.<br />
(3) Trapdoor queries: For the trapdoor queries for wi,<br />
the algorithmη1 first computes<br />
k<br />
i σ ′ = ziv mod p , X i′<br />
= H ( wi<br />
, σ ′ ) , c i′<br />
= fτ<br />
( X i′<br />
) ,<br />
generates trapdoor T i =< ci′<br />
, X i′<br />
> .<br />
By the setting <strong>of</strong> CTi above, have<br />
k<br />
αk<br />
σ ′ i = ziv<br />
= g mod p = σ ,<br />
X i′<br />
= H ( wi<br />
, σ ′ ) = H ( wi<br />
, σ ) = X i , c i′<br />
= fτ<br />
( X i′<br />
) = ci<br />
B′<br />
=< e′<br />
|| r >= CT ⊕ X ′<br />
i<br />
= X<br />
i<br />
i<br />
⊕ B ⊕ X ′ = B =< e || F ( e ) ><br />
i<br />
i<br />
i<br />
then e i = ei′<br />
, ( e ) = r , therefore T i =< ci′<br />
, X i′<br />
> is correct.<br />
Fci i<br />
Hence, Ti is a valid trapdoor for wi. η1 outputs Ti.<br />
Otherwise, returns the symbol⊥.<br />
(4) Challenge: The adversary A produces two<br />
keywords w0 and w1. The challenger C chooses a random<br />
bit b ∈{<br />
0,<br />
1}<br />
and computes a trapdoor as following<br />
ϑ<br />
i σ ′ = ziv mod p , X i′<br />
= H ( wi<br />
, σ ′ ) , c i′<br />
= fτ<br />
( X i′<br />
) ,<br />
generates trapdoor T w , c′<br />
, X ′ > is returned as the<br />
result <strong>of</strong> this query.<br />
i =< b i i<br />
ζ<br />
i<br />
Recalling that v = g mod p , y = g mod p . If<br />
ϑ<br />
z v i ϑ<br />
g iς<br />
σ ′ = i = mod p ,<br />
then σ ′ is indeed a random trapdoor <strong>of</strong> wb. If σ ′ is a<br />
random integer, then the last element <strong>of</strong> Ti is a random<br />
element and thereforeσ ′ is independent <strong>of</strong> b.<br />
Phase 2: The adversary A issues new queries as in<br />
Phase 1. It is not allowed to make a trapdoor query for the<br />
target challengeT i .<br />
Analysis: If g p<br />
i ϑ ς<br />
σ ′ = mod , the adversary A’s view<br />
in the simulated experiment is distributed identically to<br />
A’s view in the real experiment. Hence,<br />
Pr[ η 1 = 1]<br />
= Pr[ b = b′<br />
]<br />
On the other hand, when σ ′ is uniformly distributed<br />
in Z p , the adversary A has no information about the value<br />
<strong>of</strong> b and hence the probability <strong>of</strong> it outputs b ′ = b is at<br />
most 1/2. Therefore, η 1’s<br />
advantage<br />
Adv η ) = ε ≥ Pr[ b = b′<br />
] −1/<br />
2 ≥ ε □<br />
( 1 1<br />
C. Security Analysis<br />
(1) The proposed scheme provides data confidentiality.<br />
In the sense that the untrusted server cannot learn<br />
anything about the owner’s outsourced data contents in<br />
any cases when only given the ciphertext since server<br />
administrators cannot know the encryption key σ . In<br />
same reason, outsiders cannot read the owner’s<br />
© 2011 ACADEMY PUBLISHER<br />
i<br />
i<br />
i<br />
i<br />
ϑ<br />
ci<br />
i<br />
outsourced data. However, the authorized users can<br />
access to the outsourced data since they get the<br />
decryption key σ . But an authorized user can only<br />
access the part that owner allowed them to see and cannot<br />
access whole database. Because outsourced database are<br />
divided into different groups Gi and the data <strong>of</strong> the<br />
different groups are encrypted by different keys. A user<br />
ui, who is granted to access the group Gi’s resource by<br />
database owner, he can only obtain the group Gi’s<br />
specific decryption key sent by database owner and<br />
cannot obtain other group’s decryption keys. Therefore,<br />
The proposed scheme assures that no one except the<br />
permitted users can search over encrypted data and read<br />
data. Therefore, the proposed satisfies data<br />
confidentiality.<br />
(2) In the proposed scheme, a removed user will never<br />
be able to search and access restricted data. When a user<br />
uB is removed from a group, the database owner has to<br />
update the encryption key σ withσ , and server updates<br />
the encryption data with (Yi, s) that is sent by data owner<br />
such as C = C ⊕ s = M ⊕ H (σ ) (see the equation (18)),<br />
CTi = CTi<br />
⊕ Yi<br />
= Bi<br />
⊕ H ( wi<br />
, σ ) (see the equation (19)). It<br />
is computationally infeasible for the revoked user uB to<br />
get any information about σ . Therefore, the removed<br />
user uB cannot recover the decryption keyσ , thus he is<br />
prevented from accessing constrained data.<br />
(3) The proposed scheme can resist collusion attack. In<br />
the proposed scheme, even all users collude and give<br />
their secret share ki each other, they cannot reconstruct<br />
polynomial f(x) since they can only obtain at most n share<br />
polynomials that are less than a threshold m. Therefore<br />
The proposed scheme can resist collusion attack.<br />
Moreover, a user cannot use the public information<br />
together with his key pair to derive the decryption key<br />
since the public information is zi and not is f ( d j ) .<br />
(4) The proposed scheme can achieve the hidden<br />
searching. To search for keyword wi, user must compute<br />
trapdoor Ti= to server, where<br />
ki<br />
σ = ziv mod p , X i = H ( wi<br />
, σ ) , ci = fτ<br />
( X i )<br />
Server searches for wi in ciphertext according T i . It is<br />
evident without revealing wi itself. Therefore, the<br />
proposed scheme allows a user to ask server to search for<br />
keyword wi, but he does not reveal the keyword wi to<br />
server.<br />
(5) The proposed scheme can achieve controlled<br />
searching. In the proposed scheme, only privileged group<br />
user can ask server to search for keyword wi since other<br />
users don’t knowσ and don’t generate valid trapdoor T i .<br />
D. Performance Analysis<br />
(1) The proposed scheme supports the dynamic change<br />
<strong>of</strong> the permitted group users, and it is transparent for user<br />
when the users are added and removed since they are not<br />
involved in the process. When granting a new user to a<br />
resource, that is, adding to a group, it is not needed reencrypting<br />
the resource and re-updating the decryption<br />
keys for the users in the group. While adding a new user<br />
to a group, the new user’s decryption key is encrypted
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1703<br />
and sent by database owner to the new group user. Upon<br />
receiving the decryption key, the new group user can<br />
access the server, query the outsourced database and<br />
decrypt the result received from the server. Therefore, the<br />
proposed scheme can easily, efficiently and quickly grant<br />
new users to a resource.<br />
In order to remove a user from the resource, it only<br />
needs to update the encryption key σ with σ by the data<br />
owner, and the resource is re-encrypted with newσ by<br />
server without revealing σ for server (see section Ⅲ). It<br />
does not need to update their secret key for each user who<br />
can access the resource since the users can recover σ<br />
with the public information. Therefore, the proposed<br />
scheme is very easily, efficiently and quickly to remove a<br />
user from the group.<br />
However, in previous many schemes, when a group<br />
user is removed from the group, the database owner has<br />
to re-encrypt data, transmit the encrypted data to server,<br />
and transmit a great <strong>of</strong> the new decryption keys to all the<br />
authorized users. If a large encrypted database is<br />
frequently transmitted to server by finite channel and a<br />
great <strong>of</strong> the new decryption keys are frequently<br />
transmitted to all the authorized users, these would<br />
involve a lot <strong>of</strong> performance overhead and become<br />
practically impossible for large databases accessed by a<br />
dynamic group <strong>of</strong> users. The proposed scheme avoids that<br />
outsourced database has to re-encrypt data by new key,<br />
transmit the re-decrypted data to server, and transmit new<br />
decryption keys to all the authorized users. Therefore,<br />
The proposed scheme is very efficient and practical for<br />
large databases accessed by a dynamic group <strong>of</strong> users.<br />
(2) The proposed scheme, the major computation in the<br />
system is shifted from the user to his database owner and<br />
can be done in the initialization phase. In terms <strong>of</strong><br />
efficiency, the computation cost for recovering the secure<br />
key σ is only a multiplication operation, and a modular<br />
exponent computation for each user. Because using<br />
symmetrical encryption algorithm, the computational cast<br />
<strong>of</strong> trapdoor, encryption and decryption is thus minimized,<br />
therefore, the efficiency <strong>of</strong> the proposed scheme is high.<br />
The storage overhead only includes a key <strong>of</strong> constant<br />
size for each user, therefore, the storage overhead <strong>of</strong> the<br />
scheme is very low. Moreover, the scheme doesn’t<br />
require any interaction between database owner and<br />
server, server and user as well as database owner and user<br />
when the decrypted key is set up and updated.<br />
V. CONCLUSION<br />
In this paper, we have presented the efficient and<br />
secure an encryption scheme with hidden keyword search<br />
for outsourced database. We also analyze security and<br />
performance and show that the scheme is secure and<br />
practical for outsourced database. Whenever the<br />
permitted group users change, the data owner does not<br />
need to re-encrypt data, transmit the encrypted data to<br />
server and a great <strong>of</strong> the new decryption key to all the<br />
authorized users. User joining or removing is also simple,<br />
quick and efficient. The proposed scheme can ensure the<br />
© 2011 ACADEMY PUBLISHER<br />
privacy and confidentiality <strong>of</strong> sensitive data from even<br />
inside attackers and outside attackers.<br />
ACKNOWLEDGMENT<br />
This work was supported in part by a grant 61070164<br />
from the National Natural Science Foundation <strong>of</strong> China;<br />
by a grant 81510632010000022 from Natural Science<br />
Foundation <strong>of</strong> Guangdong Province, China; by grants<br />
2010B010600025 and 2010A032000002 from Science<br />
and Technology Planning Project <strong>of</strong> Guangdong Province,<br />
China.<br />
REFERENCES<br />
[1] D.Song, D.Wagner, A.Perrig. Practical Techniques for<br />
Searching on Encrypted Data. In: IEEE Symposium on<br />
Research in Security and Privacy 2000, pp. 44–55.<br />
[2] R. Agrawal, J. Kierman, R. Srikant, and Y. Xu. Order<br />
preserving encryption for numeric data. In Proc. <strong>of</strong> ACM<br />
SIGMOD 2004, Paris, France, June 2004.<br />
[3] E. Damiani, S. De Capitani di Vimercati, S. Foresti,<br />
Jajodia, S.Paraboschi, and P.Samarati. Metadata<br />
management in outsourced encrypted databases. In Proc. <strong>of</strong><br />
the 2nd VLDB Workshop on Secure Data Management<br />
(SDM’05), Trondheim, Norway, September 2005.<br />
[4] R. Brinkman, J. Doumen, and W. Jonker. Using secret<br />
sharing for searching in encrypted data. In Proc. <strong>of</strong> the<br />
Secure Data Management Workshop, Toronto, Canada,<br />
August 2004.<br />
[5] S.Paraboschi, and P. Samarati. Modeling and assessing<br />
inference exposure in encrypted databases. ACM<br />
Transactions on Information and System Security, 8(1),<br />
pp.119–152, February 2005.<br />
[6] S. De Capitani di Vimercati, S. Foresti, S. Jajodia, S.<br />
Paraboschi, and P. Samarati. Over-encryption:<br />
Management <strong>of</strong> access control evolution on outsourced<br />
data. In VLDB, 2007.<br />
[7] S.Liu,W.Li,L.Y.Wang.Towards Efficient Over-Encryption<br />
in Outsourced Databases Using Secret Sharing. New<br />
Technologies, Mobilety and Security,pp.1-5, 2008.<br />
[8] P.Golle, J.Staddon, B.Waters. Secure conjunctive search<br />
over encrypted data. In: ACNS 2004, Lecture notes in<br />
computer science, vol.3089. Springer; 2004. pp. 31–45.<br />
[9] P.Wang, H.Wang, J.Pieprzyk. Keyword field-free<br />
conjunctive keyword searches on encrypted data and<br />
extension for dynamic groups. In: CANS 2008, Lecture<br />
notes in computer science, vol. 5339. Springer; 2008. pp.<br />
178–95.<br />
[10] B.Zhang, F.Zhang. An efficient public key encryption with<br />
conjunctive-subset keywords search. <strong>Journal</strong> <strong>of</strong> Network<br />
and Computer Applications 34 (2011) ,pp.262–267.<br />
[11] Y. H. Hwang,P. J. Lee. Public Key Encryption with<br />
Conjunctive Keyword Search and Its Extension to a Multiuser<br />
System. Lecture Notes in Computer Science, 2007,<br />
Volume 4575/2007, 2-22.<br />
[12] J.W Byun, H.S.Rhee, H.A.Park,D.H.Lee. Off-line keyword<br />
guessing attacks on recent keyword search schemes over<br />
encrypted data. In: Proceedings <strong>of</strong> SDM’06. LNCS, vol.<br />
4165, pp. 75–83.<br />
[13] H.S.Rhee, J.H. Park, W.Susulo, D.H.Lee. Trapdoor<br />
security in a sear chable public-key encryption scheme<br />
with a designated tester. <strong>Journal</strong> <strong>of</strong> Systems and S<strong>of</strong>tware<br />
2010, 83(5),pp.763–71.<br />
[14] Q. Tang. Revisit the concept <strong>of</strong> PEKS: problems and a<br />
possible solution. Technical report TR-CTIT-08-54, Centre
1704 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
for telematics and information technology, University <strong>of</strong><br />
Twente, Enschede. ISSN 1381-3625, 2008.<br />
[15] J.Camenisch, M.Kohlweiss, A.Rial, C.Sheedy. Blind and<br />
anonymous identity-based encryption and authorised<br />
private searches on public key encrypted data. In: PKC,<br />
Lecture notes in computer science, vol. 433, 2009. pp.<br />
196–214.<br />
Xiaoming Wang received her Ph.D degree in Department <strong>of</strong><br />
Mathematics from Nankai University in 2003. She is a pr<strong>of</strong>essor<br />
<strong>of</strong> Department <strong>of</strong> Computer Science, Jinan University. Her<br />
research areas include database security, cryptography, network<br />
security, etc.<br />
© 2011 ACADEMY PUBLISHER
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1705<br />
A Method <strong>of</strong> Object-based De-duplication<br />
Fang Yan<br />
School <strong>of</strong> Computer Science and Technology, Beijing Institute <strong>of</strong> Technology, Beijing, China<br />
School <strong>of</strong> Information, BeiJing WuZi University, BeiJing, China<br />
Email: yanfang.joy@gmail.com<br />
YuAn Tan<br />
School <strong>of</strong> Computer Science and Technology, Beijing Institute <strong>of</strong> Technology, Beijing, China<br />
Email: victortan@yeah.net<br />
Abstract—Today, the world is increasingly awash in more<br />
and more unstructured data, not only because <strong>of</strong> the<br />
Internet, but also because data that used to be collected on<br />
paper or media such as film, DVDs and compact discs has<br />
moved online [1]. Most <strong>of</strong> this data is unstructured and in<br />
diverse formats such as e-mail, documents, graphics,<br />
images, and videos. In managing unstructured data<br />
complexity and scalability, object storage has a clear<br />
advantage. Object-based data de-duplication is the current<br />
most advanced method and is the effective solution for<br />
detecting duplicate data. It can detect common embedded<br />
data for the first backup across completely unrelated files<br />
and even when physical block layout changes. However,<br />
almost all <strong>of</strong> the current researches on data de-duplication<br />
do not consider the content <strong>of</strong> different file types, and they<br />
do not have any knowledge <strong>of</strong> the backup data format. It<br />
has been proven that such method cannot achieve optimal<br />
performance for compound files.<br />
In our proposed system, we will first extract objects from<br />
files, Object_IDs are then obtained by applying hash<br />
function to the objects. The resulted Object_IDs are used to<br />
build as indexing keys in B+ tree like index structure, thus,<br />
we avoid the need for a full object index, the searching time<br />
for the duplicate objects reduces to O(log n).We introduce a<br />
new concept <strong>of</strong> a duplicate object resolver. The object<br />
resolver mediates access to all the objects and is a central<br />
point for managing all the metadata and indexes for all the<br />
objects. All objects are addressable by their IDs which is<br />
unique in the universe. The resolver stores metadata with<br />
triple format. This improved metadata management<br />
strategy allows us to set, add and resolve object properties<br />
with high flexibility, and allows the repeated use <strong>of</strong> the same<br />
metadata among duplicate object.<br />
Index Terms—data de-duplication, object-based, backup,<br />
object index, metadata<br />
I. MOTIVATION<br />
Limited storage capacity are increasingly becoming the<br />
bottleneck <strong>of</strong> IT systems. There are two main reasons:<br />
first, the information revolution have led to far more data<br />
than in the past, all the time produced a flood <strong>of</strong> new<br />
data; second, With the calculation and storage capacity<br />
increase, people tend to permanently save all the data,<br />
Physical capacity must be purchased for all allocated<br />
storage. In this trend, more and more computer storages<br />
bear the pressure, in order to save huge amounts <strong>of</strong> data<br />
while in storage on the input costs, <strong>of</strong>ten has come to a<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1705-1712<br />
shocking degree. To address these problems, data deduplication<br />
technology is used to effectively reduce the<br />
duplication <strong>of</strong> user data in the daily backup, so backup<br />
data is greatly reduced[2, 3].<br />
Broadly speaking, there are three approaches to how<br />
data can be de-duplicated. They are file level data deduplication,<br />
block-level data de-duplication and object<br />
level data de-duplication.<br />
File-level de-duplication is the most basic form <strong>of</strong> deduplication,<br />
which can identify identical files and store<br />
them only once. Also known as Single Instance Storage,<br />
this is also perhaps the easiest approach to<br />
implement. The weak point is that if you change the file<br />
by even a single byte, the entire file needs to be stored<br />
again [4]. If you change a file and save it with a different<br />
name, the entire file will also be backed up again. This<br />
happens more <strong>of</strong>ten that one may think.<br />
Disk-based backup technology commonly used blocklevel<br />
data de-duplication technology, same block from<br />
different files stored only once. Block-level deduplication<br />
generally includes three steps: chunking,<br />
compute the hash, find and store the unique chunk data.<br />
Block-level data de-duplication technology partition the<br />
backup file into multiple data chunks, and identify<br />
duplicate chunks by comparing their fingerprints, which<br />
are hash values computed by hash function. If find the<br />
same data chunk, then insert a pointer to the index node<br />
<strong>of</strong> the backup file which point to the data chunk already<br />
stored; only non-repeated data chunk can be stored. The<br />
biggest difference in the implementation <strong>of</strong> current block<br />
de-duplication technologies is the use <strong>of</strong> fixed size data<br />
chunks versus variable sized data chunks and the use <strong>of</strong><br />
sliding windows to define the address <strong>of</strong> common chunks<br />
versus using fixed <strong>of</strong>fsets to define the address <strong>of</strong> a<br />
chunk. Fixed-sized data chunking refers to partition files<br />
into fixed-sized data chunks, the chunk size is always<br />
equal to the physical block size <strong>of</strong> storage devices, for<br />
example, 8KB, 16KB and so on; To tolerate shifted<br />
contents, variable-sized chunking is a way <strong>of</strong> breaking a<br />
file into a sequence <strong>of</strong> chunks so that chunk boundaries<br />
are determined by the local contents <strong>of</strong> the file. This is in<br />
contrast to using fixed size chunks[5]. The Basic Sliding<br />
Window Algorithm [6] is the prototypical variable sized<br />
chunking algorithm.<br />
The most useful area for file-level and block-level deduplication<br />
implementation is in backup workflows
1706 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
where the same exact set <strong>of</strong> files are archived routinely<br />
and there is a relatively low change rate in the files. In<br />
these workflows, the files are backed up regardless <strong>of</strong><br />
whether they have changed or not, so it is highly likely<br />
that there is a very high level <strong>of</strong> commonality between<br />
many blocks from one backup to another. In general,<br />
these techniques work well for text based or simple<br />
content and do not work very well for compound file<br />
formats and workflows. Furthermore, in online<br />
versioning schemes such as snapshots or in backup<br />
workflows where only the modified files are backed up,<br />
there is a very low likelihood <strong>of</strong> finding common blocks.<br />
In such schemes, block de-duplication schemes will not<br />
yield any benefit and existing technologies for online<br />
archives (backups), snapshots and mirroring become<br />
expensive.<br />
This paper presents an object-based data deduplication<br />
solutions to existing problems. In our<br />
proposed system, after file type detection, we will first<br />
extract objects from files. According to the size and<br />
content <strong>of</strong> the object, Object_ID are then obtained by<br />
applying hash function. The object resolver is a central<br />
point for managing all the metadata and indexes for all<br />
the objects. The advantage <strong>of</strong> object based data deduplication<br />
is that even if the physical layout <strong>of</strong> a file<br />
changes – which can happen with a simple save operation<br />
– the logical objects can still be detected and stored only<br />
once. Unlike file level and block level technologies,<br />
object-based de-duplication chunks the file into well<br />
known logical objects like images, paragraphs,<br />
worksheets, slides, etc.<br />
II. SYSTEM ARCHITECTURE<br />
In many cases, because the same files or different<br />
versions <strong>of</strong> the information are used, the name and<br />
location <strong>of</strong> the objects are same in compound files.<br />
Alternatively, the creation <strong>of</strong> relevant documents is<br />
unknown, so, we will first parse the file before extraction<br />
<strong>of</strong> objects. Accordingly, the system architecture design is<br />
shown in figure 1.<br />
Input files<br />
file parser<br />
Object extractor<br />
Duplicated Object Resolver<br />
Storage<br />
File update log<br />
MetaData<br />
Figure 1. Object-based data de-duplication system structure<br />
© 2011 ACADEMY PUBLISHER<br />
The system includes the major components: file parser,<br />
object extractor, duplicate object resolver and storage.<br />
Input file formats may include .pdf, .ppt, .doc, .jpg, etc.,<br />
depending on file type.<br />
A. File Parser<br />
The system will parse a file to determine if it is<br />
compound or primitive and determine the file type and<br />
attributes. It will determine the boundaries <strong>of</strong> the<br />
primitive objects within the compound file.<br />
We divide file into two categories: compound objects<br />
and atomic objects. Among them, the compound <strong>of</strong> object<br />
encapsulates a number <strong>of</strong> other objects, such as ZIP files,<br />
PPT files, word documents. They are typically encoded<br />
representations <strong>of</strong> the union <strong>of</strong> their contained objects.<br />
File extension name may be as many as 20 kinds, file<br />
encoding format may be more than 10 species. Primitive<br />
objects are the most basic representations <strong>of</strong> discrete data<br />
structures such as images, executable files, etc.<br />
B. Object Extractor<br />
Define abbreviations and acronyms the first time they<br />
are used in the text, even after they have been defined in<br />
the abstract. Abbreviations such as IEEE, SI, MKS, CGS,<br />
sc, dc, and rms do not have to be defined. Do not use<br />
abbreviations in the title or heads unless they are<br />
unavoidable.<br />
• Step 1:extract objects<br />
For atomic objects, such as JPEG images, CAD<br />
drawings, AVI clips, etc. you can go directly to step 2;<br />
For the compound file, they differ in the specific<br />
document headers that they used to identify the encoded<br />
sections and objects. The object extraction process is<br />
recursive, that is, a recursive process as layer after layer<br />
is uncovered until the lowest level atomic object is<br />
uncovered. Some compound file does not include clear<br />
rules elements as HTML tags, such as PPT files. So for<br />
different types <strong>of</strong> compound documents, objects should<br />
be extracted using different algorithms. Sometimes, the<br />
analysis by analyzing the header may be done, and by<br />
analyzing file header to determine the potential<br />
combination <strong>of</strong> objects and object code format. For<br />
example, TIFF images have specific header information<br />
to describe the representation <strong>of</strong> the image and<br />
compression algorithm that may have been used.<br />
• Step 2: compute objects fingerprints<br />
With collision-resistant hash function, such as SHA-1,<br />
for each atomic object is assigned a globally unique 160bit<br />
identifier called an object ID (Object Identifier).<br />
Fingerprint is the start <strong>of</strong> the 32-bit bytes, the size <strong>of</strong> the<br />
object. Size does not bother to get the object, and objects<br />
<strong>of</strong> different sizes is clearly not the same. The remaining<br />
contents <strong>of</strong> the object by 128-bit hash function to<br />
calculate the running. Object ID is not only used for<br />
verification, but also a unique virtual address as an object<br />
for a given request and locate objects, namely the use <strong>of</strong><br />
the underlying storage mechanism for storing objects<br />
based on object fingerprint, and use that name to retrieve<br />
them, the actual storage and we have no relationship.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1707<br />
C. Duplicate object resolver<br />
The duplicate object resolver mediates access to all the<br />
objects and is a central point for managing all the<br />
metadata and indexes for all the objects. The resolver<br />
knows the total set <strong>of</strong> objects. All objects are addressable<br />
by their IDs which is unique in the universe. The resolver<br />
is a singleton object which may be created in multiple<br />
threads or processes and accesses the same underlying<br />
data storage and synchronization engine. The resolver<br />
provides the following services:<br />
1) Metadata Services<br />
The metadata is an abstract concept that can exist<br />
independently from the data itself. There are many<br />
variations that can be made for each object, and each<br />
object requires different parameters. Rather than having<br />
different constructors for each object type, the resovler<br />
maintains consistency and flexibility by following a very<br />
simple pattern:<br />
• We term metadata to be a set <strong>of</strong> statements about<br />
objects, expressed in triple notations (Subject,<br />
Attribute, Value), where Subject is the object_ID<br />
the statement is made about. An Attribute can be<br />
any kind <strong>of</strong> value or relationship, such as the size<br />
<strong>of</strong> a object, a file number where the object is<br />
extracted from, or the timestamp, etc. A Value is<br />
the value <strong>of</strong> the attribute, which is either some<br />
textural value, or another object_ID. All metadata<br />
reduce to the triple representation.<br />
• Using this system, we are able to store arbitrary<br />
attributes about any object. The triple shows that<br />
the object has all these attributes and their values.<br />
We call these ''relations'' or ''facts''. As follows,<br />
Obj represents global object domain.<br />
• This flexibility allows duplicate objects to use the<br />
same metadata, and allows different storage<br />
strategies according to different types <strong>of</strong> objects,<br />
while allowing third parties to extend type <strong>of</strong><br />
object properties, or to introduce a new type to<br />
improve de-duplication efficiency.<br />
• The duplicate object resolver construct the object<br />
index tree based on these facts and relation, and<br />
stores object metadata in triple storage format.<br />
This includes setting, adding and resolving<br />
attributes for a given object_ID.<br />
2) Object index and object de-duplication services<br />
First, it must be noted that comparison for the two<br />
objects must have the same encoding format, otherwise,<br />
you can not be compared for the same , but can only<br />
make approximate comparison. Encoded files have this<br />
property: any two documents appears to be similar or the<br />
same information, may be represented by totally different<br />
bit on the storage medium. Most General compound file<br />
© 2011 ACADEMY PUBLISHER<br />
using different coding schemes[7]. Thus, we should<br />
compare duplicate objects based on the object content<br />
encoding format.<br />
Indexing plays an important role in de-duplication<br />
process. In this work, the duplicate object resolver try to<br />
build and search the B+ tree like structure for object<br />
indexing (see section 4), to identify two or more duplicate<br />
atomic objects from one or more files.<br />
III.OBJECT EXTRACTION GRANULARITY<br />
Two similar large objects perhaps contain only one<br />
byte <strong>of</strong> different content in large body <strong>of</strong> data, but this<br />
will prevent de-duplication due to hash code index<br />
method. Therefore, you can choose object de-duplication<br />
granularity based on object type during the de-duplication<br />
processing. We classify the object content type into text,<br />
images, audio, video and executable programs. Here we<br />
introduce object size threshold. The object size threshold<br />
can be used as the basis for object extraction.<br />
The method for determining the object size threshold :<br />
A. Generate a sample files collection<br />
Generate a sample files collection in the storage pool:<br />
we randomly select backup file set for 1 to 2 times from<br />
backup systems as sample files collection, placed in the<br />
storage pool.<br />
B. Sample objects classification<br />
The system extracts and analyzes objects according to<br />
different file types, the sample objects has the same type<br />
is placed in the same collection.<br />
C. Determine the range <strong>of</strong> candidate size thresholds<br />
Objects <strong>of</strong> different size is clearly not the same.<br />
According to the distribution <strong>of</strong> object size, supposing<br />
there are n objects in the sample object collection, the<br />
size distribution <strong>of</strong> objects in the collection is represented<br />
by a collection <strong>of</strong> S:<br />
S = { s1, s2,...... sk}, k ≤n, si ≠ si + 1,1 ≤i≤ k (1)<br />
Let dmin = MIN( s1, s2,...... sk<br />
) , represents the<br />
minimum object size in the sample object collection.<br />
Let dmax = MAX( s1, s2,...... sk)<br />
, represents the<br />
maximum object size in the sample object collection.<br />
Determine the range <strong>of</strong> candidate size thresholds:<br />
D= [ d1, d2,...... dm],1≤m≤ k<br />
(2)<br />
To consistent with the specified minimum average<br />
block size 256B in backup system, the candidate<br />
thresholds meet the following value conditions ((3)~(6)):<br />
d 1 = d<br />
(3)<br />
min<br />
if ( dmin<br />
+ �)<br />
mod 256 = 0<br />
(4)<br />
then d2 = min( dmin<br />
+ �), �=<br />
1, 2,3,......<br />
di + 1 = di + 256, 2 ≤i≤m− 2<br />
(5)<br />
if ( dmax<br />
+ �)mod256=<br />
0<br />
(6)<br />
then d = min( d + �), �=<br />
1, 2,3,......<br />
m<br />
max
1708 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
D. Generate object size thresholds<br />
For various types <strong>of</strong> objects in the sample collection,<br />
the system traverses the range <strong>of</strong> candidate thresholds for<br />
each candidate threshold. If an object size larger than the<br />
candidate threshold, it will be divided into smaller objects<br />
by the threshold value. Then we calculate data<br />
compression ratio called DCR generated by the candidate<br />
threshold value. We calculate the DCR by the following<br />
equation:<br />
Initial Dedup _ ObjTS<br />
DCR = (7)<br />
Dedup _ ObjTS<br />
Where, :<br />
Initial Dedup_ObjTS is the total amount <strong>of</strong> data after<br />
de-duplication based on the size <strong>of</strong> original objects;<br />
Dedup_ObjTS is the total amount <strong>of</strong> data after deduplication<br />
based on the candidate threshold value.<br />
Candidate threshold that produced the maximum DCR<br />
will be selected as the size threshold for particular object<br />
type.<br />
E. Save threshold<br />
We establish one mapping relationship between each<br />
type <strong>of</strong> object and the corresponding size threshold, and<br />
save into the object-type threshold library.<br />
IV. OBJECT INDEX MECHANISM<br />
In the de-duplication system, the data block<br />
comparison is operation <strong>of</strong> the highest frequency, because<br />
the most important task in de-duplication is to compare<br />
Object_ID5<br />
all the data blocks to determine whether the data has been<br />
stored. Traditional method <strong>of</strong> comparing the data block,<br />
generally use the hash value database approach to retain<br />
each block a unique hash value. But the complexity <strong>of</strong> the<br />
hash query is generally linear or logarithmic order, that is,<br />
With the expansion <strong>of</strong> data size, the efficiency <strong>of</strong> the data<br />
block comparison will be gradually reduced. In largescale<br />
de-duplication system, this will cause great impact<br />
on the system, and lead to lower the system operating<br />
efficiency. Therefore, how to use a fast data comparison<br />
technology to make the data comparing efficiency has<br />
nothing to do with the size <strong>of</strong> backup data, to improve the<br />
operating efficiency <strong>of</strong> large-scale backup systems, is the<br />
main problem in the data de-duplication system [8, 10].<br />
Our proposed object index mechanism for data deduplication<br />
is based on B + tree index structure. The<br />
optimal search time is O (log n), which is more efficient<br />
than the full indexing O(n). The duplicate object resolver<br />
constructs the index tree according to the extracted object<br />
fingerprint and object information. By using the<br />
advantage <strong>of</strong> B+ tree properties, all the number <strong>of</strong> nodes<br />
in the left and right sub-trees <strong>of</strong> non-leaf node are<br />
balanced. Comparing with binary search in contiguous<br />
memory space, its advantage is to change the B+ tree<br />
(insert and delete nodes) do not need to move the large<br />
segment <strong>of</strong> the memory data, or even usually a constant<br />
overhead.<br />
Proposed indexing mechanism is shown in the figure2,<br />
which Object_ID is object identifier, Object_IDn's<br />
MetaData is the metadata for particular object, Objectn is<br />
the content <strong>of</strong> the object.<br />
Object_ID27 Object_ID64 ……<br />
Object_ID5 Object_ID10 Object_ID20<br />
Object_ID27 Object_ID30 Object_ID50 Object_ID64 Object_ID75 ……<br />
Object_ID5's<br />
Metadata<br />
file123,file789,Ojbect_ID5<br />
Object relation node<br />
Object_ID10's<br />
Metadata<br />
……<br />
…… ……<br />
……<br />
Object_ID27's<br />
Metadata<br />
Object5 Object7 Object27 Object10 Object11 Object25 Object20 Objec12 ……<br />
Figure 2. the object index structure<br />
In the path <strong>of</strong> an object index contains the following types <strong>of</strong> nodes:<br />
© 2011 ACADEMY PUBLISHER
c<br />
b<br />
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1709<br />
A. Object index node<br />
Object index node is constituted by the object<br />
identifier ( Object_IDn ) . Objects in each node are<br />
ordered according to their size.<br />
B. Object metadata node<br />
The metadata maintain the object identifier, object<br />
size, object type, object encoding format, the object's<br />
location in the document, etc., which are stored in the<br />
form <strong>of</strong> the triple. Object metadata can be stored in an<br />
external SQL server.<br />
C. Object Relation node<br />
Object Relation node is used to describe the<br />
relationship between two objects. Relations stored in a<br />
file format that contains the object hash code and the<br />
filename on each line. In practice, it is much more<br />
efficient to refer to the filename and its long directory<br />
path via a short index number into a separate table <strong>of</strong><br />
filenames stored in a database [9].<br />
Multiple file number referring to identical object are<br />
listed out with the first file number that contains object<br />
that have been stored, and the second file number that<br />
contains duplicate object, followed by the identical<br />
object fingerprint. In fact, the relation nodes implicitly<br />
Data Dedupliction… …<br />
Data Dedupliction<br />
became very popular<br />
in storage archiving<br />
and backup ……<br />
Data Dedupliction<br />
consist in partioninng<br />
a large file into<br />
smaller parts……<br />
file<br />
Chunk1 Chunk2<br />
File237<br />
What are you waiting for? …… data<br />
Dedupliction may be the best thing.<br />
Data Dedupliction<br />
file became very popular<br />
in storage archiving<br />
Chunk1 Chunk2<br />
and backup ……<br />
figure2<br />
All backups have<br />
duplicate data,but<br />
how much air a<br />
dedupe applicance<br />
or app can ……<br />
File169<br />
© 2011 ACADEMY PUBLISHER<br />
figure1<br />
a<br />
d<br />
include the file-file similarity pairs as desired. In the<br />
future, we can use the well-known union-find algorithm<br />
to determine clusters <strong>of</strong> interconnected files. We then can<br />
compare the similarity <strong>of</strong> the files.<br />
D. Object Content node<br />
Object Content node is used to store the contents <strong>of</strong><br />
the object.<br />
V. OBJECT DE-DUPLCIATION PROCESS<br />
For different file types, such as .pdf, .word, .ppt, .txt,<br />
and zip, rar, tar, etc., perform the following steps:<br />
• Step1: Accept input file;<br />
• Step2: Analysis <strong>of</strong> file types;<br />
Step3: Extract objects from files, and compute<br />
Object_IDs;<br />
• Step4: Check whether duplicate objects exist or<br />
not by comparing object fingerprints composed<br />
by object size and hash code with efficient object<br />
indexing mechanism;<br />
• Step5: If the object is duplicate, update object<br />
relation node. Otherwise, insert the object index<br />
node and metadata, then store the new data.<br />
Extracted Objects<br />
Object_Content<br />
_Hashes<br />
a<br />
Extracted Objects<br />
c<br />
d<br />
Object_Content<br />
_Hashes<br />
a<br />
(245)<br />
b<br />
(1010)<br />
c<br />
(1067<br />
d<br />
(3035)<br />
……<br />
9D321418 B34F2C12 313F3C20 805C4E32 ……<br />
a<br />
(2569)<br />
Figure 3. object extraction diagram<br />
Duplicate Objects<br />
b<br />
(3035)<br />
c<br />
(1010)<br />
Duplicate Objects<br />
d<br />
(1965)<br />
……<br />
4E312FF8 805C4E32 B34F2C12 32B5C804E ……
1710 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
The figure3 to figure5 show the object de-duplication<br />
process, the example include a PDF file (as File237 shown<br />
in the figure3) and a PPT file(as File169 shown in the<br />
figure3). The contents boxed by a dashed line represent a<br />
unit able to be treated as an independent object. As shown ,<br />
Object a, b, c and d are extracted from the file (brackets is<br />
Figure 4. object index tree<br />
object size in bytes). Content hash is calculated for each<br />
object.<br />
Assume that the system has stored the objects in the<br />
file237. Before inserting objects in file169, structure <strong>of</strong> the<br />
object index tree is shown in figure4.<br />
File169 contains two duplicate objects, the object index tree after insert operation is shown below:<br />
……<br />
2459D32... 1010B34F...<br />
237,169,1010B34F2C127EBDF18526F6323F3E2D2E3D<br />
……<br />
1010B34F... 3035805C..<br />
Obj:ID= 1010B34F2C127EBDF18526F6323F3E2D2E3D<br />
Obj:ID:filenum=237<br />
Obj:ID:type = txt<br />
Obj:ID:stored= 5bdbf7bcd8a540cb9af0fd7e4d0e2c9e<br />
Object Metadata node<br />
© 2011 ACADEMY PUBLISHER<br />
196532BC.. 3035805C..<br />
106732B5.. 196532BC.. 25694E31.. 3035805C..<br />
…… ……<br />
Obj:ID= 3035805C4E32BF559232DDA4D1FBF161D068<br />
Obj:ID:filenum=237<br />
Obj:ID:type = image<br />
Obj:ID:stored= 479ef7bce9n340cb9af0fd7e4d0e18a<br />
Object Metadata node<br />
Object Relation node 237,169,3035805C4E32BF559232DDA4D1FBF161D068<br />
Object Relation node<br />
Object_a Object_b …… Object_dObject_c<br />
……<br />
Figure 5. Object index tree
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1711<br />
It can be seen from above example:object based deduplication<br />
can detect common embedded data across<br />
unrelated files and even when physical block layout<br />
changes. However, block level de-duplication has no idea<br />
where a logical object begins and where it ends. As a<br />
result, the chunking process will split the images in files.<br />
Due to different positions <strong>of</strong> the image, duplicate data<br />
will not be detected at all.<br />
VI.EVALUATION<br />
This paper mainly focuses on one evaluation aspect for<br />
data backup: the de-duplication ratio archived by our<br />
proposed method. We chose 2 representative data sets:<br />
one was a collection <strong>of</strong> compound files, a compound file<br />
<strong>of</strong>ten contains text, figures, audio or video clips. The<br />
details <strong>of</strong> data set1 are described in Table 1. In Table1,<br />
#<strong>of</strong> files represents the number <strong>of</strong> files.; and the other was<br />
a collection <strong>of</strong> source code, source code are typically<br />
versioned, this data set consisted <strong>of</strong> 450 versions from<br />
1.2.1 to 2.5.75, the total size is 26GB.<br />
We use the two data sets and four full backups for our<br />
evaluations. We performed three different de-duplication:<br />
file-level de-duplicaiton, block-level de-duplication and<br />
Deduplication Ratios<br />
TABLE I.<br />
BACKUP DATASET1<br />
Type Size(KB) #<strong>of</strong> files<br />
1st PDF 4, 113,<br />
862<br />
6020<br />
PPT 335, 006 562<br />
2nd<br />
3rd<br />
PDF 1, 113,<br />
862<br />
1420<br />
PPT 34, 019 108<br />
PDF 5, 002,<br />
635<br />
6421<br />
PPT 310, 006 511<br />
4th PDF 263, 943 2501<br />
© 2011 ACADEMY PUBLISHER<br />
object-level de-duplication. SHA-1 is used as our hash<br />
algorithm. It generates 160 bit fingerprint for each file,<br />
chunk or object. Block level deduplication will start with<br />
a fixed size block, we chose 16KB. The experiment<br />
results are showed in figure6 and figure 7.<br />
We can draw a few <strong>of</strong> conclusions from the results :<br />
The improvements to each data set are different. Object<br />
based data de-duplication can effectively improve the<br />
data de-duplication ratio to dataset1. This is because the<br />
object based data de-duplication can mainly improve the<br />
de-duplication ratio <strong>of</strong> unstructured data sets. According<br />
to our experiments, the improvements to data sets 2 are<br />
not obvious than block-level and file-level de-duplication.<br />
Note that , our evaluation currently is not a production<br />
quality storage deduplication system but rather a research<br />
prototype. Hence, our experiment results should not used<br />
for absolute comparison with other storage de-duplication<br />
systems. We will do more comprehensive experiments in<br />
our future work, especially for data index and metadata<br />
management.<br />
VII. CONCLUSION AND FUTURE WORK<br />
Existing file and block-based data de-duplication<br />
technology is very suitable for text and simple content,<br />
but not for compound documents. This paper proposes an<br />
object-based de-duplication framework and an efficient<br />
object index mechanism to speed up the searching facility<br />
to identify duplicate objects. It can detect common<br />
embedded data for the first backup across completely<br />
unrelated files and even when physical block layout<br />
changes. As a result, object-based de-duplication<br />
provides the best efficiency for compound files vs. block<br />
based de-duplication.<br />
Future work includes: a) Implementing the framework;<br />
b) Improving the processing speed by move most<br />
computations to the graphic processing unit(GPU), which<br />
we expect will reduce the time spent on intensive<br />
computations such as object extraction and computing the<br />
fingerprints.<br />
fixed-block whole file object<br />
45%<br />
40%<br />
35%<br />
30%<br />
25%<br />
20%<br />
15%<br />
10%<br />
5%<br />
0%<br />
1 2 3 4<br />
Figure 6. De-duplicaiton efficiency comparison <strong>of</strong> data set1
1712 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Deduplication Ratios<br />
45%<br />
40%<br />
35%<br />
30%<br />
25%<br />
20%<br />
15%<br />
10%<br />
5%<br />
fixed-block whole file object<br />
0%<br />
1 2 3 4<br />
REFERENCES<br />
[1] Dell product group, Object Storage — A Fresh Approach<br />
to Long-Term File Storage, A Dell Technical White Paper.<br />
[2] Tony A, Biggar H. Data De-Duplication and Disk-to-Disk<br />
Backup Systems: Technical and Business Considerations.<br />
The Enterprise Strategy Group Technical Report. 2007.<br />
[3] Biggar H. Experiencing in Data De-Duplication:<br />
Improving Efficiency and Reducing Capacity<br />
Requirements. The Enterprise Strategy Group Technical<br />
Report. 2007.<br />
[4] William J. Bolosky, Scott Corbin, David Goebel*, and<br />
John R. Douceur , Single Instance Storage in Windows<br />
2000, In Proceedings <strong>of</strong> the 4th conference on USENIX<br />
Windows Systems Symposium, Volume 4 USENIX<br />
Association Berkeley, CA, USA , 2000.<br />
[5] An in-depth look at data deduplication methods, The<br />
Enterprise Strategy Group Technical Report,<br />
www.falconstor.com.<br />
[6] A.Muthitacharoen, B.Chen, and D.Mazieres. A lowbandwidth<br />
network file system. In Proceedings <strong>of</strong> the 18th<br />
ACM Symposiumon Operating Systems Principles<br />
(SOSP’01), pages174–187, Ban, Canada, October 2001.<br />
[7] Goutham Rao, San Jose, Eric Brueggemann, Carter<br />
George, Object deduplication and application aware<br />
snapshots, patent application publication, US, 2010.<br />
[8] Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in<br />
the data domain deduplication file system. In: Proceedings<br />
<strong>of</strong> the 6th USENIX Conference on File and Storage<br />
Technologies. 2008.<br />
[9] George Forman, Kave Eshghi, Stephane Chiocchetti,<br />
Finding Similar Files in Large Document Repositories. In<br />
the 11th ACM SIGKDD International Conference on<br />
Knowledge Discovery and Data Mining (KDD’05),<br />
Chicago, USA, August 2005.<br />
[10] Bayer.R and Me. Creight, "Organization and Maintenance<br />
<strong>of</strong> Large ordered Indices", Acta Informatica, Volume I,<br />
Springer, Berlin/Heidelberg, New York, 1972, pp. 173-<br />
189.<br />
[11] S. Walter, T.Thiago, M.Carla and Jr. Wagner Meira, "A<br />
Scalable Parallel Deduplication Algorithm", 19th<br />
International Symposium on Computer Architecture and<br />
© 2011 ACADEMY PUBLISHER<br />
Figure 7. De-duplicaiton efficiency comparison <strong>of</strong> data set2<br />
High Performance Computing, IEEE Computer Society,<br />
Brazil, 2007, pp. 79-86.<br />
[12] W.You et aI., "PRUN: Eliminating Information<br />
Redundancy for Large Scale Data Backup System",<br />
International Conference on Computational Sciences and<br />
Its Applications (ICCSA 2008), IEEE Computer Society,<br />
Italy, 2008<br />
[13] V. Henson and R. Henderson. Guidelines for Using<br />
Compare-by-Hash. Forthcoming, 2005.<br />
http://infohost.nmt.edu/~val/review/hash2.html<br />
[14] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise<br />
G, Camble P. Sparse indexing: large scale, inline<br />
deduplication using sampling and locality. In: Proceedings<br />
<strong>of</strong> the 7th USERNIX Conference on File and Storage<br />
Technologies. 2009<br />
[15] Quinlan S, Dorward S. Venti: a new approach to archival<br />
storage. In Proceedings <strong>of</strong> the Conference on File and<br />
Storage Technologies. 2002, 89–101<br />
Fang YAN, born in 1980 .Ph.D.<br />
candidate. Beijing Institute <strong>of</strong><br />
Technology, Beijing, China. And<br />
research interests include data deduplication<br />
and network storage.<br />
She is a senior lecturer <strong>of</strong> Dept.<br />
Information BeiJing WuZi university.<br />
Yuan TAN, BeiJing, China.born in<br />
1972. is computer science Ph.D. And<br />
current research interests include<br />
Information Security and network<br />
storage.<br />
He is a Pr<strong>of</strong>essor, Ph.D. Beijing<br />
Institute <strong>of</strong> Technology, Beijing, China .<br />
and supervisor, senior member <strong>of</strong> China<br />
Computer Federation.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1713<br />
Analysis on E-consumers’ Purchasing Behavior<br />
Based on Data-driving Model<br />
Lijuan Huang<br />
Information Management College <strong>of</strong> Jiangxi University <strong>of</strong> Finance and Economics, Nanchang, 330013, China<br />
Email: huanglijuan66s@126.com<br />
Abstract—It is the Internet world with vasty purchasing<br />
data sea online that makes research model <strong>of</strong> e-consumers’<br />
purchasing behavior very different from traditional ones.<br />
Firstly this paper proposes three kinds <strong>of</strong> research models <strong>of</strong><br />
consumers’ purchasing behavior, and then pointed out that<br />
data-driving model is the best one to analyze e-consumers’<br />
purchasing behavior on the Internet. Secondly, it adopts the<br />
improved SOFM Neural Network as the tool <strong>of</strong> data-driving<br />
model to detailedly analyze e-consumers’ purchasing<br />
behavior <strong>of</strong> Internet marketing. Lastly experiment results<br />
demonstrate that the method has more visualization,<br />
exactness and robustness. Because consumers’ purchasing<br />
behavior analysis based on the SOFM Neural Network is a<br />
comparatively novel method, the research fruit in this paper<br />
is just for reference.<br />
Index Terms—Internet marketing, purchasing behavior,<br />
neural network, data-driving model<br />
I. INTRODUCTION<br />
Research about consumers’ purchasing behavior<br />
characteristics dates back to England in eighteenth<br />
century. At that time, large number <strong>of</strong> farmers poured<br />
into cities. These new urban residents show faith in the<br />
products which were able to demonstrate their social<br />
status, and the faith and attitude for these products from<br />
the residents brought people’s attention focused on<br />
consumer behavior[1]. The research about consumer<br />
behavior originated and developed from a western paper<br />
named Consumer Analysis published by Guest in Annual<br />
Review <strong>of</strong> Psychology in 1962 [2]. Afterwards, many<br />
celebrated scholar did active work on characteristics <strong>of</strong><br />
consumer behavior. For example, Engel, Kotler and Cliff<br />
Allen proposed T-I-K model <strong>of</strong> consumer behavior in<br />
1993, Solomon, Schiffman and Kanuk raised U-S-E<br />
model <strong>of</strong> consumer behavior in 1999, J. Paul Peter and<br />
Jerry C. Olsom presented S-C-T model <strong>of</strong> consumer<br />
behavior in 2000 [3-6]. But these researches were<br />
attributed to one <strong>of</strong> experience-driving research model or<br />
theory-driving research model. The author believes that<br />
research model <strong>of</strong> consumer behavior should include<br />
data-driving model besides experience-driving research<br />
model and theory-driving research model, with the<br />
development <strong>of</strong> modern science and technology, and<br />
especial with development <strong>of</strong> neural network technology,<br />
data mining, artificial intelligence, and multi-disciplinary<br />
technology. These three kinds <strong>of</strong> research models are<br />
described in Table I.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1713-1718<br />
TABLE I.<br />
RESEARCH MODEL OF CONSUMERS’ PURCHASING BEHAVIOR<br />
Method 1: Experience-driving model<br />
Researcher can communicate with consumers by means <strong>of</strong> tongue,<br />
facial expression and other body language, and then make an analysis <strong>of</strong><br />
consumers’ purchasing behavior based on the researcher’s own<br />
experience. However, in the virtual world <strong>of</strong> the Internet, there is large<br />
sum <strong>of</strong> data about e-consumers’ purchasing behavior and the researcher<br />
lose the chance face to face to communicate with consumers, So<br />
analysis <strong>of</strong> consumers’ purchasing behavior based on experience-driving<br />
model loses effect.<br />
Method 2: Theory-driving model<br />
The research steps <strong>of</strong> theory-driving model are shown in Fig 1. From<br />
Fig 1, we can know, in this kind <strong>of</strong> research mode, researcher first<br />
obtains a theory model from purchasing behavior theories; Then makes<br />
full use <strong>of</strong> purchasing data to test and modify the model repeatedly;<br />
Finally, based on the last model to deduct and analyze the consumers’<br />
purchasing behavior. This kind <strong>of</strong> research mode usually can get an<br />
unreliable analysis result due to the imperfect and even wrong<br />
purchasing behavior theories.<br />
Method 3: Data-driving model<br />
The research steps <strong>of</strong> data-driving model are shown in Fig 2. From<br />
this figure, we can know, in this kind <strong>of</strong> research mode, researcher first<br />
select appropriated intelligent algorithm; Then a model is drawn from<br />
purchasing data and is modified repeatedly by these purchasing data;<br />
Finally, based on the last model to deduct and analyze the consumers’<br />
purchasing behavior. Obviously, data-driving model is based on real<br />
data other than personal experience or pure theories and this kind <strong>of</strong><br />
model realizes the scientific idea that Let data say for themselves. So,<br />
the result <strong>of</strong> analyzing consumers’ purchasing behavior is more<br />
scientific, objective and fair.<br />
Table I shows that it is difficult to adopt experiencedriving<br />
model to analyze characteristics <strong>of</strong> online<br />
consumer purchase behavior, and adopting theory-driving<br />
model or data-driving model may be appropriate. Seen in<br />
Table I, Fig. 1 and Fig. 2, it is more objective, scientific<br />
and unbiased to adopt data-driving model than to adopt<br />
experience-driving model or theory-driving model for<br />
analyzing e-consumers’ purchasing behavior.
1714 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Purchasing<br />
behavior<br />
theories<br />
Figure 1. Analyzing consumers’ purchasing behavior based on theorydriving<br />
model<br />
Data<br />
warehouse<br />
Model<br />
deducting<br />
①<br />
Model<br />
modifying<br />
②<br />
Model<br />
Model<br />
modifying<br />
②<br />
Figure 2. Analyzing consumers’ purchasing behavior based on datadriving<br />
model<br />
Therefore, data-driving model is the most suitable for<br />
analyzing characteristics <strong>of</strong> online Consumers’<br />
purchasing behavior, and all input data is from the<br />
consumers, it also fully reflected the idea: “Consumer is<br />
the God”. Because Self-Organizing Feature Map Neural<br />
Network (SOFM NN) belongs to a typical data-driving<br />
mode , this paper takes SOFM NN as a tool to analyze econsumers’<br />
purchasing behavior. The basic principles <strong>of</strong><br />
SOFM NN are described as follows.<br />
II. BSICAL PRINCIPLES OF THE SOFM NEURAL NETWORK<br />
In 1981, Finnish scholar Teuvo Kohonen firstly raised<br />
the concept <strong>of</strong> SOFM NN[7], which can simulate the<br />
function <strong>of</strong> the brain that reflects to different kinds <strong>of</strong><br />
input signals (e.g. light signal, sound signal) and<br />
automatically sort these input signals into different zones<br />
<strong>of</strong> the brain layer[8]. Through inputting large sum <strong>of</strong><br />
purchasing data <strong>of</strong> consumers into SOFM NN, these econsumers<br />
can be objectively, scientifically, and<br />
automatically clustered and divided into different groups<br />
based on the similarity <strong>of</strong> consumers’ purchasing data,<br />
and this means minimizing difference between the<br />
consumers in the same group and maximizing the<br />
difference between different groups. Analyzing and<br />
aiming directly at the different feature <strong>of</strong> these different<br />
consumer groups, it would be helpful to make some<br />
aimed marketing strategies for promotion, service, price<br />
etc, avoid the risk <strong>of</strong> taking the uniform strategies for all<br />
the consumers and with high cost for not important<br />
consumers or taking the unscientific ranked service to<br />
lost the potential VIP consumers.<br />
A. Topology Structure <strong>of</strong> the SOFM NN<br />
The typical SOFM NN (seen in Fig. 3) forms topology<br />
structure <strong>of</strong> input signals based on one-dimension or two-<br />
③<br />
Drawing conclusion from<br />
the model<br />
Model<br />
© 2011 ACADEMY PUBLISHER<br />
Algorithm<br />
selecting<br />
①<br />
③<br />
Drawing conclusion from<br />
the model<br />
Data<br />
warehouse<br />
Feature <strong>of</strong><br />
purchasing<br />
behavior<br />
e.g.<br />
DM,<br />
ANN<br />
Feature <strong>of</strong><br />
purchasing<br />
behavior<br />
dimension cellular array [8], so the SOFM NN has the<br />
ability to extract the feature <strong>of</strong> the input signals’ model[9].<br />
The SOFM NN commonly only includes a onedimensional<br />
or two-dimensional arrays, but could also be<br />
extended to handle the multi-dimensional cellular array<br />
[10-12]. In order to have better stability and operating<br />
efficiency <strong>of</strong> SOFM NN, we add a feedback loop on the<br />
traditional SOFM NN to obtain improved SOFM NN<br />
(seen in Fig. 4).<br />
Victorious neuron<br />
Input Layer<br />
Competitive Layer<br />
Figure 3. Topology structure <strong>of</strong> the traditional SOFM NN<br />
Victorious neuron<br />
Competitive layer<br />
feedback loop<br />
Input layer<br />
Figure 4. Topology structure <strong>of</strong> the improved SOFM NN<br />
The improved SOFM NN is composed <strong>of</strong> the<br />
following four parts.<br />
• Cellular array for recognizing: This is mainly<br />
used for receiving the input signals and forming<br />
the “discrimination function” to recognize the<br />
input signals.<br />
• Mechanism for comparing and choosing: This is<br />
used for comparing these “discrimination<br />
functions” and making a decision to choose a<br />
processing unit with stronger functional output<br />
signals.<br />
• Local inter-connection and inter-action: This is<br />
used for stimulating both the chosen signals<br />
processing unit and its nearby signals processing<br />
unit.<br />
• Self-adapting process: This is used for modifying<br />
the parameters <strong>of</strong> stimulated processing unit so<br />
that it can increase the output value <strong>of</strong> the given<br />
“discrimination function”.<br />
B. The SOFM NN’s Algorithm<br />
The SOFM NN’s algorithm are described as follows.<br />
1) Initialization: choose “nearby neuron” set S j (0)<br />
with output neurons j, and the connection weight value<br />
, (0) wi j for both the input neuron i and the output neuron j<br />
is computed as equation(1).
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1715<br />
1<br />
w (0) = ∑ X<br />
(1)<br />
i, j PAM<br />
n X∈S( k)<br />
2) Calculating the Euclidean distance: euclidean<br />
distance means the distance between the input sample and<br />
every output neuron j, Calculating the Euclidean distance<br />
d j () t is shown in equation (2).<br />
n<br />
2<br />
j() = ln(|| − j ||) = ln( [ i() − i, j()]<br />
)<br />
i=<br />
1<br />
d t X w ∑ x t w t (2)<br />
3) Defining a neighborhood function: neighborhood<br />
function Sj( t) is expressed in equation (3), where<br />
Sj() t gets decreased as the time goes on.<br />
d j () t<br />
Sj( t) = Sj(0)exp(<br />
− )<br />
(3)<br />
2<br />
2σ<br />
4) Working out the minimum distance: the minimum<br />
distance min( d j ) among these corresponding neurons is<br />
calculated as equation (4).<br />
n<br />
2<br />
j = ∑ i − i, j (4)<br />
j<br />
i=<br />
1<br />
min( d ) argmin [ x () t w ()] t<br />
5) Setting learning rate: learning rate η may be<br />
computed according to equation(6) , where η gets<br />
decreased to zero as time t goes on.<br />
t<br />
η(t)= η(0)exp(<br />
− )<br />
(5)<br />
τ<br />
6) Modifying the weight value: When the weights<br />
∆wij<br />
() t<br />
variation reduces to zero, topology structure <strong>of</strong> the<br />
∆wij<br />
() t<br />
SOFM NN is most stable, and is computed as<br />
equation(6).<br />
⎛η()[ t xi() t −wij()], t X ∈S(<br />
k)<br />
⎞<br />
∆ wij () t =⎜ ⎟ (6)<br />
⎝0, X ∉ S( k)<br />
⎠<br />
7) Offering new learning samples to repeat the learning<br />
process mentioned above, then t←t+1, till<br />
η()<br />
t<br />
decreases<br />
to 0 or enough small, and process <strong>of</strong> network learning is<br />
terminted.<br />
III. AN EXAMPLE OF ANALYZING E-CONSUMERS’<br />
PURCHASING BEHAVIOR<br />
Because selling book is one typical choice to do Ebussiness,<br />
this paper takes consumers <strong>of</strong> book bussiness<br />
website for example to analyze e-consumers’ purchasing<br />
behavior[13].<br />
A. Main Clustering Variables<br />
Most data <strong>of</strong> customers come from online dealing<br />
records <strong>of</strong> a famous book website (dingdang.com) in<br />
China[1]. These data could be divided into two groups:<br />
customers’ attributes data, and transaction data.<br />
Customers’ basic attributes data mainly include:<br />
customer’s name, gender, age, income, educational<br />
status, occupation, city, marriage status, enrolment time,<br />
home address, hobby etc. Transaction data mainly<br />
include: shopping time, frequency <strong>of</strong> shopping,<br />
consumption <strong>of</strong> shopping, product name, price, way <strong>of</strong><br />
© 2011 ACADEMY PUBLISHER<br />
paying (e.g. cash on delivery, cash on postage and credit<br />
Card), latest shopping time etc.<br />
Main clustering variables <strong>of</strong> the SOFM neural network<br />
are seen in Table II, where main variables labeled by (*)<br />
indicates to be clustering variables..<br />
TABLE II.<br />
MAIN CLUSTERING VARIABLES OF THE SOFM NEURAL NETWORK<br />
Total amount<br />
<strong>of</strong> purchase<br />
Monthly<br />
income<br />
Frequency<br />
<strong>of</strong> ihopping<br />
Latest time<br />
<strong>of</strong> shopping<br />
x1 (*) x2(*) x3(*) x4(*)<br />
Age Gender<br />
Educational<br />
status<br />
District<br />
x5 x6 x7 X8<br />
B. Sample Data <strong>of</strong> Consumers’ Behavior<br />
There are 5000 sample records but limited by the<br />
length <strong>of</strong> this paper, we will only list part <strong>of</strong> the samples<br />
as demonstrated in Table III, where capitalized variables<br />
in Table III means to be standardized in the domain [0,<br />
1].<br />
Cust-ID<br />
TABLE III.<br />
CONSUMING SAMPLE DATA FROM E-MARKET<br />
Total amount <strong>of</strong><br />
purchase (X1)<br />
Monthly income<br />
(X2)<br />
1001 0.9260 0.9454<br />
1002 0.7549 0.6950<br />
1003 0.8118 0.8975<br />
1004 0.7982 0.6825<br />
1005 0.6532 0.5816<br />
… … … … … …<br />
Cust-ID<br />
Frequency <strong>of</strong><br />
shopping (X3)<br />
Latest shopping<br />
time (X4)<br />
1001 0.9720 0.9335<br />
1002 0.7273 0.6918<br />
1003 0.7586 0.7324<br />
1004 0.8180 0.7817<br />
1005 0.6609 0.5141<br />
… … … … … …<br />
Through system function premnmx() or user-defined<br />
functions, sample data can be normalized in the domain<br />
[0, 1]. In this paper, we adopt the Min-Max standardize<br />
method shown in equation (7).<br />
X(i) =<br />
x(i) - min{x(i)}<br />
max{<br />
x(i)} - min{x(i)}<br />
(7)
1716 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
C. Design <strong>of</strong> the SOFM Neural Network<br />
1) Topology Structure<br />
There are three kinds <strong>of</strong> topology structures:<br />
rectangular topology structure, hexagonal topology<br />
structure and random topology structure, which can take<br />
the corresponding three functions (namely gridtop(),<br />
hextop() and randtop() ) to describe the different topology<br />
structure <strong>of</strong> these neuron areas [14]. Here we take the 6*4<br />
random topology structure (shown in Fig. 5).<br />
Figure 5. 6*4 Random topology structure<br />
2) Main Programming Codes<br />
We firstly use function newsom() to create a SOFM<br />
neural network; then we use function train() and function<br />
sim() to train and simulate the new created network in<br />
order. Different training steps have different effects over<br />
efficiency <strong>of</strong> self-recognizing. Here, we set the training<br />
steps as 1000, 3000, 5000 and 10000 and observe the<br />
efficiencies <strong>of</strong> clustering respectively. The main<br />
programming codes are shown as follows:<br />
net=newsom(minmax(X),[6,4],’ randtop’);<br />
a=[1000 3000 5000 10000];<br />
yc=rands(1,10);<br />
for i=1:4<br />
net.trainParam.epochs=a(i);<br />
net=train(net,X);<br />
figure;<br />
w1=net.IW{1,1}<br />
plotsom(w1,net.layers{1}.distances);<br />
y=sim(net,X);<br />
yc=vec2ind(y)<br />
end<br />
D. Analysis on the Result <strong>of</strong> Training and Computing<br />
1) Network ’s Weight Value Structure<br />
There are great differences <strong>of</strong> SOFM neural<br />
network’s performance when we take different training<br />
steps. In the paper, we only take four kind <strong>of</strong> different<br />
training steps namely 1000, 3000, 5000 and 10000, and<br />
the corresponding Network ’s weight value structure are<br />
shown in Fig. 6, Fig. 7, Fig. 8, and Fig. 9 respectively as<br />
follows.<br />
© 2011 ACADEMY PUBLISHER<br />
Figure 6. Network ’s weight value structure (training steps: 1000)<br />
Figure 7. Network ’s weight value structure (training steps: 3000)<br />
Figure 8. Network ’s weight value structure (training steps: 5000)<br />
Figure 9. Network ’s weight value structure (training steps: 10000)<br />
From the above 4 figures, we can easily find that<br />
Network’s weight value figure comes to a comparatively<br />
stable status when the training steps is 5000 and 10000.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1717<br />
2) Network ’s Clustering Result<br />
Through training and simulating according to the four<br />
different kinds <strong>of</strong> training steps, we can also acquire a<br />
clustering result as shown in Table IV, where only 20<br />
sample records are listed for demonstration.<br />
Training<br />
Steps<br />
1000<br />
3000<br />
5000<br />
10000<br />
Training<br />
Steps<br />
1000<br />
3000<br />
5000<br />
10000<br />
TABLE IV.<br />
NETWORK ’S CLUSTERING RESULT TABLE<br />
1 2 3 4 5<br />
11 12 13 14 15<br />
8 20 8 20 8<br />
15 20 20 8 15<br />
16 19 13 19 13<br />
13 7 19 13 16<br />
11 18 7 19 7<br />
7 11 13 11 13<br />
6 7 8 9 10<br />
16 17 18 19 20<br />
15 20 8 8 20<br />
8 20 8 8 15<br />
7 19 13 7 19<br />
19 13 19 16 7<br />
13 19 7 18 19<br />
18 11 13 19 7<br />
To observe Table IV, we can find some rules as<br />
follows:<br />
� When the training steps are 1000, all the<br />
samples are divided into 1 group.<br />
� When the training steps are 3000, all the<br />
samples are divided into 2 groups.<br />
� When the training steps are 5000, all the<br />
samples are divided into 3 groups.<br />
� When the training steps are 10000, all the<br />
samples are divided into 3 groups<br />
From Fig. 10, we can also find that there is the unqiue<br />
minimum from a single neuron’s error surface, so the<br />
structure <strong>of</strong> the above improved SOFM NN is<br />
comparatively stable. This means Customers clustering<br />
stability are robust.<br />
© 2011 ACADEMY PUBLISHER<br />
Figure 10. Single Neuron’s Error<br />
3) Customers’ Recognition and the Corresponding<br />
Marketing Strategies<br />
According to the above 4 network’s weight value<br />
structure Figures (namely Fig.6-9) and one network ’s<br />
clustering result table (namely Table IV), We can also<br />
reach a further conclusion: when the training steps are<br />
more than 5000 (including 5000), the samples are steadily<br />
clustered and divided into 3 groups. To observe these 3<br />
groups and make an analysis <strong>of</strong> customers’ purchasing<br />
behavior, we find each group has its own special features<br />
as illustrated in Table V, where 3 distinguished marketing<br />
strategies are strongly suggested aiming at these 3<br />
groups’ special features. Obviously, recognizing<br />
customers’ features and taking the distinguishing<br />
marketing strategies can help to reach a win-win result<br />
between customers and bussiness website, increase the<br />
loyalty <strong>of</strong> customers (esp. VIPs), and maximize the pr<strong>of</strong>it<br />
<strong>of</strong> e-marketing.<br />
TABLE V.<br />
ANALYSIS RESULT OF E-CONSUMERS’ PURCHASING BEHAVIOR<br />
Cluster NO 1 Customers (5.71%) Consumption (0.13%)<br />
� Features <strong>of</strong> consumers’ purchasing behavior: occasional<br />
customers, most <strong>of</strong> the occasional customers are teenagers<br />
who come from different districts <strong>of</strong> the nation, and the<br />
total amount <strong>of</strong> purchase is low with low income and low<br />
shopping frequency. Most <strong>of</strong> them have a low-level<br />
educational status.<br />
� Marketing strategy: These customers deserve the normal<br />
service, such as racking up points for discount, getting the<br />
book information through e-mail but reading e-books not<br />
free on the Internet.<br />
Cluster NO 2 Customers (74.71%) Consumption: (17.86%)<br />
� Features <strong>of</strong> consumers’ purchasing behavior: main<br />
customers, most <strong>of</strong> the main customers are youths who<br />
come from different districts <strong>of</strong> the nation, and the total<br />
amount <strong>of</strong> purchase is higher with middle-level income<br />
and higher shopping frequency. They have a middle-level<br />
educational status.
1718 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
� Marketing strategy: these customers deserve the middleclass<br />
service, such as ordering individualized information<br />
<strong>of</strong> book through e-mail, racking up points for higher<br />
discount, reading some e-books free on the Internet when<br />
the amount <strong>of</strong> purchase accumulate to a certain point,<br />
enjoying free e-cards or e-flowers on their birthday and so<br />
on.<br />
Cluster NO 3 Customers (19.58%) Consumption (82.01%)<br />
� Features <strong>of</strong> consumers’ purchasing behavior: most is VIPs,<br />
who are young women who <strong>of</strong>ten come from highly<br />
developed districts or remote places, and the total amount<br />
<strong>of</strong> purchase is the highest with high income or low income<br />
and the highest shopping frequency. Most <strong>of</strong> the VIPs have<br />
a middle-level or high-level educational status.<br />
� Marketing strategy: these customers deserve the top<br />
service, such as enjoying VIP service to have free private<br />
cyberspace and fastest green passage, downloading or<br />
reading some e-books free on the Internet, conferring the<br />
latest book catalogue in both paper’s form and e-mail’s<br />
form, free biggest cards and best flowers on their birthday,<br />
the highest discount and so on.<br />
Table V strongly proves the Pareto 80/20 Principle:<br />
20% <strong>of</strong> all customers are the VIPs (Cluster NO 3), and<br />
their contribution is 80%. In this table, we can also find<br />
some interesting phenomena. For example, VIPs would<br />
not definitely be customers with high income, and most<br />
<strong>of</strong> VIPs are young women rather than men, VIP<br />
customers are not only from developed regions, but also<br />
from less developed regions.<br />
IV. CONCLUSION<br />
Famous economist Christopher pointed out: in today’s<br />
unpredictable business competition, the market is no<br />
longer on the sellers’ side but on the buyers’ side [14].<br />
“Customer is the god”. So exactly to analyze consumers’<br />
purchasing behavior on the Internet and accordingly to<br />
make some scientific Internet marketing strategy for sale<br />
promotion are key factor to success for assuring the pr<strong>of</strong>it<br />
<strong>of</strong> E-business website. As for how to analyze econsumers’<br />
purchasing behavior, this paper proposes and<br />
compares three kinds <strong>of</strong> research models, and pointed out<br />
thedata-driving model is best one to analyze econsumers’<br />
purchasing behavior. SOFM NN belongs to a<br />
typical data-driving model, so this paper improves the<br />
tradional SOFM NN and takes the improved one as a tool<br />
to analyze e-consumers’ purchasing behavior. Because econsumers’<br />
purchasing behavior analysis based on the<br />
SOFM Neural Network is a comparatively novel method,<br />
the result <strong>of</strong> research in this paper is just for reference.<br />
ACKNOWLEDGMENT<br />
The author thanks the anonymous reviewers for their<br />
valuable remarks and comments. This work is supported<br />
© 2011 ACADEMY PUBLISHER<br />
by 2010 National Social Science Fund <strong>of</strong> China (Grant<br />
No. 10BGL028), National Natural Science Fund <strong>of</strong> China<br />
(Grant No. 70861002), China Postdoctoral Science Fund<br />
(Grant No. 200902535), 2010 Science and Technology<br />
Project <strong>of</strong> education department <strong>of</strong> Jiangxi Province<br />
(Grant No. GJJ10430), and 2010 Social Science Planning<br />
Project <strong>of</strong> Jiangxi Province (Grant No. 10GL35).<br />
REFERENCES<br />
[1] H.Rubost, “Consumer Behavior <strong>of</strong> Online Procurement<br />
and Book Supply Chain,” Service Operations, Logistics,<br />
and Informatics. May 2005. pp 49-66.<br />
[2] J.P.Peter, and J.C. Olson, “Consumer Behavior and<br />
Marketing Strategy,” McGraw-Hill Press, 2009.<br />
[3] Blanca Hernández, Julio Jiménez, and M. José, “Customer<br />
behavior in electronic commerce: The moderating effect <strong>of</strong><br />
e-purchasing,” <strong>Journal</strong> <strong>of</strong> Business Research, Volume 63,<br />
Issues 9-10, September-October 2010, pp. 964-971.<br />
[4] H. J. Chang, L. P. Hung and C. L. Ho, “An anticipation<br />
model <strong>of</strong> potential customers’ purchasing behavior based<br />
on clustering analysis and association rules analysis,”<br />
Expert Systems with Applications, Vol.32, Issue 3, April<br />
2007, pp. 753-764.<br />
[5] P.W Engel, “A View Coming from Database Management<br />
<strong>of</strong> Consumer’s Behavior,” New York: Dryden Press, 2008<br />
[6] I. C. Yeh, C. H. Lien, T. M. Ting, Y. Y Wang and C. M.<br />
Tu, “Cosmetics purchasing behavior–An analysis using<br />
association reasoning neural networks,” Expert Systems<br />
with Applications, Vol.37, Issue 10, October 2010, pp.<br />
7219-7226.<br />
[7] Simon Haykin, A Comprehensive Foundation, World<br />
publishing house, February 2004.<br />
[8] D J.Willshow, “How Patterned Neural Connections Can<br />
Be Set Up By Self-organizations,” Proc Roy Soc London<br />
B,1976,194: 431-445.<br />
[9] T.Kohonen, “Self-organized Formation <strong>of</strong> Topologically<br />
Correct Feature Maps,” Biological Cybernetic.<br />
1982,43(1):59-69.<br />
[10] FeiSi Science Research Center, Neural Network theory and<br />
Realization in Matlab 7, Beijing: Publishing House <strong>of</strong><br />
Industry Electronics, May 2005, pp. 165-178.<br />
[11] Z. H. Yang and Y. Yan, “Research and Development <strong>of</strong><br />
Self-organizing Maps Algorithm,” Computer Engineering,<br />
2006, 32 (16), pp. 201-228.<br />
[12] Mao Guojun, et al. Principle and Algorithm <strong>of</strong> Data<br />
Mining, Beijing: Tsinghua University Press, 2008.<br />
[13] Huang Lijuan. Yu Guoping. “Research on the Design for<br />
the National Unified E-marketing Platform <strong>of</strong> Chinese<br />
Book Supply Chain,” UESTC Press, 2006.<br />
[14] M.Christopher, “Logistics and Supply Chain<br />
Management,” London: Pitman Publishing House, 1992.<br />
Lijuan Huang Jiangxi Province, China.<br />
Birthdate: February, 1971. is<br />
Management Science and Engineering<br />
Ph.D., graduated from Nanchang<br />
University. And research interests on ecommerce<br />
and Logistics and Supply<br />
Chain Management.<br />
She is a postdoctor <strong>of</strong> Jiangxi<br />
University <strong>of</strong> Finance and Economics.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1719<br />
Repair Method <strong>of</strong> Complex Network Based on<br />
Matthew Effect<br />
Minsheng Tan<br />
School <strong>of</strong> Computer Science and Technology, University <strong>of</strong> South China, Hengyang Hunan, 421001,China<br />
Email:tanminsheng65@163.com<br />
Qiang Cui, Lingfeng Zhu and Hui Zhao<br />
School <strong>of</strong> Computer Science and Technology, University <strong>of</strong> South China, Hengyang Hunan, 421001, China<br />
Email:{kiteblue@126.com, 407999562@qq.com, zhaohui.1006@yahoo.com.cn }<br />
Abstract — Complex network repair after suffering the<br />
deliberate assault becomes extraordinarily important. In this<br />
paper, a repair method <strong>of</strong> complex network based on<br />
Matthew Effect was proposed. Single-node selective attack<br />
algorithm and multi-node cluster attack algorithm was given.<br />
Aiming at the two kinds <strong>of</strong> attack, linear detection algorithm<br />
and BA network generation algorithm was put forward to<br />
get experiment data. Correspondingly, repair experiments<br />
were done. Experimental results show that repair rate <strong>of</strong> the<br />
method is more than 95% in sampling Internet and BA<br />
network. For repair rate <strong>of</strong> complex network, the conception<br />
<strong>of</strong> stability and its mathematics description was addressed.<br />
Experiments show that the complex network can achieve a<br />
steady topology state after some steps <strong>of</strong> attacks and repairs.<br />
Index Terms—Complex Network, Repair Method, Power-law,<br />
Matthew Effect, Stability<br />
I. INTRODUCTION<br />
Issues on complex network repair are raised as<br />
forefront topic in recent years in this field. Currently,<br />
research on this topic is very little domestic and in its<br />
infancy abroad. Complex network repair, which has no<br />
uniform definition, only use connectivity to evaluate<br />
repair method is good or bad. The repair is lack <strong>of</strong> some<br />
unified considerations, such as the cost <strong>of</strong> restoration, the<br />
stability <strong>of</strong> the network and the ability against attacks after<br />
repair [1-2].<br />
Repair methods and attack methods are inseparable.<br />
Studies on attack efficiency, damage degree and attack<br />
principle <strong>of</strong> different attack strategies help to find speed<br />
and efficient repair strategies. Through constant attacks<br />
and repair on the network, we can observe changes in<br />
network topology, anti-attacks level and easy-repairing<br />
ability <strong>of</strong> different types <strong>of</strong> network topology. Currently,<br />
measure <strong>of</strong> repair method to measure quality just reflects<br />
the connectivity <strong>of</strong> the network topology but not the<br />
performance <strong>of</strong> network run-time communication services<br />
which is precisely one <strong>of</strong> the greatest concern <strong>of</strong> users and<br />
Manuscript received Feb.25, 2011; revised Mar.5, 2011; accepted Apr.<br />
2, 2011.<br />
project number: 60572137, 10JJ9025, 2009GK3036 , 10C1185.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1719-1725<br />
managers [3-5]. Repair strategies and exploration on<br />
complex network has important theoretical significance<br />
and application value. This paper proposes a new repair<br />
method <strong>of</strong> complex network based on Matthew Effect for<br />
the power-law.<br />
II. MATTHEW EFFECT<br />
Matthew Effect is a phenomenon, which is the good<br />
better, the bad worse and worse, much more, little less,<br />
and its name comes from a fable in the "Bible. Gospel <strong>of</strong><br />
Matthew"[6]. In 1968, the United States • History <strong>of</strong><br />
Science researcher Robert Morton proposed the term used<br />
to summarize a social psychological phenomenon. Robert<br />
• Morton interpreted "Matthew Effect" as: any individual,<br />
group or region, if success and progress in one respect<br />
(such as money, fame, status, etc.) it will produce a<br />
cumulative advantage, and there will be more<br />
opportunities to achieve greater success and progress<br />
[7-9].<br />
Real Internet in the generation process is a true<br />
portrayal <strong>of</strong> Matthew in practical application, when a new<br />
node is added to the network, the node will tend to be<br />
connected with network nodes which have larger degrees<br />
[10-11].<br />
BA network model can well reflect the Matthew Effect.<br />
Its generation process well considered the two following<br />
characteristics [12]:<br />
� Growth characteristic: network growing larger.<br />
� Preferential attachment characteristic: new nodes<br />
tend to connect with those with high degree <strong>of</strong> "big"<br />
node connections.<br />
Figure 1 shows the evolution process <strong>of</strong> BA network<br />
when m = m0 = 2.<br />
Figure 1 Formation process <strong>of</strong> BA network
1720 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
III. REPAIR METHOD OF COMPLEX NETWORK<br />
BASED ON MATTHEW EFFECT<br />
This article research Matthew Effect application in<br />
repair from scale-free network based on the Matthew<br />
Effect in complex network. A single-node selective attack<br />
(for example, delete the network nodes with degree <strong>of</strong> the<br />
maximum value) and multi-node cluster attack (such as<br />
one-time attacks on 30% network nodes <strong>of</strong> moderate value<br />
ones) under sustained attack is the main consideration.<br />
While a node or multiple nodes are deleted as a ratio r<br />
priority in the network, a node is reconnected. Those<br />
nodes losing neighbor nodes reconnect to other nodes to<br />
replace the lost nodes; in addition, the node attacked is<br />
reconnected to the network as a new node. Compensation<br />
dynamics in linear priority sustained attack will lead to<br />
power-law degrees distribution associated with index<br />
truncation which depends on the rate <strong>of</strong> priority deletion.<br />
Thus, when the node <strong>of</strong> the network with maximum<br />
degree was attacked, compensation agreement could still<br />
protect the index <strong>of</strong> power-law distribution. Even in a high<br />
rate <strong>of</strong> priority attack, or attacking the network nodes with<br />
a large value, as long as the new node can connect<br />
network randomly with m ≥ 2, the network will be able to<br />
maintain a large connection parts, and the lost connection<br />
is no longer the damage result <strong>of</strong> this sustained attack. The<br />
repair method considered here are changing from the time,<br />
which is showed as follows:<br />
A. Repair Algorithm under Single-node Selective Attack<br />
For a given network topology, according to the size <strong>of</strong><br />
first statistical degree <strong>of</strong> the network nodes to do selective<br />
attack network nodes, because the result <strong>of</strong> attacks and<br />
repair will lead to changes in the degree <strong>of</strong> network nodes,<br />
which need count the degree <strong>of</strong> the network nodes<br />
according to changes <strong>of</strong> time in real-time, this repair is to<br />
change over time. Here are the steps in the recovery<br />
algorithm:<br />
(1) Count degree <strong>of</strong> nodes in the network, data<br />
storage :d;<br />
(2) According to the degree <strong>of</strong> the nodes from the<br />
statistics (1) to attack a node in the network<br />
(assuming the network nodes numbered from 1<br />
onwards, if a node i meet that a (i, 1) == M (k) ∪ a<br />
(i, 2 ) == M (k) is true, then the edge(i, j) connecting<br />
this node will be deleted);<br />
(3) Recount degree <strong>of</strong> nodes in the network;<br />
(4) Count the number <strong>of</strong> node with the maximum degree<br />
and the one with the maximum degree in the network:<br />
p ← max (d (i, 1));<br />
(5) Remove the node M;<br />
(6) Count network node degree, the number <strong>of</strong> nodes<br />
with maximum degree and the ones with maximum<br />
degree in the network;<br />
(7) Unicom generated the largest sub-graph f, recount<br />
the degrees <strong>of</strong> nodes in the network and statistics the<br />
number <strong>of</strong> nodes in f;<br />
(8) Repeat steps (1) - (7), when the number <strong>of</strong> network<br />
nodes and average degrees tend to balance, the<br />
© 2011 ACADEMY PUBLISHER<br />
algorithm end;<br />
(9) Repair rate calculation.<br />
Input: connected network with N nodes and certain<br />
number <strong>of</strong> edges.<br />
Output: N nodes in the connected network, repair rate<br />
s (r).<br />
d said the matrix storing node degree, n said the number<br />
<strong>of</strong> network nodes, m said the number <strong>of</strong> edges in the<br />
network, (i, j) said one network edge, a said network<br />
before each attack or repair, M (k) said the node with<br />
degree k, M, M ∈ (1 ... n), p said the matrix storing node<br />
degree, f said the largest restored Unicom network<br />
sub-graph.<br />
In both algorithms attacks and repair process, Matthew<br />
Effect is used to remove <strong>of</strong> a linear priority and repair in<br />
the network. After each once, the degree <strong>of</strong> network nodes<br />
are recounted to ensure nodes attacked by a linear attack<br />
and repair. The time complexity is O (n 3 ).<br />
B. Repair Algorithm under Multi-node Cluster Attack<br />
According to the degree, network nodes are divided<br />
into the central node, sub-central node, the intermediate<br />
value node and small scale value node, each attack the<br />
bulk <strong>of</strong> those kinds <strong>of</strong> node, the specific algorithm is as<br />
follows:<br />
(1) A new node added to the network, and the linear<br />
preferential attachment to m nodes from the network;<br />
(2) In accordance with the value <strong>of</strong> node degree, the<br />
nodes include the central node, sub-central node, the<br />
intermediate value <strong>of</strong> the node and the value <strong>of</strong> small<br />
degree nodes, then select n nodes w1、w2、w3…wn<br />
from those kinds <strong>of</strong> node at different rates r and<br />
delete them, the following steps:<br />
� Remove nodes w1, w2, w3 ... wn and all their sides,<br />
then w1, w2, w3 ... wn, respectively, as a new node<br />
connect to m nodes <strong>of</strong> the network;<br />
� Each node connected to the node w1, w2, w3 ... wn<br />
has lost an edge, and added a random edge to<br />
compensate;<br />
(3) Repeat steps (1), (2), until the network nodes and<br />
edges become balanced.<br />
Input: N nodes <strong>of</strong> connected network with a certain<br />
number <strong>of</strong> edges.<br />
Output: N nodes <strong>of</strong> the connected network, repair rate<br />
s (r).<br />
From steps <strong>of</strong> the repair method based on Matthew<br />
Effect, first adding one node in linear preferential<br />
attachment with m nodes in the network ensures that most<br />
connection information will be stored in the new nodes<br />
which will preferential attack to or attach with network<br />
nodes in linear to promise the power-law <strong>of</strong> network. The<br />
attack and repair process is similar to natural growth<br />
process, so in the term <strong>of</strong> topology <strong>of</strong> total network, the<br />
topology after attack and attachment will change<br />
strikingly.<br />
r<br />
The repair rate in the process is: ()<br />
N<br />
sr = , N r said<br />
N<br />
the node number after repair in the network, N said the
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1721<br />
node number before.<br />
IV. TWO KEY ALGORITHMS GETTING<br />
EXPERIMENT DATA<br />
Considering that Internet topology has important<br />
influence to its anti-destroying ability, to research better<br />
on Internet topology, NSF (National Science Foundation<br />
<strong>of</strong> America) subsidizes National Laboratory for applied<br />
network research to measure and analysis on Internet<br />
topology. The original measured result includes AS-level<br />
Internet topology which truly reflects the status <strong>of</strong> Internet<br />
connection. Taking into account the authenticity<br />
<strong>of</strong> network simulation and constraints <strong>of</strong><br />
experimental hardware, to validate the effectiveness <strong>of</strong><br />
repair methods in this paper, experiment test in the<br />
Matalab simulation platform. To make the simulation<br />
closer to the real Internet model, we used real network<br />
simulation statistics. Specific steps to get experimental<br />
data:<br />
A. Algorithm <strong>of</strong> Sampling from Actual Network.<br />
(1) b=zeros(37447,3)<br />
(2) for n=1:37447<br />
(3) b(n,1)=data(n,1)<br />
(4) b(n,2)=data(n,2)<br />
(5) end<br />
(6) a=zeros(2400,2)<br />
(7) k=1<br />
(8) a(1,1)=b(4513,1)<br />
(9) a(1,2)=b(4513,2)<br />
(10) for m=1:50<br />
(11) for i=m+1:37447<br />
(12) for j=1:2<br />
(13) if b(i,j)==a(k,1)&&b(i,3)==0||b(i,j)==a(k,2)<br />
&&b(i,3)==0<br />
(14) k=k+1;%%%<br />
(15) a(k,1)=b(i,1)<br />
(16) a(k,2)=b(i,2)<br />
(17) b(i,3)=1<br />
(18) end<br />
(19) end<br />
(20) end<br />
(21) end<br />
Input: one network with ten thousands <strong>of</strong> nodes.<br />
Output: one network with one thousands <strong>of</strong> nodes.<br />
First import a matrix from measured data and detect in<br />
linear from one node omnipresence one connection<br />
sub-graph kept other matrix, <strong>of</strong> which the number <strong>of</strong> node<br />
is not continuous, so the nodes <strong>of</strong> the graph is necessary to<br />
renumber start from 1 to make number continuous.<br />
According to the actual data on the<br />
http://moat.nlanr.net/routing/rawda-ta (the total number <strong>of</strong><br />
edges in the network is 37,448, the total number <strong>of</strong> nodes<br />
is 26589, the number <strong>of</strong> nodes zero is 13010, the<br />
maximum degree is the 2637, the average degree is<br />
5.515576), sampling network by detection method (after<br />
sampling, number <strong>of</strong> edges is 2358, the number <strong>of</strong> nodes<br />
is 1028, the maximum is 191 degrees, the average degree<br />
<strong>of</strong> 4.587549). Figure 2 shows results which obey the<br />
degree distribution <strong>of</strong> real network, the nodes <strong>of</strong> between<br />
© 2011 ACADEMY PUBLISHER<br />
1~5 degree account for about 80% in network nodes, and<br />
less large value ones.<br />
B. BA Scale-free <strong>Networks</strong> Generated and The Steps as<br />
Follows:<br />
I)Initializing network<br />
(1) nodes ← zeros (N)<br />
(2) cii ← zeros (1, N)<br />
(3) t ← zeros (1, N)<br />
(4) for i ←1: m<br />
(5) nodes (i,m+1) ←1<br />
(6) nodes (m+1,i) ←1<br />
(7) list (i) ←i<br />
(8) end<br />
(9) for i←m+1:2*m<br />
(10) list (i) ←m+1<br />
(11) end<br />
II)Increasing the node and edge into the Internet and<br />
add 2m each t into the auxiliary vector list.<br />
(1) for n←m+2: N<br />
(2) t←2*m*(n-m-1)<br />
(3) for i←1: m<br />
(4) list (t+i) ←n<br />
(5) end<br />
(6) k←1<br />
(7) while k0&p(k)
1722 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
0.5<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0 20 40 60 80 100<br />
度数<br />
120 140 160 180 200<br />
Figure 2 Internet degree distribution sampling<br />
Figure 3 BA network degree distribution<br />
Figure 4 Sampling Internet degree and BA network degree<br />
distribution in Logarithmic Coordinates<br />
V. EXPERIMENT PROCESS AND RESULT<br />
ANALYSIS<br />
To validate the effectiveness <strong>of</strong> repair methods in this<br />
paper, experiment test in the Matalab simulation platform.<br />
To make the simulation closer to the real Internet model,<br />
we used real network simulation statistics.<br />
A. Repair Process under Single-node Selective Attack<br />
According to the algorithm <strong>of</strong> the previous section,<br />
single-node selective attack and repair to the sampling<br />
Internet and BA network, as follows:<br />
(1) A new node, respectively, was added to sampling<br />
Internet and BA network, and connect to the m<br />
(where m = 3) nodes with maximum degree <strong>of</strong> both<br />
network;<br />
(2) At ratio r = 0.0125,0.03,0.2,0.33,0.5,1 select nodes<br />
from sampling Internet, the 40 <strong>of</strong> 3 degree, the 1019<br />
<strong>of</strong> 6 degree, the 632 <strong>of</strong> 14 degree, the 1015 <strong>of</strong> 21<br />
degree, the 599 <strong>of</strong> 25 degree, the 457 <strong>of</strong> 191 degree,<br />
in each time attack one <strong>of</strong> them and remove all edges<br />
connecting it; To select nodes at ratio r = 0.0047,<br />
0.04, 0.17, 0.33,1,1 from BA network, the 114 <strong>of</strong> 3<br />
degree, the 85 <strong>of</strong> 7 degree, the 194 <strong>of</strong> 10 degree, the<br />
127 <strong>of</strong> 15 degree, the 13 <strong>of</strong> 24 degree, the 3 <strong>of</strong> 69<br />
degree, in each time attack one <strong>of</strong> them and remove<br />
all edges connecting it;<br />
(3) Nodes attacked have priority to connect to m (m = 1)<br />
nodes, while each one lose one edge;<br />
(4) Repeat steps (1), (2), (3), until the number <strong>of</strong> nodes<br />
in the network remain at 1027, the average Internet<br />
remained at about 4.5 degree, the average degree <strong>of</strong><br />
BA network kept steady state <strong>of</strong> about 3.8;<br />
(5) At this time calculating connection rate and<br />
power-law <strong>of</strong> both networks and index sharp<br />
truncated <strong>of</strong> sampling Internet. Internet, BA network<br />
connectivity rate s (r) is equal to the total number <strong>of</strong><br />
nodes in the network after the repair dividing 1028,<br />
the results shown in Table I, Table II, Table III.<br />
Table I RATE OF CONNECTIVITY OF THE SAMPLING NETWORK s (r) AND THE POWER-LAW k<br />
r 0.0125 0.03 0.2 0.33 0.5 1<br />
s(r) 1.0 0.999027 0.998054 0.997082 0.990272 0.955253<br />
dmax 192 192 192 192 192 122<br />
dave 4.585 4.578 4.555 4.520 4.525 4.354<br />
k 2.372 2.384 2.467 2.352 2.664 2.572<br />
Table II RATE OF CONNECTIVITY OF THE BA NETWORK s (r) AND THE POWER-LAW k<br />
r 0.0047 0.04 0.17 0.33 1 1<br />
s(r) 0.995136 0.995136 0.995136 0.996109 0.995136 0.995236<br />
dmax 55 55 55 55 54 55<br />
dave 3.858 3.848 3.836 3.81 3.767 3.767<br />
k 3 3 3 3 3 3<br />
Table III INDEX SHARP CUT OF THE SAMPLING INTERNET<br />
r 0.0 0.01 0.03 0.05 0.07<br />
Kc(r) 27 20 14 12 8<br />
© 2011 ACADEMY PUBLISHER
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1723<br />
B. Repair Process under Multi-node Cluster Attack<br />
According to the algorithm <strong>of</strong> the previous section,<br />
multi-node cluster attack and repair to the sampling<br />
Internet and BA network, as follows:<br />
(1) A new node, respectively, was added to the sampling<br />
Internet and the BA network, and connected to m<br />
(where m = 3) nodes <strong>of</strong> maximum degree value;<br />
(2) In the subnet <strong>of</strong> Internet, all central nodes, 1% <strong>of</strong> the<br />
total number, 10% and 50% <strong>of</strong> the sub-central node<br />
<strong>of</strong> 3% in total, 10% and 50% the middle value <strong>of</strong><br />
degree nodes, and 10% and 50% small value <strong>of</strong><br />
degree nodes <strong>of</strong> 60% in total, are attacked. BA model<br />
in the same proportion <strong>of</strong> the nodes were also tested;<br />
(3) Nodes attacked have priority to connect to m (m = 1)<br />
nodes in the network, while these m nodes will lose<br />
edges;<br />
(4) Repeat steps (1), (2), (3), until the number <strong>of</strong> nodes<br />
in the network remains at 1027, the average<br />
remained at about 4.5 degree, BA degree <strong>of</strong> the<br />
network to keep the average steady state <strong>of</strong> about<br />
3.8;<br />
(5) At this time calculating connection rate and power<br />
rate <strong>of</strong> both networks and index sharp truncated <strong>of</strong><br />
sampling internet. The results are shown in Table ,<br />
Table and Table .<br />
C. Analysis <strong>of</strong> Experimental Results<br />
From Table and Table , the repair method with a<br />
very high repair rate on the sampling real Internet, even if<br />
nodes <strong>of</strong> the maximum degree value are attacked ,or nodes<br />
are subjected to cluster attack, a simple repair can make<br />
the network connectivity rate still reach more than 95%<br />
and 99% under attack <strong>of</strong> nodes with the general value <strong>of</strong><br />
degree; From Table II and Table V, for different r, repair<br />
rate <strong>of</strong> BA networks is more than 99%; Table I, Table II<br />
Table IV and Table V again proved that applying Matthew<br />
Effect to construct BA network can generate network<br />
topology very close to the real Internet. But the Internet in<br />
the build process, following Matthew Effect, is also<br />
affected by other factors on which research is advantage to<br />
research in the real Internet; Table I and Table II also<br />
shows the network average degree value decreased when a<br />
high rate <strong>of</strong> repair methods, which indicates that the repair<br />
method can remove redundant edge to the network easier.<br />
From Table I, Table II, Table IV and Table V, with this<br />
method, the power-law distribution network can well<br />
maintain its power-law and high rate <strong>of</strong> repair, and the<br />
original topological properties has also been well<br />
maintained. Table III, Table VI mirrored index sharp<br />
truncated appears in the sampling Internet in different<br />
options proportion, which again illustrates the real<br />
network in the build process is affected by other factors, in<br />
addition to follow Matthew Effect.<br />
Table IV RATE OF CONNECTIVITY OF THE SAMPLING NETWORK s (r) AND THE POWER- LAW k AFTER REPAIR<br />
Node class Central node Next central node Intermediate value node Small scale value node<br />
Node ratio <strong>of</strong> r 50% 10% 50% 10% 50% 10% 50%<br />
s(r) 1.0 0.999027 0.988054 0.997082 0.980272 0.9955 0.9936<br />
dmax 167 178 192 192 192 192 192<br />
dave 4.585 4.578 4.555 4.520 4.525 4.354 4,237<br />
k 2.37 2.42 2.54 2.28 2.46 2.41 2.39<br />
Table V RATE OF CONNECTIVITY OF THE BA NETWORK s (r) AND THE POWER -LAW k AFTER REPAIR<br />
Node class Central node Next central node Intermediate value node Small scale value node<br />
Node ratio <strong>of</strong> r 50% 10% 50% 10% 50% 10% 50%<br />
s(r) 0.998 0.999027 0.998054 0.997082 0.990272 0.9975 0.99<br />
dmax 173 189 192 192 192 192 0.99<br />
dave 4.585 4.578 4.555 4.520 4.525 4.354 4.37<br />
k 3 3 3 3 3 3 3<br />
Table VI INDEX SHARP CUT OF THE SAMPLING INTERNET AFTER REPAIR<br />
Node class Central node Next central node Intermediate value node Small scale none<br />
Node ratio <strong>of</strong> r 50% 10% 50% 10% 50% 10% 50%<br />
k 14 9 12 21 23 19 17<br />
VI. The STABILITY OF COMPLEX NETWORK IN<br />
THE REPAIR PROCESS<br />
By complex network repair algorithm based on<br />
Matthew Effect, this paper researched and compared<br />
random network, scale-free network and small world<br />
network and found that all <strong>of</strong> them can be evolved into a<br />
state <strong>of</strong> equilibrium. So the introduction <strong>of</strong> the stability <strong>of</strong><br />
S (t) is to describe the repair extent <strong>of</strong> the system after<br />
© 2011 ACADEMY PUBLISHER<br />
repair and that <strong>of</strong> the network easy to repair.<br />
The current international and domestic study on the<br />
destruction <strong>of</strong> complex network still limits the robustness<br />
which is the capacity <strong>of</strong> complex network to bear the<br />
external damage. This is the first study on the<br />
characteristics <strong>of</strong> complex network under destruction and<br />
repair.<br />
In general, a maximal connected sub-graph <strong>of</strong> the<br />
network tends to a stable value in the process <strong>of</strong> constant
1724 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
attack and repair process and it is said to have reached a<br />
steady state.<br />
Considering N(t), the size <strong>of</strong> the sub-graph <strong>of</strong> the<br />
network, changing along with time, there are N(t0),N(t1),<br />
N(t2),N(t3),…N(tn), then the stability <strong>of</strong> S (t) is defined as:<br />
N(<br />
t0<br />
)<br />
S(<br />
t)<br />
=<br />
(1)<br />
N(<br />
tn<br />
) n→∞<br />
That is to say, the stability S (t) is a ratio <strong>of</strong> the size <strong>of</strong><br />
network and the size <strong>of</strong> the largest connected sub-graph <strong>of</strong><br />
final network in the constant attacks and repair process.<br />
N ( tn<br />
) said the size <strong>of</strong> the largest connected sub-graph <strong>of</strong><br />
the network, N( t0<br />
) said the size <strong>of</strong> network. As can be<br />
seen from the definition, S (t) is a step-wise increasing<br />
function with initial value 1 and S (t) whose value is a<br />
number greater than or equal to 1.<br />
S (t) reflects, to some extent, the stability <strong>of</strong> network<br />
topology. The greater S (t) is, more easily the system<br />
reaches a steady state after the repair. Relative to the<br />
topology structure at other times, the topology at this time<br />
is more easily to fix.<br />
Figure 5 shows the evolution <strong>of</strong> the stability along with<br />
the time step t, size for the sample Internet N = 1028,<br />
connection probability and repair probability Pr = PC =<br />
0.02.<br />
We find that S (t) grows very fast at the beginning with<br />
evolution, S (t) gradually slows the growth and eventually<br />
reaches a balance value Sb. Stability S (t) gradually<br />
increasing means that the system becomes more balanced<br />
through a series <strong>of</strong> attacks and repair, more easily to<br />
achieve good restoration results.<br />
One point worthy to illustrate here: the stability S (t)<br />
finally reached a balance value Sb. In equilibrium, the<br />
value <strong>of</strong> S (t) is the largest one in the system. This<br />
implies that the system reaches a vulnerable state after<br />
thousands <strong>of</strong> steps evolution.<br />
Figure 5 Stability s (t) changes with time t map, t that repair times,<br />
s (t) that the stability<br />
VII. CONCLUSION<br />
Single-node selective attack and multi-node cluster<br />
attack is the most difficult to deal in complex network<br />
attacks. For these two attacks, this paper proposed a repair<br />
method <strong>of</strong> complex network based on Matthew Effect.<br />
Experimental results show that the rate <strong>of</strong> the proposed<br />
repair method under attacks both sampling Internet and<br />
the BA network can reach 95% or more. Applying the idea<br />
© 2011 ACADEMY PUBLISHER<br />
<strong>of</strong> building the BA network to the repair <strong>of</strong> power-law<br />
distribution network can not only get a high repair rate,<br />
but also optimize the network topology. For the level <strong>of</strong><br />
complex network repair, we also proposed the conception<br />
<strong>of</strong> stability and described it in mathematics. Experimental<br />
results show that complex network after several steps <strong>of</strong><br />
attacks and repairs can gradually evolved into a relatively<br />
stable state. In this state, the complex network is easily<br />
repaired.<br />
Matthew Effect increased the efficiency <strong>of</strong> information<br />
exchange in network, but also brought problems to<br />
network security. If network nodes <strong>of</strong> large value were<br />
attacked, the probability would increased that part nodes<br />
in the network can not be able to connect with others. We<br />
will consider this issue in future research.<br />
ALKNOWLEDGEMENT<br />
This work was supported by Project 60572137 <strong>of</strong> the<br />
National Science Foundation, Project 10JJ9025 <strong>of</strong> the<br />
Hunan Natural Science Foundation, Project 2009GK3036<br />
<strong>of</strong> the Hunan Science and Technology Plan and Porject<br />
10C1185 <strong>of</strong> the Hunan Province <strong>of</strong> Science Research.<br />
REFREENCES<br />
[1] Wang Xiao-fang, Li Xiang, Chen Guan-rong. Complex<br />
nextwork theory and application[M].BenJing: Tsing Hua<br />
University punishment. 2006:11-14.<br />
[2] Carreras B A, Newman D E, Dobson I, et al. Evidence for<br />
self organized criticality in electric power system<br />
blackouts[C]. Thirty forth Hawaii International<br />
Conference on System Sciences. Maui, Hawaii,<br />
2001:705-709.<br />
[3] Wu Jun, Tan Yue-jin. Complex network anti-destroying<br />
ability estimation research[J]. system engineering journal,<br />
2005(2):128-131.<br />
[4] Chen Zhen-yi, Wang Xiao-fang, congestion and control in<br />
scale-free network[J]. system engineering journal,<br />
2005,20(1):132-138.<br />
[5] Faloutsos M, Flaoutsos P, Faloutsos C. On power-law<br />
relationship <strong>of</strong> the Internet topology[J]. ACM SIGCOMM<br />
Computer Communication Review, 1999,29(4): 251- 262.<br />
[6] Albert R, Barabási A L. Statistical mechanics <strong>of</strong> complex<br />
network[J]. Review <strong>of</strong> Modern Physics, 2002,74(1):47-97.<br />
[7] Barthelemy M, Amaral L A N. Small-world networks:<br />
Evidence for a crossover picture [J]. Phys.Rev.Lett,<br />
1999,82:5180-5184.<br />
[8] Erdos P, Renyi A. On the evolution <strong>of</strong> random graph[J].<br />
Publ.Math.inst.Hung. Acad Sci, 1960,5:17-60.<br />
[9] Watts D J, Strogatz S H. Collective dynamics <strong>of</strong><br />
small-world networks[J]. Nature, 1998, 393(6684):440-442.<br />
[10] Holme P, Kim B J, Yoon C N, et al. Attack vulnerability <strong>of</strong><br />
complex networks[J]. Phys.Rev.E, 2002,65(5):056109.<br />
[11] Xiao Zhong-zhe, Dong Zai-Wang. Improved GIB<br />
synchronization method for OFDM system[J]. IEEE<br />
Telecommunications,2003,2(8):1417-1421.<br />
[12] Criado R, Flores J, Hernández-Bermejo B, et al. Effective<br />
measurement <strong>of</strong> network vulnerability under random and<br />
intentional attacks[J]. <strong>Journal</strong> <strong>of</strong> Mathem-atical Modelling<br />
and Algorithms, 2005,4(3):307-316.<br />
[13] Che Hong-an, Gu Ji-fa. Scale-free network and its system<br />
scientific significance[J]. system engineering theory and<br />
practice, 2004 (4):11-16.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1725<br />
Minsheng Tan, Hunan province, China,<br />
Birthday: Sep, 1965, master tutor,<br />
graduated from Dept. Computer Science,<br />
Wuhan University. His research interests<br />
include computer network and<br />
information security.<br />
He is a pr<strong>of</strong>essor <strong>of</strong> School <strong>of</strong><br />
Computer Science and Technology,<br />
University <strong>of</strong> South China.<br />
Pr<strong>of</strong>. Tan is the member <strong>of</strong> ACM, senior member <strong>of</strong> China<br />
Computer Society, director <strong>of</strong> Hunan Computer Society,<br />
executive director <strong>of</strong> Hunan Computer Committee <strong>of</strong> Higher<br />
Education Institute, executive director <strong>of</strong> Hunan Computer Users<br />
Association.<br />
© 2011 ACADEMY PUBLISHER<br />
Qiang Cui, Shandong province, China,<br />
Birthday: Nov, 1981, is master. He<br />
graduated from School <strong>of</strong> Computer<br />
Science and Technology, University <strong>of</strong><br />
South China. And the main research<br />
interest is complex network.<br />
Lingfeng Zhu, Hunan province, China,<br />
Birthday: June, 1984, is working toward<br />
master in computer science <strong>of</strong> University<br />
<strong>of</strong> South China. And the main research<br />
interests include computer network and<br />
information security.<br />
Hui Zhao Henan province, China,<br />
Birthday: Oct, 1986, is working toward<br />
master in computer science <strong>of</strong> University<br />
<strong>of</strong> South China. And the main research<br />
interest is trusted network<br />
.
1726 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Study and Design an Anycast Routing Protocol for<br />
Wireless Sensor <strong>Networks</strong><br />
Demin Gao<br />
Nanjing University <strong>of</strong> Science and Technology Department <strong>of</strong> Computer Science and Engineering, Nanjing, China<br />
Email:gdmnj@163.com<br />
Huanyan Qian, Zheng Wang, Jiguang Chen<br />
Nanjing University <strong>of</strong> Science and Technology Department <strong>of</strong> Computer Science and Engineering Nanjing, China<br />
Email:ninanan@tom.com, wangzheng@163.com, chenjiguang@163.com<br />
Abstract—In wireless sensor networks, there is usually a sink<br />
which gathers data from the battery-powered sensor nodes.<br />
As sensor nodes around the sink consume their energy faster<br />
than the other nodes, several sinks have to be deployed to<br />
increase the network lifetime. Anycast is a mechanism that<br />
the source node sends the data to the nearest sink node. The<br />
paper study and design an anycast service for deploying<br />
several sinks in wireless sensor network. A novel anycast<br />
tree-based is proposed approach to minimize the path cost.<br />
Here the nodes form a tree with a sink node as the root, while<br />
the height <strong>of</strong> the tree integrates multiple metrics to calculate<br />
path cost based on diverse selection criteria. This paper<br />
discusses and analyzes the model deeply. The experimental<br />
data proves its validity and efficiency. Computer simulation<br />
shows that the proposed scheme reduces and balances the<br />
energy consumption among the nodes effectively, so it<br />
significantly extends the network lifetime compared to the<br />
existing schemes.<br />
Key words: Wireless sensor networks; Anycast; Routing<br />
protocol<br />
I. INTRODUCTION<br />
Wireless sensor networks are paid to lots <strong>of</strong> attention<br />
due to their promising techniques and wide-ranging<br />
applications in recent years. This kind <strong>of</strong> network consists<br />
<strong>of</strong> a large number <strong>of</strong> low-cost, low-power, small-size, and<br />
multifunction sensor nodes which can sense and process<br />
data and communicate with other nodes in a short distance.<br />
In many applications <strong>of</strong> wireless sensor network, usually a<br />
sink node and numerous tiny sensor nodes are deployed in<br />
the monitoring area randomly. With the scale <strong>of</strong> wireless<br />
sensor network increasing, nodes close to the sink<br />
consume their energy faster than that <strong>of</strong> farther nodes.<br />
When the energy all the nodes around the sink have<br />
exhausted, the sink node is not able to receive any data<br />
from the sensors, nor gets connecting with the network.<br />
When this situation happens, the whole network is<br />
considered to be down. In addition, sensor nodes are<br />
deployed in a remote or dangerous area in which servicing<br />
a node may be impossible. A solution to these problems is<br />
to deploy several sinks and tiny sensor nodes that need to<br />
send data to a nearest sink node in the sensor networks. If<br />
the traffic is balanced among the sinks, the network<br />
lifetime can be significantly increased since the energy<br />
consumption will be almost equal for all the nodes in the<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1726-1733<br />
network.<br />
Internet Protocol Version 6(IPv6) specifically defines<br />
a new addressing scheme called "Anycast address” that is<br />
an identifier for a set <strong>of</strong> interfaces [1, 2] . A data packet is<br />
intended to be delivered to an Anycast address and routed<br />
to the "nearest" interface. The routing protocols can be<br />
classified into unicast, broadcast, multicast, and anycast<br />
roughly [3] . Nowadays the Anycast technology is studied in<br />
wireless network widely. The Anycast communications<br />
becomes quite important in a network with multiple sinks.<br />
Anycast can be an important paradigm for a wireless<br />
sensor network in terms <strong>of</strong> resource, robustness and<br />
efficiency for replicated service applications. Assuming<br />
that the sources and the sinks are distributed in the network<br />
uniformly, the sources sending the data packet to<br />
the ”nearest” sink around the area in which the events<br />
happen can reduce the hops <strong>of</strong> packets transmitting, so that<br />
it saves energy, reduces the cost <strong>of</strong> router table<br />
maintenance and extends the effect <strong>of</strong> network survival.<br />
This simple strategy is assumed to balance the energy<br />
consumption. When a sensor node produces data, it has to<br />
send it to any available sink. A sink selection strategy is to<br />
choose a sink for each source arbitrarily.<br />
This paper addresses the sink discovery and routing<br />
problem in sensor networks. Generic routing protocols<br />
designed for wireless ad hoc networks fail in sensor<br />
networks primarily due to the fact that they are designed<br />
for more powerful nodes with higher transmission range<br />
and power as compared to sensors. In addition to this, the<br />
packet structure, routing table sizes, implemented code<br />
size and many other states that are maintained, cannot be<br />
ported to tiny sensors directly. This paper contains a<br />
description <strong>of</strong> a protocol implementing the anycast service.<br />
Construct an anycast tree that is rooted at the sink and<br />
contains many sensor nodes as leaves. The objective is to<br />
select a minimum path cost for every sensor node. The<br />
paper is organized as follows. In section II we present a<br />
number <strong>of</strong> existing Anycast solutions, while in section III<br />
specify the network model and energy model used, Section<br />
IV we present our anycast protocol. Section V contains<br />
experimental results. Conclusions are presented in section<br />
VI.<br />
II. RELATED WORKS<br />
The concept <strong>of</strong> anycast was studied in multiple
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1727<br />
contexts, including network type, communications model,<br />
and purpose <strong>of</strong> usage. For example, anycast is studied in<br />
the TCP/IP networks deeply. As it is used for directing<br />
DNS queries to the closest root name server [4] .It is also<br />
used for server selection in distributed systems [5] . When<br />
Anycast is used to access gateways which interconnect<br />
IPv6 with IPv4 networks, it gain more attention. Though<br />
Anycast is originally designed for Internet service, it has<br />
been applied to routing protocol design for wireless ad hoc<br />
and sensor networks. In mobile networking, there are<br />
some Anycast routing protocols which were improved to<br />
support Anycast service and mainly come from current<br />
routing protocol.<br />
In the paper [6, 7], the AODV protocol is used to<br />
support Anycast service. AODV is an on-demand reactive<br />
routing protocol designed for ad hoc networks. When there<br />
are packets needed to transmit, the source node initiates<br />
the process <strong>of</strong> route establishment. It’s suitable for the<br />
situation <strong>of</strong> mobile nodes. In addition, Anycast routing<br />
protocols based on the tree structure [8, 9, 10] is in accordance<br />
with the agreement, the extended model usually in the tree<br />
by hop count, physical interval or time intervals for unit, to<br />
build an Anycast tree. An query is transported along the<br />
most fitting Anycast tree. Routing and sink discovery<br />
protocols which are designed for ad hoc networks do not<br />
adapt to the sensor networks.<br />
Low-Energy Adaptive clustering Hierarchy (LEACH)<br />
[11] is one <strong>of</strong> the representative clustering schemes. In<br />
LEACH sensors are organized into clusters and one node<br />
in each cluster acting as cluster-head takes the<br />
responsibility to collect data, aggregate data and finally<br />
transmit data to the distant Sink. Lifetime <strong>of</strong><br />
heterogeneous wireless sensor networks can be increased<br />
in networks with more than one data sink when access to<br />
the sinks is provided by an Anycast protocol [12] .Such a<br />
network consists <strong>of</strong> two types <strong>of</strong> devices resource rich<br />
(information sinks) and resource-constrained (sensors<br />
generating new data) [13] .A similar concept <strong>of</strong> improving<br />
the energy efficiency <strong>of</strong> WSNs has been proposed in the<br />
HAR [14] protocol. All the above anycast solutions are<br />
different from our paper. In each <strong>of</strong> them, the set <strong>of</strong><br />
attributes used as the anycast address is not a singleton.<br />
Usually, node sent data to the nearest sink, rather than<br />
a specific one which is different from the TCP/IP<br />
networks and the ad hoc networks. Another type <strong>of</strong><br />
anycast which can be found in the WSN environment, is<br />
anycasting to a region. Solutions such as SPEED [15] and<br />
HLR [16] assume a situation where it is sufficient to deliver<br />
a packet to any node in a specified area. Algorithms for<br />
region-targeted anycast rely on the strong spatial<br />
correlation <strong>of</strong> the attributes used for addressing, which is<br />
not the case in this paper.<br />
In the view <strong>of</strong> the Anycast routing protocol in<br />
wireless sensor network, combining the characteristics <strong>of</strong><br />
wireless sensor networks and to improve the performance<br />
<strong>of</strong> Anycast routing, this paper puts forward a method<br />
which based on the Anycast tree routing algorithm for<br />
wireless sensor networks. Some protocols are simplified to<br />
suit for the wireless sensor network application. Algorithm<br />
is used to establish an Anycast tree for each sink node.<br />
© 2011 ACADEMY PUBLISHER<br />
Each sensor node joins in an Anycast tree which is nearest<br />
to it. Applications require minimizing certain cost metric(s)<br />
to optimize the performance, such as energy consumption.<br />
Thus, applications require using <strong>of</strong> multiple metrics for<br />
path cost calculation to guarantee the performance. Based<br />
on the multiple-metric path cost specified by the<br />
application requirement, path with the minimum cost<br />
value will be selected as the best route. This algorithm can<br />
balance the network load greatly, extend the whole<br />
network <strong>of</strong> survival and improve the performance <strong>of</strong><br />
Anycast routing algorithm.<br />
III. SYSTEM MODEL AND PATH SELECTION<br />
It is first discuss the topology model, energy model and<br />
path selection metrics used in the proposed routing<br />
scheme.<br />
A. Topology Model<br />
Consider a static wireless network modelled as an<br />
undirected graph G = ( V, A)<br />
where V are the set <strong>of</strong> sensor<br />
nodes and sink nodes. A is the set <strong>of</strong> links. A graph is<br />
simple if it has no loops and no two <strong>of</strong> its links join the<br />
same pair <strong>of</strong> vertices. An acyclic graph is one that contains<br />
no cycles. A tree is a connected acyclic graph. A sink tree<br />
is a tree with a sink node as tree root and sensor nodes as<br />
tree leaves. G consists <strong>of</strong> a finite nonempty vertex set V<br />
and edge set A <strong>of</strong> ordered pairs <strong>of</strong> distinct vertices <strong>of</strong> V. A<br />
leaf is a vertex <strong>of</strong> degree 1.Two nodes i and j are<br />
connected by a link if they can transmit a packet to each<br />
other with a transmission power less than the maximum<br />
transmission power at each node. Thus all links are<br />
assumed to be bi-directional. This assumption is not<br />
necessary for the convergence <strong>of</strong> the distributed<br />
algorithms however it can make the presentation clearer.<br />
The set <strong>of</strong> nodes are connected to node i by links is<br />
denoted as N i .We assumes that the network graph is<br />
connected, i.e. It is always exists a path between any pair<br />
<strong>of</strong> nodes i and j inV .<br />
A wireless sensor network contains a number <strong>of</strong><br />
sensor nodes and multiple sinks is considered which are<br />
distributed in a given region randomly. These sensor nodes<br />
transmit the information they have collected to the sink<br />
node. We make some assumptions about the sensor nodes<br />
and the underlying network model as follows:<br />
� All sensor nodes are started with the same<br />
initial energy. The sink node doesn’t have<br />
energy constraint.<br />
� Every node is aware <strong>of</strong> its own location. A<br />
sensor node can compute approximate distance<br />
<strong>of</strong> the source based on the received location<br />
information.<br />
� The transmitting power <strong>of</strong> a sensor node is<br />
controllable which means transmitting power<br />
can be modulated according to the transmitting<br />
distance.<br />
� Sink and sensor nodes are static. All nodes are<br />
homogeneous and have the same capabilities.<br />
Each node is assigned a unique identifier (ID)<br />
except the sink node and all sink nodes form an
1728 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
anycast group sharing an ID.<br />
These hypotheses are reasonable because <strong>of</strong> wireless<br />
hardware technology and low power calculation<br />
technology's development and progress.<br />
B. Energy Model<br />
The power consumption <strong>of</strong> a sensor node consists <strong>of</strong><br />
four parts: sensing and generating data, idling, receiving,<br />
and transmitting. Also the power e g for generating one bit<br />
<strong>of</strong> data is assumed to be the same with all nodes. The idle<br />
power consumed by a node, is assumed to be the same for<br />
all nodes and independent <strong>of</strong> traffic, is denoted by e s . For<br />
power consumption in receiving and transmitting, the first<br />
order radio model is adopted in [17-19]. Specifically, a<br />
node needs ε elec = 50nJ<br />
for running the circuitry and<br />
2<br />
ε amp = 100 pJ / bit / m for the transmitting amplifier.<br />
Therefore, the power consumption for receiving one bit <strong>of</strong><br />
data is given by er = ε elec .The power consumption for<br />
transmitting one bit <strong>of</strong> data to a neighbor node j is given<br />
n<br />
by eij = εelec+ εamp<br />
∗ dij,<br />
where n is the path loss exponent,<br />
which typically ranges between 2 and 4 for free-space and<br />
short-to-medium-range radio communication. Let i E<br />
denote the initial battery energy <strong>of</strong> node i and w i denote<br />
the fraction <strong>of</strong> power consumption for one bit <strong>of</strong> data.<br />
w = e + e + e + e (1)<br />
i s g r ij<br />
Where the first term is the idling power consumption, the<br />
second term is the power for sensing, the third term is the<br />
power consumption for receiving and the last term is the<br />
power consumption for transmitting.<br />
C. Path Selection<br />
A simple linear combination <strong>of</strong> different routing<br />
metrics is used to determine the path cost, as shown in<br />
following equation:<br />
'<br />
φ = φ + α ∗metric<br />
Where<br />
∑<br />
i∈V i i (2)<br />
'<br />
φ is the accumulated cost <strong>of</strong> previous nodes along<br />
the path, metric i is scaled value from (0, 1) and αi is the<br />
weight factors (or called coefficients) for metric i to<br />
calculate the cost. Basing on application requirement,<br />
these weight factors can be flexibly varied to change the<br />
importance <strong>of</strong> the cost metrics during route discovery. Our<br />
protocol adopt four path cost metrics: hop count, energy<br />
cost, data delay, and remaining energy. Therefore, the path<br />
cost equation becomes:<br />
'<br />
φ φ α1 hopi α2 wi α3 delay α4<br />
Ei<br />
= + ∗ + ∗ + ∗ + ∗ (3)<br />
Here, hop i =1, which is the hop count, energy cost<br />
denotes the normalized energy cost for the link from the<br />
previous hop to the current node, data delay denotes the<br />
time for transmitting the data from the node to next, and<br />
E i denotes the surplus energy. Different applications can<br />
define their requirement by including different sets <strong>of</strong><br />
weight factors. For example, an application might only<br />
© 2011 ACADEMY PUBLISHER<br />
want to consider energy consumption, thus, (α1, α2, α3, α4)<br />
= (0, 1, 0, 0).In order to demonstrate how different<br />
requirements and path cost metrics guiding route<br />
discovery and resource consumption, simulations with<br />
three different network deployment are conducted. The<br />
model will be used to the choosing tactics <strong>of</strong> the next node.<br />
D. PROBLEM DEFINITION<br />
The core <strong>of</strong> anycast routing protocol for wireless<br />
sensor networks is to select a “nearest” sink as destination.<br />
The problem <strong>of</strong> optimal sink selection can be formulated<br />
as follows. Consider a case <strong>of</strong> n sources{ s1, s2, …, sn}<br />
and<br />
a group <strong>of</strong> k sink nodes where 1 ≤k ≤ n.The<br />
problem is<br />
to assign the n sources to k sink nodes so that the total<br />
path cost <strong>of</strong> the network is minimized. The problem can be<br />
formulated as a 0-1 integer programming problem as<br />
follows:<br />
n k<br />
Minimum∑∑ φijλij (4)<br />
i= 1 j−1<br />
Subject to<br />
k<br />
∑ λij<br />
= 1(1 ≤i≤n) (5)<br />
j=<br />
1<br />
λ ij = 0or1(1 ≤i≤ n)(1 ≤ j ≤ m)<br />
(6)<br />
Where λij is the path cost <strong>of</strong> the best route between<br />
node i and node j and λ ij is a binary variable used for<br />
sink selection: if the best sink node chosen for node i is<br />
node j , then λ ij =1, otherwise λ ij =0. Constraint (5) states<br />
that node i can only transmit all its packets to one sink.<br />
IV. ANYCAST ROUTING POTOCOL<br />
The anycast routing proposed scheme which employs<br />
the tree-based is introduced approach to distribute the<br />
energy load evenly among the sensors in the network and<br />
thus minimize data transfer time. An objective <strong>of</strong> our<br />
protocol is to establish a connection between sensor nodes<br />
and sinks which belong to an anycast group based on<br />
multiple path selection metrics. Thus, the selected sink can<br />
forward packets to the destination in the core network.<br />
A. Packet Format<br />
Four types <strong>of</strong> control packets are designed for our<br />
protocol, as it’s explained in this section. Hello packet<br />
(HELLO) is a special type <strong>of</strong> packet generated only by the<br />
sink nodes which is broadcasted periodically to all sensor<br />
nodes, for sensor nodes that do not have any valid route<br />
available to any member <strong>of</strong> the anycast group in its routing<br />
table. The traditional Route Request (RREQ), Route Reply<br />
(RREP), and Route Error (RERR) packets are stripped <strong>of</strong><br />
unnecessary fields for a WSN, such as the reserved fields,<br />
flags for multicast, prefix field, and life time field. In<br />
addition, a small HELLO packet is added for sink<br />
advertisement. A Hello message is transmitted<br />
periodically to advertise the presence <strong>of</strong> a Sink. The<br />
transmission range <strong>of</strong> a mobile platform will cover all
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1729<br />
sensor nodes more than one hop away. Thus, there is no<br />
need to retransmit the Hello message by sensor nodes.<br />
Nodes receive the hello packet and cache the information.<br />
Route Request packet (RREQ) is generated to<br />
initialize the route discovery. RREQ is different from the<br />
packet in the TCP/IP, such as AODV protocol. The major<br />
differences are: instead <strong>of</strong> using unicast address as<br />
destination address, the packet has the anycast group ID as<br />
the destination address. Two more fields are added for<br />
adapting application requirements and utilizing multiple<br />
metrics as path cost. In our protocol, the RREQ include<br />
CRQ (Child Request) and PRQ (Parent Request). CRQ is<br />
used to discover a child node and PRQ to discover a parent<br />
node.<br />
The data packet format in our protocol is defined as<br />
follows:<br />
(Type, Anycast group ID, Path costφ , Next node’s<br />
ID, Node’s address)<br />
If type=1, the packet is CRQ. If type=2, the packet is<br />
PRQ. If Anycast group ID =0, denotes the packet comes<br />
from a sink, otherwise from a sensor node. If the next<br />
node’s ID is empty, denotes the packet comes from a<br />
sensor node and the node hasn’t discovery a rout to a sink.<br />
Every node doesn’t need to remember its child nodes,<br />
because the node doesn’t transmit message to its child<br />
nodes. The sink transmission range can cover all sensor<br />
nodes. Node’s address denotes the node address, such as<br />
the node ID and position.<br />
Route Reply packet (RREP) is generated by sinks or<br />
sensors for corresponding RREQ packets. While<br />
destination anycast group ID represents the anycast group<br />
that the destination node belongs to. The accumulative<br />
path cost is the accumulative cost along the path from the<br />
destination node to the source node. Route ERROR Packet<br />
(RERR) is the same as that <strong>of</strong> AODV protocol.<br />
B. Established an Anycast tree<br />
The sensor nodes are distributed in the monitoring<br />
area randomly. There are multiple sink nodes and n<br />
sensor nodes. The anycast group is assigned an identifier<br />
( ID = 0 ) which identifies the anycast group and contains<br />
all sink nodes. Every sink node can construct an anycast<br />
tree and the root is the sink node. Sensor nodes can get<br />
anycast services from the anycast tree. This protocol starts<br />
with the creating <strong>of</strong> a number <strong>of</strong> spanning trees. In this<br />
model, if a sensor node wants to become an Anycast<br />
member it must join in an anycast tree firstly. A sensor<br />
node can join in an anycast tree through the following<br />
process:<br />
1) Every sink node broadcasts a query CRQ to its<br />
neighbor nodes within small range. The CRQ contains the<br />
location information <strong>of</strong> some one sink node, the ID <strong>of</strong><br />
Anycast group, the path costφ from a sink to the node that<br />
has sent the CRQ. If the CRQ comes from a sink node<br />
directly, the value <strong>of</strong>φ is zero and Next node’s ID is zero.<br />
2) If a neighbor node receives the CRQ and it hasn’t<br />
joined in any anycast tree. The node accepts the CRQ and<br />
checks if it comes from a tree’s node through checking the<br />
id that identifying anycast group, if it is, it appends the<br />
© 2011 ACADEMY PUBLISHER<br />
CRQ into its father node table and records the father node's<br />
relevant parameters including location information, the<br />
path costφ and anycast id it is requested to join. If the id in<br />
the CRQ was not belonging to any anycast tree’s node, the<br />
node discards the CRQ.<br />
3) After receiving a CRQ, the node set a timer whose<br />
time interval may be decided by the current network status.<br />
The node may receive more than one CRQ in the time<br />
interval. After the timer expires, the node selects the<br />
neighbor node with the minimum path costφ as its father<br />
node through comparing the size <strong>of</strong> the path costφ in the<br />
CRQ, records the information on its father node and<br />
returns a RREP to its father node. If more than one the path<br />
costφ <strong>of</strong> the CRQ received is equal, the node selects a<br />
neighbor node as its father node randomly.<br />
4) After the father node receives the joining message,<br />
it will return an ACK message to this child node. Due to<br />
the characteristics <strong>of</strong> the algorithm, each node only needs<br />
to retain the information <strong>of</strong> his father node. The father<br />
node doesn’t need to record the relevant information <strong>of</strong> the<br />
child node. This is different from the TCP/IP which will<br />
record the child’s IP. The child node replaces Next node’s<br />
ID in the CRQ with it’s the father ID, recalculate the path<br />
costφ from the sink to this node and replaces the path<br />
cost φ in the CRQ with the new φ . At the same time,<br />
updates the relevant parameters (position, etc) and<br />
broadcasts the CRQ to the next hop until all node join in an<br />
anycast tree, just as it is shown in fig. 1.<br />
Figure.1 a anycast tree is establish from all sensor to a sink<br />
C. New node joins in the Anycast tree<br />
In fact, if a sensor node wants to share the anycast<br />
service. It must join in an anycast tree. If a node wants to<br />
join in an Anycast tree, it will broadcast a joining message<br />
PRQ to its neighbor nodes. The PRQ contains the location<br />
information <strong>of</strong> the node. If one neighbor node receives the<br />
PRQ and it has joined in any anycast tree, the node will<br />
accept the PRQ and return a CRQ. The CRQ contains the<br />
location information <strong>of</strong> the neighbor node, the ID <strong>of</strong><br />
Anycast group, the path costφ from the sink node to this<br />
neighbor node.<br />
The node that sends the joining message accepts the<br />
CRQ and appends the CRQ into its father node table with<br />
the node's relevant parameters including the location<br />
information, the path cost φ from the sink node to this<br />
node. The node then sets a timer and expects to receive<br />
more CRQ in the time interval. After the timer expires, the
1730 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
node selects a node with the minimum the path costφ as<br />
its father node through comparing the size <strong>of</strong> the path cost<br />
φ <strong>of</strong> CRQ from the father table, recalculates the path<br />
costφ from the sink to this node and replaces the path<br />
costφ in the CRQ with the newφ , returns a RREP to its<br />
father node. If there is more than one minimum path costφ<br />
<strong>of</strong> the CRQ received, the node selects a neighbor node as<br />
its father node randomly. The father node will receive the<br />
RREP and sent an ACK message to the child node. So the<br />
new node joins in an anycast tree successfully.<br />
D. Node leave or be failed<br />
The energy <strong>of</strong> some nodes was exhausted as the<br />
sensor node power is constrained then the node become<br />
invalid. There are three cases when a node becomes failed.<br />
1) The failed node is the anycast tree’s leaf and the<br />
node has no child. When the father node can’t receive the<br />
information from the node in a time interval set in<br />
advance, the node is considered to be failed. This case is a<br />
sample, as it is show in the fig.2, if v5 is failed, v4 don’t<br />
need to do anything and nor revise the relevant<br />
information <strong>of</strong> v4.<br />
2) If the failed node is an intermediate node and it has<br />
a child node. In fig. 2, the v2 is the intermediate node. If v2<br />
is failed, the v4 and v5 will get disconnected to v1and the<br />
data that v4 and v5 have collected can’t transmit to the sink<br />
node. In this case, v4 should broadcast a joining message<br />
PRQ to its neighbor node, such as the node v3 and node v6.<br />
The process is the same as a new node join in an anycast<br />
tree that is shown in the above section C. In fig. 2, the node<br />
v4 will receive the CRQ from the node v3 and node v6.<br />
Clearly, the node v4 selects the node v3 with the minimum<br />
path costφ as its father node because <strong>of</strong>φ3< φ6.<br />
3) If the failed node is the sink node which is the root<br />
<strong>of</strong> the anycast tree. All data collected by the tree’s node<br />
can’t be transmitted to the sink node and the anycast tree<br />
will become invalid. All nodes will start the tree creating<br />
process that has shown in the above section B.<br />
Figure2 when the node v2 was failed, the node v4 will be disconnected to<br />
the sink node s1 and should rebuild the connection to the node v3<br />
E. Anycast tree<br />
After the tree construction is over, every node joins in<br />
an anycast tree successfully where many anycast trees<br />
exist, as it is shown in fig. 3. In this phase every node sends<br />
the collected data to the parent node. Every parent node<br />
receives data from the children nodes, fuses the data with<br />
its own and forwards them to its parent node along the<br />
© 2011 ACADEMY PUBLISHER<br />
anycast tree. When the data from all member nodes in the<br />
anycast tree have been received, the sink node applies data<br />
fusion to the received data. After that, it sends the fused<br />
data to the internet or other devices. In fact, sensor nodes<br />
don’t know which sink nodes the data is sent to in the<br />
transition, but the data was certainly transmitted to some<br />
one sink.<br />
Note that the notable feature <strong>of</strong> the proposed anycast<br />
routing protocol is that several trees are constructed<br />
instead <strong>of</strong> one which allows more distributed operation<br />
among the nodes. The tree construction which is based on<br />
the path cost further increases this effect, which results in<br />
more balanced energy consumption and data delay among<br />
the nodes and increases network lifetime in the long run.<br />
Data collected by sensor nodes may contain<br />
redundant information due to the spatiotemporal<br />
correlation. Therefore, it is desirable to aggregate the data<br />
at the sink to remove the redundant information. However,<br />
the correlation data may be transmitted to different sink. If<br />
sinks transmit so redundant information to the internet or<br />
others, the frequent communications is vulnerable to be<br />
wiretapped and the transition interference will be very<br />
serious. In our paper, the data correlation is taking into<br />
account. The data received by every sink should be<br />
aggregated. All sinks form a tree and one sink is selected<br />
as root sink randomly. The root sink will gain all data from<br />
all sinks and aggregate them. An example is shown in<br />
Fig.3<br />
V1<br />
S1<br />
S4<br />
S3<br />
S2<br />
V1<br />
V2<br />
V3<br />
V3<br />
Source node<br />
Middle node<br />
Sink node<br />
Root sink<br />
Link<br />
Pseudo link<br />
Figure3. Multiple anycast trees are established cove all sensor<br />
nodes and one sink is selected as the root sink<br />
V. SIMULATIONS<br />
In this section the performance <strong>of</strong> the anycast routing<br />
protocol is evaluated via computer simulation and<br />
compared it with other schemes such as AODV [6] ,<br />
LEACH [11] . Assume that there are 100 sensor nodes<br />
including 5 sink nodes and 95 sensor nodes distributed<br />
randomly in a 100×100 region. The simulation parameters<br />
are given in Table 1. All nodes’ transmission power is<br />
adjustable and they adjust transmission power to<br />
communicate with other nodes according to actual need.<br />
Every two nodes can communicate directly with each<br />
other in the transmission range.<br />
Fig4. Shows the resultant network topology obtained<br />
by different schemes for a network. The topology <strong>of</strong><br />
LEACH is shown in Fig4 (a). The transition distance is<br />
one hop count from every sensor node to sink. There are<br />
no transmissions between sensor nodes. Data collected by
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1731<br />
sensor nodes are transmitted directly from the member<br />
nodes to the cluster-head. Sensor node need to consume<br />
more energy because <strong>of</strong> many long distance transmissions.<br />
The topology <strong>of</strong> anycast routing protocol <strong>of</strong> our paper is<br />
shown in Fig4 (b). Every sensor node joins in an anycast<br />
tree according to the path cost and data is transmitted<br />
along the tree from sensor leaves to root sink. Observe that<br />
the proposed scheme display more balanced and<br />
distributed pattern <strong>of</strong> network.<br />
TABLE 1.THE PARAMETERS USED IN THE SIMULATION<br />
Parameter Value Parameter Value<br />
Size <strong>of</strong> target 100×100 Data packet 512 byte<br />
area<br />
size<br />
Number <strong>of</strong> 5 Metadata 25 byte<br />
sink nodes<br />
packet size<br />
Number <strong>of</strong> 95 Maximum 20m<br />
sensor nodes<br />
radius, R<br />
Initial energy 10J α 1 1<br />
ε elec 50 nJ/bit α 2 1<br />
ε amp<br />
50<br />
nJ/bti/m2<br />
α 3 1<br />
α 1<br />
e 100 nJ/s s<br />
4<br />
(a)LEACH<br />
(b) Anycast routing protocol<br />
Figure4. The network topology with different protocols<br />
For a network flow f , let f ij denote the rate <strong>of</strong><br />
information flow from node i to node j .The energy<br />
© 2011 ACADEMY PUBLISHER<br />
spent by node i to transmit a unit <strong>of</strong> information directly<br />
to node j is e ij .Then the lifetime <strong>of</strong> node i under<br />
flow fij is given by<br />
Ei<br />
Ei<br />
T = = T =<br />
wi ⋅ fij<br />
( es+ eg+ er+ eij ) ⋅ fij<br />
Fig.5 measures network energy consumption and<br />
lifetime when we vary the number <strong>of</strong> sensor nodes which<br />
shows that deploying 2, 3, 4, 5, 6 sink respectively. As the<br />
network size increases, the network total energy<br />
consumption rate rises and the network lifetime is<br />
gradually reduced. With the increase <strong>of</strong> the sinks, the<br />
network rate <strong>of</strong> total energy consumption decreased and<br />
the lifetime <strong>of</strong> network increases. Meanwhile, with the<br />
number <strong>of</strong> sinks increase, sinks added to the network<br />
newly prolong the lifetime capacity reduced gradually.<br />
This is because the number <strong>of</strong> nodes increase, cause the<br />
shortening <strong>of</strong> distance between nodes, data relevance<br />
increase and lower transmission power is needed, while<br />
the routing algorithm can effectively balance the node data<br />
traffic load, which makes the network lifetime increases.<br />
The new sinks adding to the network can reduce the<br />
distance between nodes, so the network lifetime can be<br />
prolonged. With the number <strong>of</strong> sinks increase, it only<br />
affects the route near the sinks. The impact on the network<br />
becomes smaller and the effect <strong>of</strong> increasing the lifetime<br />
<strong>of</strong> the network decreases.<br />
Energy consumption/nJ<br />
Energy consumption/nJ<br />
(a) The energy consumption<br />
Times/S<br />
(b). the lifetime <strong>of</strong> network<br />
Figure.5. the network energy consumption and lifetime when vary the<br />
number <strong>of</strong> sensor nodes and deploy 2, 3, 4, 5, 6 sink respectively.
1732 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
We define that the delay time means the time interval<br />
between the transmission <strong>of</strong> a packet by the source and the<br />
reception <strong>of</strong> the same packet by the sink. The delay time <strong>of</strong><br />
AODV is the longest, as it shows in Fig6 (a). When there<br />
are some packets that some sensor node collected need to<br />
be transmitted to a sink node, these sensor nodes initiate<br />
the process <strong>of</strong> route establishment. The time is accounted<br />
to the delay time, so that the delay time is increased.<br />
AODV tries to create a route to a single sink, thus waste<br />
more time comparing to the other two methods. Our<br />
protocol and LEACH are proactive routing protocol. The<br />
route has been established before the packets are<br />
transmitted to a sink node, so that the packets can be sent<br />
to a sink node in the shortest time. Our protocol is a little<br />
better than the LEACH. In LEACH, as data are transmitted<br />
directly from the member nodes to the cluster-head, many<br />
long distance transmissions are required in a cluster. The<br />
number <strong>of</strong> long distance transmissions will increase as the<br />
network size grows. However, in our protocol, the<br />
minimum the path cost node is selected as the father node.<br />
So we can say that our route is better than the other two<br />
protocols. In particularly, when the rate <strong>of</strong> transmission is<br />
quick, the property <strong>of</strong> our protocol increases 5% than<br />
LEACH.<br />
Fig.6 (b) shows that the comparison <strong>of</strong> energy<br />
consumption as time moves. With the increasing <strong>of</strong> the<br />
time, more and more packets can be transmitted to sink<br />
node and the energy consumption increases. Compared<br />
with the AODV and the LEACH, our protocol has a less<br />
energy consumption, and it’s more with the increasing. As<br />
it expected, our protocol has the best performance. The<br />
AODV protocol transmits more packets than the others<br />
because it sends route request and rebuilds the route every<br />
time when the new packets collected by sensor nodes need<br />
to be sent to a sink node, then it consumes more energy.<br />
The gap between them was getting wider and wider as the<br />
time moving. We can also see that both LEACH and ours<br />
perform better than AODV. Our protocol is a little better<br />
than the LEACH. The main reason is that communication<br />
radius is may be very large in LEACH. However, multiple<br />
paths cost metrics is considered and the minimum the path<br />
cost node is selected as the father node in our protocol, so<br />
that it can minimize the energy consumption and reduce<br />
the data delay. As previously discussing, this is because<br />
the anycast can reflect the wireless advantage when there<br />
are more than one sink nodes.<br />
VI. CONCLUSIONS<br />
In this paper an anycast routing protocol basing on<br />
anycast tree scheme for energy efficient data transfer and<br />
reducing average delay time is proposed in wireless sensor<br />
networks. To form a tree for each sink node, every node<br />
sends the collected data to the parent node along the<br />
anycast tree. The architecture <strong>of</strong> anycast tree is decided<br />
according to the path cost <strong>of</strong> nodes to sink. Some protocols<br />
are simplified to suit for the wireless sensor network<br />
application.The data packet can be sent to the nearest sink<br />
node along the anycast tree. Multiple-metric is used to<br />
instruct the route discovery and sink selection. The node<br />
© 2011 ACADEMY PUBLISHER<br />
own a minimum path cost is selected as the father node<br />
and forwards the packet. It can minimize the energy<br />
consumption which was required for the communication<br />
between the nodes and the sinks. Simulation results show<br />
that the proposed scheme reduces the delay time<br />
successfully and balances the energy consumption among<br />
the nodes and thus significantly extends the network<br />
lifetime comparing to those existing schemes.<br />
(a) Delay as the packets transfer rate<br />
(b) Energy consumption as time moving<br />
Figure6. The comparison <strong>of</strong> delay and energy consumption<br />
REFERENCES<br />
[1] Weber S, Cheng L.A survey <strong>of</strong> Anycast in IPv6 networks.<br />
IEEE Communications Magazine, 2004, 42 (1):127-132.<br />
[2] Doi S, A ta S, Kitamura H.Protocol design for Anycast<br />
communication in IPv6 network .Proceedings <strong>of</strong> 2003<br />
IEEE Pacific Rim Conference on Communications,<br />
Computers and Signal Processing(PACR MI’03). New<br />
York, USA: IEEE Press, 2003.470-473.<br />
[3] Jia W, Zhou W, and Kaiser J.Efficient algorithm for mobile<br />
multicast using anycast group. IEEE Proc.<br />
Communications, 2001, 148 (1):14–18.<br />
[4] Abley, J.:Hierarchical Anycast for Global Service<br />
Distribution(2003)<br />
[5] Michael, J., Freedman, K.L. Mazieres, D.:Oasis: Anycast for<br />
any service. In:Proceedings <strong>of</strong> the 3rd Symposium on<br />
Networked Systems Design and Implementation, San Jose,<br />
CA(May 2006)<br />
[6] Subramanian Swaminat han, Jinye Huo, Fang Liu.An
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1733<br />
Anycast Routing Protocol for Ad-Hoc <strong>Networks</strong>.<br />
http://www.cs.ucsb.edu/ebelding/, 2003- 03.<br />
[7] Jianxin Wang, Yuan Zheng, Weijia Jia.A-DSR:A Based-DSR<br />
Anycast Protocol for IPv6 Flow in Mobile Ad Hoc<br />
<strong>Networks</strong>.IEEE Proc V TC2003[C].2003.<br />
[8] Thepvilojanapong N, Tobe Y, Sezaki K.HAR: hierarchy<br />
based anycast routing protocol for wireless sensor<br />
networks//Proceedings <strong>of</strong> Symposium on Applicat- ions<br />
and the Internet Workshops. 2005: 204- 212.<br />
[9] WANG Xiao-nan etc, Routing protocol for w ireless sensor<br />
networks based on Anycast, Application Research <strong>of</strong><br />
Computers, 2009, 7(7):2695-2697.<br />
[10] Michal Koziuk, Jaroslaw Domaszewicz. Tree-based anycast<br />
for wireless sensor/actuator networks. Lecture Notes in<br />
Computer Science archive Proceedings <strong>of</strong> the 9th<br />
international conference on Distributed computing and<br />
networking. Kolkata, India SECTION: Sensor<br />
networks .2008<br />
[11] W.R.Heinzelman, A.Chandrakasan, and H. Balakris- hnan,<br />
“Energy-Efficient Communication Protocol for Wireless<br />
Micro-sensor <strong>Networks</strong>”, In Proceedings <strong>of</strong> the Hawaii<br />
International Conference on System Science, Maui,<br />
Hawaii, 2000.<br />
[12] Hu, W., Bulusu, N., Jha, S.:A communication paradigm for<br />
hybrid sensor/actuator networks(2004)<br />
[13] Hu, W., Chou, C.T.:S.J.N.B:Deploying long-lived and<br />
cost-effiective hybrid sensor networks(2004)<br />
[14] The pvilo jana pong, N., Tobe, Y, Sezaki, K.:Har:<br />
Hierarchy-based anycast routing protocol for wireless<br />
sensor networks.In:SAINT 2005:Proceedings <strong>of</strong> the The<br />
2005 Symposium on Applications and the Internet<br />
(SAINT 2005), pp.204–212.IEEE Computer Society, Los<br />
Alamitos(2005)<br />
[15] He, T, Stankovic, J.A, Lu, C, Abdelzaher, T.F:A<br />
spatiotemporal communication protocol for wireless<br />
sensor networks.IEEE Transactions on Parallel and<br />
Distributed Systems 16, 995–1006(2005)<br />
[16] Bian, F., Govindan, R., Schenker, S., Li, X.:Using<br />
hierarchical location names for scalable routing and<br />
rendezvous in wireless sensor networks.In:SenSys<br />
2004:Proceedings <strong>of</strong> the 2nd international conference on<br />
Embedded networked sensor systems, pp. 305–306.ACM<br />
Press, New York(2004)<br />
[17] W.R.Heinzelman, A.Chandrakasan, and H.Balakrishnan,<br />
“EnergyEfficient Communication Protocol for Wireless<br />
Micro-sensor <strong>Networks</strong>”, In Proceedings <strong>of</strong> the Hawaii<br />
International Conference on System Science, Maui,<br />
Hawaii, 2000.<br />
© 2011 ACADEMY PUBLISHER<br />
[18] Lindsey, C.S.Raghavendra, “PEGASIS:Power-Efficient<br />
gathering in sensor information systems, ”in Proc.<strong>of</strong> the<br />
IEEE Aerospace Conf., Canada, March 2002.pp.1-6.<br />
[19] S.S.Satapathy and N.Sarma, “TREEPSI:tree based energy<br />
efficient protocol for sensor information”, Wireless and<br />
Optical Communications <strong>Networks</strong> 2006 IFIP<br />
International Conference, April 2006.<br />
Demin Gao ShanDong Province,<br />
China. Birthdate: September, 1980. He<br />
received the M.S. degree in computer<br />
application technology from Jingdezhen<br />
Ceramic Institute, Jingdezhen, Jiangxi,<br />
china, in 2008. He is pursuing the Ph.D.<br />
degree in Nanjing University <strong>of</strong> Science<br />
and Technology Department <strong>of</strong> Computer<br />
Science and Engineering. And research<br />
interests on routing protocols for wireless<br />
sensor networks and data aggregation in wireless sensor<br />
networks.<br />
Huanyan Qian Jiangsu Province, China. Birthdagte: October,<br />
1950. He is currently a pr<strong>of</strong>essor in the Nanjing University <strong>of</strong><br />
Science and Technology at Department <strong>of</strong> Computer Science and<br />
Engineering. His current research interests include sensor<br />
networks, mobile communication and wireless communication<br />
networks.<br />
Zheng WANG Jiangsu Province, China. Birthdate: September,<br />
1980. He received the M.S. degree in computer application<br />
technology from Nanjing University <strong>of</strong> Science and Technology,<br />
Nanjing, Jiangsu, china, in 2007. He is pursuing the Ph.D. degree<br />
in Nanjing University <strong>of</strong> Science and Technology Department <strong>of</strong><br />
Computer Science and Engineering. And research interests on<br />
routing protocols for wireless sensor networks.<br />
Jiguang Chen Henan Province, China. Birthdate: February,<br />
1982. He received the M.S. degree in Education from Henan<br />
Normal University, Xinxiang, Henan, China, in 2008. He is<br />
pursuing the Ph.D. degree in Nanjing University <strong>of</strong> Science and<br />
Technology Department <strong>of</strong> Computer Science and Engineering.<br />
And research interests on routing protocols for wireless sensor<br />
networks.
1734 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Management Model Research <strong>of</strong> Low-power<br />
Wireless Sensor Network<br />
LinGe Wang<br />
Ningbo Dahongying University college <strong>of</strong> s<strong>of</strong>tware, Ningbo, 315175, China<br />
Email:Wanglingew@163.com<br />
YueDou Qi<br />
Ningbo Dahongying University college <strong>of</strong> s<strong>of</strong>tware, Ningbo, 315175, China<br />
Email:yuedouqi@sohu.com<br />
Abstract—Nowadays most <strong>of</strong> the wireless sensor network<br />
management modes have a shorter lifetime because<br />
adopting the way <strong>of</strong> transferring management information<br />
to each other, which thus consuming energy too fast. This<br />
paper Present a new modal based on mobile agent for<br />
wireless sensor network scluster management.this model<br />
can make up the shortcoming <strong>of</strong> the nowadays wireless<br />
sensor networks management architecture.The nowadays<br />
models are fall eousider the information report <strong>of</strong> each<br />
nodes can consulne lots <strong>of</strong> energy and result in reduce the<br />
network lifetime.The mobile agent-based wireless seusor<br />
networks management model inherit the preponderant <strong>of</strong><br />
the traditional merit, and have plenty consideration in nodes<br />
energy feature.Through the analysis <strong>of</strong> the model, the model<br />
author provided have more predominance than traditional<br />
model in save energy, data integrate, topology control and<br />
so on.<br />
Index Terms—wireless sensor network, mobile Agent, Low<br />
energy consumption<br />
I. INTRODUCTION<br />
With the rapid development and increasingly<br />
sophisticated <strong>of</strong> communication, embedded computing<br />
and sensor technology, with a perception by the<br />
substantial capacity, computing power and<br />
communications capability <strong>of</strong>, Sensor networks<br />
composed <strong>of</strong> thousands <strong>of</strong> micro-sensors, with each senor<br />
capable <strong>of</strong> sensing, computing and communication, has<br />
aroused great concern.It integrates the sensor, embedded<br />
computing, networking, and wireless communications<br />
technology, become a new information acquisition and<br />
processing technology, Be widely used in national<br />
defense and, military, environmental, monitoring, traffic<br />
management, medicine and health care.<br />
Agent technology is developed from artificial<br />
intelligence.Agent system is a loosely coordinated system<br />
which stands for the trends <strong>of</strong> distributed s<strong>of</strong>tware<br />
development, is more flexible and intelligent.Agent<br />
Manuscript received Mar. 25, 2011; revised Apr. 15, 2011; accepted<br />
Apr. 20 2011.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1734-1739<br />
s<strong>of</strong>tware as a new s<strong>of</strong>tware technology has made<br />
considerable progress, is used in many areas such as<br />
internet information retrieval, information collection, ecommerce,<br />
data mining, integrated manufacturing and SO<br />
on. The node <strong>of</strong> the wireless sensor network <strong>of</strong>ten USeS<br />
batteries to supply electricity, but the electricity energy in<br />
the batteries is limited, meanwhile the communication,<br />
calculation and storage abilities <strong>of</strong> the node are limited<br />
which raise challenges to the hardware and s<strong>of</strong>tware<br />
design <strong>of</strong> the wireless sensor network<br />
II. WIRELESS SENSOR NETWORK ARCHITECTURE<br />
Composed <strong>of</strong> wireless sensor network system are as<br />
shown in figure 1.A large number <strong>of</strong> sensor nodes are<br />
randomly distributed in the monitoring area.These nodes<br />
constitute a network <strong>of</strong> self-organization structure way,<br />
Each node not only data collection but also routing, The<br />
data was collected through the multi-hop transmission to<br />
the focal point, Passed to the Internet, Information will be<br />
man-agement, classification, treatment that is the task<br />
manager node in the network.Finally, for users to focus<br />
on.<br />
Figure 1. Wireless Sensor Network<br />
Past communication system for wireless sensor<br />
networks, Network nodes to collect raw data and send it
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1735<br />
directly to the central node, By the central node for the<br />
signal processing tasks.This central approach to waste a<br />
lot <strong>of</strong> bandwidth resources, At the same time,, Because a<br />
lot <strong>of</strong> information forwarded, The Nodes near the center,<br />
Soon lead to depletion <strong>of</strong> energy.<br />
Sensor network is an integrated monitoring, control,<br />
and wireless communication network system, Much<br />
larger number <strong>of</strong> nodes (thousands), More intensive<br />
distribution <strong>of</strong> nodes;Because the environmental impact<br />
and energy depletion, Node failure more easily,<br />
Environmental interference and node failures could easily<br />
lead to changes in network topology;Typically, Most <strong>of</strong><br />
the sensor nodes are stationary. In addition, Sensor node<br />
has the energy, processing, storage capacity and<br />
communication are very limited, This makes the transfer<br />
<strong>of</strong> resources, Power management, Computing, Network<br />
topology discovery, etc. should be considered<br />
comprehensive in the Wireless Sensor Network.In<br />
particular, the energy consumption <strong>of</strong> wireless sensor.On<br />
the one hand to minimize the energy consumption <strong>of</strong><br />
sensor nodes, On the other hand when the node energy<br />
depletion should be able to find a new topology, Isolation<br />
the death node, generate a new path to complete data<br />
collection and processing.This fact also shows two<br />
important elements <strong>of</strong> wireless sensor networks:Topology<br />
discovery and reduce energy consumption.This paper will<br />
explore three aspects <strong>of</strong> wireless sensor networks to<br />
reduce energy consumption as a precondition to achieve<br />
the route discovery, while ensuring the reliability <strong>of</strong><br />
wireless sensor networks.<br />
Custering model is based on mobile agent, namely,<br />
how to solve clustering problems.<br />
In the cluster model, how to choose the cluster head<br />
node.<br />
In the cluster model, once the cluster head node is<br />
identified, the next, Solve the routing problem which the<br />
Mobile agent in the cluster how to move. Between Nodes<br />
within the same cluster<br />
III. ANALYSIS OF EXISTING MANAGEMENT MODEL<br />
A. MANNA<br />
Advantages:<br />
Groundbreaking research, A set <strong>of</strong> complete network<br />
management system, Specificity for sensor networks, The<br />
SNMP management model based on Summarizes the<br />
sensor network management architecture, including the<br />
organization, functions, and information etc.<br />
Disadvantages:<br />
Although made some simulations, but did not give<br />
detailed implementation programs, research level is the<br />
initial stage.<br />
B. MIADSN<br />
Advantages:<br />
The entire sensor network is divided into<br />
everal.subsystems <strong>of</strong> different functions, Conducive to<br />
modular. To conserve bandwidth, The introduction <strong>of</strong><br />
mobile agent technology, Fuzzy theory <strong>of</strong> statistical<br />
© 2011 ACADEMY PUBLISHER<br />
methods, Only a few sensor nodes achieve the purpose <strong>of</strong><br />
collecting data.<br />
Disadvantages:<br />
Main consideration is the negative effect <strong>of</strong> data<br />
fusion, Management talked about less specific methods<br />
under study.<br />
C. Other existing methods rely on broadcast traffic:<br />
Advantages:<br />
Low-level nodes are organized and managed through<br />
high-level, When the Mobile Agent in the<br />
implementation <strong>of</strong> each task is assigned a certain degree<br />
<strong>of</strong> strategy, Through these strategies to control the Mobile<br />
Agent Wireless sensor nodes to achieve the data<br />
collection.<br />
Disadvantages:<br />
Because the management information required notice<br />
with each other, Lead to excessive energy consumption <strong>of</strong><br />
nodes, Lead to reduced network lifetime. There are other<br />
issues not considered the energy <strong>of</strong> the node.<br />
D. Comparison <strong>of</strong> common methods<br />
Because the characteristics <strong>of</strong> wireless sensor<br />
networks,, The network management model <strong>of</strong> CMIP,<br />
SNMP and ANMP is no longer adapted to the wireless<br />
sensor network management, So researchers put forward<br />
some new network management solution.For example<br />
Linnyer B. Ruiz .etc. From the management level.the<br />
management functions and management functions<br />
domain described in three aspects <strong>of</strong> the management<br />
framework for wireless sensor networks, And design the<br />
architecture <strong>of</strong> wireless sensor network — MANNA,<br />
Through it configuration and managementwireless sensor<br />
network. WangFeng etc. Proposed the Distributed Sensor<br />
Network Management Model based on Mobile Agent that<br />
is the sensor network management based on Mobile<br />
intelligent agent.(Mobile Intelligent Agent-based DSN,<br />
called MIADSN), It uses a new model:Data retained in<br />
the local, Data fusion in remote.It was also proposed<br />
cluster-based wireless sensor network self-management<br />
hierarchical model, Low-level nodes are organized and<br />
managed through high-level, Because the management<br />
information required notice with each other, Lead to<br />
excessive energy consumption <strong>of</strong> nodes, Lead to reduced<br />
network lifetime. It was also proposed based on strategic<br />
management framework for wireless sensor MobileAgent,<br />
In this management structure, According to management<br />
needs, When the Mobile Agent in the implementation <strong>of</strong><br />
each task is assigned a certain degree <strong>of</strong> strategy,<br />
Through these strategies to control the Mobile Agent<br />
Wireless sensor nodes to achieve the data<br />
collection.There are other issues not considered the<br />
energy <strong>of</strong> the node.<br />
Based on the above analysis <strong>of</strong> the problem, The most<br />
popular model for sensor networks is Multiple mobile<br />
agents, Clustering topology network model. In this<br />
model, To reduce energy consumption in wireless sensor<br />
networks the problem becomes how to cluster, How to<br />
select cluster heads, How to cluster routing and intercluster<br />
routing problem. According to the literature,<br />
Currently used topology as shown in figure 2.
1736 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
In order to achieve the purpose <strong>of</strong> reducing energy<br />
consumption <strong>of</strong> nodes in wireless sensor networks, In this<br />
paper, we Proposed the wireless sensor network<br />
management model based on clustering and multiple<br />
Mobile Agents<br />
At the same time clustering based on mobile agents in<br />
wireless sensor network management model to improve<br />
the clustering algorithm, Agent routing algorithm.<br />
Simulation results show that the proposed network<br />
management structure and algorithm can achieve the<br />
purpose <strong>of</strong> reducing power consumption <strong>of</strong> sensor nodes.<br />
Figure 2. Clustering structure<br />
IV. WIRELESS SENSOR NETWORK MANAGEMENT MODEL<br />
A. Problems <strong>of</strong> traditional Wireless Sensor Network<br />
In traditional wireless sensor network, data collection<br />
was conducted through each sensor node and the data<br />
collected transferring to the designated destination node<br />
sink. In this mode, the power <strong>of</strong> the wireless sensor<br />
networks is mainly consumed in data transfer <strong>of</strong> sensors.<br />
The power is the most important resource <strong>of</strong> the wireless<br />
sensor network. power consumption The communication<br />
among the net nodes is much larger than that <strong>of</strong> computer<br />
processing and perception and it focuses on the states <strong>of</strong><br />
sending, receiving and idleness. In traditional wireless<br />
sensor network the large amounts <strong>of</strong> power is consumed<br />
in data transfer processing resulting in a rapid death <strong>of</strong><br />
the sensor nodes. It is suitable for deploying in the<br />
environment <strong>of</strong> a few data monitoring not for deploying<br />
with a large-scale and a long time<br />
Because <strong>of</strong> traditional shortcomings <strong>of</strong> wireless sensor<br />
network, clumps and management measure is popular at<br />
present That is through differentiating a number <strong>of</strong><br />
different regions in the whole wireless sensor networks<br />
and choosing the suitable node which is called cluster<br />
head in each region, the cluster will give a basic<br />
processing and then transfer the data to the terminal sink<br />
node. This method in a certain extent can reduce the<br />
consumption <strong>of</strong> sensor nodes.<br />
© 2011 ACADEMY PUBLISHER<br />
B. Wireless sensor network management model is based<br />
on LEACH method<br />
Suppose, In a two-dimensional square area A, There<br />
are N sensor nodes are randomly and evenly distributed,<br />
the sensor network has the following properties:<br />
A sensor nodes and Sink nodes are stationary, the sink<br />
node is far away from the network area, and it is unique.<br />
The sensor nodes have the same Initialization energy,<br />
can not be added.<br />
Sink node has enough energy.<br />
Sensor node can calculate its distance to the cluster<br />
nodes. Sink node through the launch <strong>of</strong> the test signal<br />
strength.<br />
Nodes in all directions the same amount <strong>of</strong> energy.<br />
In wireless sensor networks, the differences in signal<br />
transmission <strong>of</strong> energy will affect the performance <strong>of</strong><br />
routing protocols.which Including the Receiver Model<br />
and the Launch model,<br />
A model is assumed, as shown in figure 2, The model<br />
considers the Energy consumption <strong>of</strong> Transmit<br />
Electronics, the Energy consumption by power amplifier,<br />
The Energy consumption which Receive Electronics<br />
receive signals.<br />
Figure 3. Energy consumption model for sensor networks<br />
According to energy model, When the transmission<br />
distance is d, the data is L bits, the Energy consumption<br />
<strong>of</strong> the Transmit Electronics is :<br />
The Energy consumption <strong>of</strong> the Receive Electronics<br />
is :<br />
E = E − elec ( k)<br />
= L×<br />
recieve Rx E elec<br />
C. LEACH protocol shortcomings<br />
LEACH protocol is only applicable to homogeneous<br />
network, heterogeneous network can not get good results.<br />
The mechanisms <strong>of</strong> each node to probability act as<br />
cluster heads.Without considering the residual energy <strong>of</strong><br />
nodes, without considering the node location.In each<br />
cycle, all nodes must act as a cluster head.<br />
In the process <strong>of</strong> transferring data. From cluster head<br />
to the base station.<br />
All the cluster heads are sent directly to the base<br />
station.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1737<br />
V. CLUSTER HEAD ELECTION<br />
On the optimal probability, How to determine the<br />
cluster head node election The people did a lot <strong>of</strong><br />
simulation and analysis, They think that the optimal<br />
probability as theSpace density function which Evenly<br />
distributed nodes in the monitoring area.The best<br />
clustering can be achieved optimal energy distribution in<br />
the network, Bringing the total minimum energy<br />
consumption.<br />
Suppose, N nodes Evenly distributed in a square area<br />
whose Side length is 2a Distribution density observe<br />
poisson distribution Within Parameters for λ .The N is a<br />
random variable for number <strong>of</strong> sensor nodes, N= λ A.The<br />
p is the probability <strong>of</strong> cluster head election, np is the final<br />
number <strong>of</strong> clusters was calculated. Assuming the base<br />
station is located in the center <strong>of</strong> the square area, Then<br />
the average distance From a cluster head node to the base<br />
station as:<br />
1<br />
E[<br />
D | N n]<br />
x y dA 0.<br />
765a<br />
i i<br />
A 4a 2<br />
2 2⎛<br />
⎞<br />
= = ⎜ ⎟ =<br />
i ∫ +<br />
⎜ ⎟<br />
⎝ ⎠<br />
B is the Poisson distributed random variables, means<br />
the distance from Cluster head node to the base stations<br />
( ) i y at the i<br />
+ x , In the network has np cluster head<br />
node from the cluster head to base station.And between<br />
the cluster head and cluster head, the position is<br />
independent <strong>of</strong> each other.between.then, the length is<br />
0.765npa.from All the cluster head to o the sink node.<br />
The cluster head obey Poisson point process pp0 whose<br />
( λ)<br />
Intensity is λ λ = p<br />
i i , Cluster member nodes obey<br />
Poisson point process pp1 whose Intensity is<br />
λ ( λ = ( 1−<br />
p)<br />
λ)<br />
0 0<br />
, we define the e1 is the energy<br />
consumption which Member nodes within a unit to<br />
transfer data to the cluster head.then:<br />
E<br />
[ ] [ L | N = n]<br />
| N = n =<br />
E e<br />
1<br />
r<br />
e2 is all the Energy consumption which All the<br />
ordinary nodes in the network transmit data to their<br />
respective cluster heads .then:<br />
[ e | N = n]<br />
= npE[<br />
| N = n]<br />
2<br />
1<br />
E e<br />
e3 is the Energy consumption which Cluster head<br />
node transfer data to Sink nodes.then:<br />
[ | N = n]<br />
0.765npa<br />
E e =<br />
3<br />
r<br />
e is the energy consumption <strong>of</strong> the entire network, then:<br />
© 2011 ACADEMY PUBLISHER<br />
E<br />
=<br />
[ e | N = n]<br />
= E[<br />
| N = n]<br />
+ E[<br />
e | N = n]<br />
2<br />
3<br />
np<br />
r<br />
2<br />
e<br />
( 1−<br />
p)<br />
0.<br />
765npa<br />
+<br />
3<br />
2<br />
λ<br />
r<br />
p<br />
In Theorem:N= λ A, then:<br />
⎡ ( 1−<br />
p)<br />
0.<br />
765 pa ⎤<br />
E[]<br />
e = E[<br />
e | N = n]<br />
= E[<br />
N]<br />
⎢ + ⎥<br />
⎢⎣<br />
2r<br />
pλ<br />
r ⎥⎦<br />
⎡ ( 1−<br />
p)<br />
0.<br />
765 pa ⎤<br />
= λA⎢<br />
+ ⎥<br />
⎢⎣<br />
2r<br />
pλ<br />
r ⎥⎦<br />
When the above formula to obtain the minimum, the<br />
system Can find the optimal value p which Determine<br />
the system probability <strong>of</strong> cluster head election.<br />
p =<br />
⎡<br />
⎤<br />
⎢<br />
⎥<br />
⎢ +<br />
⎥<br />
⎢<br />
⎥<br />
⎢<br />
⎥<br />
⎢<br />
+ ⎥<br />
⎢ + + + ⎥<br />
⎢<br />
⎥<br />
⎢ 2<br />
2 3 ⎥<br />
⎢<br />
+ + +<br />
⎥<br />
⎢<br />
3<br />
⎣<br />
⎥⎦<br />
1<br />
2<br />
2 3<br />
1<br />
1<br />
3e<br />
3<br />
2<br />
3e(<br />
2 27e<br />
3 3e<br />
27e<br />
4)<br />
( 2 27e<br />
3 3e<br />
27e<br />
4)<br />
1<br />
.<br />
3e<br />
2<br />
In Theorem: e = 3.<br />
06 λ , This time, p is the<br />
probability <strong>of</strong> the optimal cluster head election and<br />
Minimum energy consumption in the whole<br />
network.Then the p into the formula can Calculate the<br />
distance threshold T(n) for each round.Finally, the<br />
number <strong>of</strong> the optimal number <strong>of</strong> cluster heads will be<br />
obtained for each round..<br />
With the operation <strong>of</strong> the network, the network energy<br />
change, P value also changes, the number <strong>of</strong> cluster heads<br />
in the network also with the dynamic changes.<br />
VI. MOBILE AGENT<br />
Mobile agent is the combination <strong>of</strong> distributed<br />
technology and artificial intelligence, simply, which is<br />
intelligent agent with mobility. The main idea is to<br />
transfer the code <strong>of</strong> calculation module to each node, then<br />
to finish calculation on a node, and return the processing<br />
results to the objective sink, so it can reduce the power <strong>of</strong><br />
sensor node generated by transmitting data.<br />
Through the discussion above, we can complete the<br />
establishment <strong>of</strong> cluster group, at the same time selected<br />
cluster head node. Through the use <strong>of</strong> mobile agent, we<br />
can in each cluster head node transfer data information,<br />
and will eventually results return to the objective sink<br />
node.<br />
2
1738 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Assuming mobile agent migration process, only<br />
improve data information accuracy, the S MA is fixed<br />
and it does not consider the energy consumption <strong>of</strong> free<br />
0<br />
monitoring. When E = =<br />
idl e b4 , the node energy<br />
transmission agent can be defined as<br />
a<br />
E d b b d ).<br />
tx S MA<br />
( ) ( 1 2 + =<br />
,<br />
Receiving energy consumption can be defined as<br />
b . 3 S<br />
, b , 1 2 b<br />
E = rx<br />
MA ,<br />
b , 3 b4<br />
In the<br />
is the constants with a sensor<br />
node wireless transceiver related; d is the transmission<br />
distance between the nodes; 2 ≤ a ≤ 4 is the attenuation<br />
factor for signal transmission path with energy<br />
consumption for measurement; the migration cost that<br />
agent moving from<br />
v()<br />
i<br />
a<br />
ij<br />
to<br />
⎧<br />
⎪(<br />
= ⎨<br />
⎪⎩<br />
∞<br />
( ) j v<br />
b<br />
1<br />
+<br />
b<br />
2<br />
is :<br />
d<br />
a<br />
ij<br />
).<br />
S<br />
MA<br />
+<br />
b<br />
3<br />
.<br />
S<br />
MA<br />
Thend ij is the distance between () i v<br />
d ij<br />
d<br />
and<br />
ij<br />
≤<br />
><br />
R<br />
R<br />
max<br />
max<br />
( ) j v<br />
; node<br />
can be reached when d ij is no greater than Rmax , on<br />
the contrary, the aim will not be visited, this process that<br />
node perception aim is considered the process <strong>of</strong> target<br />
signal<br />
So, information gains, the<br />
SE<br />
⎧<br />
⎪<br />
( j)<br />
= ⎨<br />
⎪⎩<br />
b<br />
0<br />
5<br />
d<br />
−a<br />
'<br />
jo<br />
( ) j v<br />
d<br />
d<br />
d jo ( )<br />
is the distance between<br />
v j<br />
jo<br />
jo<br />
is<br />
><br />
≤<br />
D<br />
D<br />
max<br />
max<br />
and the aim; a '<br />
is the attenuation factor to arrive at destination. When<br />
d jo<br />
is no more than Dmax , node may perceived goals,<br />
a<br />
d jo<br />
the information gains and into inverse.<br />
VII. SIMULATION AND RESULTS<br />
Using the improved algorithm to calculate cluster head<br />
node based on the LEACH algorithm, based on this,<br />
realize routing optimization algorithm <strong>of</strong> mobile agent,<br />
using NS - 2 to realize simulation, network is set in the<br />
area <strong>of</strong> 1000× 1000 sensor node random distribution, the<br />
number <strong>of</strong> node from 10 change to 1000, Compare<br />
energy consumption between without optimization<br />
© 2011 ACADEMY PUBLISHER<br />
'<br />
algorithm and based on the improved algorithm <strong>of</strong> mobile<br />
agent, the simulation results are as shown in figure 4.<br />
Figure 4. Comparison <strong>of</strong> energy consumption<br />
Random application: ten test scenario for each node<br />
scale, every scene test 10 times, taking average T. The<br />
simulation results as shown in table 1.<br />
Ran<br />
do<br />
m<br />
sce<br />
ne<br />
TABLE I. SHOWS THE RESULTS OF PERFORMANCE<br />
COMPARISON<br />
Not optimization<br />
Energy<br />
consump<br />
tion<br />
algorithm<br />
Informat<br />
ion<br />
Improved agent method<br />
Energy<br />
consumption<br />
Information<br />
1 545612 1078.33 216532 948.76<br />
2 432125 1468.41 154334 1399.78<br />
3 409563 960.12 126697 842.64<br />
4 896521 321.46 301682 355.47<br />
5 502184 1823.68 192360 1253.14<br />
6 57620 1123.17 221302 987.17<br />
7 496172 989.88 182780 1132.54<br />
8 457890 1572.91 155076 1475.21<br />
9 870475 309.24 270817 333.69<br />
10 429761 1254.71 131721 982.06<br />
VIII. CONCLUSION<br />
Based on the traditional sensor network algorithm to<br />
optimize the generating cluster head node through the use<br />
<strong>of</strong> energy calculation was used by routing algorithms<br />
which have been improved, This dissertation firstly<br />
analyzes the advantages and disadvantages <strong>of</strong> the existing<br />
routing protocols.The classical LEACH(Low Energy<br />
Adaptive Clustering Hierarchy)protocol <strong>of</strong> hierarchical<br />
sensor networks is analyzed and discussed in detail, and<br />
then it presents a new energy-efficient routing protocol <strong>of</strong><br />
WSN:Multi-Hierarchical Algorithm based on<br />
Clustering(MHAC).Simulation results show that MHAC<br />
Can balance energy load and prolongs the life time 0f<br />
WSN.<br />
The prominence feature <strong>of</strong> the wireless sensor<br />
networks is energy limited.The algorithm <strong>of</strong> the<br />
nowadays ale fall to pay attention to feature <strong>of</strong>the nodes<br />
energy, when issue the topology discovery, this algorithm<br />
consume much energy and can not guarantee the<br />
connectivity <strong>of</strong> the networks.The mobile agent-based<br />
topology discovery algorithm considerated the<br />
improvement on energy aspect for traditional.Through the<br />
simulation <strong>of</strong>the algorithm, the energy consumption can
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1739<br />
save than Waditional algorithm when hold oil some<br />
feature <strong>of</strong>traditional algorithm and have even luore<br />
connectivity.<br />
ACKNOWLEDGMENT<br />
This work was supported by Project Y200804680 <strong>of</strong><br />
the Research planning issues.<br />
REFERENCES<br />
[1] Shaojun Yang, Haoshan Shi, Rui Huang.Spatial-Temporal<br />
Information Integration Framework Based on Mobile-<br />
Agent in Wireless Sensor <strong>Networks</strong>.In Proc.<strong>of</strong> 16th<br />
International Conference on Computer<br />
Communication(ICCC2004), 2004, beijing, China;1096-<br />
1100, (ISIP:000228632800198)<br />
[2] LI N, Hou J C, Sha L.Design and analysis <strong>of</strong> an MSTbased<br />
topology control algorithm[A].In:Proceedings 12th<br />
Joint Conf on IEEE Computer and Communications<br />
Socienties(INFOCOM 2003)[C].San Francisco,<br />
2003.1702-1712<br />
[3] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and<br />
exchange anisotropy, ” in Magnetism, vol. III, G. T. Rado<br />
and H. Suhl, Eds. New York: Academic, 1963, pp. 271–<br />
350.<br />
[4] Tynan R, Marsh D, O'Knae D, O'Hare GMP.Agents for<br />
Wireless Sensor Network Power Management[A].In :<br />
Proceedings <strong>of</strong> the 2005 International Conference on<br />
Parallel Processing Workshops[c].June 205.413`418<br />
[5] Chen Min, Kwon T, Choi Y.Data Dissemination based on<br />
Mobile Agent in Wireless Sensor<br />
<strong>Networks</strong>[A].In:Preceedings <strong>of</strong> the IEEE Conference on<br />
Local Computer <strong>Networks</strong> 30TH<br />
Anniversary(LCN'05)[C].IEEE Computer Society,<br />
SYDNEY, AUSTRALIA, November 2005.2~3<br />
[6] Kui Wu, Yong Gao, Fulu Li, Yang Xiao. Lightweight<br />
Deployment-Aware Scheduling for Wireless Sensor<br />
<strong>Networks</strong>[J]. Mobile <strong>Networks</strong> and Applications, 2005,<br />
10(6)<br />
[7] Zhang wenjuan, Zhu Xiangbin, Mobile Agent-based<br />
Clustering Data Fusion in WSN[J], Computer & Digital<br />
Engineering, March 2010.<br />
[8] Wang Jietai, Yang Shaojun, Yu Haixun, Application <strong>of</strong><br />
Mobile Agent in Wireless Sensor <strong>Networks</strong>, Computer<br />
Engineering, March 2008<br />
[9] Xiao Qing, Jiao Jian, Application for artificial bee<br />
algorithm in migration <strong>of</strong> mobile agent[J], Application<br />
Research <strong>of</strong> Computers, June.2010.<br />
[10] Li Ming, Fan Gaojuan, An EIW-DSR Route Algorithm<br />
Based on the Energy Integrated Weight in Ad Hoc<br />
<strong>Networks</strong>[J], Computer Engineering & Science, November<br />
2010<br />
[11] Fdd Zhang Sheng, He Qingquan, Improved ant colony<br />
algorithm to solve mobile agent in wireless sensor<br />
networks[J], Application Research <strong>of</strong> Computers,<br />
November 2010.<br />
[12] LinGe Wang, Management Model Research <strong>of</strong> Wireless<br />
Sensor Network Based on Mobile Agent, Intelligent<br />
Computatyion Technology and Automaton(ICICTA), 2010,<br />
shenzhen, china<br />
[13] FanGaoJuan, Management Model Research <strong>of</strong> Wireless<br />
Sensor Network Based on Mobile Agent, 2007, 5<br />
© 2011 ACADEMY PUBLISHER<br />
``<br />
LinGe Wang ZheJiang Province, China.<br />
Birthdate: January, 1979. is Master <strong>of</strong><br />
computer technology.graduated from<br />
fudan University . And research interests<br />
on Network Engineering, Information<br />
Security, Wireless sensor networks.<br />
he is a senior lecturer <strong>of</strong> Dept. Network<br />
Ningbo Dahongying University<br />
college <strong>of</strong> s<strong>of</strong>tware.<br />
YueDou QI ZheJiang Province, China.<br />
Birthdate: Feb, 1964. is Bachelor <strong>of</strong><br />
Computer Application. graduated from<br />
QiQiHaEr University . And research<br />
interests on data mining, complex<br />
networks, business intelligence.<br />
He is a pr<strong>of</strong>essor <strong>of</strong> Dept. Network<br />
Ningbo Dahongying University college <strong>of</strong><br />
s<strong>of</strong>tware.
1740 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Covert Flow Graph Approach to Identifying<br />
Covert Channels<br />
XiangMei Song<br />
School <strong>of</strong> Computer Science and Telecommunication Engineering,<br />
Jiangsu University, Zhenjiang, 210013, China<br />
Email: jlsxm@ujs.edu.cn<br />
ShiGuang Ju<br />
School <strong>of</strong> Computer Science and Telecommunication Engineering,<br />
Jiangsu University, Zhenjiang, 210013, China<br />
Email: jushig@ujs.edu.cn<br />
Abstract—In this paper, the approach for identifying covert<br />
channels using a graph structure called Covert Flow Graph<br />
is introduced. Firstly, the construction <strong>of</strong> Covert Flow<br />
Graph which can <strong>of</strong>fer information flows <strong>of</strong> the system for<br />
covert channel detection is proposed, and the search and<br />
judge algorithm used to identify covert channels in Covert<br />
Flow Graph is given. Secondly, an example file system<br />
analysis using Covert Flow Graph approach is provided,<br />
and the analysis result is compared with that <strong>of</strong> Shared<br />
Resource Matrix and Covert Flow Tree method. Finally, the<br />
comparison between Covert Flow Graph approach and<br />
other two methods is discussed. Different from previous<br />
methods, Covert Flow Graph approach provides a deep<br />
insight for system’s information flows, and gives an effective<br />
algorithm for covert channel identification.<br />
Index Terms—multilevel security, covert channels,<br />
information flows, covert flow graph, shared resource<br />
matrix, covert flow trees<br />
I. INTRODUCTION<br />
Multilevel secure computer systems are used to protect<br />
hierarchic information by enforcing both mandatory and<br />
discretionary access controls. They can restrict the flow<br />
<strong>of</strong> information through legitimate communication<br />
channels[1]. However, covert channels are usually<br />
beyond the scope <strong>of</strong> the security model. Covert channels<br />
usually signal information through system facilities not<br />
intended for data transfer. That is, the sending process<br />
alters some system attributes, and the receiving process<br />
monitors the alteration[2]. In order to decrease the threat<br />
<strong>of</strong> covert channels, several covert channel analysis<br />
techniques have been proposed and utilized in the past<br />
thirty years. Among these techniques are the Shared<br />
Resource Matrix methodology (SRM)[3], the Noninterference<br />
approach[4], the Information Flow<br />
analysis[5], the Covert Flow Trees methodology<br />
Manuscript received Mar. 25, 2011; revised Apr. 17, 2011; accepted<br />
Apr. 20, 2011.<br />
project number: 60773049, 61003288, 20093227110005,<br />
BK2010192, 07JDG014, 08KJD520015<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1740-1746<br />
(CFT)[6], the Backward Tracking approach[7], and<br />
others.<br />
The SRM method is one <strong>of</strong> the most successful<br />
approaches for covert channel identification. The method<br />
starts from identifying shared resources. A shared<br />
resource is any object or collection <strong>of</strong> objects that may be<br />
referenced or modified by more than one subject. All<br />
identified shared resources are enumerated by a matrix<br />
structure, and then each resource is carefully examined to<br />
determine whether it can be used to transfer information<br />
from one subject to another covertly. The usage <strong>of</strong> the<br />
matrix structure makes the SRM method simple and<br />
intuitive. However, the shared resources matrix is<br />
helpless when constructing covert communication<br />
scenarios, and amount <strong>of</strong> analysis work by hand is<br />
enormous. Lots <strong>of</strong> research work to improve the SRM<br />
method has been done. The CFT method is virtually a<br />
transformation <strong>of</strong> the SRM method. Furthermore,<br />
McHugh[8] made three extensions to the matrix structure,<br />
and Shen and Qing[9-10] optimized the SRM method.<br />
The CFT method uses a tree structure instead <strong>of</strong> the<br />
shared resources matrix. Due to its tree structure, the CFT<br />
method is capable <strong>of</strong> recording flow paths and helpful to<br />
construct covert communication scenarios. But the covert<br />
flow trees usually are huge, and the construction <strong>of</strong> the<br />
trees could fall into infinite loop. To resolve the problem,<br />
a constraint parameter named REPEAT has to be<br />
introduced, which may lead to lose some potential covert<br />
channels.<br />
This paper presents a graph data structure that models<br />
system information flow from one shared resource<br />
attribute to another. This data structure is referred to as<br />
Covert Flow Graph (CFG). The process for constructing a<br />
covert flow graph is easy, and the graph can include<br />
almost information flows in a system. By searching for<br />
information flow paths, operation sequences can be<br />
<strong>of</strong>fered that will help the analysis work for detecting<br />
covert channels. To demonstrate this technique, an<br />
example file system is analyzed. The result is compared<br />
to other covert channel analysis methods.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1741<br />
II. THE COVERT FLOW GRAPH APPROACH<br />
The goal <strong>of</strong> Covert Flow Graph approach is to identify<br />
operation sequences that support potential<br />
communication channels exploited by two users. The<br />
Covert Flow Graph is a direct graph. The nodes <strong>of</strong> graph<br />
describe the information flows from one or more shared<br />
resource attributes to another one in an operation. While<br />
the direct edges denote the dependency relationships<br />
between two operations that share the same attribute and<br />
generate information flows. Section Ⅱ-A presents the<br />
graph notation and semantics used in the construction <strong>of</strong><br />
Covert Flow Graph. Section Ⅱ -B explains how to<br />
construct a covert flow graph. Section Ⅱ-C discusses the<br />
reason for pruning <strong>of</strong> Covert Flow Graph. Section Ⅱ-D<br />
introduces an algorithm for searching information flow<br />
paths and judging potential covert channels.<br />
A. Graph notation and semantics<br />
Let SA denote the collection <strong>of</strong> all shared resources<br />
(or shared attributes) in system and OP denote the<br />
collection <strong>of</strong> all primitive operations. Specially, the set<br />
SA contains an attribute, named output , whose value is<br />
returned by primitive operations. For opi ∈ OP (1 ≤ i ≤ n,<br />
n is the number <strong>of</strong> primitive operations),<br />
let , ,<br />
i i i<br />
SAR SAM SAO ⊂ SA ; SAR i contains all recognized<br />
attributes by op i , SAM i contains all modified attributes<br />
by op i and SAO i contains all output attributes by op i .<br />
The Covert Flow Graph is a direct graph. Fig.1 shows<br />
three kinds <strong>of</strong> nodes in Covert Flow Graphs. The triple<br />
< SARi, opi, v > in Fig 1.(a), where v∈ SAMi,<br />
presents<br />
that v is modified by op i according to the values <strong>of</strong> the<br />
shared attributes in SAR i . Another triple<br />
< v, opi , output > in Fig 1.(b), where v∈ SAOi,<br />
indicates<br />
that v is returned by op i . The OUTPUT node in<br />
Fig.1(c) is the finish node which appears only once in a<br />
covert flow graph. Its use will be discussed later.<br />
Figure 1. Nodes used in Covert Flow Graphs<br />
Definition 1. Covert Flow Graph (CFG):<br />
CFG =< SA, SAO, OP, AR, OUTPUT, S, E > . SA is the<br />
set <strong>of</strong> all shared attributes. SAO ⊂ SA is the set <strong>of</strong> all<br />
returned attributes by primitive operations. OP is the set<br />
<strong>of</strong> all primitive operations. AR = { SARi| i = 1,..., n}<br />
.<br />
S = S1∪S2 ∪ S3<br />
is the node set; S1⊆ AR× OP× SA,<br />
S2 ⊆ SAO × OP × { output}<br />
, S3= OUTPUT ,<br />
E = E1∪E2 ∪ E3<br />
is the edge set,<br />
E1 = { < si, sj > | si, sj ∈S1∧ si =< SARi, opi, v >∧<br />
,<br />
s =< SAR , op , v > ∧v ∈ SAR }<br />
j j j j j<br />
© 2011 ACADEMY PUBLISHER<br />
E2 = { < si, sj > | si ∈S1∧sj ∈S2 ∧ si =< SARi, opi, v ><br />
∧ s j =< v, op j, output > ∧v∈SAOj} E = { < s , s > | s ∈S ∧s ∈ S } .<br />
3 i j i 2 j 3<br />
The directed edges in E 1 connecting two nodes<br />
describes the dependency relationship between two<br />
operations, such as opi and op j . It means that one shared<br />
attribute like v is modified by op i and then referenced<br />
by op j . In E 2 , the directed edges present a shared<br />
attribute named v is modified by one operation named<br />
op i and then its value is returned by another operation<br />
named op j . The node like < vop , i , output><br />
must be<br />
connected to the finish node by a directed edge in E 3 ,<br />
which means the value <strong>of</strong> the attribute v is returned<br />
by op i .<br />
B. Construction <strong>of</strong> Covert Flow Graph<br />
Similarly to CFT methods, here creates reference list,<br />
modify list and return list for each primitive operation;<br />
then uses these lists as input to construct Covert Flow<br />
Graph. The information for creating three lists can get<br />
from system’s description, formal specification or<br />
implementation code. So, Covert Flow Graphs can be<br />
applicable to either phase <strong>of</strong> system life cycle.<br />
Constructing a covert flow graph has two main steps.<br />
The first step is to construct nodes. For ∀vij ∈ SAMi<br />
(1 < j < mi,<br />
m i is the number <strong>of</strong> the attributes that are<br />
modified by op i ), generate the triple < SARi, opi, vij<br />
> ;<br />
and for ∀vik ∈ SAOi(1<<br />
k < oi,<br />
o i is the number <strong>of</strong> the<br />
attributes whose values are returned by op i ), generate the<br />
triple < vik , opi , output > . Furthermore, the finish node<br />
OUTPUT should be generated. Next, generate oriented<br />
edges among the nodes. Firstly, for ∀opi, op j ∈ OP<br />
(1 < i, j < n),<br />
if there is an attribute v that is modified by<br />
opi and referenced or returned by op j , then op j is<br />
dependent on op i in connection with v . In other words,<br />
for every two inequality operations, such as op i and op j ,<br />
if having constructed the nodes < SARi, opi, v1<br />
> and<br />
< SAR j, op j,<br />
v2<br />
> ( v1∈ SARj)<br />
or < v, op j , output > in the<br />
first step, , then an oriented edge should be generated<br />
from the former node to the latter one. Besides, for the<br />
node < v, op , output > , it should be connected to the<br />
j<br />
finish node.<br />
Fig 2 illustrates an example graph. The operation lists<br />
used to build this example is just the same one used in<br />
[6], defined as follows in Table 1.<br />
According to the method above, for the operation op 4<br />
in Table Ⅰ, SAR 4 is { A } , SAM 4 is { BC , } , then<br />
generate triples as < { A}, op4, B > and < { A}, op4, C > ;<br />
and SAO4 is { A } , so < A, op4, output > is created.<br />
Therefore three nodes marked with gray color have been
1742 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Figure 2. Example CFG representing information flows<br />
TABLE I. EXAMPLE OPERATION LISTS<br />
Operation 1 Operation 2 Operation 3 Operation 4<br />
Reference List: Reference List: Reference List: Reference List:<br />
D A B A<br />
Modify List: Modify List: Modify List: Modify List:<br />
A, B B Null B, C<br />
Return List: Return List: Return List: Return List:<br />
Null Null B A<br />
Attributes: A, B, C, D<br />
constructed in Fig 2. The bold edge from<br />
< { D}, op1, A><br />
to < A, op4, output > is generated<br />
op .<br />
because A is modified by op 1 and referenced by 4<br />
And as A ’s value is returned by op 4 , the node<br />
< A, op , output > is connected to the finish node.<br />
4<br />
C. Pruning <strong>of</strong> Covert Flow Graph<br />
When a covert flow graph has been constructed, it can<br />
be pruned before the analysis. The pruning work is a twostep<br />
process. First, remove the node that has indegrees<br />
but no outdegrees, except for the finish node and the<br />
edges connected to it. Because only those paths that end<br />
with the finish node in the covert flow graph may be<br />
potential covert storage channels. Second, identify and<br />
remove the starting nodes provided that the paths started<br />
with those nodes cannot occur in practice. In a system,<br />
such pairs <strong>of</strong> operations <strong>of</strong>ten exist that one operation,<br />
named post-executed operation, must be executed<br />
consecutively after the execution <strong>of</strong> the other operation,<br />
named pre-executed operation, and the consecutive<br />
executions <strong>of</strong> the two operations can nullify each other’s<br />
effects, such as Lock_File and Unlock_File 1 . Almost no<br />
operation sequences are started with post-executed<br />
operations under the running circumstance. When<br />
analyzing the primitive operations <strong>of</strong> a system, such pairs<br />
<strong>of</strong> consecutive operations should be identified. And in<br />
this step, if a starting node presents a post-executed<br />
operation, then remove the node and the edges emitted<br />
from it.<br />
D. Search for information flow paths and identification<br />
<strong>of</strong> covert channels<br />
A pruned covert flow graph includes all information<br />
flow paths in the system, but not all <strong>of</strong> them are covert<br />
1 the operation Lock_File and Unlock_File come from a file<br />
system used in [3], which is referred to following.<br />
© 2011 ACADEMY PUBLISHER<br />
storage channels. The next task is to search for<br />
information flow paths and identify covert storage<br />
channels. According to the minimum criteria that a covert<br />
storage channel must be satisfied [3], exploiting covert<br />
channels to communicate between two users has three<br />
characters:<br />
(1) The sending process (or user) must be able to<br />
modify some shared attribute’s value.<br />
(2) The receiving process must be able to detect the<br />
attribute change. Namely, the attribute’s value<br />
should be returned by the operation invoked by<br />
receiving process.<br />
(3) The security class <strong>of</strong> the sending process must be<br />
dominant or incomparable to that <strong>of</strong> the receiving<br />
process.<br />
These characters have special behaviors on operation<br />
sequences <strong>of</strong> covert channels. Because the operation<br />
names are included in nodes <strong>of</strong> covert flow graphs, the<br />
operation sequences can be acquired when searching for<br />
information flow paths in covert flow graphs. According<br />
to (1)-(3) characters above the criteria rules for covert<br />
channel identification based on Covert Flow Graphs can<br />
get as following.<br />
Regulation 1. If the start operation in an operation<br />
sequence for covert communication has dependency, the<br />
operation sequence will not built covert channels.<br />
In systems, whether some operation’s reading or<br />
writing action can execute or not is decided by other<br />
operation’s execution results. The former operation is<br />
dependent on the latter one. According to character (1),<br />
the sending process must modify shared attribute’s value<br />
independently, so this kind <strong>of</strong> dependent operation cannot<br />
be exploited by sending process.<br />
Regulation 2. If one operation sequence can built<br />
covert channels, its corresponding information flow paths<br />
must be ended with the finish node in covert flow graph.<br />
According to character (2), the receiver has to invoke<br />
an operation to output the attribute’s value finally.<br />
Because the nodes presenting some attribute’s value<br />
returned must be connected to the finish node, Regulation<br />
2 is valid.<br />
Regulation 3. If one operation sequence can built<br />
covert channels, the user authority which the start<br />
operation needs should not be dominated by that <strong>of</strong> the<br />
end operation in the operation sequence.<br />
If Regulation 3 is not satisfied, then the covert<br />
communication will be a legal channel between the<br />
sender and receiver because the sender’s security class is<br />
lower than or equal to the receiver’s.<br />
According to Regulation 2, only those paths that end<br />
with the finish node in covert flow graph may be<br />
potential covert channels. So, the search method for<br />
information flow paths consists <strong>of</strong> the following steps:<br />
Firstly, get the converse digraph <strong>of</strong> a covert flow graph,<br />
named CFG -1 . In CFG -1 , the finish node is the only node<br />
without indegrees. Secondly, use the deep first search<br />
method to find out all the paths which begin with the<br />
finish node. While searching, determine whether it can be<br />
exploited by covert channels. The judge basis is<br />
Regulation 2 and 3. In order to avoid endlessly cycles
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1743<br />
while searching, one directed edge must appear only once<br />
in a path. The search and judge algorithm is as following:<br />
Algorithm 1: the search and judge algorithm<br />
1<br />
Procedure PathSearching( CFG − )<br />
1<br />
Input: CFG −<br />
Output: PATH // information flow paths<br />
Begin<br />
initial stack 1 // stack 1 is used to backtrack<br />
initial PATH<br />
T :=Φ //T is the set <strong>of</strong> direct edges<br />
push( stack 1 , OUTPUT )<br />
// push the OUTPUT node into the stack<br />
while stack 1 != NULL do<br />
pop( stack 1 , v )<br />
push( stack 1 , NULL)<br />
//NULL presents the null node<br />
flag:=FALSE<br />
−1<br />
while ( ∃v → w∈CFG ) ∧( v→ w∉ T)<br />
do push( stack 1 , w )<br />
T =T ∪{ v→ w}<br />
flag:=TRUE<br />
od<br />
if flag=TRUE then<br />
push( stack 2 , ( vw)) ,<br />
else<br />
JudgeCovertChannel( )<br />
pop( stack 1 , v )<br />
while v =NULL do<br />
pop( stack 2 , ( vw)) ,<br />
T=T-{ v→ w}<br />
pop( stack 1 , v )<br />
od<br />
push( stack 1 , v )<br />
fi<br />
od<br />
End<br />
Procedure JudgeCovertChannel( )<br />
Output: CC // potential covert channels<br />
Begin<br />
i:=0<br />
j:=0<br />
while PATH !=NULL do<br />
pop( PATH , ( vw)) ,<br />
Array[i++]:= w<br />
if the operation in w is independent<br />
then lab[j++]=i<br />
od<br />
v :=Array[i-1]<br />
for k:=0 to j-1 do<br />
w :=Array[lab[k]]<br />
if sl( w≮ ) sl() v then<br />
// sl( x ) presents node’s security class<br />
CC ← Array[lab[k]..i-1]<br />
output CC<br />
fi<br />
od<br />
End<br />
© 2011 ACADEMY PUBLISHER<br />
In Algorithm 1, procedure PathSearching is used to<br />
1<br />
deep first search for paths in CFG − . In order to find out<br />
all paths, here needs a stack structure for backtracking,<br />
named stack 1 . While backtracking, determine the steps<br />
with the number <strong>of</strong> NULL nodes pop from stack 1 .<br />
stack 2 is another stack structure used to recorder the<br />
nodes in a path. During the search time, as long as a<br />
direct edge has been found out, the pair <strong>of</strong> nodes<br />
corresponding to the edge should be pushed into stack 2<br />
Furthermore, a set T is defined to denote whether every<br />
edge in a path appears only once in order to avoid<br />
endlessly cycles. Once searching out a path, procedure<br />
JudgeCovertChannel will be invoked to judge whether<br />
covert channels exist in the path.<br />
III. EXAMPLE FILE SYSTEM ANALYSES USING COVERT<br />
FLOW GRAPH<br />
This section presents the results from an example<br />
covert channel analyses using the CFG approach<br />
described in Section 2. A brief description <strong>of</strong> the example<br />
system is included, which is taken from [6], in order to<br />
provide an overview <strong>of</strong> the basic functions <strong>of</strong> the<br />
primitive operations and attributes. For more detailed<br />
descriptions <strong>of</strong> the system the reader is referred to [3], the<br />
paper from which the specification was taken. The<br />
operation description lists used in the construction <strong>of</strong><br />
covert flow graph are also taken from [6].<br />
A. A brief description <strong>of</strong> the file system example<br />
The attributes <strong>of</strong> the system includes six file attributes,<br />
three process attributes, and one attribute associated with<br />
the global state <strong>of</strong> the system: Current_Process.<br />
Current_Process contains the ID <strong>of</strong> the process currently<br />
running on the CPU. The file attributes <strong>of</strong> the system are<br />
File_ID, Locked, Locked_By, Value, Security_Class, and<br />
In_Use. The three process attributes are Process_ID,<br />
access_Rights and Buffer. The operations are discussed<br />
in more detail in the following paragraphs.<br />
The Write_File operation is used by a process to<br />
change the contents <strong>of</strong> a file. The file is locked by the<br />
current process. The value <strong>of</strong> the file is modified to<br />
contain the contents <strong>of</strong> the current process's buffer.<br />
The Read_File operation is used by a process to<br />
interrogate the contents <strong>of</strong> file. If the current process is<br />
included in the in-use set for the file specified, the value<br />
<strong>of</strong> the file is copied to the current process's buffer.<br />
The Lock_File operation is used by a process to<br />
modify the contents <strong>of</strong> file. A process must lock a file<br />
before modifying it and must unlock the file after the<br />
modification is complete. If the current process has write<br />
access for the specified file, if the file specified is<br />
unlocked, and if its in-use set is empty, then the file is<br />
locked, and its locked by attribute is set to the id <strong>of</strong> the<br />
current process.<br />
The Unlock_File operation makes a file accessible<br />
when a process is done modifying its contents. If the<br />
specified file's locked by attribute is the current process,<br />
the file is unlocked.
1744 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
The Open_File operation is used by a process to<br />
initiate retrieval <strong>of</strong> the contents <strong>of</strong> a file. This primitive<br />
guarantees that no other process is modifying the contents<br />
<strong>of</strong> the file being interrogated. If the current process has<br />
read access or the specified file and the file is not locked,<br />
the current process's id is added to the in-use set for this<br />
file.<br />
The Close_File operation is used when a process has<br />
completed interrogation <strong>of</strong> a file and wants to release it so<br />
that it can be modified. If the current process's id is an<br />
element <strong>of</strong> the in-use set for the specified file, then it is<br />
removed from that set.<br />
The File_Locked operation is used by a process to<br />
determine whether a file locked. If the current process has<br />
write access for the specified file, then, if the file is<br />
locked, a value <strong>of</strong> true is returned. If the file is unlocked<br />
the value false is returned. If the current process lacks<br />
write access for the specified file, the result is undefined.<br />
The File_Opened operation is used by a process to<br />
determine whether a file is open for reading. If the current<br />
process has write access for the specified file, then, if the<br />
file's in-use set is nonempty, a value <strong>of</strong> true is returned. If<br />
it is empty the value false is returned. If the current<br />
process does not have write access for the specified file,<br />
the result is undefined.<br />
The View_Buf operation is introduced to explicitly<br />
state how a process is allowed to view its buffer attribute.<br />
The lists constructed from the operation descriptions<br />
are as in Table Ⅱ.<br />
TABLE II. FILE SYSTEM OPERATION DESCRIPTION LISTS<br />
Operation Reference List<br />
Buffer,<br />
Modify List Return List<br />
Write_File<br />
Current_Process,<br />
Locked_By,<br />
Locked<br />
Value,<br />
Value Null<br />
Read_File current_Process,<br />
In_Use<br />
Buffer Null<br />
View_Buf Buffer<br />
Current_Process,<br />
Null Buffer<br />
Lock_File<br />
Access_Rights,<br />
Locked, In_Use,<br />
Security_Class<br />
Locked,<br />
Locked,<br />
Locked_By<br />
Null<br />
Unlock_File Locked_By,<br />
Current_Process<br />
Current_Process,<br />
Locked Null<br />
Open_File<br />
Access_Rights,<br />
Security_Class,<br />
Locked,<br />
In_Use Null<br />
Close_File<br />
Current_Process,<br />
In_Use<br />
Access_Rights,<br />
In_Use Null<br />
File_Opened Security_Class,<br />
In_Use<br />
Access_Rights,<br />
Null In_Use<br />
File_Locked Security_Class,<br />
Locked<br />
Null Locked<br />
B. Example covert flow graph and scenario list for file<br />
system example<br />
Fig.3 is the CFG constructed for the file system<br />
example. To make the analysis easier, two nodes marked<br />
with dark grey color is considered firstly.<br />
© 2011 ACADEMY PUBLISHER<br />
The triple describes the information flows by executing the<br />
operation Close_File. The node is the only one that connects to the marked<br />
node, which describes the information flows by executing<br />
the operation Open_File, shown in Fig.3. While<br />
Open_File and Close_ file are the pair <strong>of</strong> consecutive<br />
operations, they can nullify each other’s effects when<br />
existing in an operation sequence. Therefore they can be<br />
reduced from the operation sequence. And in the CFG,<br />
the dotted edge from to<br />
should<br />
be deleted . This results in the node has no indegree. Because<br />
Close_File is post-executed operation, the node<br />
and<br />
edges from it can also be deleted in the CFG.<br />
Similarly, the Lock_File and Unlock_File are the pair<br />
<strong>of</strong> consecutive operations. But they can nullify each<br />
other’s effects only on the Locked attribute. So the dotted<br />
edge from <br />
to should be deleted, however the<br />
edge from to should not be<br />
deleted.<br />
In Fig 3, the path with bold black lines is one <strong>of</strong> the<br />
information flow paths searched out by using Algorithm<br />
1. Each subpath to the finish node in the path may be a<br />
potential covert channel. Because Read_File, Write_File<br />
and Unlock_File are dependent other operations and<br />
Lock_File’s security class is not dominate View_File’s,<br />
such subpaths that starts with these four operations could<br />
not be used as covert channels. Only the subpaths staring<br />
with Open_File can be exploited. The corresponding<br />
operation sequences are shown in Fig 4.<br />
Figure 3. The covert flow graph <strong>of</strong> example system<br />
Figure 4. Potential covert communication sequences starting with<br />
Open_File
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1745<br />
The two sequences in Fig 4 need further analysis by<br />
constructing covert scenarios to determine whether they<br />
are covert channels. The result is that only sequence (2) is<br />
a covert channel. In this covert channel, the high security<br />
class user can choose whether to invoke Open_File, while<br />
another user with low security class can judge the former<br />
user’s action by invoking a serial <strong>of</strong> operations.<br />
Fig 5 shows covert communication sequences existing<br />
in the example system using Covert Flow Graph method.<br />
The method finds out six covert channels that were<br />
provided by CFT in reference [6], as sequences (a)-(f) in<br />
Fig 5. Besides, sequence (g) presents a new covert<br />
channel which was not found by CFT. Corresponding<br />
covert scenario can be constructed as following: the<br />
sender can affect the receiver’s observation result through<br />
whether invoking Open_File or not. If the sender invokes<br />
Open_File to open a file, then the receiver can not locked<br />
the same file. The following operation Open_File<br />
invoked by receiver will be successful and File_Opened<br />
will return TRUE to receiver. Otherwise, the receiver will<br />
get FALSE from File_Opened. Therefore, the receiver<br />
can detect whether the sender has opened the given file.<br />
Table Ⅲ enumerates the covert channel analysis<br />
results for the above file system with Shared Resource<br />
Matrix, Covert Flow Tree method and Covert Flow<br />
Graph. Using SRM method, only the exploited shared<br />
resource attribute can be detected. While both CFT and<br />
CFG approach can provide detailed covert<br />
communication sequences.<br />
Figure 5. Potential covert communication sequences existing in<br />
example system<br />
TABLE III. CORRESPONDENCE BETWEEN CHANNEL<br />
ANALYSIS TECHNIQUES FOR THE FILE SYSTEM EXAMPLE<br />
SRM CFT CFG<br />
Cover channel using<br />
File_Locked to<br />
sense changes in<br />
Locked<br />
Covert channel<br />
using Lock_File to<br />
sense changes in<br />
Locked<br />
Covert channel<br />
using Lock_File to<br />
sense changes in<br />
In_Use<br />
Covert channel<br />
using File_Opened<br />
to sense changes in<br />
In_Use<br />
Covert<br />
communication<br />
sequences A<br />
Covert<br />
communication<br />
sequences B<br />
Covert<br />
communication<br />
sequences C, D, E<br />
Covert<br />
communication<br />
sequences F<br />
Covert<br />
communication<br />
sequences A<br />
Covert<br />
communication<br />
sequences B<br />
Covert<br />
communication<br />
sequences C, D, E, G<br />
Covert<br />
communication<br />
sequences F<br />
VI. COMPARISON AMONG SRM, CFT AND CFG<br />
The Shared Resource Matrix approach works well<br />
since it has been introduced. The major problem may be<br />
© 2011 ACADEMY PUBLISHER<br />
that it cannot afford the operation sequences which can<br />
help the analysis <strong>of</strong> covert channels. In contrast, the CFT<br />
approach, which can present the operation sequences by a<br />
tree structure. Compared with CFT, the CFG may have<br />
two advantages as follows:<br />
(4) The CFG can provide almost complete<br />
information flows <strong>of</strong> a system in one graph, while<br />
the CFT has to construct trees for every shared<br />
attribute that would be modified by operations.<br />
Usually the size <strong>of</strong> the tree structure is quite<br />
large. For example, the CFT representing the<br />
information flow via attribute In_Use for the file<br />
system example used 136 nodes. The CFG in<br />
Fig.6 only uses 11 nodes for all attributes that can<br />
be exploited for covert communication.<br />
(5) The CFT construction algorithm dependents on a<br />
parameter, called REPEAT, which is used to<br />
control the constructing CFT with infinite tree<br />
paths. The parameter defines the number <strong>of</strong> times<br />
any attribute may be repeated in an inference<br />
path, thus providing the analyst with a way to<br />
avoid cpu or memory exhaustion by controlling<br />
the depth <strong>of</strong> the CFT paths. But unsuitable value<br />
<strong>of</strong> REPEAT may result in missing covert<br />
channels. For example, when REPEAT set to 0,<br />
scenario D and E would not be discovered. While<br />
the CFG avoids this problem.<br />
Notwithstanding these advantages, CFG encounters<br />
problems similar to the CFT approach. In the CFG,<br />
pseudo communication paths still exist. One way to<br />
reduce pseudo communication paths is to consider a finer<br />
relationship between the referenced and modified<br />
attributes and to consider conditional modifies and<br />
references, on which our research group and others are<br />
working.<br />
V. CONCLUSION AND FUTURE WORK<br />
This paper introduces a technique for detecting covert<br />
channels. The approach uses covert flow graphs, which<br />
can present the information flow paths and operation<br />
sequences. A algorithm for searching information flow<br />
paths and judging potential covert channels is introduced.<br />
To illustrate the approach, one example file system is<br />
analyzed and the result is compared to previous channel<br />
analysis <strong>of</strong> the same system using CFT approach.<br />
Compared with SRM methods, Covert Flow Graph<br />
approach can provide operation sequences. In the<br />
meantime, Covert Flow Graph approach avoids the<br />
difficult problem that CFT method has encountered.<br />
In future work, other example system should be<br />
analyzed by Covert Flow Graph approach. The emphasis<br />
will be put on automated tool for the construction <strong>of</strong><br />
covert flow graphs.<br />
ACKNOWLEDGMENT<br />
This work was supported by the National Natural<br />
Science Foundation <strong>of</strong> China (Grant Nos. 60773049,<br />
61003288), the Ph.D. Programs Foundation <strong>of</strong> Ministry<br />
<strong>of</strong> Education <strong>of</strong> China (Grant Nos. 20093227110005), the<br />
Natural Science Foundation <strong>of</strong> Jiangsu Province (Grant
1746 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Nos. BK2010192), the People with Ability Foundation <strong>of</strong><br />
Jiangsu University(Grant Nos.07JDG014), the<br />
Fundamental Research Project <strong>of</strong> the Natural Science in<br />
Colleges <strong>of</strong> Jiangsu Province (Grant Nos.<br />
08KJD520015).<br />
REFERENCES<br />
[1] D. E. Bell, L. J. LaPadula, “Secure Computer Systems:<br />
Unified Exposition and Multics Interpretation,” Mitre<br />
Crop., Bedford, MA, Tech. Rep. ESD_TR_75_306(1975).<br />
[2] R. A. Kemmerer, P. A. Porras, “Covert Flow Trees: a<br />
Visual Approach to Analyzing Covert Storage Channels,”<br />
IEEE Transactions on S<strong>of</strong>tware Engineering, vol.17, no.<br />
11, pp. 1166 – 1185, Nov. 1991.<br />
[3] R. A. Kemmerer, “Shared Resource Matrix Methodology:<br />
an Approach to Identifying Storage and Timing Channels,”<br />
ACM Transactions on Computer Systems, vol. 1, no. 3, pp.<br />
256-277, Aug. 1983.<br />
[4] J. Goguen, J. Meseguer, “Security Policies and Security<br />
Models.,” In: Proc. 1982 Symposium on Security and<br />
Privacy, pp. 11-20, IEEE Press, New York (1982).<br />
[5] D. E. Denning, “A Lattice Model <strong>of</strong> Secure Information<br />
Flow,” Communications <strong>of</strong> the ACM, vol. 19, no. 5, pp.<br />
236-243, May 1976.<br />
[6] P. A. Porras, R. A. Kemmerer, “Covert Flow Tree Analysis<br />
Approach to Covert Storage Channel Identification.,”<br />
Comput. Sci. Dept., Univ. California. Santa Barbara, Tech.<br />
Rep. No. TRCS 90-26, Dec 1990.<br />
[7] S.H. Qing, J.F. Zhu,: “Covet Channel Analysis on<br />
ANSHENG Secure Operating System.,” <strong>Journal</strong> <strong>of</strong><br />
S<strong>of</strong>tware, vol. 15, no. 9, pp. 1385-1392, 2004.<br />
[8] J. McHugh, “Handbook for the Computer Security<br />
Certification <strong>of</strong> Trusted Systems - Covert Channel<br />
Analysis.” Technical Report, Naval Research Laboratory,<br />
Feb 1996.<br />
[9] J.J. Shen, S.H. Qing, Q.N. Shen, L.P. Li, “Covert Channel<br />
Identification Founded on Information Flow Analysis,”<br />
Lecture Notes in Computer Science, Vol. 3802, pp. 381-<br />
387, 2005.<br />
[10] J.J. Shen, S.H. Qing, Q.N. Shen, L.P. Li, “Optimization <strong>of</strong><br />
covert channel identification, ” In: Proceeding <strong>of</strong> the Third<br />
IEEE International Security in storage workshop<br />
(SISW’05), 13 Dec 2005.<br />
[11] J. Zeng, S.G. Ju, X.M. Song, “Construct Information Flow<br />
Graph Based on PDG,” Computer Science and<br />
Computational Technology, Vol. 1, pp. 756-759, 20-22<br />
Dec. 2008.<br />
[12] Y.J. Wang, J.Z. WU, H.T. Zeng, L.P. DING, X.F. LIAO,<br />
“Covert Channel Research,” <strong>Journal</strong> <strong>of</strong> S<strong>of</strong>tware, Vol. 21,<br />
No. 9, pp.2262-2288, Sep 2010.<br />
© 2011 ACADEMY PUBLISHER<br />
XiangMei Song JiLin Province, China.<br />
Birthdate: Nov, 1979. is Computer<br />
Science doctoral student, studying in<br />
School <strong>of</strong> Computer Science and<br />
Telecommunication Engineering, Jiangsu<br />
University. And research interests on<br />
information security.<br />
She is a senior lecturer <strong>of</strong> Dept.<br />
information security, School <strong>of</strong> Computer<br />
Science and Telecommunication Engineering, Jiangsu<br />
University.<br />
ShiGuang Ju JiangSu Province, China.<br />
Birthdate: May, 1955. is Computer<br />
Science Ph.D., graduated from National<br />
Polytechnic Institute (Mexico). And<br />
research interests on information security<br />
and data base.<br />
He is a pr<strong>of</strong>essor and Ph.D. supervisor<br />
<strong>of</strong> School <strong>of</strong> Computer Science and<br />
Telecommunication Engineering, Jiangsu<br />
University.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1747<br />
A Novel HAVE Message <strong>of</strong> Peer-to-peer<br />
Protocol in BitTorrent Systems<br />
Jianyong Li<br />
School <strong>of</strong> Computer & Communication Engineering, Zhengzhou University <strong>of</strong> Light Industry, Zhengzhou, China<br />
Email: lijianyong@zzuli.edu.cn<br />
Jianchun Li, Daoying Huang and Qiang Wei<br />
School <strong>of</strong> Computer & Communication Engineering, Zhengzhou University <strong>of</strong> Light Industry, Zhengzhou, China<br />
Email: lijianchun@zzuli.edu.cn, dyhuang@zzuli.edu.cn, weiqiang200456@163.com<br />
Abstract—In BitTorrent systems, there are eleven types <strong>of</strong><br />
messages for data communication between the peers, among<br />
which HAVE, REQUEST and PIECE messages are the<br />
three main transmission parts in terms <strong>of</strong> quantity and flow.<br />
In order to improve the efficiency <strong>of</strong> network transmission<br />
and decrease the management costs <strong>of</strong> file delivery, this<br />
paper investigates the mechanism <strong>of</strong> HAVE message <strong>of</strong><br />
BitTorrent systems and propose a novel MultiHAVE<br />
message scheme, which comprises several HAVE messages<br />
via a proper set timer. Experiment results show that under<br />
the environment <strong>of</strong> high bandwidth and consistent peers,<br />
together with assistant <strong>of</strong> the timer, the flow ratio <strong>of</strong><br />
MultiHAVE message to HAVE message can be reduced to<br />
11%, so MultiHAVE message can decrease the flow <strong>of</strong><br />
messages and prevent the HAVE message storm efficiently.<br />
Furthermore, MultiHAVE message can adapt itself to<br />
various BT systems with various bandwidths. If the action<br />
<strong>of</strong> network peers is inconsistent, it can degenerate to the<br />
original HAVE message and keep the high performance <strong>of</strong><br />
BitTorrent systems.<br />
Index Terms—BitTorrent, protocol, Peer-to-peer networks,<br />
MultiHAVE message, performance analysis<br />
I. INTRODUCTION<br />
BitTorrent (BT) is a Peer-to-peer (P2P) protocol<br />
designed to distribute and replicate data quickly,<br />
efficiently and fairly [1-2]. It possesses similar<br />
technological principle to other P2P downloading<br />
s<strong>of</strong>tware. In BT system, each peer is a client as well as a<br />
server. So the more people download the file, the quicker<br />
its speed is. Numerous practical results have verified the<br />
flexibility, efficiency and reliability <strong>of</strong> BT systems [3].<br />
However, the widely usage <strong>of</strong> BT systems may result<br />
in message storm and decrease the communication<br />
efficiency. Recent studies showed that the proportion <strong>of</strong><br />
P2P traffic on the backbone links has increased from 10%<br />
to 80% [4-6] and the BitTorrent traffic has increased from<br />
26% to 52% <strong>of</strong> the total P2P traffic during the first half <strong>of</strong><br />
2004, and even amounts to 60% in 2005, according to the<br />
report <strong>of</strong> CacheLogic [4]. Due to the extensive use <strong>of</strong> BT<br />
systems and the congestion <strong>of</strong> local network, many ISPs<br />
began to constrain the application <strong>of</strong> BT systems.<br />
However, some <strong>of</strong> the original file-distributing services<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1747-1753<br />
based on the central servers need to invoke the support <strong>of</strong><br />
BT systems.<br />
In order to improve the performance <strong>of</strong> BT systems,<br />
many researches have been carried out to modify the<br />
existing BitTorrent mechanisms. Qureshi [7] suggested<br />
the use <strong>of</strong> proximity in BitTorrent overlay network and<br />
the peers that are close by in the real world should be<br />
close by in the overlay network. Bindal et al [8] proposed<br />
a new algorithm based on biased neighbor selection for<br />
the cross-ISP problem. In [9], Yamazaki et al put forward<br />
a so-called Cost-Aware BitTorrent strategies to reduce<br />
the ISP costs. To improve the piece exchange mechanism,<br />
Garbacki et al [10] proposed a protocol named 2Fast<br />
which extended the bartering model <strong>of</strong> BitTorrent and<br />
Garbacki et al [11] extended it by proposing a novel<br />
mechanism in which incentives are built around<br />
bandwidth rather than content. Noticing that a free-rider<br />
is a node that downloads pieces from other peers but does<br />
not upload any pieces to others, Sirivianos et al [12]<br />
presented a new free-riding technique named the large<br />
view exploit and suggested a modification to the<br />
BitTorrent tracker and clients to address the problem.<br />
In this paper, we investigate the performance<br />
enhancement <strong>of</strong> BitTorrent systems by inducing the<br />
management costs. It is known that in BitTorrent systems,<br />
there are eleven types <strong>of</strong> messages for data<br />
communication between peers and the management costs<br />
are mainly depending on HAVE message, REQUEST<br />
message and PIECE message. In some specific<br />
applications, management costs even reach 23% [13]. A<br />
HAVE message is sent once the peer has received the<br />
entire piece and verified the corresponding hash value in<br />
the torrent file. The purpose <strong>of</strong> the message is to inform<br />
all the connected peers that they could update the<br />
download piece information which was notified by the<br />
BITFIELD message in HANDSHAKE stage. In a BT<br />
system the peers that the tracker returns can reach up to<br />
50 due to the numerous peers joining the system.<br />
Correspondingly, the ratio <strong>of</strong> the number and flow <strong>of</strong><br />
HAVE message will increase quickly and result in a<br />
possible HAVE message storm.<br />
Actually, sending HAVE message to all peers in a high<br />
frequency cannot improve other peers’ downloading rate.
1748 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
So reducing the frequency <strong>of</strong> HAVE message can not<br />
only relieve the burden <strong>of</strong> peer in receiving and sending<br />
HAVE message but also reduce the network bandwidth<br />
costs. Under above consideration, in this paper, we<br />
propose a novel HAVE message mechanism,<br />
MultiHAVE message, to improve the efficiency <strong>of</strong><br />
network transmission and decrease the management costs<br />
<strong>of</strong> file delivery. The proposed MultiHAVE message is<br />
composed <strong>of</strong> several HAVE message via a proper set<br />
timer. The regular sending scheme <strong>of</strong> MultiHAVE<br />
message is analyzed as well. In order to show the<br />
effectiveness <strong>of</strong> the proposed mechanism, we compare<br />
the performance <strong>of</strong> MultiHAVE message and HAVE<br />
message. Experiment results show that under the<br />
environment <strong>of</strong> high bandwidth and consistent peers, the<br />
flow ratio <strong>of</strong> MultiHAVE message to HAVE message<br />
reduces to 11%. So the proposed MultiHAVE message<br />
can effectively decrease the amount <strong>of</strong> HAVE message<br />
and reduce the management costs <strong>of</strong> BT system.<br />
The rest <strong>of</strong> this paper is organized as follows. In<br />
Section 2, we propose a novel structure <strong>of</strong> MultiHAVE<br />
message and illustrate the regular sending scheme <strong>of</strong> the<br />
MultiHAVE message. In Section 3, we compare the<br />
performance <strong>of</strong> MultiHAVE message HAVE message.<br />
Experiment results are given in Section 4 to verify the<br />
efficiency <strong>of</strong> the proposed scheme. Section 5 summarizes<br />
the paper and draws the conclusion.<br />
II. STRUCTURE OF MULTIHAVE MESSAGE AND REGULAR<br />
SENDING SCHEME OF MULTIHAVE MESSAGE<br />
A. Structure <strong>of</strong> MultiHAVE Message<br />
The purpose <strong>of</strong> the HAVE message is to inform all the<br />
connected peers that they could update the download<br />
piece information which was notified by the BITFIELD<br />
message in HANDSHAKE stage. Sending HAVE<br />
message to all peers in a high frequency cannot improve<br />
other peers’ downloading rate. In this subsection we<br />
propose a new HAVE message mechanism, which<br />
comprises several HAVE messages via a proper set timer.<br />
Noticing that HANDSHAKE, KEEP ALIVE message<br />
and the other 9 messages have 4B message prefix and 1B<br />
message ID, the structure <strong>of</strong> MultiHAVE message can be<br />
formulated as follows:<br />
(1) 4B long Message prefix. Message prefix shows the<br />
bytes size <strong>of</strong> message ID and the payload in MultiHAVE<br />
message. The value range is n × 4 + 1 , where n is the<br />
number <strong>of</strong> piece’s index in payload.<br />
(2) 1B Message ID. The largest message ID in current<br />
BT system is 8. Here the value is declared as 9.<br />
(3) Payload. The length is n × 4 B, where n is the<br />
number <strong>of</strong> pieces. Each 4B represents the index <strong>of</strong> a<br />
piece.<br />
The comparison between MultiHAVE message and<br />
HAVE message is shown in TABLE I.<br />
B. Regular Sending Scheme <strong>of</strong> MultiHAVE Message<br />
The purpose <strong>of</strong> sending HAVE message is to notify<br />
other peers <strong>of</strong> the local peer’s downloaded piece state. It<br />
could also update the downloaded piece information<br />
© 2011 ACADEMY PUBLISHER<br />
TABLE I.<br />
THE COMPARISON OF MULTIHAVE MESSAGE AND HAVE MESSAGE<br />
Message name Length prefix<br />
Message<br />
ID<br />
Payload<br />
HAVE 0005 4 Integer/4B<br />
MultiHAVE Payload + 1 9<br />
Variable<br />
length<br />
which was notified by the BITFIELD message in<br />
HANDSHAKE stage.<br />
Conventionally, when the peer receives a piece, a<br />
HAVE message is sent to tell all the connected peers that<br />
it has the piece. As the connecting number <strong>of</strong> peers<br />
increases, the largest increase range <strong>of</strong> HAVE message is<br />
( )<br />
2<br />
O n , where n is the connecting number <strong>of</strong> the peers. In<br />
particular, under the high-bandwidth network<br />
environment, in a choke conversion cycle (10s) or an<br />
optimistic unchoking cycle (30s), a high-speed peer may<br />
receive hundreds <strong>of</strong> MB data. Calculated with a typical<br />
size <strong>of</strong> piece as 256KB, the data-receiving peers will send<br />
400 or 1200 HAVE messages to all its connecting peers<br />
in 10s or 30s. If the default number <strong>of</strong> connection is 50,<br />
the peer would send a total <strong>of</strong> 20000 or 60000 HAVE<br />
messages, 2000 per second on average. Obviously, under<br />
this circumstance, a serious HAVE message storm will<br />
appear at the end <strong>of</strong> receiving peer. The above<br />
calculations are only HAVE message that a receiving<br />
peer has sent. In fact, each peer has similar action<br />
because the relationship between them is symmetrical. If<br />
each peer has balanced equivalent sending and receiving<br />
data action in a period, the entire bandwidth is shared by<br />
uploading and downloading. Then the data that each peer<br />
receives are reduced by half and the frequency <strong>of</strong> peer<br />
sending HAVE message will be reduced to 1000 per<br />
second consequently. It should be noted that, due to the<br />
symmetry <strong>of</strong> peer action (called peer action consistency),<br />
in this period, each peer should receive totally 1000<br />
HAVE messages per second from the other 50 peers, and<br />
it will bring 51000 HAVE messages per second among<br />
51 peers. Clearly, high density HAVE message<br />
transmission will seriously affect the entire network<br />
performance.<br />
When a small number <strong>of</strong> low-bandwidth peers and a<br />
large number <strong>of</strong> high-bandwidth peers coexist in a BT<br />
system, the high-bandwidth peers may send a mass <strong>of</strong><br />
HAVE messages in a period. To the low-bandwidth<br />
peers, HAVE message is the message that they must<br />
receive and handle. The large amount <strong>of</strong> HAVE messages<br />
will definitely occupy their valued bandwidth and block<br />
the PIECE message which carries real data. In some<br />
serious cases, the low-bandwidth peers may not<br />
download any data during a long time. In other words, in<br />
a network where large numbers <strong>of</strong> high-bandwidth peers<br />
are constantly joining, the low-bandwidth peers are<br />
probable to be attacked by the HAVE message storm.<br />
In order to avoid forming the new MultiHAVE<br />
message storm, the frequency <strong>of</strong> sending MultiHAVE<br />
message should be taken into consideration when<br />
deciding the payload <strong>of</strong> MultiHAVE message. In<br />
practice, it can be managed by a timer. When the timer
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1749<br />
times out, the peer aggregates the entire HAVE messages<br />
produced by the newly- received pieces, composing them<br />
into one MultiHAVE message and sending it. At the<br />
same time the timer starts the next round <strong>of</strong> re-timing.<br />
Different to the 10s choking algorithm cycle and 30s<br />
optimistic unchoking cycle, when choosing a long<br />
MultiHAVE regular cycle (such as 30s), for highbandwidth<br />
peers, it is likely that the two connecting highbandwidth<br />
peers may send NOT INTERESTED<br />
messages and choke each other, because they may not<br />
find the new piece’s timely change between them in 10s<br />
cycle. For low-bandwidth peers (56k modem), if the size<br />
<strong>of</strong> piece is 256KB, they cannot get a complete piece in<br />
this cycle, when a complete piece has been achieved, the<br />
timer times out and a MultiHAVE message is sent.<br />
According to the length <strong>of</strong> the interval, the MultiHAVE<br />
messages that the low-bandwidth peer sends always<br />
includes only a piece <strong>of</strong> payload. At the same time, the<br />
MultiHAVE message returns to the original HAVE<br />
message and will not affect its downloading performance.<br />
To be summarized, the principles <strong>of</strong> choosing timer<br />
value are as follows:<br />
(1) It can prevent the new MultiHAVE message storm;<br />
(2) It cannot exceed the choking algorithms cycle.<br />
Based on the above two principles, the 5s (less than<br />
10s) interval is chosen for the MultiHAVE scheme.<br />
III. PERFORMANCE COMPARISON OF MULTIHAVE<br />
MESSAGE AND HAVE MESSAGE<br />
In this section, we calculate the flow <strong>of</strong> MultiHAVE<br />
and HAVE message and compare their performance<br />
consequently.<br />
First, assume n be the number <strong>of</strong> peers that<br />
connecting with peer A, the frequency <strong>of</strong> MultiHAVE<br />
message set by peer A is<br />
1<br />
f MS ( n)<br />
= n , (1)<br />
T<br />
where T I is the timer interval <strong>of</strong> MultiHAVE message.<br />
The frequency <strong>of</strong> MultiHAVE message received by peer<br />
A can be formulated as<br />
n<br />
MR n)<br />
= ∑ FM<br />
i<br />
i=<br />
1<br />
f<br />
I<br />
( , (2)<br />
where F M is the frequency <strong>of</strong> the i-th peer connecting<br />
i<br />
with the peer A and sending MultiHAVE message to peer<br />
A. The flow <strong>of</strong> MultiHAVE message sent by peer A is<br />
4Bd<br />
n n<br />
fl MS ( n)<br />
= + ( OP<br />
+ 5)<br />
(3)<br />
S T<br />
with B d being the download bandwidth <strong>of</strong> peer A, S P<br />
being the size <strong>of</strong> the piece, O P being the size <strong>of</strong> the<br />
TCP/IP header, which is 40B.<br />
The flow <strong>of</strong> MultiHAVE message received by peer A<br />
is<br />
© 2011 ACADEMY PUBLISHER<br />
p<br />
I<br />
fl<br />
n<br />
MR n)<br />
= ∑ FLM<br />
i<br />
i=<br />
1<br />
( , (4)<br />
where FL M is the flow <strong>of</strong> the i-th peer connecting with<br />
i<br />
the peer A and sending MultiHAVE message to peer A.<br />
Similarly, the frequency HAVE message sent by peer<br />
A is<br />
Bd<br />
fHS<br />
( n)<br />
= n . (5)<br />
S<br />
The frequency <strong>of</strong> HAVE message received by peer A is<br />
f<br />
p<br />
n<br />
HR n)<br />
= ∑ FH<br />
i<br />
i=<br />
1<br />
( , (6)<br />
where F H is the frequency <strong>of</strong> the i-th peer connecting<br />
i<br />
with and sending HAVE message to peer A.<br />
The flow <strong>of</strong> HAVE message sent by peer A is<br />
fl ( n)<br />
f ( n)<br />
× ( 4 + O + 5)<br />
. (7)<br />
HS<br />
= HS<br />
P<br />
The flow <strong>of</strong> HAVE message received by peer A can be<br />
presented as<br />
fl<br />
n<br />
HR n)<br />
= ∑ FLH<br />
i<br />
i=<br />
1<br />
( , (8)<br />
where FL H is the flow <strong>of</strong> the i-th peer connecting with<br />
i<br />
the peer A and sending HAVE message to it.<br />
Supposing that the peer actions be consistent, we have<br />
1<br />
fMS<br />
( n)<br />
= fMR(<br />
n)<br />
= n , (9)<br />
T<br />
4Bd<br />
n n<br />
fl MS ( n)<br />
= flMR<br />
( n)<br />
= + ( OP<br />
+ 5),<br />
(10)<br />
S T<br />
B n<br />
f =<br />
p<br />
I<br />
d<br />
HS ( n)<br />
= f HR(<br />
n)<br />
, (11)<br />
S p<br />
fl ( n)<br />
fl ( n)<br />
= f ( n)<br />
× ( 4 + O + 5)<br />
. (12)<br />
HS<br />
= HR HS<br />
P<br />
The ratio <strong>of</strong> the frequency <strong>of</strong> sending HAVE message<br />
to the frequency <strong>of</strong> sending MultiHAVE message is as<br />
follows:<br />
f<br />
f<br />
HS<br />
MS<br />
I<br />
I<br />
( n)<br />
=<br />
( n)<br />
Bdn<br />
S p<br />
n<br />
T<br />
BdTI<br />
=<br />
S p<br />
. (13)<br />
The ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE message to the<br />
flow <strong>of</strong> sending MultiHAVE message is as follows:
1750 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
fl<br />
fl<br />
HS<br />
MS<br />
( n)<br />
f HS ( n)<br />
× ( 4 + OP<br />
+ 5)<br />
=<br />
( n)<br />
n ⎛ 4<br />
⎞<br />
⎜<br />
TI<br />
Bd<br />
+ O 5⎟<br />
⎜<br />
P +<br />
T<br />
⎟<br />
I ⎝<br />
S p ⎠<br />
Bd<br />
n<br />
× ( OP<br />
+ 9)<br />
S p<br />
=<br />
n ⎛ 4<br />
⎞<br />
⎜<br />
TI<br />
Bd<br />
+ O + 5⎟<br />
⎜<br />
P<br />
T<br />
⎟<br />
I ⎝<br />
S p ⎠<br />
TI<br />
Bd<br />
( OP<br />
+ 9)<br />
=<br />
.<br />
4T<br />
B + S ( O + 5)<br />
I<br />
d<br />
p<br />
P<br />
(14)<br />
According to (13) and (14), if the peers actions are<br />
consistent, the improvement <strong>of</strong> MultiHAVE message to<br />
HAVE message is relative with the download bandwidth<br />
B d , the size <strong>of</strong> piece S p and the timer interval <strong>of</strong><br />
MultiHAVE message T I . If the these parameters have<br />
been confirmed, the ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE<br />
message to the flow <strong>of</strong> sending MultiHAVE message is<br />
constant. The ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE message<br />
to the flow <strong>of</strong> sending MultiHAVE message is constant,<br />
too.<br />
For example, suppose each peer have a maximum<br />
upload and download speed, 5MB/s, the size <strong>of</strong> piece be<br />
256KB, the timer interval be 5s and the downloading files<br />
are big enough. Further suppose the peers actions being<br />
consistent and ignore the seed peers, then the frequency<br />
<strong>of</strong> sending and receiving <strong>of</strong> MulitHAVE message and<br />
HAVE message, the flow <strong>of</strong> MultiHAVE and HAVE<br />
message can be shown in Figure 1 and Figure 2<br />
respectively.<br />
As can be seen from Figure 1 and Figure 2, when the<br />
download bandwidth B d =5MB, the size <strong>of</strong> piece<br />
S p =256KB and MultiHAVE message regular intervals<br />
T I =5s, the ratio <strong>of</strong> the frequency <strong>of</strong> sending HAVE<br />
message to that <strong>of</strong> sending MultiHAVE message is 100,<br />
The ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE message to that <strong>of</strong><br />
sending MultiHAVE message is 11.01. The frequency<br />
and flow <strong>of</strong> sending MultiHAVE message have been<br />
improved a lot than that <strong>of</strong> HAVE message. The<br />
improvement <strong>of</strong> frequency is mainly due to MultiHAVE<br />
message sending the payload <strong>of</strong> HAVE message in<br />
aggregation, and the improvement <strong>of</strong> flow is a decrease<br />
<strong>of</strong> the 40B overhead <strong>of</strong> TCP/IP header <strong>of</strong> HAVE message<br />
which are repeatedly sent.<br />
It need to be pointed out that these conclusions are<br />
based on the assumption that the peers are highbandwidth<br />
peers and actions are consistent. In the actual<br />
network environment, all peers <strong>of</strong>ten have different<br />
bandwidths, that is, high-bandwidth peers and lowbandwidth<br />
peers co-exist, and the time when each peer<br />
joins BT system is also different. In such cases, the peers<br />
will lose coherence and show diversification. For highbandwidth<br />
peers, due to the fact that each peer joins BT<br />
system in different time, there might not be the full<br />
downloading flow, it will lead to the decline in the<br />
payload <strong>of</strong> MultiHAVE message, thus the ratio <strong>of</strong> the<br />
© 2011 ACADEMY PUBLISHER<br />
frequency <strong>of</strong> sending HAVE message to that <strong>of</strong><br />
MultiHAVE message will reduce a lot, the ratio <strong>of</strong> the<br />
flow will also reduce. When the two ratios are reduced to<br />
1, MulitHave message will return to HAVE message. In<br />
addition, for the low-bandwidth peers, the time <strong>of</strong><br />
downloading a piece is <strong>of</strong>ten longer than the timer<br />
interval <strong>of</strong> the MultiHAVE message, then, MulitHave<br />
message will also return to HAVE message. But whatever<br />
the circumstances, the HAVE message storm in BT<br />
system will be prevented. In fact, along with the<br />
continuous improve -ment <strong>of</strong> the network environment,<br />
more and more peers will have the characteristics <strong>of</strong> highbandwidth,<br />
so MultiHAVE message scheme will also<br />
play a more effective role.<br />
FSR (times/s)<br />
Flow (B/s)<br />
1E+03<br />
1E+02<br />
1E+01<br />
1E+00<br />
MultiHAVE message HAVE message<br />
6 7 8 9 10 11 12 13 14 15<br />
n<br />
Figure 1. Frequency <strong>of</strong> sending and receiving(FSR)<br />
<strong>of</strong> MultiHAVE and HAVE message<br />
1E+05<br />
1E+04<br />
1E+03<br />
1E+02<br />
1E+01<br />
1E+00<br />
MultiHAVE message HAVE message<br />
6 7 8 9 10 11 12 13 14 15<br />
n<br />
Figure 2. Flow <strong>of</strong> MultiHAVE and HAVE message<br />
IV. EXPERIMENT<br />
In this section, some experiments are carried out to<br />
illustrate the effectiveness <strong>of</strong> the proposed MultiHAVE<br />
message scheme. As to the experiment parameters we<br />
refer to the first BitTorrent client developed by Bram<br />
Cohen, the inventor <strong>of</strong> the protocol[2]. The main<br />
parameters and their default values are as follows:<br />
(1) The maximum upload rate, no limitation;
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1751<br />
(2) The minimum number <strong>of</strong> peers in the peer set<br />
before requesting more peers to the tracker, default to be<br />
20;<br />
(3) The maximum number <strong>of</strong> connections the local<br />
peer can initiate, default to be 50;<br />
(4) The maximum number <strong>of</strong> peers in the peer set,<br />
default to be 80;<br />
(5) The number <strong>of</strong> peers in the active peer set<br />
including the optimistic unchokes, default to be 4;<br />
(6) The block size, set to be 2MB;<br />
(7) The number <strong>of</strong> pieces downloaded before<br />
switching from random to rarest first piece selection,<br />
default to be 4.<br />
In addition, the downloading file size is 2.15GB, the<br />
Torrent file is 43.1KB and the downloading file is<br />
divided into 2205 pieces.<br />
The experimental evaluation <strong>of</strong> the BitTorrent protocol<br />
is very complex and each experiment in not reproducible<br />
as it heavily depends on the behavior <strong>of</strong> peers, the<br />
number <strong>of</strong> seeds and leechers in the torrent, and the<br />
subset <strong>of</strong> peers randomly returned by the tracker.<br />
However, by choosing a large variety <strong>of</strong> peers and<br />
designing the experiment process deliberately, we can<br />
identify the fundamental behaviors <strong>of</strong> the BitTorrent<br />
protocol.<br />
During the experiment, we send ten kinds <strong>of</strong> messages<br />
in the BT system peers. All the messages are with TCP.<br />
The size <strong>of</strong> each message is given with the TCP/IP header<br />
overhead <strong>of</strong> 40B. The details <strong>of</strong> each message are shown<br />
in TABLE II.<br />
TABLE II.<br />
THE COMPARISON OF MULTIHAVE MESSAGE AND HAVE MESSAGE<br />
Message name Message size/B Function<br />
HANDSHAKE 108 Initiate a connection<br />
CHOKE 45 Choke the remote peer<br />
UNCHOKE 45 Unchoke the remote peer<br />
INTERESTED 45 Interested the remote peer<br />
NOT<br />
INTERESTED<br />
45 Not interested the remote peer<br />
Announce each remote peer<br />
49 when the local peer has<br />
HAVE 49<br />
BITFIELD<br />
⎡ Numberpiece<br />
⎤<br />
⎢<br />
⎥ + 45<br />
⎢ 8 ⎥<br />
received a new piece<br />
Notify the remote peer <strong>of</strong> the<br />
pieces the local peer already<br />
has<br />
REQUEST 47<br />
Request data to the remote<br />
peer<br />
PIECE Length piece + 53 Send data to the remote peer<br />
CANCEL 47 Cancel request message<br />
The BT systems adopted in the experiment are 1 seed<br />
and 5 downloaders, 1 seed and 10 downloaders and 1<br />
seed and 20 downloaders, where the classical HAVE<br />
message and the proposed MultiHAVE message in this<br />
paper are adopted respectively. Experiment results are<br />
shown in Figure 3~Figure 5, where “After Extension”<br />
and “Before Extension” columns describe the message<br />
© 2011 ACADEMY PUBLISHER<br />
flow <strong>of</strong> BT system with conventional message and<br />
MultiHAVE message respectively.<br />
1E+10<br />
1E+08<br />
1E+06<br />
1E+04<br />
1E+02<br />
1E+00<br />
After Extension<br />
Before Extension<br />
HS C UC I NI H BF R P CA<br />
Figure 3. Bytes per Type <strong>of</strong> Messages in 1 seed and 5 downloader<br />
1E+10<br />
1E+08<br />
1E+06<br />
1E+04<br />
1E+02<br />
1E+00<br />
After Extension<br />
Before Extension<br />
HS C UC I NI H BF R P CA<br />
Figure 4. Bytes per Type <strong>of</strong> Messages in 1 seed and 10 downloader<br />
1E+10<br />
1E+08<br />
1E+06<br />
1E+04<br />
1E+02<br />
1E+00<br />
After Extension<br />
Before Extension<br />
HS C UC I NI H BF R P CA<br />
Figure 5. Bytes per Type <strong>of</strong> Messages in 1 seed and 20 downloader<br />
It can be seen that at each case, though the flue <strong>of</strong><br />
UNINTERESTED and BITFIELD and other messages<br />
change little in the BT systems with the proposed<br />
MultiHAVE message, the HAVE message reduced 89%<br />
approximately. Furthermore, the flux <strong>of</strong> BITFIELD<br />
message reduces about a half than that <strong>of</strong> the BT systems<br />
with original HAVE message. So the proposed
1752 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
MultiHAVE message scheme can reduce the total<br />
message amount in the BT systems and hence decrease<br />
the management costs.<br />
It should be point out that the above experiments are<br />
carried out in BT systems with high bandwidth and<br />
consistent peers. In order to complete the MultiHAVE<br />
message, the timers are used. In the real network<br />
environment, the peers <strong>of</strong>ten possess various bandwidths,<br />
that is, high-bandwidth peers and low-bandwidth peers<br />
coexist in the system. Furthermore, we cannot demand all<br />
the peers in the network join the BT system at the same<br />
time. Actually, they join the system stochastically. In<br />
such cases, the peers will lose coherence. For highbandwidth<br />
peers, due to each peer joins BT system in<br />
different time, there might not be the full downloading<br />
flow, it will lead to the decline in the payload <strong>of</strong><br />
MultiHAVE message, thus the ratio <strong>of</strong> the frequency <strong>of</strong><br />
sending HAVE message to that <strong>of</strong> MultiHAVE message<br />
will reduce a lot, the ratio <strong>of</strong> the flow will also reduce.<br />
When the two ratios are reduced to 1, the MultiHAVE<br />
message will degenerate to the original HAVE message.<br />
Furthermore, for the low-bandwidth peers, the time <strong>of</strong><br />
downloading each piece is <strong>of</strong>ten longer than the timer<br />
interval <strong>of</strong> the MultiHAVE message, so the MultiHAVE<br />
message will also degenerate to HAVE message. So<br />
whatever the circumstances are, the HAVE message<br />
storm in BT system will be prevented considerably.<br />
In fact, along with the continuous improvement <strong>of</strong> the<br />
network environment, more and more peers will have the<br />
characteristics <strong>of</strong> high-bandwidth, so MultiHAVE<br />
message scheme can work effective to prevent the HAVE<br />
message storm in BT systems.<br />
V. CONCLUSION AND FURTHRE WORK<br />
In this paper we propose a novel HAVE message<br />
scheme, MultiHAVE message, to prevent the possible<br />
message storm in BT systems. MultiHAVE message<br />
comprises several HAVE messages via a proper set timer.<br />
By adjusting the timer interval, we can change the size <strong>of</strong><br />
MultiHAVE message. We compare the performance <strong>of</strong><br />
the proposed MultiHAVE message and conventional<br />
HAVE message to illustrate the effectiveness <strong>of</strong> the<br />
MultiHAVE message. Experiments on BT systems with<br />
high-bandwidth, consistent peers show that the proposed<br />
MutiHave message scheme can significantly reduce the<br />
flow <strong>of</strong> HAVE message, thus reducing the management<br />
costs in BT system and effectively preventing the HAVE<br />
message storm. When the action <strong>of</strong> network peers is<br />
diverse for the low -bandwidth peers, the MultiHAVE<br />
message will degenerate to the original HAVE message,<br />
thus remaining the high performance <strong>of</strong> BT system.<br />
There are still further works need to be carried out. For<br />
instance, when the BT client that is compatible with<br />
MultiHAVE message communicates with the BT client<br />
that is incompatible with MultiHAVE message, how to<br />
match them intelligently is an unsolved problem.<br />
© 2011 ACADEMY PUBLISHER<br />
ACKNOWLEDGMENT<br />
This work was supported by the National Natural<br />
Science Foundation <strong>of</strong> China under Grant 60974005, the<br />
Specialized Research Fund for the Doctoral Program <strong>of</strong><br />
Higher Education under Grant 20094101120008, the<br />
Natural Science Foundation <strong>of</strong> Henan Province under<br />
Grant 092300410201, Zhengzhou Science and<br />
Technology Research Program under Grant<br />
0910SGYN12301-6 and the Science Fund for<br />
Distinguished Yong Scholars <strong>of</strong> Henan Province under<br />
Grant 0612000600. The authors would like to thank Dr<br />
Yanhong Liu for her invaluable suggestions.<br />
REFERENCES<br />
[1] “Bittorrent,” http://www.bittorrent.com/.<br />
[2] B. Cohen, “Incentives build robustness in BitTorrent,” in<br />
First Workshop on Economics <strong>of</strong> Peer-to-peer Systems,<br />
Berkeley, USA, June 2003.<br />
[3] R. L. Xia and J. K. Muppala, “A survey <strong>of</strong> BitTorrent<br />
performance,” IEEE Communications Surveys & Tutorials,<br />
2010, vol. 12, no 2, pp. 140-158.<br />
[4] Andrew Parker. “The True Picture <strong>of</strong> Peer-to-Peer<br />
Filesharing”. http://www.cachelogic.com/research/slide9.<br />
php, May 2005.<br />
[5] T. Karagiannis, A. Broido, M. Faloutsos, and K. C. Claffy.<br />
“Transport Layer Identification <strong>of</strong> P2P Traffic”. In<br />
Proceedings <strong>of</strong> ACM IMC, Taormina, Sicily, Italy,<br />
October 2004.<br />
[6] T. Karagiannis, A. Broido, N. Brownlee, and K. C. Claffy.<br />
“Is P2P Dying or Just Hiding?”. In Proceedings <strong>of</strong> IEEE<br />
GLOBECOM, Dalla, Texas, USA, Nov. 29 - Dec. 3, 2004.<br />
[7] A. Qureshi, “Exploring proximity based peer selection in a<br />
BitTorrentlike protocol,” MIT 6.824 student project, 2004<br />
[8] R. Bindal, P. Cao, W. Chan, J. Medved, G. Suwala, T.<br />
Bates, and A. Zhang, “Improving traffic locality in<br />
BitTorrent via biased neighbor selection,” in ICDCS ’06:<br />
Proc. 26th IEEE International Conference on Distributed<br />
Computing Systems. Washington, DC, USA: IEEE<br />
Computer Society, 2006, p. 66.<br />
[9] S. Yamazaki, H. Tode, and K. Murakami, “CAT: A costaware<br />
BitTorrent,” in 32nd IEEE Conference on Local<br />
Computer <strong>Networks</strong> (LCN 2007), Oct 2007, pp. 226–227.<br />
[10] P. Garbacki, A. Iosup, D. Epema, and M. van Steen, “2fast:<br />
Collaborative downloads in p2p networks,” in P2P ’06:<br />
Proc. Sixth IEEE International Conference on Peer-to-Peer<br />
Computing. Washington, DC, USA: IEEE Computer<br />
Society, 2006, pp. 23–30.<br />
[11] P. Garbacki, D. Epema, and M. van Steen, “An amortized<br />
tit-for-tat protocol for exchanging bandwidth instead <strong>of</strong><br />
content in p2p networks,” Self-Adaptive and Self-<br />
Organizing Systems, 2007. SASO ’07. First International<br />
Conference on, pp. 119–128, July 2007.<br />
[12] M. Sirivianos, J. Park, R. Chen, and X. Yang, “Freeriding<br />
in BitTorrent networks with the large view exploit,” in<br />
IPTPS’07, 2007.<br />
[13] Arnaud Legout, Guillaume Urvoy-Keller, and Pietro<br />
Michiardi. “Understanding BitTorrent: An Experimental<br />
Perspective”. Technical Report, INRIA, Sophia Antipolis,<br />
July 2005.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1753<br />
network control.<br />
© 2011 ACADEMY PUBLISHER<br />
Jianyong Li received his master degree<br />
from the Department <strong>of</strong> Computer,<br />
Huazhong University <strong>of</strong> Science and<br />
Technology in 2001. He is currently an<br />
associate pr<strong>of</strong>essor with the School <strong>of</strong><br />
Computer and Communication<br />
Engineering, Zhengzhou University <strong>of</strong><br />
Light Industry. His research interest<br />
covers Peer-to-peer networks and<br />
Jianchun Li received his master degree<br />
from the Department <strong>of</strong> Computer,<br />
Zhengzhou University in 2005. He is<br />
currently a lecturer with the School <strong>of</strong><br />
Computer and Communication<br />
Engineering, Zhengzhou University <strong>of</strong><br />
Light Industry. His research interest<br />
covers computer networks and<br />
distributed computing systems.<br />
paper.<br />
Daoying Huang received his Ph. D.<br />
degree from the PLA Information<br />
Engineering University in 2001. Since<br />
2006, he has been a pr<strong>of</strong>essor with the<br />
School <strong>of</strong> Computer and<br />
Communication Engineering,<br />
Zhengzhou University <strong>of</strong> Light Industry.<br />
His research interest covers computer<br />
networks and distributed computational<br />
systems. Corresponding author <strong>of</strong> this<br />
Qiang Wei is currently a master<br />
candidate with the School <strong>of</strong> Computer<br />
and Communication Engineering,<br />
Zhengzhou University <strong>of</strong> Light Industry.<br />
His research interest covers computer<br />
networks.
1754 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
Image-based Position Estimation and Adaptive<br />
Modulation Coding in Vehicular Communication<br />
Hao Yang 1 , Qingmin Meng 1,2 , Xiong Gu 1 , and Baoyu Zheng 1<br />
1 School <strong>of</strong> Geography and Biological Information,<br />
Key Lab <strong>of</strong> Broadband Wireless Communication and Sensor Network Technology (Ministry <strong>of</strong> Education)<br />
Nanjing University <strong>of</strong> Posts and Telecommunications, Nanjing, 210003, China<br />
2 National Mobile Communications Research Lab, Southeast University, Nanjing, 210096, China<br />
Email: {yanghao, mengqm, zby}@njupt.edu.cn, guxiong108@gmail.com<br />
Abstract—Vehicle position estimation is a key technology for<br />
Inter-Vehicle Communications, while template matching<br />
can be used to get information <strong>of</strong> vehicular position. In the<br />
paper, a simplified template matching, namely area-based<br />
template match is considered. A vehicular communication<br />
system designed for wireless data application is proposed<br />
where a camera is fixed in a vehicle which is served as a<br />
base station. By means <strong>of</strong> comparison between the outline<br />
area <strong>of</strong> vehicular image and reference templates, the base<br />
station can obtain the position estimation <strong>of</strong> the vehicle. The<br />
reference templates can be pre-calculated from a group <strong>of</strong><br />
field experiment data. Based on supervised learning, we<br />
develop an image-based vehicle position estimation method<br />
and evaluate its effect on an adaptive coding modulation<br />
scheme. The computer simulation results show that in the<br />
wireless fading channel with the OFDM physical model,<br />
compared with fixed modulation coding scheme, the studied<br />
adaptive modulation and coding (AMC) scheme taking<br />
account <strong>of</strong> the position estimation can gain greater<br />
throughput.<br />
Index Terms—Inter-Vehicle Communications, supervised<br />
learning, template matching, OFDM, adaptive modulation<br />
and coding<br />
I. INTRODUCTION<br />
In recent years, research on how to achieve Inter-<br />
Vehicle Communications (IVC) has become one <strong>of</strong> the<br />
focuses <strong>of</strong> research and application. It is emerging as a<br />
key part <strong>of</strong> Intelligent Transportation Systems (ITS)<br />
which facilitates the ITS to realize short distance<br />
wideband wireless communication without expensive<br />
infrastructure. IVC has attracted research attention from<br />
both the academia and industry <strong>of</strong>, notably, US, EU, and<br />
Japan [1]. Refering to [2], we find that IVC can be briefly<br />
divided into two categories: one is mainly to solve traffic<br />
safety, called Safety Application; the other mainly<br />
contributes to providing value-added services, such as<br />
meeting passengers’ need for business, entertainment and<br />
information functions in the car, called User Application.<br />
In other words, IVC can provide various road traffic<br />
applications ranging from traffic safety to pleasant<br />
Manuscript received March 1, 2011; revised April 10, 2011;<br />
accepted April 20, 2011.<br />
Project number: 2010ZX03003-003-02, 60972039, 61001077,<br />
20090451239.<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1754-1759<br />
driving. In [3], IVC is simplified into three layer model<br />
which consists <strong>of</strong> physical layer, data link layer and<br />
application layer. Literature [4] gived the specification <strong>of</strong><br />
Dedicated Short Range Communications (DSRC), a type<br />
<strong>of</strong> high-speed mobile broadband. Recently, many<br />
automobile manufactures regard DSRC as a vehicle<br />
communication platform called DSRC Vehicle Ad Hoc<br />
Network (VANET). Specially IEEE 802.11 adds the<br />
Wireless Access to Vehicle Environment (WAVE) [5] to<br />
form the IEEE 802.11p and the latter is very closely<br />
related to the IEEE 802.11a standard [6].<br />
Orthogonal Frequency Division Multiplexing (OFDM)<br />
is a multiplexing technique that divides a channel with a<br />
higher data rate into multiple orthogonal sub-channels<br />
with a lower data rate. OFDM has been adopted in<br />
several wireless standards such as digital audio<br />
broadcasting (DAB), digital video broadcasting (DVB-T),<br />
the IEEE 802.11a local area network (LAN) standard<br />
and the IEEE 802.16a metropolitan area network (MAN)<br />
standard [7]. OFDM is also being pursued for the abovementioned<br />
DSRC for road side to vehicle<br />
communications.<br />
The significance <strong>of</strong> the paper is to propose an image-<br />
based IVC design. In order to improve the performance<br />
<strong>of</strong> the AMC in the OFDM transmission, we use the<br />
supervised learning <strong>of</strong> machine learning to estimate the<br />
position <strong>of</strong> the vehicle.<br />
The remainder <strong>of</strong> the paper is organized as follows:<br />
Section 2 introduces some relevant research work about<br />
image processing; Section 3 describes the system model<br />
and vehicle position estimation; Section 4 gives the signal<br />
model and the AMC selection; In Section 5, we bring out<br />
the simulation and results; and finally the conclusion is<br />
given in Section 6.<br />
II. RELATED WORK OF IMAGE PROCESSING<br />
Digital image processing refers to handling digital<br />
images or video frames by means <strong>of</strong> a digital computer.<br />
The results <strong>of</strong> digital image processing are generally<br />
images or a set <strong>of</strong> characteristics and parameters related<br />
to the images [8].<br />
Image processing techniques can be used to measure<br />
distance. In [9], Lu et al. proposed a novel measuring<br />
system using a scan-counter method via a CCD camera.<br />
The system can be used to measure the distance between
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1755<br />
a CCD camera and an object. Set on either side <strong>of</strong> a CCD<br />
camera, two laser projectors in the system produced two<br />
parallel rays that projected two bright spots on the object<br />
and the CCD. The interval between the two bright spots<br />
in the video image was calculated. As there is a linear<br />
relationship between the actual distance and the interval<br />
<strong>of</strong> the two bright spots, the actual distance from the CCD<br />
camera to the object can be obtained from a simple<br />
formula. Later, Hsu et al. [10] brought forward a new<br />
method for calculating the distance. The proposed<br />
scheme counted pixel number variation <strong>of</strong> reference<br />
points in the images to acquire the displacement <strong>of</strong> the<br />
camera movement along the photographing direction.<br />
In [11], Chang et al. proposed a method to use images<br />
to measure the relative distance between vehicles. The<br />
procedures <strong>of</strong> the method were divided into two parts.<br />
First, the location <strong>of</strong> the license plate in the image was<br />
found by several image processing techniques. Second,<br />
the image size <strong>of</strong> the plate was obtained by the region<br />
growing technique, then the relative distance was<br />
computed by using the geometric relation.<br />
In [12], Lü et al. put forward an efficient measuring<br />
method for live plant leaf area. The proposed method was<br />
composed <strong>of</strong> four steps. First, image geometric<br />
distortions were corrected by using mapping function.<br />
Then, image segmentation was performed using threshold<br />
method and leaf region was obtained. Next, leaf contour<br />
was extracted and contour region was filled. Finally, leaf<br />
area was calculated through pixel number statistic.<br />
An object size in images can be obtained by using the<br />
result <strong>of</strong> contour extraction. There are many papers<br />
focused on this topic. Active contour model, known as<br />
“snakes”, is a framework for delineating an object outline<br />
from a noisy image [13]. Snakes have been successfully<br />
used in segmentation, matching and tracking the<br />
interested target. In [14], Dubuisson proposed a specific<br />
method for the contour extraction <strong>of</strong> the moving object.<br />
The method is based on the fusion <strong>of</strong> a motion<br />
segmentation technique, which uses image subtraction<br />
and color segmentation based on the split-and-merge<br />
paradigm and edge information. The edge information<br />
can be obtained by using the Canny edge detector. He<br />
also applied the object matching in intelligent<br />
vehicle/highway system.<br />
III. SYSTEM MODEL AND VEHICLE POSITION<br />
ESTIMATION<br />
The scene <strong>of</strong> IVC in the paper is shown in Fig.1. The<br />
three vehicles form a linear topology <strong>of</strong> the Ad Hoc<br />
Network and each vehicle is regarded as a<br />
communication node. Each vehicle is considered to be<br />
equipped with a whole communication system, which<br />
consists <strong>of</strong> three main components: the wireless<br />
transceiver, the microcomputer and the camera. The main<br />
function <strong>of</strong> each part is as follows:<br />
1) The Wireless Transceiver: It is used for the<br />
receiving and sending <strong>of</strong> information between vehicles <strong>of</strong><br />
short distance.<br />
2) The Microcomputer: On one hand it receives the<br />
image information from camera through a specific<br />
© 2011 ACADEMY PUBLISHER<br />
interface and then handles the information and displays<br />
the results on the screen; On the other hand, through a<br />
specific interface it communicates with the wireless<br />
transceiver.<br />
3) The Camera: it is the main sensing component <strong>of</strong><br />
the system and used for capturing the surrounding<br />
environment. In the paper, it is used for capturing the<br />
snapshots <strong>of</strong> the vehicle so as to track its position.<br />
Figure 1. The scene <strong>of</strong> the IVC<br />
A. Assumptions <strong>of</strong> the model<br />
Before introducing the specific design, for the sake <strong>of</strong><br />
simplicity, we make the following assumptions <strong>of</strong> the<br />
system.<br />
1) In order to facilitate the camera to capture the<br />
snapshots <strong>of</strong> the vehicle, we assume vehicles are traveling<br />
in the queue, that is to say, they are traveling in a straight<br />
line.<br />
2) Taking into account the driver has good vision in<br />
front <strong>of</strong> the vehicle, we assume that the camera is<br />
installed at the tail <strong>of</strong> the vehicle and only captures the<br />
snapshots <strong>of</strong> the following vehicle.<br />
3) As the communication process as well as the control<br />
process <strong>of</strong> the vehicle with its preceding one and the<br />
following one is similar, we just consider the vehicle<br />
communicate with its following vehicle. Hereinafter,<br />
when referring to a vehicle that transmits data, we called<br />
it active vehicle, otherwise we called it inactive vehicle.<br />
4) We assume that the camera is with ordinary and<br />
fixed focal length optical lens.<br />
5) We assume that the type and size <strong>of</strong> the vehicle are<br />
the same.<br />
B. Vehicle Position Estimation<br />
Considering the position <strong>of</strong> the vehicle changes rapidly<br />
in IVC, it is difficult to achieve exact matching <strong>of</strong> the<br />
vehicle. The paper performs a fast matching based on the<br />
contour area <strong>of</strong> the vehicle. The active vehicle selects the<br />
appropriate modulation and coding scheme for OFDM<br />
transmission with the assumption <strong>of</strong> an ideal OFDM<br />
channel estimation. The key part <strong>of</strong> the process is to<br />
determine the distance between vehicles through<br />
snapshots <strong>of</strong> the vehicle, which we will use machine<br />
learning algorithms. Machine learning is generally<br />
divided into supervised learning, unsupervised learning<br />
and reinforcement learning [15]. For our scenario,<br />
supervised learning is adopted. In this method <strong>of</strong><br />
learning, a training set is given, and then we attempt to<br />
identify the relationship between input and output<br />
through a learning algorithm and then achieve a function
1756 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
h , called a hypothesis. When a new input x is given, we<br />
can get the predicted output y through the function h .<br />
The process is shown in Fig.2. Supervised learning<br />
consists <strong>of</strong> two important parts namely regression and<br />
classification. The difference between them is whether<br />
the predicted output is continuous or discrete. If the<br />
predicted output is continuous then it is a regression<br />
problem, otherwise a classification problem. As the<br />
output in the paper is continuous, we consider the former<br />
one.<br />
Figure 2. The supervised learning process<br />
In the paper we consider an n-dimensional linear<br />
regression, in which the relationship between the input<br />
features x and predicted output y is linear. As<br />
illustrated in equation (1),<br />
n<br />
∑<br />
T<br />
h( x) = θ x = θ x (1)<br />
i=<br />
0<br />
where θ and x are both vectors, and we set x 0 = 1.<br />
In<br />
order to work out the value <strong>of</strong> θ , we first introduce the<br />
cost function,<br />
m 1<br />
i i 2<br />
J( θ) = ∑( hθ( x ) − y )<br />
2 i=<br />
1<br />
(2)<br />
1<br />
T<br />
= ( Xθ −Y) ( Xθ−Y) 2<br />
it indicates the difference between the predicted output<br />
i i i i T<br />
and the practical output. x = [ x0, x1,..., xn]<br />
is the<br />
i<br />
specific input feature vector, y refers to the<br />
corresponding output, and m defines the number <strong>of</strong> the<br />
1 2 m T<br />
training data. X= [ x , x ,..., x ] is the matrix <strong>of</strong> the<br />
1 2 m T<br />
whole input features, and Y = [ y , y ,..., y ] is the<br />
whole corresponding output.<br />
After defining the cost function, all we need to do is to<br />
choose appropriate θ so as to minimize J ( θ ) . The<br />
intuitive approach is to make derivation for each θ i , as<br />
illustrated in formula (3),<br />
1<br />
T<br />
∇ θJ( θ) =∇θ{ ( Xθ −Y) ( Xθ−Y)} 2<br />
(3)<br />
T T<br />
= XXθ−XY in which ∇θJ ( θ ) means J ( θ ) makes a derivation for<br />
θ in matrix form. Set the derivatives to zero, and we can<br />
have the following standard expression.<br />
© 2011 ACADEMY PUBLISHER<br />
i i<br />
T T<br />
XXθ= XY (4)<br />
Solving the above equation, we get the appropriate θ<br />
to make J ( θ ) minimized. The final expression is<br />
T −1<br />
T<br />
θ = ( XX) XY (5)<br />
IV. SIGNAL MODEL AND ADAPTIVE MODULATION AND<br />
CODING SELECTION<br />
A. Signal model<br />
See [16-19]. A wireless channel including path loss,<br />
shadow fading, small scale fading and additive<br />
background noise is considered. The channel impulse<br />
response for the small scale fading can be modelled as<br />
described with<br />
Lp<br />
−1<br />
∑ α , δ ( τ )<br />
(6)<br />
h () t = t−<br />
ij l ij l<br />
l=<br />
0<br />
where α lij , represents the discrete time-domain channel<br />
coefficient which is independent and identically<br />
distributed (i.i.d) complex Gaussian variable, L p denotes<br />
the number <strong>of</strong> paths in a frequency selective fading<br />
channel and τ l denotes the path delay term.<br />
The transmission parameters <strong>of</strong> OFDM are: total<br />
subcarrier number N , subchannel number K ,<br />
subcarrier spacing B (KHz) and channel spacing W<br />
(MHz). The signal between transmit node i and receive<br />
node j can be represented as<br />
Lp<br />
−1<br />
−α<br />
∑ βα, ( τ )<br />
(7)<br />
r () t = Pd s t− + n () t<br />
ij t ij i lij i l ij<br />
l=<br />
0<br />
In equation (7), P t is the transmission power and d ij<br />
indicates the distance between node i , j . α denotes<br />
pass loss index and β i refers to log-normal shadowing<br />
2<br />
term, i.e., 10log 10 βi ~ N(0,<br />
σ db ) . si() t is the<br />
transmission signal from node i and nij () t denotes<br />
additive white Gaussian noise with zero mean and power<br />
spectral density N 0 .<br />
In the studied OFDM transmission scheme, distinct<br />
Quadrature Amplitude modulation (QAM) schemes are<br />
used according to differing separated spacing between the<br />
two vehicles communicating with each other. The cyclic<br />
prefix is used in OFDM signals as a guard interval whose<br />
length needs to be larger than the maximum excess delay<br />
to mitigate the effect <strong>of</strong> Intersymbol Interference (ISI)<br />
due to the multipath propagation. The cyclic prefix is<br />
added after the IFFT at the transmitter and is removed in<br />
order to get the original signal at the receiver.<br />
The average frequency response <strong>of</strong> subchannel k in<br />
an OFDM receiver is Hij ( k ) . For simplicity reasons, we<br />
ignore the subchannel index and define the gain <strong>of</strong> the
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1757<br />
−α<br />
2<br />
subchannel as G = d β | H | , therefore the signal<br />
to noise ratio (SNR) is<br />
i ij i ij<br />
Pi⋅Gi γ i =<br />
N ⋅ B<br />
0<br />
B. Selection <strong>of</strong> adaptive modulation and coding scheme<br />
The selection principle <strong>of</strong> AMC is to choose the<br />
appropriate scheme that makes the throughput <strong>of</strong> the<br />
vehicle transmission maximum in OFDM transmission.<br />
Considering the M-ary quadrature amplitude modulation<br />
(M-QAM), the modulation level and coding rate <strong>of</strong> node<br />
i are i M and C i , respectively. As seen in [20], the<br />
practical modulation and coding schemes (MCS) will<br />
cause the loss <strong>of</strong> SNR when the bit error rate p b is<br />
considered. Then we consider the rate formula as<br />
bi = log 2(1<br />
+ φγi), φ =− 1.5 / ln(5 pb)<br />
(9)<br />
Assume the length <strong>of</strong> the packet is L , define the<br />
throughput <strong>of</strong> node i is R f , the data rate is R m and<br />
packet error rate (PER) is P e ,then we have:<br />
fR = Rm*1 ( − Pe)<br />
(10)<br />
log2 i M<br />
Rm = N⋅B⋅Ci⋅ (11)<br />
1 (1 ) L<br />
P = − − p<br />
(12)<br />
e b<br />
V. SIMULATION AND RESULTS<br />
The simulation training set for the vehicle is obtained<br />
from the practical measurement. First we fix a vehicle V1<br />
(base station) and make another vehicle V2 drive in a<br />
straight line towards V1. Then we use the camera<br />
installed in V1 to capture the snapshots <strong>of</strong> the V2 and<br />
estimate the distance d between them. Finally in order<br />
to reduce the dimension <strong>of</strong> the unique input feature, we<br />
regard the area <strong>of</strong> the vehicle as the input feature and the<br />
corresponding distance between vehicles as the training<br />
set output. The type <strong>of</strong> the vehicle is Peugeot 307, the<br />
camera employed PAL form, the lens focal length is 12<br />
millimeters and the image resolution is 720*576 pixels.<br />
When the distance between vehicles is too far, the vehicle<br />
size in the image is too small for the camera to capture.<br />
On the other hand when the distance is too close, the<br />
vehicle size in the image is too large and occupies the<br />
whole image. Taking both into consideration, we choose<br />
the distance ranging from 15 meters to 70 meters. The<br />
practical measured data is shown in Table 1. As observed<br />
from Table 1, when the vehicle spacing is close, the<br />
outline <strong>of</strong> V2 becomes larger in size; when the vehicle<br />
spacing gradually increases, the contour dimension<br />
gradually becomes smaller. After getting the training set,<br />
image fitting can be performed by using the mentioned<br />
linear regression method.<br />
For the plane curve fitting, n points on the plane<br />
generally can always be completely fitted by using n-1<br />
order polynomial fitting. However, even though the fitted<br />
curve can pass through the points perfectly, we can not<br />
© 2011 ACADEMY PUBLISHER<br />
(8)<br />
definitely say that the curve is a best prediction. In the<br />
studied process the prediction is the vehicle spacing for<br />
different outline areas <strong>of</strong> the vehicle. Two major issues<br />
for the curve fitting are over-fitting and under-fitting. In<br />
general, the under-fitting shows if the order is lower<br />
compared with the actual model’s and mainly behaves<br />
that most <strong>of</strong> the data are not good fitted as show in Fig.3,<br />
while the over-fitting shows if the order is higher than the<br />
actual model’s and mainly behaves that all the data are<br />
better fitted as show in Fig.4. The selection <strong>of</strong> the order<br />
plays a decisive role in the curve fitting. We employed a<br />
3-rd order fitting and the fitting result for the area is<br />
shown in Fig.5.<br />
TABLE I.<br />
THE TRAINING SET OF LINEAR REGRESSION<br />
Distance (m) 15 16 18 20 25<br />
Area (pixels) 37395 30820 25314 19725 13150<br />
Distance (m) 30 35 40 45 50<br />
Area (pixels) 9780 6903 4931 4068 2958<br />
Distance (m) 55 60 70 - -<br />
Area (pixels) 2588 2301 1725 - -<br />
distance between two vehicles(m)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0 0.5 1 1.5 2 2.5 3 3.5 4<br />
x 10 4<br />
0<br />
area <strong>of</strong> the vehicle in the image(pixels)<br />
Figure 3. The under-fitting for the training set with 2ndorder<br />
distance between two vehicles(m)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0 0.5 1 1.5 2 2.5 3 3.5 4<br />
x 10 4<br />
0<br />
area <strong>of</strong> the vehicle in the image(pixels)<br />
Figure 4. The over-fitting for the training set with 7th-order
1758 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
System simulation parameters are partially based on<br />
IEEE 802.11a. The number <strong>of</strong> the subcarriers, the<br />
subchannels, subcarrier spacing and channel spacing is<br />
N = 52 , K = 4 , B = 312.5 [KHz] and W = 20<br />
[MHz], respectively. The carrier frequency ranges from<br />
5.850 GHz to 5.925GHz. A quasi-static six-path fading<br />
channel model is considered, whose Rician coefficient is<br />
4 and standard deviation <strong>of</strong> log-normal shadowing is 8dB.<br />
The power gain in each tap is defined as [0.8084, 0.462,<br />
0.253, 0.259, 0.0447, 0.01] and the delay with T=1/W<br />
spaced taps is given as [0, 2, 4, 6, 9, 13]. In the simplified<br />
path-loss model [19], a reference distance, d 0 = 15 [m],<br />
is defined and the corresponding normalized distance is<br />
defined as ( 0 / d d ). In order to get a simple result, five<br />
MCS are considered, i.e., QPSK-1/2, QPSK-3/4,<br />
16QAM-1/2, 16QAM-3/4 and 64QAM-3/4. Assume that<br />
all the subcarriers can obtain equal treatment and all<br />
subcarriers use the single QAM modulation scheme in an<br />
interval <strong>of</strong> fading block. When packet length is 1000<br />
Bytes, we have L = 8000 [bits]. According to Eq.10 we<br />
calculate the objective function and select the MCS that<br />
makes the value <strong>of</strong> the objective function maximum. The<br />
throughput performance comparison under different SNR<br />
values is shown in Fig. 6. One curve represents the near<br />
constant performance with a fixed modulation mode <strong>of</strong><br />
QPSK-1/2, another represents the performance with the<br />
AMC mode taking account <strong>of</strong> the position information.<br />
Obviously, with the proposed AMC the system<br />
throughput can be remarkably improved.<br />
VI. CONCLUSIONS<br />
Position estimation has a significant effect on the<br />
choosing <strong>of</strong> transmission parameters <strong>of</strong> wireless vehicular<br />
communications. However the research work in this field<br />
is less. The work <strong>of</strong> this paper shows such a preliminary<br />
design, namely vehicle-location awareness OFDM<br />
transmission. By using the supervised learning algorithms<br />
<strong>of</strong> machine learning, the base station can first perform<br />
identification and area matching and then predict the<br />
separated spacing between the two vehicles<br />
communicating with each other. The spacing information<br />
can be used to the subsequent selection <strong>of</strong> modulation and<br />
coding scheme. Therefore, the throughput performance <strong>of</strong><br />
the vehicle communication system will be significantly<br />
improved.<br />
ACKNOWLEDGMENT<br />
The authors wish to thank National Mobile<br />
Communications Research Laboratory, Southeast<br />
University. This work was supported by National Science<br />
and Technology <strong>of</strong> major special projects (2010ZX<br />
03003-003-02), the National Natural Science Foundation<br />
<strong>of</strong> China (60972039 and 61001077) and the national postdoctoral<br />
research funding (20090451239).<br />
© 2011 ACADEMY PUBLISHER<br />
distance between two vehicles(m)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0 0.5 1 1.5 2 2.5 3 3.5 4<br />
x 10 4<br />
0<br />
area <strong>of</strong> the vehicle in the image(pixels)<br />
Figure 5. Three-order fitting for the training set<br />
Throughput(bit/s)<br />
x 107<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
Adaptive MCS<br />
Fixed MCS<br />
0<br />
5 10 15 20 25<br />
SNR(db)<br />
30 35 40 45<br />
Figure 6. The throughput comparison between fixed MCS<br />
and adaptive MCS<br />
REFERENCES<br />
[1] J. Luo, and J. P. Hubaux, “A Survey <strong>of</strong> Inter-Vehicle<br />
Communication,” Tech. Rep, 2004.<br />
[2] M. Rudack, M. Meincke, K. Jobmann, and M. Lott, “On<br />
traffic dynamical aspects inter vehicle communication<br />
(IVC),” In Proc. <strong>of</strong> the 57th IEEE Semiannual Vehicular<br />
Technology Conference (VTC’03 Spring), 2003.<br />
[3] Ugur Keskin, “In-Vehicle Communication <strong>Networks</strong>: A<br />
Literature Survey,” July 28, 2009.<br />
[4] ASTM International. ASTM E2213-03 Standard<br />
Specification for Telecommunications and Exchange<br />
Between Roadside and Vehicle Systems - 5GHz Band<br />
Dedicated Short Range Communications (DSRC) Medium<br />
Access Control (MAC) and Physical Layer (PHY)<br />
Specifications, 2003..<br />
[5] IEEE 1609 - Family <strong>of</strong> Standards for Wireless Access in<br />
Vehicular Environments (WAVE), U.S. Department <strong>of</strong><br />
Transportation, January 9, 2006.<br />
[6] IEEE Standard 802.11a-1999, Part 11: Wireless LAN<br />
Medium Access Control (MAC) and Physical Layer (PHY)<br />
specifications: High-speed Physical Layer in the 5 GHz<br />
Band.<br />
[7] IEEE Standard IEEE 802.16a, for Local and Metropolitan<br />
Area <strong>Networks</strong> Part 16, Air Interface for Fixed Broadband<br />
Wireless Access Systems:<br />
http://grouper.ieee.org/groups/802/16/.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1759<br />
[8] Rafael C. Gonzalez and Richard E. Woods, “Digital image<br />
processing,” Second Edition, Beijing, Publishing House <strong>of</strong><br />
Electronics Industry, September, 2007.<br />
[9] Ming-Chih Lu, Wei-Yen Wang and Chun-Yen Chu,<br />
“Image-Based Distance and Area Measuring Systems,”<br />
IEEE Sensors <strong>Journal</strong>, Vol. 6, No.2, April 2006, pp495-<br />
503.<br />
[10] Chen-Chien Hsu, Ming-Chih Lu, Wei-Yen Wang and Yin-<br />
Yu Lu, “Distance measurement based on pixel variation <strong>of</strong><br />
CCD images,” ISA Transactions, Vol. 48, No. 4, October<br />
2009, pp389-395.<br />
[11] Tang-Hsien Chang, Chun-hung Lin, Chih-sheng Hsu, and<br />
Yao-jan Wu, “A Vision-Based Vehicle Behavior<br />
Monitoring and Warning System,” In Proc. <strong>of</strong> Intelligent<br />
Transportation Systems, 2003.<br />
[12] Chaohui Lü, Hui Ren, Yibin Zhang, and Yinhua Shen,<br />
“Leaf Area Measurement Based on Image Processing,” In<br />
2010 International Conference on Measuring<br />
Technology and Mechatronics Automation.<br />
[13] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active<br />
contour models,” Internat. J. Comput. Vision 1 (1987)<br />
pp321–331.<br />
[14] Marie-Pierre Dubuisson and Jain. A. K, “Object Contour<br />
Extraction using Color and Motion,” in Computer Vision<br />
and Pattern Recognition, Proceedings CVPR '93, IEEE<br />
Computer Society Conference, 1993.<br />
[15] CS 229: Machine Learning. http://www.stanford.edu/class/<br />
cs229/. Autumn 2010.<br />
[16] R. C. Daniels, C.Caramanis, and R.W.Heath, “A<br />
Supervised Learning Approach to Adaptation in Practical<br />
MIMO-OFDM Wireless Systems,” in Global<br />
Telecommunications Conference, New Orleans, Lo, Nov.<br />
2008, pp1-5.<br />
[17] Qingmin Meng, Xiong Gu, Feng Tian, Baoyu Zheng, “ k-<br />
NN Based MCS Selection in Distributed OFDM Wireless<br />
<strong>Networks</strong>,” In 2011 international conference on<br />
Automation, Communication, Architectonics, and<br />
Materials (ACAM2011), to be published, June 18-19,<br />
Wuhan, China.<br />
[18] T. S. Rappaport, Wireless Communications: Principles and<br />
Practice, 2nd ed. NJ: Prentice-Hall, 2001.<br />
[19] Andrea Goldsmith. Wireless Communications. Cambridge<br />
University Press, 2005.<br />
[20] Koji Yamamoto, “Trade<strong>of</strong>f between Area Spectral<br />
Efficiency and End-to-End Throughput in Rate-Adaptive<br />
Multihop Radio <strong>Networks</strong>,” IEICE Trans. Commu., Vol.<br />
E88-B, No.9, 2005.<br />
© 2011 ACADEMY PUBLISHER<br />
Hao Yang Jiangsu Province, China.<br />
Birthdate: November, 1969. He is Signal<br />
and Information Processing Ph.D.,<br />
graduated from the School <strong>of</strong> Information<br />
Science and Engineering, Southeast<br />
University. And research interests on<br />
image processing.<br />
He is a senior lecturer <strong>of</strong> the School <strong>of</strong><br />
Geography and Biological Information,<br />
Nanjing University <strong>of</strong> Posts and Telecommunications.<br />
Qingmin Meng Jiangsu Province, China.<br />
Birthdate: September, 1965. He received<br />
Ph.D. degree in radio engineering from<br />
Southeast University, Nanjing, China, in<br />
2007. Then he joined the Faculty <strong>of</strong><br />
School <strong>of</strong> Telecommunications and<br />
Information Engineering, Nanjing<br />
University <strong>of</strong> Posts and<br />
Telecommunications. His current research<br />
interests include multihop relaying in next<br />
generation broadband wireless communication, the application<br />
<strong>of</strong> machine learning to resource allocation in cognitive radio<br />
networks and vehicular opportunistic communication.<br />
Xiong Gu Hubei Province, China.<br />
Birthdate: May, 1988. He is working for<br />
master degree in School <strong>of</strong><br />
Telecommunications and Information<br />
Engineering, Nanjing University <strong>of</strong> Posts<br />
and Telecommunications. His current<br />
research interests include machine<br />
learning and its application in resource<br />
allocation in cognitive radio networks<br />
and vehicular opportunistic communication.
1760 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
A Request Distribution Algorithm for Web<br />
Server Cluster<br />
Wei Zhang<br />
School <strong>of</strong> Computer Science and Engineering, Beihang University, Beijing, 100191, China<br />
State Key Laboratory <strong>of</strong> Rail Traffic Control and Safety,Beijing Jiaotong University,Beijing 100044, China<br />
zhangwqh@cse.buaa.edu.cn<br />
Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, Li Ruan<br />
School <strong>of</strong> Computer Science and Engineering, Beihang University, Beijing, 100191, China<br />
{zhumf, xiaolm,ruanli}@buaa.edu.cn<br />
Abstract—With the explosively increasing <strong>of</strong> web-based<br />
applications’ workloads, Web server cluster encounters<br />
challenge in response time for requests. Request distribution<br />
among servers in web server cluster is the key to address<br />
such challenge, especially under heavy workloads. In this<br />
paper, we propose a new request distribution algorithm<br />
named llac (least load active cache) for load balancing<br />
switch in web server cluster. The goal <strong>of</strong> llac is to improve<br />
the cache hit rate and reduce response time. Packets are<br />
parsed in IP level, and back-end servers are notified to<br />
cache hot files using link change technology, neither<br />
changing URL information nor modifying the service<br />
program. This avoids switching overhead between user<br />
mode and kernel mode. The load balancing switch directly<br />
creates connection with the selected server, avoiding<br />
migrating connection overhead. This policy estimates the<br />
current composited load <strong>of</strong> each server and selects the<br />
server with the least load to serve the request. It also<br />
improves the resource utilization <strong>of</strong> web servers.<br />
Experimental results show that llac achieves better<br />
performance for web applications than wrr (weight round<br />
robin) which is a popular request distribution.<br />
Index Terms—Web Cluster, Request Distribution, LLAC<br />
I. INTRODUCTION<br />
The enormous growth <strong>of</strong> the internet industry<br />
introduces web-based application as popular demanding<br />
programs. Users are becoming increasingly reliant on the<br />
web for their daily activities such as electronic commerce,<br />
on-line banking, stock trading, reservations and product<br />
merchandising. Therefore the performance <strong>of</strong> a web server<br />
system plays an important role in success <strong>of</strong> many internet<br />
related companies. Traditionally, a single server machine<br />
can only handle a limited amount <strong>of</strong> requests and can’t<br />
scale up with demand. The better way to cope with<br />
growing processing demands for web servers is by adding<br />
more hardware resources instead <strong>of</strong> completely replacing<br />
one server with a faster one [1]. More and more web sites<br />
use a web cluster, composed <strong>of</strong> a front-end request<br />
dispatching server, also called load balancing switch, and<br />
several back-end servers handing requests. By distributing<br />
requests from clients to separate servers for load balancing<br />
© 2011 ACADEMY PUBLISHER<br />
doi:10.4304/jnw.6.12.1760-1766<br />
or load sharing, web cluster have proved to be a better<br />
solution than using an overloaded single server. Due to<br />
various technical issues regarding the management <strong>of</strong> a<br />
web server cluster, request distribution algorithms (which<br />
are implemented in the load balancing switch) are<br />
particularly important to boost the performance <strong>of</strong> cluster<br />
web servers [2]. The ratio <strong>of</strong> the peak load to light load for<br />
internet applications is usually on the order <strong>of</strong> 300% [19].<br />
J.C.Mongul said [3], web site happened to collapse mostly<br />
because <strong>of</strong> popular and hot event access. A famous<br />
example, the normally well-provisioned Amazon.com site<br />
suffered a forty-minute down time due to an overload<br />
during the popular holiday season in November 2000.<br />
Popular web sites <strong>of</strong>ten face the challenge to deal with<br />
huge amount <strong>of</strong> requests in short time. This paper<br />
addresses the problem <strong>of</strong> request distribution so that web<br />
server cluster can serve its peak workload demand. We<br />
simultaneously use client-side and server-side information<br />
to select server, and avoid switching overhead between<br />
user mode and server mode. We also avoid migrating<br />
connection overhead. We present a new request<br />
distribution algorithm with several contributions in which<br />
as follows: Firstly, design combined load model based on<br />
collection <strong>of</strong> typical load information. This model is used<br />
with online measurements <strong>of</strong> load information to estimate<br />
the processing capacity <strong>of</strong> web servers. This gives us<br />
reliable load descriptors for web servers which are used in<br />
the decision making process <strong>of</strong> the request distribution<br />
algorithm. Secondly, in order to increase the speed <strong>of</strong><br />
accessing the popular or hot files, our approach resorts to<br />
active caching technology. Packets are captured and<br />
analyzed using netfilter mechanism in IP level, avoiding<br />
switching overhead between user mode and server mode<br />
and migrating connection overhead. The active caching<br />
technology does not modify URL or server program,<br />
resorting to link change technology to put hot files in<br />
memory file system. Finally, we propose and implement a<br />
novel request distribution algorithm which works on the<br />
basis <strong>of</strong> the composited load and file access frequency.<br />
We call this novel request distribution algorithm for<br />
llac(least load and active caching) shortly.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1761<br />
The rest <strong>of</strong> the paper is organized as follows. Section II<br />
discusses the related work. In Section III we propose<br />
request distribution architecture and algorithm, and then<br />
we discuss each module separately. Section IV describes<br />
the experimental results <strong>of</strong> a web cluster prototype using<br />
llac. Section V concludes the paper.<br />
II. RELATED WORKS<br />
Numerous dispatching algorithms are proposed for web<br />
server cluster. We can classify dispatching algorithms as<br />
layer-4 and layer-7 algorithms.<br />
A layer-4 algorithm only considers web server-side<br />
information, but doesn’t use client-side information. In<br />
this approach, clients directly create connection with the<br />
selected server. This algorithm is easy to implement, but<br />
cannot make good use <strong>of</strong> server’s resources according to<br />
the customer’s request. It includes random policy [4],<br />
round-robin policy [4], weight round-robin policy [5],<br />
least connection policy [6], fast response time [6] and so<br />
on. Random and round-robin are easy to implement, but<br />
they don’t consider servers capacity. This can easily lead<br />
to unbalance. Wrr associates an evaluated weight with<br />
each server node in a cluster which is proportional to the<br />
server’s capacity. Initial weight is set by the administrator,<br />
disturbing by human factors. Least connection doesn’t<br />
consider that each request may have different response<br />
time and different demand for resources. Fast Response<br />
Time is influenced by the network environment, so can’t<br />
evaluate the performance <strong>of</strong> a web server effectively.<br />
A layer-7 algorithm not only considers server<br />
information, but also can use client’s user level<br />
information, such as session identifiers, type <strong>of</strong> URL,<br />
cookies and so on. However, clients need to create TCP<br />
connection with the load balancing switch in order to<br />
analyze information <strong>of</strong> customer. This involves to two<br />
copies <strong>of</strong> packets between user space and kernel space. As<br />
customers firstly establish connections with the load<br />
balancing switch, so the connections need to be migrated<br />
to the selected server. Migrating connections are very<br />
time-consuming and consume large amounts <strong>of</strong> system<br />
resources. Layer-7 algorithms can consider more<br />
information deciding to select which server to response to<br />
a request and make good use <strong>of</strong> server resources, in<br />
particular cache resource. However, they require<br />
migration <strong>of</strong> connection and copy <strong>of</strong> packets between user<br />
space and kernel space, bringing a certain degree loss <strong>of</strong><br />
performance. Examples <strong>of</strong> the layer-7 include LARD<br />
(locality-aware request distribution) [7], WARD<br />
(workload aware request distribution) [8], CAP (client<br />
aware policy) [9] and so on. LARD is well known<br />
dispatching policy that aims to improve cache hit rate in<br />
web server. In LARD policy, the load balancing policy<br />
dispatches the request <strong>of</strong> the same web object to the same<br />
back-end web server. However, LARD may lead to load<br />
unbalancing due to different popularity <strong>of</strong> web pages.<br />
WARD is static partitioning that assigns dedicated servers<br />
to specific groups <strong>of</strong> requests. Although this policy is<br />
useful from the system management point <strong>of</strong> view and<br />
achieves a higher cache hit rate, it does have poor server<br />
utilization. Degradation in the utilization is due to<br />
© 2011 ACADEMY PUBLISHER<br />
resources that are not utilized and cannot be shared among<br />
all <strong>of</strong> the clients. The main goal <strong>of</strong> CAP is to improve load<br />
sharing in web clusters that provides multiple types <strong>of</strong><br />
services. The load balancing switch classifies requests<br />
from clients into four classes: normal, CPU bound, disk<br />
bound, and CPU and disk bound. However, requests with<br />
the same type might consume different amounts <strong>of</strong><br />
resources.<br />
Considering the shortcomings <strong>of</strong> the above methods,<br />
we propose llac. Using netfiler mechanism to intercept<br />
packets and analyze the URL information in IP level,<br />
Load balancing switch notifies the back-end server cache<br />
hot files, selects least-load server to process requests. We<br />
use both client-side and server-side information, avoid<br />
switching user mode and kernel mode and migrating TCP<br />
connection overhead. In case <strong>of</strong> hot files, use cache to<br />
improve response time.<br />
III. LLAC FOR WEB CLUSTER<br />
In the sites, most <strong>of</strong> the crashes occur during the hot<br />
visit. Therefore, increase accessing speed <strong>of</strong> hot files eases<br />
the pressure on the sites to a large extent. The main<br />
objective in designing such an algorithm is to minimize<br />
the average response time <strong>of</strong> popular or hot requests<br />
(proactive cache hot files, read them from memory file<br />
system, so reducing disk write times), make good use <strong>of</strong><br />
server resources in the cluster and increase the utilization<br />
and throughput <strong>of</strong> cluster web servers. In this regard, we<br />
use a linear model to compute each server’s composite<br />
load, which helps to decide which should be chosen to<br />
serve request. The module <strong>of</strong> llac uses information <strong>of</strong><br />
clients and servers to make request distribution. The load<br />
balancing switch parses packets in IP layer using netfilter<br />
mechanism, records packet access frequency and informs<br />
web servers to cache popular or hot files. Fig. 1 shows<br />
system components we design.<br />
A. Load Collection<br />
The Load Collection is responsible for tracking the<br />
processor, network and memory usage <strong>of</strong> web server. We<br />
gather resource utilization traces by running a set <strong>of</strong><br />
microbechmarks. The full list <strong>of</strong> metrics is shown in<br />
TABLE I. These statistics can all be gathered easily in<br />
Linux with the sysstat monitoring package [10]. We focus<br />
on this set <strong>of</strong> resource measurements since they can easily<br />
be gathered with low overhead and are representative <strong>of</strong><br />
estimating the performance <strong>of</strong> the web server [11]. Since<br />
these traces must also be gathered from the live<br />
application, it is crucial that a lightweight monitoring<br />
system can be used to gather data. To improve<br />
performance, we create a thread for every load indicator in<br />
parallel to execute. The load collection tracks the usage <strong>of</strong><br />
each resource over a measurement interval and reports<br />
these statistics to the load calculation at the end <strong>of</strong> each<br />
interval.<br />
B. Load Calculation<br />
This section describes how to create models which<br />
characterize the relationship between a set <strong>of</strong> resource<br />
utilization metrics gathered from an application running<br />
on the web server and the composited load. The model
1762 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
creation employs a component which is a linear equation.<br />
Using values from the load collection module, we form<br />
an equation which calculates the total load as a linear<br />
combination <strong>of</strong> the different metrics.<br />
T = α 0 + α1<br />
* U1<br />
+ α1<br />
* U 2 + � + α 9 * U 9 (1)<br />
Where<br />
• U i is a value <strong>of</strong> metric collected for a benchmark<br />
executed in the web server;<br />
• The set <strong>of</strong> coefficients α 0 , α1,<br />
�, α n is the<br />
model that describes the relationship between the total<br />
load and Resource Utilization Metrics. Unfortunately,<br />
finding a set <strong>of</strong> good parameters is a rather empirical job,<br />
with very little support from theory. The main objective<br />
Figure1. System Architecture<br />
TABLE I. RESOURCE UTILIZATION METRICS<br />
is to tune the parameters to achieve good system<br />
performance, without asking too many questions about<br />
why it works well. Often, it is just a matter <strong>of</strong> “let’s try<br />
this approach and see what happens”.<br />
• This is the total load <strong>of</strong> the web server.<br />
Load calculation module passes the results to load<br />
management module located in the load balancing switch.<br />
Web Server adopts active push method to report their<br />
composited load. Pushing way than active asking can<br />
further reduce the burden <strong>of</strong> the load balancing switch.<br />
Also, Using UDP unicast data transmission can reduce<br />
the burden on the network bandwidth.<br />
CPU Memory Network Disk<br />
User Space % Memory Used % Rx packets/sec Read req/sec<br />
Kernel Space % Swapper Used % Tx packets/sec Write req/sec<br />
Rx bytes/sec<br />
Read blocks/sec<br />
IO Wait %<br />
Tx bytes/sec<br />
Write blocks/sec<br />
C. Load Management<br />
In the listening state, if it receives the server’s<br />
composited load information, it creates a child thread and<br />
notifies the current load value to llac module located in<br />
the kernel space using IPVSadm management tool.<br />
Frequency statistics module is mounted on the<br />
IP_LOCAL_IN in the Netfilter[12] in order to parsing the<br />
request packets and count the access frequency statistics.<br />
The priority <strong>of</strong> the frequency statistics function must be<br />
higher than the IPVS, otherwise the request will be<br />
forwarded out at this point, leading to not reaching<br />
frequency statistics function. Meanwhile, we use hash lists<br />
in order to raise the speed <strong>of</strong> accessing and searching.<br />
© 2011 ACADEMY PUBLISHER<br />
They link together through the general list head pointer<br />
inside the structure. We sort the file from more to less<br />
according to access frequency, through sorted_list<br />
pointing to the sorted list (Fig. 2) so that we can easily get<br />
hot file information.<br />
D. Cache Replacement<br />
We use memory file system divided from memory<br />
space for hot file cache. Our novelty lies in using link<br />
change technology to modify file location on disk<br />
(changed to symbolic link) pointing to the location on<br />
memory file system caching the file. It brings many<br />
benefits. For example, we do not need to modify the URL<br />
information. Also, service program does not require
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1763<br />
modification, we can access the cached file from the<br />
memory space. Whether the file is on disk or in memory<br />
file is transparent to the user or service procedures.<br />
Due to the size limit <strong>of</strong> RAM resource, when memory<br />
space in memory file is not sufficient to accommodate the<br />
needs <strong>of</strong> caching file, cached files need to be replaced out<br />
from the cache using replacement policy. We use the<br />
IV. EXPERIMENTAL RESULTS<br />
To analyze the proposed dispatching algorithm, it is<br />
implemented on a web server cluster. We implement the<br />
experimental testbed with hardware and s<strong>of</strong>tware<br />
configurations as described below.<br />
B. S<strong>of</strong>tware Setup<br />
following cache replacement strategy. We sort the files<br />
according to the ratio <strong>of</strong> file access frequency and file size.<br />
When need to be replaced, give priority to small ratio.<br />
Follow this way, the access frequency which is low and<br />
the size which is large will be replaced out. To a certain<br />
extent, this improves the cache hit rate, also improves the<br />
cache utilization.<br />
Figure2. File Frequency Statistics<br />
TABLE II. HARDWARE ENVIRONMENT<br />
A. Hardware setup<br />
The web server cluster consists <strong>of</strong> a load balancing<br />
switch node, connected to the web server nodes. All the<br />
nodes are connected through a high speed gigabit LAN<br />
switch. The distributed architecture <strong>of</strong> the cluster is<br />
hidden from the clients via a unique virtual IP address.<br />
The hardware environment is shown in TABLE II.<br />
CPU Memory(GB) HD NIC<br />
Front-end Intel(R) E5345 2.33GHz 8cores DDR2 4 60GB 80003ES2LAN<br />
Back-ends (1-4) Intel(R) E5345 2.33GHz 8cores DDR2 4 60GB 80003ES2LAN<br />
TABLE III. SOFTWARE ENVIRONMENT<br />
OS Kernel IPVS Web server Benchmarks<br />
LB switch Red Hat Linux5.0 2.6.18 1.0.4 ————— —————<br />
Web server Red Hat Linux5.0 2.6.18 —— Apache 2.0.40 —————<br />
Client Red Hat Linux5.0 2.6.18 —— ————— WebBench 5.0<br />
TABLE III shows the experimental s<strong>of</strong>tware<br />
environment. All the machines in the cluster run Linux<br />
kernel 2.6.18 as their operating system, and the load<br />
© 2011 ACADEMY PUBLISHER<br />
balancing switch uses IPVS for request dispatching. We<br />
use Apache 2.0.40 for HTTP service installed as the web<br />
server. HTTP/1.1 connection is applied. In addition, all
1764 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
clients have WebBench[13] installed for request<br />
proposing.<br />
WebBench is a performance testing s<strong>of</strong>tware for web<br />
servers, including both the controller and clients. The<br />
controller is able to control clients for proposing requests,<br />
to record and summarize the experimental data, and then<br />
output the experimental results. In addition, WebBench<br />
can control the mixed ratio <strong>of</strong> request types transmitted<br />
from clients by the programmable workload.<br />
We perform all experiments to analyze the system<br />
performance under different ratios <strong>of</strong> request types (e.g.<br />
different localities <strong>of</strong> hot Web pages). We also create a<br />
workload generator to generate a synthetic workload for<br />
various ratios <strong>of</strong> request types. The performance metrics<br />
we used are the requests per second (req/s) megabits per<br />
second (Mbps) and number <strong>of</strong> successful requests, which<br />
are the experimental results summarized and reported by<br />
WebBench.<br />
C. Experimental evaluation<br />
In this section, we present performance evaluation <strong>of</strong><br />
our proposed llac request distribution algorithm. In this<br />
test, WebBench is used and hot Web pages are built from<br />
the requested web pages <strong>of</strong> the default workload.<br />
Furthermore, we prepare the click through rate (CTR)<br />
with 20%, 40%, 60% and 80%, change the percentage <strong>of</strong><br />
the hot web pages in requested web pages to 10%, 20%,<br />
30% and 40%. We compare the experimental results with<br />
that <strong>of</strong> wrr. We also<br />
compare llac with only using ll or ac. Fig. 3, 4, and 5<br />
shows that our llac outperforms wll, ll, and ac.<br />
Figure5. number <strong>of</strong> successful requests<br />
The reason that llac policy performs better than wrr, ll<br />
is because the llac policy uses frequency-based<br />
mechanism to achieve high cache rates <strong>of</strong> servers. The<br />
reason that llac policy performs better than ac is because<br />
the llac policy considers server’s current load, assessing<br />
the current load and selecting the appropriate node to<br />
response the request.<br />
Experimental results demonstrate that when the web<br />
server cluster is under heavy load, the llac policy can<br />
handle more requests and show better performance.<br />
© 2011 ACADEMY PUBLISHER<br />
Figure 3 number <strong>of</strong> requests per minute<br />
Figure4. data transfer per second<br />
V. SUMMARY<br />
This paper presents a new request distribution<br />
algorithm for web server cluster, called llac. This<br />
research focuses on reducing hot files access time and<br />
uses the resources <strong>of</strong> web servers more efficiently. Our<br />
experimental results show that our proposed llac policy<br />
can get better performance than wrr, ll, and ac under<br />
heavy load condition. This policy reduces response time<br />
especially for hot files, because hot files are retrieved<br />
from memory. The node with least load is selected to<br />
serve the request so that it results in resource utilization<br />
getting better used. In future work we plan to experiment<br />
with more benchmark to further verify effectiveness <strong>of</strong><br />
llac.<br />
ACKNOWLEDGMENT<br />
This study is sponsored by the fund <strong>of</strong> the State Key<br />
Laboratory <strong>of</strong> S<strong>of</strong>tware Development Environment under<br />
Grant No. SKLSDE-2009ZX-01, the National Natural<br />
Science Foundation <strong>of</strong> China under Grant No. 61003015,<br />
the Doctoral Fund <strong>of</strong> Ministry <strong>of</strong> Education <strong>of</strong> China<br />
under Grant No. 20101102110018 and the Fundamental<br />
Research for the Central Universities under Grant No.<br />
YWF-10-02-058.
JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1765<br />
REFERENCES<br />
[1] A. Chandra, P. Pradhan, R. Tewari, S. Sahu, P. Shenoy.<br />
"An observation-based approach towards self-managing<br />
web servers", Computer Communications, 2006, pp1174-<br />
1188.<br />
[2] V. Cardellini, E. Casalicchio, M. Colajanni, S. Tucci,<br />
"Mechanisms for quality <strong>of</strong> service in web clusters",<br />
Computer <strong>Networks</strong>, vol.37, No.6, 2001, pp761-771.<br />
[3] M.E. Crovella, A. Bestavros. "Self-Similarity in World<br />
Wide Web Traffic: Evidence and Possible Causes",<br />
IEEE/ACM Transactions on Networking, vol.5, No.6,<br />
1997, pp835-846.<br />
[4] V. Cardellini, E. Casalicchio, M. Colajanni, and P.S. Yu.<br />
"The State <strong>of</strong> the Art in Locally Distributed Web-Server<br />
Systems", ACM Computing Surveys, vol.34, No.2, 2002,<br />
pp 263-311.<br />
[5] M. Andreolini, E. Casalicchio. "A cluster-based web<br />
system providing differentiated and guaranteed services",<br />
Cluster Computing , vol.7, No.1, 2004, pp7-19.<br />
[6] E. Choi. "Performance test and analysis for an adaptive<br />
load balancing mechanism on distributed server cluster<br />
systems", Future Generation Computer Systems, No.20,<br />
2004, pp 237-247.<br />
[7] V.S. Pail, M. Aront, G. Bangat. "Locality-Aware Request<br />
Distribution in Cluster-based Network Servers", ACM<br />
SIGOPS Operating Systems Review, USA:ACM , 1998,<br />
pp205-216.<br />
[8] L. Cherkasova, M. Karlsson. "Scalable Web Server<br />
Cluster Design with Workload-Aware Request<br />
Distribution Strategy WARD", Advanced Issues <strong>of</strong> E-<br />
Commerce and Web-Based Information Systems,<br />
Washington:IEEE Computer Society, 2001, pp212-221.<br />
[9] E. Casalicchio, M. Colajanni. "A client-aware dispatching<br />
algorithm for Web clusters providing multiple services",<br />
The International World Wide Web Conference<br />
Committee (IW3C2), 2001, pp535-544.<br />
[10] Sysstat-7.0.4. http://perso.orange.fr/sebastien.godard/<br />
[11] M. Andreolini , S. Casolari , Michele Colajanni. "Models<br />
and framework for supporting runtime decisions in Webbased<br />
systems", ACM Transactions on the Web (TWEB),<br />
vol.2, No.3, 2008, pp1-43.<br />
[12] CHRISTIAN BENVENUTI. Understanding LINUX<br />
NETWORK INTERNALS. 2006.<br />
[13] http://linux.s<strong>of</strong>tpedia.com/get/System/Benchmarks/Webbench-1378.shtml<br />
[14] M.L. Chiang, Y.C. Lin, L.F. Guo. "Design and<br />
implementation <strong>of</strong> an efficient web cluster with contentbased<br />
request distribution and file caching", <strong>Journal</strong> <strong>of</strong><br />
Systems and S<strong>of</strong>tware, vol.81, No.11, 2008, pp 2044-2058<br />
[15] S. Sharifian, S.A. Motamedi, M.K. Akbari. "A contentbased<br />
load balancing algorithm with admission control for<br />
cluster web servers", Future Generation Computer<br />
Systems , vol.24, No.8, 2008, pp775-787.<br />
[16] M.L. Chiang, C.H. Wu, Y.J. Liao, Y.F. Chen. "New<br />
Content-aware Request Distribution Policies in Web<br />
Clusters Providing Multiple Services", Proceedings <strong>of</strong> the<br />
2009 ACM symposium on Applied Computing,<br />
USA:ACM, 2009, pp79-83.<br />
[17] Z.Y. Xu, J.Z. Han, L. Bhuyan. "Scalable and<br />
Decentralized Content-Aware Dispatching in Web<br />
Clusters", IEEE International Performance, Computing,<br />
and Communications, Washington:IEEE Computer<br />
Society, 2007, pp202-209.<br />
[18] Y.K. Chang. "Fully Pre-Splicing TCP for Web Switches",<br />
Proceedings <strong>of</strong> the First International Conference on<br />
© 2011 ACADEMY PUBLISHER<br />
Innovative Computing, Information and Control,<br />
Washington:IEEE Computer Society , 2006, pp737-740.<br />
[19] S. Chase , D.C. Anderson. "Managing energy and server<br />
resources in hosting centers", In Proc. <strong>of</strong> the eighteenth<br />
ACM symposium on Operating systems principles, 2001,<br />
pp103-116.<br />
[20] Tarek F. Abdelzaher, Kang G. Shin, and Nina Bhatti.<br />
"Performance Guarantees for Web Server End-Systems: A<br />
Control-Theoretical Approach", IEEE Transactions on<br />
Parallel and Distributed Systems, June 2001.<br />
[21] Yasushi Saito, Brian N. Bershad, and Henry M. Levy. "An<br />
approximation-based load-balancing algorithm with<br />
admission control for cluster web servers with dynamic<br />
workloads", <strong>Journal</strong> <strong>of</strong> Supercomputing, vol.53, No.3,<br />
2010, pp 440-463.<br />
Wei Zhang HeBei Province, China.<br />
Birthdate: Dec, 1983. is a PhD<br />
candidate in the Department <strong>of</strong><br />
Computer science and Technology at<br />
Beihang University. She received her<br />
master degree in 2008. Her research<br />
interests include virtualization, load<br />
balancing and cloud computing.<br />
Huan Wang HuNan Province, China.<br />
Birthdate: Oct, 1986. is Computer<br />
Science and Engineering Master,<br />
graduated from Dept. Computer<br />
Science Beihang University. And<br />
research interests on operating system,<br />
load balancing, parallel computing and<br />
massive data processing.<br />
Binbin Yu born in July 1987. Now<br />
study in Computer Science College at<br />
Beihang University for the master<br />
degree. Mainly concentrates on load<br />
balancing and cloud computing.<br />
Wei Xu Fujian Province, China.<br />
Birthdate: Nov, 1986. is Computer<br />
Science and Technology MA, graduated<br />
from Dept. Computer Science and<br />
Engineering BeiHang University. And<br />
research interests on virtualization,<br />
operating system, high performance<br />
computing, cloud computing.
1766 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />
© 2011 ACADEMY PUBLISHER<br />
Mingfa Zhu born in 1945. Ph.D,<br />
Pr<strong>of</strong>essor, Senior membership <strong>of</strong> China<br />
Computer Federation. His main research<br />
areas are computer architecture,<br />
computer system s<strong>of</strong>tware, high<br />
performance computing, virtualization<br />
and cloud computing<br />
Limin Xiao born in 1970. Ph.D,<br />
Pr<strong>of</strong>essor, Senior membership <strong>of</strong> China<br />
Computer Federation. His main research<br />
areas are computer architecture,<br />
computer system s<strong>of</strong>tware, high<br />
performance computing, virtualization<br />
and cloud computing.<br />
Li Ruan born in 1978. Ph.D, Lecturer,<br />
Membership <strong>of</strong> China Computer<br />
Federation, Her main research areas are<br />
computer architecture, computer system<br />
s<strong>of</strong>tware, high performance computing,<br />
virtualization and cloud computing.
Aims and Scope.<br />
Call for Papers and Special Issues<br />
<strong>Journal</strong> <strong>of</strong> <strong>Networks</strong> (JNW, ISSN 1796-2056) is a scholarly peer-reviewed international scientific journal published monthly, focusing on theories,<br />
methods, and applications in networks. It provide a high pr<strong>of</strong>ile, leading edge forum for academic researchers, industrial pr<strong>of</strong>essionals, engineers,<br />
consultants, managers, educators and policy makers working in the field to contribute and disseminate innovative new work on networks.<br />
The <strong>Journal</strong> <strong>of</strong> <strong>Networks</strong> reflects the multidisciplinary nature <strong>of</strong> communications networks. It is committed to the timely publication <strong>of</strong> highquality<br />
papers that advance the state-<strong>of</strong>-the-art and practical applications <strong>of</strong> communication networks. Both theoretical research contributions<br />
(presenting new techniques, concepts, or analyses) and applied contributions (reporting on experiences and experiments with actual systems) and<br />
tutorial expositions <strong>of</strong> permanent reference value are published. The topics covered by this journal include, but not limited to, the following topics:<br />
• Network Technologies, Services and Applications, Network Operations and Management, Network Architecture and Design<br />
• Next Generation <strong>Networks</strong>, Next Generation Mobile <strong>Networks</strong><br />
• Communication Protocols and Theory, Signal Processing for Communications, Formal Methods in Communication Protocols<br />
• Multimedia Communications, Communications QoS<br />
• Information, Communications and Network Security, Reliability and Performance Modeling<br />
• Network Access, Error Recovery, Routing, Congestion, and Flow Control<br />
• BAN, PAN, LAN, MAN, WAN, Internet, Network Interconnections, Broadband and Very High Rate <strong>Networks</strong>,<br />
• Wireless Communications & Networking, Bluetooth, IrDA, RFID, WLAN, WMAX, 3G, Wireless Ad Hoc and Sensor <strong>Networks</strong><br />
• Data <strong>Networks</strong> and Telephone <strong>Networks</strong>, Optical Systems and <strong>Networks</strong>, Satellite and Space Communications<br />
Special Issue Guidelines<br />
Special issues feature specifically aimed and targeted topics <strong>of</strong> interest contributed by authors responding to a particular Call for Papers or by<br />
invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are <strong>of</strong> interest to the <strong>Journal</strong>.<br />
Preference will be given to proposals that cover some unique aspect <strong>of</strong> the technology and ones that include subjects that are timely and useful to the<br />
readers <strong>of</strong> the <strong>Journal</strong>. A Special Issue is typically made <strong>of</strong> 10 to 15 papers, with each paper 8 to 12 pages <strong>of</strong> length.<br />
The following information should be included as part <strong>of</strong> the proposal:<br />
• Proposed title for the Special Issue<br />
• Description <strong>of</strong> the topic area to be focused upon and justification<br />
• Review process for the selection and rejection <strong>of</strong> papers.<br />
• Name, contact, position, affiliation, and biography <strong>of</strong> the Guest Editor(s)<br />
• List <strong>of</strong> potential reviewers<br />
• Potential authors to the issue<br />
• Tentative time-table for the call for papers and reviews<br />
If a proposal is accepted, the guest editor will be responsible for:<br />
• Preparing the “Call for Papers” to be included on the <strong>Journal</strong>’s Web site.<br />
• Distribution <strong>of</strong> the Call for Papers broadly to various mailing lists and sites.<br />
• Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be<br />
informed the Instructions for Authors.<br />
• Providing us the completed and approved final versions <strong>of</strong> the papers formatted in the <strong>Journal</strong>’s style, together with all authors’ contact<br />
information.<br />
• Writing a one- or two-page introductory editorial to be published in the Special Issue.<br />
Special Issue for a Conference/Workshop<br />
A special issue for a Conference/Workshop is usually released in association with the committee members <strong>of</strong> the Conference/Workshop like<br />
general chairs and/or program chairs who are appointed as the Guest Editors <strong>of</strong> the Special Issue. Special Issue for a Conference/Workshop is<br />
typically made <strong>of</strong> 10 to 15 papers, with each paper 8 to 12 pages <strong>of</strong> length.<br />
Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop:<br />
• Selecting a Title for the Special Issue, e.g. “Special Issue: Selected Best Papers <strong>of</strong> XYZ Conference”.<br />
• Sending us a formal “Letter <strong>of</strong> Intent” for the Special Issue.<br />
• Creating a “Call for Papers” for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees.<br />
Information about the <strong>Journal</strong> and <strong>Academy</strong> <strong>Publisher</strong> can be included in the Call for Papers.<br />
• Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus<br />
the evaluation from the Session Chairs and the feedback from the Conference attendees.<br />
• Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors.<br />
Authors should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced.<br />
• Providing us the completed and approved final versions <strong>of</strong> the papers formatted in the <strong>Journal</strong>’s style, together with all authors’ contact<br />
information.<br />
• Writing a one- or two-page introductory editorial to be published in the Special Issue.<br />
More information is available on the web site at http://www.academypublisher.com/jnw/.
(Contents Continued from Back Cover)<br />
Covert Flow Graph Approach to Identifying Covert Channels<br />
XiangMei Song and ShiGuang Ju<br />
A Novel HAVE Message <strong>of</strong> Peer-to-peer Protocol in BitTorrent Systems<br />
Jianyong Li, Jianchun Li, Daoying Huang, and Qiang Wei<br />
Image-based Position Estimation and Adaptive Modulation Coding in Vehicular Communication<br />
Hao Yang, Qingmin Meng, Xiong Gu, and Baoyu Zheng<br />
A Request Distribution Algorithm for Web Server Cluster<br />
Wei Zhang, Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, and Li Ruan<br />
1740<br />
1747<br />
1754<br />
1760