14.01.2013 Views

Journal of Networks - Academy Publisher

Journal of Networks - Academy Publisher

Journal of Networks - Academy Publisher

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Journal</strong> <strong>of</strong> <strong>Networks</strong><br />

ISSN 1796-2056<br />

Volume 6, Number 12, December 2011<br />

Contents<br />

REGULAR PAPERS<br />

Botnet Detection Architecture Based on Heterogeneous Multi-sensor Information Fusion<br />

HaiLong Wang, Jie Hou, and ZhengHu Gong<br />

XOEM plus OWL-based STEP Product Information Uniform Description and Implementation<br />

Chengfeng Jian and Haizhong Meng<br />

Design <strong>of</strong> Greenhouse Control System Based on Wireless Sensor <strong>Networks</strong> and AVR Microcontroller<br />

Yongxian Song, Chenglong Gong, Yuan Feng, Juanli Ma, and Xianjin Zhang<br />

Simulation <strong>of</strong> Networked Control System based on Smith Compensator and Single Neuron<br />

Incomplete Differential Forward PID<br />

Haitao Zhang and Zhen Li<br />

A Web Crawler System Design Based on Distributed Technology<br />

Shaojun Zhong and Zhijuan Deng<br />

A Ranking Method <strong>of</strong> Retrieval Results Based on Web Comprehending<br />

Zhijuan Deng and Shaojun Zhong<br />

An Encryption Scheme with Hidden Keyword Search for Outsourced Database<br />

Xiaoming Wang, Guoxiang Yao, and Zhen Zhang<br />

A Method <strong>of</strong> Object-based De-duplication<br />

Fang Yan and YuAn Tan<br />

Analysis on E-consumers’ Purchasing Behavior Based on Data-driving Model<br />

Lijuan Huang<br />

Repair Method <strong>of</strong> Complex Network Based on Matthew Effect<br />

Minsheng Tan, Qiang Cui, Lingfeng Zhu, and Hui Zhao<br />

Study and Design an Anycast Routing Protocol for Wireless Sensor <strong>Networks</strong><br />

Demin Gao, Huanyan Qian, Zheng Wang, and Jiguang Chen<br />

Management Model Research <strong>of</strong> Low-power Wireless Sensor Network<br />

LinGe Wang and YueDou Qi<br />

Covert Flow Graph Approach to Identifying Covert Channels<br />

XiangMei Song and ShiGuang Ju<br />

A Novel HAVE Message <strong>of</strong> Peer-to-peer Protocol in BitTorrent Systems<br />

Jianyong Li, Jianchun Li, Daoying Huang, and Qiang Wei<br />

1655<br />

1662<br />

1668<br />

1675<br />

1682<br />

1690<br />

1697<br />

1705<br />

1713<br />

1719<br />

1726<br />

1734<br />

1740<br />

1747


Image-based Position Estimation and Adaptive Modulation Coding in Vehicular Communication<br />

Hao Yang, Qingmin Meng, Xiong Gu, and Baoyu Zheng<br />

A Request Distribution Algorithm for Web Server Cluster<br />

Wei Zhang, Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, and Li Ruan<br />

1754<br />

1760


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1655<br />

Botnet Detection Architecture Based on<br />

Heterogeneous Multi-sensor Information Fusion<br />

HaiLong Wang and Jie Hou<br />

National University <strong>of</strong> Defense Technology, Changsha, 410073, China<br />

Email: {hlwang1981, jhou1983}@gmail.com<br />

ZhengHu Gong<br />

National University <strong>of</strong> Defense Technology, Changsha, 410073, China<br />

Email: gzh@nudt.edu.cn<br />

Abstract—As technology has been developed rapidly, botnet<br />

threats to the global cyber community are also increasing.<br />

And the botnet detection has recently become a major<br />

research topic in the field <strong>of</strong> network security. Most <strong>of</strong> the<br />

current detection approaches work only on the evidence<br />

from single information source, which can not hold all the<br />

traces <strong>of</strong> botnet and hardly achieve high accuracy. In this<br />

paper, a novel botnet detection architecture based on<br />

heterogeneous multi-sensor information fusion is proposed.<br />

The architecture is designed to carry out information<br />

integration in the three fusion levels <strong>of</strong> data, feature, and<br />

decision. As the core component, a feature extraction<br />

module is also elaborately designed. And an extended<br />

algorithm <strong>of</strong> the Dempster-Shafer (D-S) theory is proved<br />

and adopted in decision fusion. Furthermore, a<br />

representative case is provided to illustrate that the<br />

detection architecture can effectively fuse the complicated<br />

information from various sensors, thus to achieve better<br />

detection effect.<br />

Index Terms—botnet, botnet detection, network security,<br />

information fusion, D-S theory<br />

I. INTRODUCTION<br />

Internet threats have recently transformed from highly<br />

visible, disruptive attacks to stealthy attacks used for<br />

pr<strong>of</strong>it, and at the center <strong>of</strong> this change are the botnets [1].<br />

These botnets have been the workhorses <strong>of</strong> many various<br />

disastrous attacks, such as information theft [2],<br />

distributed denial <strong>of</strong> service (DDoS) [3], and sending<br />

spam [4]. The threats can disable the infrastructure and<br />

cause the financial damage, which leads to a severe<br />

challenge for the global network security. Hence, in order<br />

to detect botnet attacks effectively, we need to have a<br />

correct and comprehensive understanding <strong>of</strong> the botnet<br />

attacks. In particular, we must fuse all the gathered<br />

information related to botnet activities from<br />

heterogeneous multi-source sensors, and then carry out<br />

further analysis for decision-making. Therefore, we can<br />

Manuscript received January 1, 2011; revised June 1, 2011; accepted<br />

July 1, 2011.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1655-1661<br />

say that information fusion is absolutely a necessary<br />

component for botnet detection [5].<br />

Botnet is a network composed by computers on which<br />

the s<strong>of</strong>tware called ‘bot’ is automatically installed<br />

without user intervention, and is remotely controlled via<br />

command and control channel for malicious purpose [6].<br />

Its activities have the following common characteristics.<br />

First, they have more action phases and representation<br />

forms than the traditional malware attacks. The activity<br />

cycle <strong>of</strong> a botnet attack usually consists <strong>of</strong> four stages,<br />

i.e., propagation, infection, communication, and attack<br />

[7]. Even in the same stage, different botnet attacks could<br />

exhibit various activity forms, such as propagating by<br />

system vulnerabilities or email. Second, the botnet<br />

activities are wide-ranging from a private host, local area<br />

network to the backbone [8]. Third, the botnet activities<br />

are always hidden. Since their resulting network traffic is<br />

small, the bots can upgrade itself without exposition for a<br />

long time [9]. These three characteristics make great<br />

difficulty in botnet detection. However, it is well known<br />

that the traces <strong>of</strong> botnet would be recorded during its<br />

actions over a wide range [10]. There are diverse types <strong>of</strong><br />

information sources which can be retrieved, such as<br />

network packets, network flows, system logs, alerts from<br />

anti-virus s<strong>of</strong>tware or intrusion detection systems, and the<br />

analysis results from botnet detection tools. Though the<br />

information could be used to identify the traces <strong>of</strong> botnet,<br />

it is usually large-scale, uncertain and redundant.<br />

Despite <strong>of</strong> the importance <strong>of</strong> information fusion for<br />

botnet detection, most <strong>of</strong> the existing work does not focus<br />

on this field. To our best knowledge, the existing botnet<br />

detection schemes can discover bots to some extent, but<br />

they do not make full use <strong>of</strong> the multifarious information<br />

related to botnet activities and are not able to handle the<br />

entire situation <strong>of</strong> botnet infiltration. In recent years,<br />

multi-sensor information fusion has been rapidly<br />

developed and applied in many sophisticated application<br />

areas, especially network security [11]. In the view <strong>of</strong><br />

integrating the complicated information from<br />

heterogeneity and multi-source in an efficient way, we<br />

propose a botnet detection architecture based on<br />

information fusion techniques. In the architecture, we<br />

design a novel feature extraction module and adopt an


1656 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

extended decision fusion algorithm, which enables the<br />

detection to achieve three-level fusions <strong>of</strong> data, feature<br />

and decision.<br />

The remainder <strong>of</strong> the paper is organized as follows.<br />

Section 2 discusses background technologies and related<br />

work. Section 3 presents the botnet detection architecture<br />

based on heterogeneous multi-sensor information fusion.<br />

Section 4 introduces a fusion algorithm used in the<br />

architecture and gives the pro<strong>of</strong> <strong>of</strong> the algorithm. Section<br />

5 shows an illustration <strong>of</strong> botnet detection. Finally,<br />

section 6 concludes the paper.<br />

II. RELATED WORK<br />

Botnet is a new type <strong>of</strong> attack which is developed and<br />

syncretized from network worm, Trojan, backdoor tools<br />

and other traditional forms <strong>of</strong> malicious code [12].<br />

However, compared to these traditional attacks, the major<br />

difference is that the botnet has a one-to-many control<br />

relationship among attackers and bots [13]. This feature<br />

makes botnet more privacy, flexible and efficient than<br />

any other malicious programs.<br />

With the evolution <strong>of</strong> botnet, the detection techniques<br />

for it have also developed. Many diverse schemes for<br />

botnet detection have been proposed, such as honeypot or<br />

honeynet for capture and analysis [14], correlation<br />

analysis <strong>of</strong> malicious behaviors[15], detection approaches<br />

for different C&C mechanisms (e.g. IRC, HTTP, DNS, or<br />

P2P) [16-19], and identifying bots from DDoS and spam<br />

[20, 21]. However, these techniques mainly focus on the<br />

network traffic and obtain evidences <strong>of</strong> botnet activities<br />

indirectly. For example, the evidence for detecting the<br />

upgrade <strong>of</strong> bot is obtained by identifying the upgrade<br />

binaries in the traffic, rather than directly derived from<br />

the code server which logs the download event. Single<br />

information source and indirect evidences cause the<br />

following three problems for botnet detection. First, it<br />

usually brings the false-positive and false-negative.<br />

Second, it will extend the detection cycle. Generally,<br />

multiple rounds observations are required to give the<br />

correct results. Third, due to the inadequate information<br />

collection, it is very difficult to be aware <strong>of</strong> new botnet or<br />

botnet variations. Therefore, the research on detection<br />

architecture with the ability <strong>of</strong> integrating heterogeneous<br />

multi-sensor information should be paid more attention.<br />

Robert et al. [22] design a multi-layered architecture<br />

for the detection <strong>of</strong> a wide range <strong>of</strong> existing and new<br />

botnets. The architecture can integrate many techniques<br />

to detect the gather information from all the available<br />

network information sources: network traffic data, system<br />

process information, and file system information.<br />

Napoleon et al. [23] introduce a risk-aware networkcentric<br />

management framework to detect and prevent<br />

targeted botnet attacks as well as propagation attempts<br />

with the network. The framework systematically collects<br />

network traffic and vulnerabilities in s<strong>of</strong>tware,<br />

comprehensively analysis and discovers characteristics<br />

and unique behaviors <strong>of</strong> bots, and dynamically<br />

determines associated risks and generates corresponding<br />

detection rules. Zhang et al. [24] develop a top-down<br />

analytical framework as a basis for critical evaluation on<br />

© 2011 ACADEMY PUBLISHER<br />

the existing countermeasures. The framework correlates<br />

and integrates the observations and reports <strong>of</strong> anti-botnet<br />

tools at different layers, i.e., Internet, intranet, and host,<br />

for achieving a whole snapshot <strong>of</strong> the botnet. Alireza et<br />

al. [25] propose an architecture which is called “Visual<br />

Threat Monitor” that combines data mining and<br />

visualization to enhance botnet traffic detection. The<br />

processing pipeline <strong>of</strong> the architecture consists <strong>of</strong><br />

correlation, statistical analysis, clustering, aggregation,<br />

and visualization. On the basis <strong>of</strong> the studies [15, 26, 27],<br />

Gu et al. [28] present a general detection framework to<br />

realize more accurate botnet detection over local area<br />

network. It analyzes <strong>of</strong> traffic and network flow,<br />

correlating with multiple alerts or events <strong>of</strong> intrusion<br />

detection system. The aforementioned detection<br />

architectures have some problems in the aspect <strong>of</strong><br />

information fusion. First, the types <strong>of</strong> the information<br />

source are incomplete. And there is no proper division<br />

method towards the information source according to the<br />

botnet activities, which would cause large redundancy<br />

information and ill-targeted collection. Second, the<br />

aforementioned schemes are lack <strong>of</strong> a powerful algorithm<br />

to fuse large-scale information from different sources and<br />

obtain the correlation between attackers and their botnets,<br />

though they adopt some correlation analysis methods.<br />

Third, most <strong>of</strong> the existing frameworks do not have<br />

independent feature extraction module, or function <strong>of</strong><br />

feature extraction is too simple.<br />

III. ARCHITECTURE OVERVIEW<br />

Information fusion technique is a kind <strong>of</strong> information<br />

processing method makes use <strong>of</strong> information from<br />

multiple sensors, and related information from associated<br />

database, achieves improved accuracies and more specific<br />

inferences than could be achieved by the use <strong>of</strong> a single<br />

sensor alone [29]. Network security is the latest<br />

application <strong>of</strong> information fusion, and all these<br />

applications are mainly about the improvement <strong>of</strong> IDS<br />

[30]. Information fusion processes are <strong>of</strong>ten categorized<br />

as data, feature or decision level fusion depending on the<br />

processing stage at which fusion takes place [31]. Data<br />

level fusion, combines several sources <strong>of</strong> raw information<br />

to produce new information that is expected to be more<br />

informative and synthetic than the inputs. Feature level<br />

fusion, various features are combined into a feature map<br />

that may be used by further process. Decision level<br />

fusion, combines decisions coming from several expert<br />

knowledge. According to the processes <strong>of</strong> information<br />

fusion, we give the botnet detection architecture based on<br />

heterogeneous multi-sensor information fusion in this<br />

paper, which consists <strong>of</strong> several parts in Figure 1.<br />

A. Information Collection<br />

This part adopts a role-based collaborative information<br />

collection model, which is our recent work in [32]. This<br />

part includes the s<strong>of</strong>tware and hardware <strong>of</strong> the<br />

information collection system, the main task is to collect<br />

all the information related to botnet activities from<br />

heterogeneous multi-source sensors. The information can<br />

be gathered from computers in the network, network


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1657<br />

Figure. 1 Botnet detection architecture based on heterogeneous multi-sensor information fusion.<br />

security equipments such as firewall, intrusion detection<br />

system (IDS), and network equipments such as router and<br />

switch. The function <strong>of</strong> this part is implemented by the<br />

information collection agent.<br />

To figure axis labels, use words rather than symbols.<br />

Do not label axes only with units. Do not label axes with<br />

a ratio <strong>of</strong> quantities and units. Figure labels should be<br />

legible, about 9-point type.<br />

Color figures will be appearing only in online<br />

publication. All figures will be black and white graphs in<br />

print publication.<br />

B. Pre-processing<br />

To increase the effect and efficiency <strong>of</strong> further<br />

information fusion, Pre-processing is needed. Preprocessing<br />

module is composed <strong>of</strong> classification and<br />

refinement. Classification is to divide the information<br />

source into original information, indirect information and<br />

direct information. The original information includes the<br />

real record <strong>of</strong> network and system behaviors without any<br />

security analysis, such as packet payload, system process<br />

information, etc. However, the indirect information is the<br />

alarm information from general security s<strong>of</strong>tware, such as<br />

anti-virus s<strong>of</strong>tware, firewall, honeypot, etc. The indirect<br />

information, always combined with original information,<br />

could be the indirect evidences for botnet detection.<br />

Besides, direct information is the analysis result <strong>of</strong><br />

technical botnet detection tools (e.g. BotHunter [15]),<br />

which could be the direct evidence for botnet detection.<br />

Refinement is to filter out unwanted information, detect<br />

the suspicious information on preset rules, unify the<br />

presentation, and store the result into the information<br />

database.<br />

C. Feature Extraction<br />

Feature extraction is the core module <strong>of</strong> the<br />

architecture. The existing detection techniques use the<br />

following two extraction modes:<br />

• Utilizing the botnet samples captured by<br />

honeypot or honeynet (including bots, message<br />

© 2011 ACADEMY PUBLISHER<br />

contents, etc.). Because the sample data is<br />

relatively pure, extracted data features can be<br />

directly adopted as the essential features<br />

(signature or pattern) <strong>of</strong> botnet.<br />

• Utilizing general information (such as flow data,<br />

logs, etc.) and indirect information. The main<br />

process is: first <strong>of</strong> all, try to discover data<br />

features; then, compare to the results found by<br />

the proved botnet detection system; finally, verify<br />

whether the data features <strong>of</strong> information belong<br />

to the essential features <strong>of</strong> botnets.<br />

Figure. 2 Feature extraction.<br />

Our feature extraction covers the above-mentioned two<br />

modes. As shown in figure 2, the structure <strong>of</strong> feature<br />

extraction module consists <strong>of</strong> four parts, including<br />

attribute selection, data feature analysis, validation and<br />

scheme management. The data feature analysis integrates<br />

data mining methods, such as statistical data analysis,<br />

pattern recognition, artificial neural networks, support<br />

vector machines, etc. Its goal is to provide a mechanism<br />

for the identification <strong>of</strong> new features in the data sets from<br />

the attribute selection. For general and indirect<br />

information, it must be verified before being stored into<br />

the feature database as signatures or patterns. Meanwhile,<br />

through the extracted features, the scheme management


1658 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

gives feedbacks to the modules <strong>of</strong> the data feature<br />

analysis and the attribute selection for dynamic<br />

optimization. Besides, this part divides the analytic<br />

results into four main categories as the inputs <strong>of</strong> the<br />

fusion analysis, i.e., propagation, infection,<br />

communication and attack, according to the stages <strong>of</strong><br />

botnet activities.<br />

D. Fusion Analysis<br />

Fusion Analysis is also the key <strong>of</strong> the architecture and<br />

the main module <strong>of</strong> producing high-level and qualified<br />

alerts. This part gives the final conclusion for the<br />

decision-making to reflect the real situation <strong>of</strong> the botnet<br />

activities. The detailed process will be described in<br />

section 4.<br />

E. Database<br />

The information database stores the results producing<br />

by pre-processing module. The results have been divided<br />

into three categories, which is useful for the following<br />

fusions. The feature database stores signatures and<br />

patterns from the feature extraction. And the knowledge<br />

database includes vulnerability database, security policy,<br />

client configuration records, etc., which can provide a<br />

strong data support for the decision-making.<br />

F. Control and Collaboration Mechanism<br />

Control mechanism is used to react against the<br />

<strong>of</strong>fending events taking place on or within the detection<br />

system. Depending on the result <strong>of</strong> analysis and<br />

synthesis, this part adopts the measure responding to the<br />

main modules. And some responding work can be<br />

finished automatically by control mechanism through<br />

adjusting system parameters. Collaboration mechanism<br />

provides the communications and function calls among<br />

the detection systems or with other security products.<br />

IV. METHOD OF FUSION ANALYSIS<br />

Decision fusion algorithms in the fusion analysis<br />

confront three critical requirements as follows:<br />

• Flexibility. The algorithm should require no prior<br />

probability and conditional probability. Since the<br />

botnet behaviors are <strong>of</strong>ten random and uncertain,<br />

it is difficult to obtain the prior knowledge.<br />

• Compatibility. It can effectively integrate<br />

heterogeneous multi-sensor information, and in<br />

particular with the accumulation <strong>of</strong> evidences, the<br />

decision will be more accurate.<br />

• Scalability. It has the ability to easily fuse new<br />

evidences from the emerging sensors without<br />

changing the framework <strong>of</strong> algorithm.<br />

The Dempster-Shafer (D-S) theory is the right one that<br />

can meet these requirements among the main algorithms.<br />

The D-S theory is a mathematical theory <strong>of</strong> evidence,<br />

introduced in the 1960's by Arthur Dempster [33] and<br />

developed in the 1970's by Glenn Shafer [34]. It is<br />

viewed as a mechanism for reasoning under epistemic<br />

(knowledge) uncertainty. First, we will give a brief<br />

introduction <strong>of</strong> D-S theory [35]. Then, in our architecture<br />

we will introduce an extended D-S theory proposed in<br />

© 2011 ACADEMY PUBLISHER<br />

[36] to fuse the results from feature extraction. And we<br />

will give a pro<strong>of</strong> <strong>of</strong> the extended theory which was not<br />

proved in [36].<br />

D-S Theory<br />

Frame <strong>of</strong> discernment (Θ) is a finite set mutually<br />

exclusive propositions and hypotheses about some<br />

problem domain. Basic probability assignment (bpa) is<br />

stated in [34] as: “If Θ is a frame <strong>of</strong> discernment, then a<br />

function m: is called a basic probability assignment<br />

whenever<br />

m( ∅ ) = 0<br />

(1)<br />

∑ m( A)<br />

= 1<br />

(2)<br />

A⊂Θ<br />

The mass value <strong>of</strong> A (m(A)) is also called A’s basic<br />

probability number, and it is understood to be the<br />

measure <strong>of</strong> the belief that is committed exactly to A.”<br />

( ) ( )<br />

Bel A = ∑ m B<br />

(3)<br />

B⊆A Plausibility function (Pl) takes into account all the<br />

elements related to A (either supported by evidence or<br />

unknown).<br />

( ) 1 ( )<br />

Pl A = − Bel ¬ A<br />

(4)<br />

For the subset A, Bel(A) and Pl(A) represent upper and<br />

lower belief bounds, and the interval [Bel(A), Pl(A)]<br />

represents the belief range.<br />

12<br />

( )<br />

m A<br />

=<br />

∑<br />

∑<br />

B∩ C= A<br />

B∩C≠∅ ( ) ( )<br />

m B m C<br />

1 2<br />

( ) ( )<br />

m B m C<br />

1 2<br />

Dempster’s rule <strong>of</strong> combination can be used to<br />

combine the mass values <strong>of</strong> all features from each<br />

individual sensor to achieve the overall summary mass<br />

values for each sensor. These summary values from all<br />

sensors are combined to give the summary mass values<br />

for the system. Initially, the bpas are used to assign the<br />

mass values to appropriate hypothesis. Then the resulting<br />

mass values are used to calculate the belief for the<br />

appropriate hypothesis. Finally all beliefs are combined<br />

with Dempster’s rule <strong>of</strong> combination to gain the overview<br />

belief for the appropriate hypothesis, as shown in (5).<br />

Extended D-S Theory<br />

Dempster’s rule <strong>of</strong> combination gives equivalent trust<br />

to the evidences provided by different sensors as shown<br />

in (5). But actually it is not the case. Observations show<br />

that for the same type <strong>of</strong> sensors, the local ones should<br />

provide more credible information than the remote ones;<br />

even if the same sensor, installed in different locations <strong>of</strong><br />

network will have different detection capacity; different<br />

types <strong>of</strong> sensors, may have different detection capability<br />

and accuracy for the same type <strong>of</strong> attack, so that the<br />

provided evidences will have great difference in<br />

importance and reliability. To solve these problems, the<br />

extended D-S theory assigns different weight to different<br />

(5)


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1659<br />

sensors. It means that different sensors are given different<br />

trusts. As shown in (6), using a weighted index method,<br />

the evidences after the rule combination should meet the<br />

basic nature <strong>of</strong> the bpa, that is to say, (2).<br />

12<br />

( )<br />

m A<br />

=<br />

∑<br />

∑<br />

B∩ C= A<br />

B∩C≠∅ ( ) ( )<br />

w1 w2<br />

1 2<br />

⎡⎣m B ⎤⎦ ⎡⎣m C ⎤⎦<br />

( ) ( )<br />

w1 w2<br />

1 2<br />

⎡⎣m B ⎤⎦ ⎡⎣m C ⎤⎦<br />

(6)<br />

In (6), m1 and m2 are the mass functions over Θ,<br />

m ∅ = 0 . We just need to prove m ( A)<br />

= 1 ,<br />

and ( )<br />

12<br />

∑<br />

A⊂Θ<br />

which is shown in (7). So m12 is also the mass function<br />

and the evidences after the rule combination in (6) are<br />

truly to meet the basic nature <strong>of</strong> the bpa.<br />

∑<br />

∩ =<br />

∑ m12 ( A) = m12 ( ∅ ) + ∑ m12 ( A)<br />

= ∑<br />

A⊂Θ A⊂Θ, A≠∅A⊂Θ, A≠∅∑<br />

12<br />

( ) ( )<br />

⎡⎣m1 w1 B ⎤⎦ ⎡⎣m2 w2<br />

C ⎤⎦<br />

B C A<br />

w1 w2<br />

⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C)<br />

⎤⎦<br />

w1 w2 ∑ ∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) ⎤⎦ A⊂Θ, A≠∅ B∩ C= A<br />

w1 w2 ∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C) ⎤⎦ B∩C≠∅ w1 w2<br />

∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C)<br />

⎤⎦<br />

B∩C≠∅ w1 w2<br />

∑ ⎡⎣m1( B) ⎤⎦ ⎡⎣m2( C)<br />

⎤⎦<br />

1<br />

B∩C≠∅ B∩C≠∅ = = =<br />

In the extended D-S theory, the weights can be<br />

obtained by training samples based on maximum entropy<br />

or minimum mean square error, and also can be the<br />

experience values from several tests.<br />

Weight Assignment<br />

In addition to the situations <strong>of</strong> sensors, our researches<br />

show that weight assignment should take the stages <strong>of</strong><br />

botnet activities into account. The features which are<br />

extracted from the stages <strong>of</strong> communication and attack<br />

are more credible than those from the stages <strong>of</strong><br />

propagation and infection. Moreover, other factors, such<br />

as vulnerability, might also affect the weight assignment.<br />

V. SCENARIO<br />

As shown in Figure 3, this is a typical environment for<br />

botnet attacks. The environment contains the local area<br />

network and backbone network, involving three<br />

application servers (EMAIL, WWW, and DNS), one<br />

management server, three firewalls, an attacker, a<br />

honeynet, several zombies, etc. The Attacker sends the<br />

© 2011 ACADEMY PUBLISHER<br />

(7)<br />

Figure. 3 Illustration <strong>of</strong> botnet detection.<br />

commands to the zombies through command and control<br />

channel. According to the commands, the bots on the<br />

zombies will carry out some actions such as propagation,<br />

information theft, DDoS attack, spam, etc. The thin oneway<br />

arrow in figure 3 shows the process <strong>of</strong> command and<br />

control communication. Towards this typical<br />

environment, BotHunters are deployed for two local area<br />

networks; Spam filters is used in EMAIL server; Servers<br />

and hosts are equipped with terminal monitor and log<br />

analysis tools; network traffic monitor for flow and traffic<br />

information; vulnerability scanning systems to collect<br />

vulnerability, configuration, and port information for<br />

servers and hosts. All the sensors through the collection<br />

agents send the information to the management server for<br />

fusion analysis. Then the management server gives the<br />

final results to the administrator. The thick one-way<br />

arrow shows the aforementioned process.<br />

To show the work <strong>of</strong> the botnet detection architecture,<br />

an example <strong>of</strong> sending spam is provided in figure 3. It<br />

can be observed that the attacker discovers the zombies<br />

online and commands the bots on the zombies to send<br />

spam to victim host A and B. On the one hand, the<br />

BotHunters can detect the bots in the network by<br />

monitoring the traffic. On the other hand, other sensors<br />

can also hold every stage <strong>of</strong> spam botnet activities. The<br />

log analysis tools in DNS server could discover some<br />

suspicious hosts, for the spam bots usually perform<br />

DNSBL lookups on the DNS server to determine whether<br />

they are blacklisted [37]. The terminal and traffic<br />

monitors could retrieve the direct evidences <strong>of</strong> anomalies<br />

from the communications. In a word, all the sensors send<br />

the suspicious information to the management server.<br />

Then, the management server carries out pre-processing,<br />

feature extraction and fusion analysis to integrate and<br />

analyze the received information, so that the<br />

administrator can fully master the evidences <strong>of</strong> the<br />

interactions between the attacker and the zombies within<br />

a short time. And, this fusion process can also identify the<br />

zombies [38] as well as the position <strong>of</strong> the attacker [39].<br />

If the administrator only monitored the traffic or only<br />

checked email records to identify the spam activities, it<br />

may take more time and cause more false-positive alarms.<br />

Theoretically speaking, the results from fusion analysis<br />

are more accurate than those from the BotHunters.


1660 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

VI. CONCLUSIONS<br />

In this paper, we have introduced a botnet detection<br />

architecture based on heterogeneous multi-sensor<br />

information fusion. Also, we described functionalities<br />

and features <strong>of</strong> each component in the architecture,<br />

highlighting the module <strong>of</strong> feature extraction. In addition,<br />

we introduced an extended algorithm <strong>of</strong> D-S theory and<br />

gave its pro<strong>of</strong>. Finally, we elaborated a representative<br />

case <strong>of</strong> detecting spam botnet to demonstrate the<br />

feasibility <strong>of</strong> our architecture.<br />

For the future work, we are going to implement the<br />

prototype and deploy it in the real network, and then<br />

evaluate the accuracy <strong>of</strong> the fusion algorithm to compare<br />

the existing detection method. Our ultimate goal is to<br />

develop a practical botnet detection system, following the<br />

architecture proposed in this paper, to integrate multiple<br />

information fusion techniques, and eventually provide<br />

identification, evaluation and prediction for the botnet.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank Tao Li for his helpful<br />

comments for improving this paper. This material is<br />

based upon work supported in part by the National<br />

Natural Science Foundation <strong>of</strong> China under Grant<br />

No.61070200 and No.61003303, the National Science<br />

and Technology Support Program <strong>of</strong> China under Grant<br />

No.2008BAH37B03, the National High-Tech Research<br />

and Development Plan <strong>of</strong> China under Grant<br />

No.2009AA01Z432, and the National Grand<br />

Fundamental Research 973 Program <strong>of</strong> China under<br />

Grant No.2009CB320503.<br />

REFERENCES<br />

[1] K. Singh, A. Srivastava, J. Giffin, and W. Lee, “Evaluating<br />

Email’s Feasibility for Botnet Command and Control,”<br />

Proc. 38th Annual IEEE/IFIP International Conference on<br />

Dependable Systems and <strong>Networks</strong> (DSN 2008), USA,<br />

2008, pp. 376-385.<br />

[2] K. Bohn, “Teen questioned in computer hacking probe,”<br />

CNN [Online], 2004, Available:<br />

http://www.cnn.com/2007/TECH/11/29/fbi.botnets/index.h<br />

tml.<br />

[3] J. Davis, “Hackers take down the most wired country in<br />

europe,” WIRED MAGZINE: ISSUE 15.09 [Online],<br />

2007, Available:<br />

http://www.wired.com/politics/security/magazine/15-<br />

09/ff_estonia.<br />

[4] T. Holz, M. Steiner, and F. Dahl, “Measurements and<br />

mitigation <strong>of</strong> peer-to-peer-based botnets: A case study on<br />

storm worm,” Proc. 1st USENIX Workshop on Large-<br />

Scale Exploits and Emergent Threats (LEET’08), 2008.<br />

[5] H. Wang and Z. Gong, “Collaboration-based botnet<br />

detection architecture,” Proc. 2nd International Conference<br />

on Intelligent Computation Technology and Automation,<br />

Zhangjiajie, China, 2009.<br />

[6] Zhaosheng Zhu, Guohan Lu, and Yan Chen, “Botnet<br />

Research Survey”, Proc. 32nd Annual IEEE International<br />

Computer S<strong>of</strong>tware and Applications Conference, Finland,<br />

2008.<br />

[7] J. Govil, “Examining the criminology <strong>of</strong> bot zoo,” Proc.<br />

6th International Conference on Information,<br />

Communications and Signal Processing, Singapore, 2007.<br />

© 2011 ACADEMY PUBLISHER<br />

[8] J. Govil, “Criminology <strong>of</strong> botnets and their detection and<br />

defense methods,” Proc. 2007 IEEE International<br />

Conference on Electro/Information Technology (EIT’07),<br />

2007.<br />

[9] D. Geer, “Malicious bots threaten network security,” IEEE<br />

Computer, vol. 38, no. 1, pp. 18-20, 2005.<br />

[10] M. Rajab, J. Zarfoss, and F. Monrose, “A multi-faceted<br />

approach to understanding the Botnet phenomenon,” Proc.<br />

ACM SIGCOMM/USENIX Internet Measurement<br />

Conference (IMC’06), Brazil, 2006.<br />

[11] G. Giorgio, R. Fabio, and S. Carlo, “Information fusion in<br />

computer security,” Information Fusion, vol. 10, no. 4, pp.<br />

272-273, 2009.<br />

[12] J. Zhuge, X. Han, Y. Zhou, Z. Ye, and W. Zou, “Research<br />

and Development <strong>of</strong> Botnets,” <strong>Journal</strong> <strong>of</strong> S<strong>of</strong>tware, vol. 19,<br />

no. 3, pp. 702-715, 2008.<br />

[13] J. Zhuge, X. Han, Z. Ye, and W. Zou, “Discover and track<br />

botnets,” Proc. Chinese Symposium on Network and<br />

Information Security (NetSec), Beijing, 2005.<br />

[14] J. Cheng, J. Yin, Y. Liu, and J. Zhong, “Advances in the<br />

Honeypot and Honeynet Technologies,” <strong>Journal</strong> <strong>of</strong><br />

Computer Research and Development, vol. 45, no. 1, pp.<br />

375-378, 2008.<br />

[15] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee,<br />

“BotHunter: Detecting malware infection through idsdriven<br />

dialog correlation,” Proc. 16th USENIX Security<br />

Symposium (Security’ 07), 2007.<br />

[16] J. R. Binkley and S. Singh, “An algorithm for anomalybased<br />

Botnet detection,” Proc. USENIX SRUTI’06, 2006,<br />

pp. 43–48.<br />

[17] J. Lee, H. Jeong, J. Park, M. Kim, and B. Noh, “The<br />

activity analysis <strong>of</strong> malicious http-based botnets using<br />

degree <strong>of</strong> periodic repeatability,” Proc. 2008 International<br />

Conference on Security Technology, 2008, pp. 83-86.<br />

[18] H. Choi, H. Lee, and H. Lee, “Botnet detection by<br />

monitoring group activities in DNS traffic,” Proc. 7th IEEE<br />

International Conference on Computer and Information<br />

Technology, Aizu-Wakamatsu City, Japan, 2007.<br />

[19] S. Matthew and I. Igor, Detection <strong>of</strong> peer-to-peer botnets,<br />

Masters Thesis, University <strong>of</strong> Amsterdam, 2008.<br />

[20] F. Freiling, T. Holz, G, Wicherski, “Botnet Tracking:<br />

Exploring a Root-cause Methodology to Prevent Denial <strong>of</strong><br />

Service Attacks,” Proc. 10th European Symposium on<br />

Research in Computer Security (ESORICS’05), 2005.<br />

[21] Z. Duan, P. Chen, F. Sanchez, Y. Dong, M. Stephenson,<br />

and J. Barker, “Detecting Spam Zombies by Monitoring<br />

Outgoing Messages, ” Proc. IEEE INFOCOM’09<br />

Conference, Brazil, 2009.<br />

[22] E. Robert, C. Adele, and B. Pranab, “A Multi-Layered<br />

Approach to Botnet Detection,” Proc. 2008 International<br />

Conference on Security and Management (SAM’08),<br />

USA, 2008.<br />

[23] N. Paxton, G.J. Ahn, and B. Chu, “Towards practical<br />

framework for collecting and analyzing network-centric<br />

attacks,” Proc. IEEE International Conference on<br />

Information Reuse and Integration, USA, 2007.<br />

[24] Z. Zhang and Y. Kadobayashi, “A holistic perspective on<br />

understanding and breaking botnets: Challenges and<br />

countermeasures,” <strong>Journal</strong> <strong>of</strong> the National Institute <strong>of</strong><br />

Information and Communications Technology, vol. 55, no.<br />

2, pp. 43-59, 2008.<br />

[25] S. Alireza, F. Maryam, and A. Rodina, “Architecture for<br />

applying data mining and visualization on network flow for<br />

botnet traffic detection,” Proc. 2009 International<br />

Conference on Computer Technology and Development,<br />

Cairo, Egypt, 2009.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1661<br />

[26] G. Gu, J. Zhang, and W. Lee, “BotSniffer: Detecting<br />

Botnet command and control channels in network traffic,”<br />

Proc. 15th Annual Network and Distributed System<br />

Security Symposium (NDSS’08), USA, 2008.<br />

[27] G. Gu, J. Zhang, and R. Perdisci, “Botminer: Clustering<br />

analysis <strong>of</strong> network traffic for protocol- and structureindependent<br />

Botnet detection,” Proc. 17th USENIX<br />

Security Symposium (Security’08), USA, 2008.<br />

[28] G. Gu, Correlation-based Botnet Detection in Enterprise<br />

<strong>Networks</strong>, PhD Thesis, Georgia Institute <strong>of</strong> Technology,<br />

USA, 2008.<br />

[29] B.V. Dasarathy, “A special issue on information fusion in<br />

computer security,” Information Fusion, vol. 10, no. 4, pp.<br />

271, 2009.<br />

[30] Y. Niu, Q. Zheng, and H. Peng, “Network security<br />

management based on data fusion technology,” Proc. 7th<br />

International Conference on Computer-Aided Industrial<br />

Design and Conceptual Design, China, 2006.<br />

[31] B.V. Dasarathy, “Decision Fusion,” IEEE Computer<br />

Socienty Press, 1994.<br />

[32] H. Wang and Z. Gong, “Role-based collaborative<br />

information collection model for botnet detection,” Proc.<br />

2010 International Symposium on Collaborative<br />

Technologies and Systems (CTS 2010), Chicago, USA,<br />

2010.<br />

[33] A.P. Dempster, “Upper and lower probabilities induced by<br />

a multivalued mapping,” Ann. Math. Statist., 1967, pp.<br />

325-339.<br />

[34] G. Shafer, A Mathematical Theory <strong>of</strong> Evidence, Princeton<br />

University Press, Princeton and London, 1976.<br />

[35] Qi Chen, Uwe Aickelin, “Dempster-Shafer for Anomaly<br />

Detection,” Proc. the International Conference on Data<br />

Mining (DMIN 2006), Las Vegas, USA, 2006, pp. 232-<br />

238.<br />

[36] L. Ma, L. Yang, and J. Wang, “Research on Security<br />

Information Fusion from Multiple Heterogeneous<br />

Sensors,” <strong>Journal</strong> <strong>of</strong> System Simulation, vol. 20, no. 4, pp.<br />

181-187, 2008.<br />

[37] A. Ramachandran, N. Feamster, and D. Dagon, “Revealing<br />

Botnet membership using DNSBL counterintelligence,”<br />

Proc. USENIX SRUTI’06, 2006.<br />

[38] S. Kondo and N. Sato, “Botnet traffic detection techniques<br />

by C&C session classification using SVM,” Proc. 2nd<br />

International Workshop on Security, Japan, 2007.<br />

[39] M. Rajab, J. Zarfoss, and F. Monrose, “My botnet is bigger<br />

than yours (maybe, better than yours): Why size estimates<br />

remain challenging,” Proc. 1st Workshop on Hot Topics in<br />

Understanding Botnets (HotBots 2007), Boston, USA,<br />

© 2011 ACADEMY PUBLISHER<br />

2007.J. Clerk Maxwell, A Treatise on Electricity and<br />

Magnetism, 3 rd ed., vol. 2. Oxford: Clarendon, 1892,<br />

pp.68–73.<br />

HaiLong Wang JiLin Proviince, China.<br />

Birthdate: May, 1981. is Electronic<br />

Engineering B.E., graduated from Dept.<br />

Electronic Engineering Naval University<br />

<strong>of</strong> Engineering, Wuhan, China, in 2004.<br />

And research interests on network and<br />

information security, distributed<br />

computing, computer network<br />

architecture.<br />

He is currently working toward the Ph.D. degree at the<br />

School <strong>of</strong> Computer, National University <strong>of</strong> Defense<br />

Technology, Changsha, China.<br />

Jie Hou HeBei Proviince, China.<br />

Birthdate: July, 1983. is Communication<br />

Engineering B.E., graduated from Dept.<br />

Communication Engineering Chinese<br />

People’s Armed Police Force Institute <strong>of</strong><br />

Engineering, Xi’an, China, in 2005. And<br />

research interests on the next generation<br />

computer network architecture, network<br />

and information security.<br />

She is currently working toward the<br />

Ph.D. degree at the School <strong>of</strong> Computer, National University <strong>of</strong><br />

Defense Technology, Changsha, China.<br />

ZhengHu Gong HuNan Province, China.<br />

Birthdate: August, 1945. is Electronic<br />

Engineering B.E., graduated from Dept.<br />

Electronic Engineering Tsinghua<br />

University, Beijing, China, in 1970. And<br />

research interests on computer network<br />

and communication, network security,<br />

computer network architecture.<br />

He is currently a Pr<strong>of</strong>essor with the<br />

School <strong>of</strong> Computer, National University <strong>of</strong> Defense<br />

Technology, Changsha, China.


1662 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

XOEM plus OWL-based STEP Product<br />

Information Uniform Description and<br />

Implementation<br />

Chengfeng Jian<br />

Zhejiang University <strong>of</strong> Technology, Hangzhou, 310023, China<br />

Email: jiancf@zjut.edu.cn<br />

Haizhong Meng<br />

Zhejiang University <strong>of</strong> Technology, Hangzhou, 310023, China<br />

Email: mhz_1986@126.com<br />

Abstract—Aimed at the current inconsistencies in the OWLbased<br />

STEP description, the mapping rules between<br />

EXPRESS and OWL are established on the base <strong>of</strong> uniform<br />

semantic model named XOEM+OWL, then the<br />

implementation method <strong>of</strong> STEP-OWL converter is put<br />

forward and the corresponding examples are shown.<br />

Index Terms—OWL, STEP, XOEM, EXPRESS<br />

I. INTRODUCTION<br />

With the development <strong>of</strong> the semantic web and<br />

semantic grid, knowledge sharing and exchange <strong>of</strong><br />

product information over the Internet became the main<br />

research focus. Currently, there are many research<br />

methods to realize the semantic description <strong>of</strong> STEP [1]<br />

by means <strong>of</strong> semantic web such as RDF, DAMIL, OWL<br />

[2], etc [3-5]. Summarized the results <strong>of</strong> these studies,<br />

their thinking is similar to STEP using the same XML<br />

data representation, are trying to use RDF or OWL to<br />

replace EXPRESS described. The limitations <strong>of</strong> this<br />

approach is: different from the data format conversion,<br />

OWL semantic description <strong>of</strong> a variety <strong>of</strong> ways, for the<br />

same kind <strong>of</strong> product information, OWL can be used<br />

many different ways to describe their internal semantics,<br />

even in the same kind <strong>of</strong> OWL language to describe the is<br />

difficult to standardize the understanding <strong>of</strong> semantic<br />

consistency. Therefore, trying to only through the<br />

description <strong>of</strong> OWL to realize semantic description <strong>of</strong> the<br />

unity <strong>of</strong> product information is difficult, which is<br />

currently difficult for these studies have further reasons.<br />

Overall, although the realization <strong>of</strong> the expression <strong>of</strong><br />

STEP in OWL, but mainly through the EXPRESS and<br />

OWL syntax match between the establishment <strong>of</strong><br />

mapping between ontology definitions and descriptions<br />

<strong>of</strong> their lack <strong>of</strong> consistency, lack <strong>of</strong> a unified model and<br />

define the appropriate constraints.<br />

project number: 60603087<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1662-1667<br />

In this paper, aimed at OWL-based STEP semantic<br />

description, the mapping rules between EXPRESS and<br />

OWL are established on the base <strong>of</strong> uniform semantic<br />

model named XOEM+OWL, then the implementation<br />

method <strong>of</strong> STEP-OWL converter is put forward and the<br />

corresponding examples are shown.<br />

II. XOEM+OWL-BASED SCHEMA MAPPING<br />

XOEM [6] is the data model <strong>of</strong> the XML-based STEP<br />

representation. It is difficult to realize the direct mapping<br />

between XOEM and OWL because OWL belongs to the<br />

semantic layer and the XOEM belongs to the data layer.<br />

XOEM has strong capability on the description <strong>of</strong> data<br />

object but the weak capability on the reasoning <strong>of</strong><br />

constraint. So it is necessary to build the model that it can<br />

realize the conversion from XOEM and introduced from<br />

OWL pattern graph. That’s called XOEM+OWL [7].<br />

According to the OO conception, table1 shows the<br />

comparison:<br />

XOEM+OWL model is based on the XOEM model.<br />

We can also get the follow definition reference to<br />

XOEM:<br />

Object: = Atomic Object | Complex Object<br />

Atomic Object: = (oid, label_name, attribute_type,<br />

attribute_value )<br />

Complex Object: = (oid, label_name, Reference)<br />

Reference: = (link, oid, label_name )


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1663<br />

TABLE I.<br />

EXPRESS-XML-OWL<br />

OO EXPRESS XML OWL<br />

Object Entity<br />

Object<br />

instance<br />

Object<br />

property<br />

Method<br />

Element<br />

type<br />

Definition1.<br />

Given directed graph G=(V, E).<br />

Class<br />

Entity instance Element Individual<br />

Entity<br />

attribute<br />

ENTITY<br />

Function<br />

Element<br />

Element Class<br />

ObjectProperty/<br />

DatatypeProperty<br />

Declaration Entity Schema DTD Ontologies<br />

Relationship<br />

Complex<br />

Constraint<br />

express<br />

Hierarchy<br />

Complex<br />

Constraint<br />

express<br />

Assumption: v0, v1…vi, …vn ∈ V, e1, e2…en∈ E.<br />

Convention3: If the above rules cannot be achieved or<br />

is difficult to achieve, under the circumstances, using the<br />

original translation.<br />

According to above description, table2 shows the<br />

different corresponding Schema graph relation.<br />

TABLE II.<br />

XOEM+OWL-BASED SCHEMA GRAPH DESCRIPTION<br />

OWL Schema Graph<br />

Class Node<br />

Property with basic datatypes<br />

as range (Attribute)<br />

Property with other class as<br />

range (Attribute)<br />

Node with edge joining it to the class<br />

with name “hasProperty”<br />

Edge between the two class nodes<br />

Individual Node with edge joining it to the class<br />

with name “hasInstance”<br />

Class – subclass relationship Edge between class node to subclass<br />

node with name “hasSubClass”<br />

Exists r: d > 0 , r ∈V<br />

v ( V { r})<br />

, i ∈ −<br />

,<br />

i=0, 1, …n.<br />

III. MAPPING RELATIONSHIP BETWEEN OWL AND<br />

EXPRESS EXPRESSION<br />

Definition2.<br />

Given directed graph G(V, E, r).<br />

A. SCHEMA definition<br />

SCHEMA defined as a collection <strong>of</strong> STEP ENTITY<br />

Exist G(Vi, Ei), vi∈ V.<br />

and types, which can be refer to each other for the<br />

i<br />

V = { v j | v j ∈V<br />

∧ v k , v j};<br />

purpose <strong>of</strong> type reuse. The definition <strong>of</strong> SCHEMA can be<br />

corresponding to the Ontology in OWL which part #1 in<br />

i<br />

E<br />

i<br />

i<br />

= { < v j , v k > | v j ∈ V ∧ v k ∈ V ∧ < v j , v k >∈ E }; Figure 1 shows.<br />

B. Basic data type definition<br />

Rule1.<br />

Basic data types EBNF expressed as shown in Figure 2.<br />

For the XOME+OWL object, the Node <strong>of</strong> the directed OWL uses XML Schema embed data type, so as follows:<br />

graph is represented as Object. It is mapping into the For simple data types, mapping directly into the xsd<br />

Class <strong>of</strong> OWL.<br />

data types in the XML schema.<br />

For the construction <strong>of</strong> data types, mapping into owl:<br />

Rule2.<br />

oneOf.<br />

For the XOME+OWL object’s property, the Edge <strong>of</strong> For the aggregate data types, mapping into Owl:Class<br />

the directed graph is represented as Property. It is aggregate with attribute (lowerboundary, upperboundary,<br />

correspond to the property <strong>of</strong> the Class or the<br />

“hasSubClass” among the classes in the OWL.<br />

repetitiveness, if ordered, storage type)<br />

Convention1: If the relevant concepts or data types<br />

<strong>of</strong> EXPRESS can be directly expressed in OWL, then be<br />

expressed using the OWL keyword in priority, to ensure<br />

TypeDecl::=’’<br />

TYPE_HEAD::=TYPE_ID+ TYPE_ID::MarkupDecl*<br />

TYPE_BODY::=TYPE_DECLARATION+SMarkupDecl*<br />

TYPE_DECLARATION::=’’<br />

accuracy by reasoning tools.<br />

BASE_TYPE::=SimpleTypes|ConstuctedTypes|Aggregatio<br />

nTypes|TypeRef<br />

Convention2: If the relevant concepts or data types <strong>of</strong> WHERE CLAUSE::=’WHERE’|RuleDecl<br />

EXPRESS cannot be directly expressed in OWL, but can<br />

expressed by combining the OWL relevant concepts for<br />

the same purpose, and ensuring the accuracy <strong>of</strong><br />

semantics. The combination approach is the better.<br />

Figure2. The BENF expression <strong>of</strong> basic data type in EXPRESS<br />

© 2011 ACADEMY PUBLISHER


1664 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

SCHEMA config_control_design; #1<br />

Entity action; #2<br />

name:label; #3<br />

description: OPTIONAL text; #4<br />

…<br />

DERIVE<br />

scl: REAL:=NVL(scale, 1); #5<br />

;<br />

#1<br />

#2<br />

#3<br />

<br />

<br />

<br />

#4<br />

<br />

<br />

#5<br />

<br />

<br />

<br />

<br />

Figure1. Mapping Example between EXPRESS and OWL<br />

C. Entity definition<br />

ENTITY is an important concept in EXPRESS, so the<br />

mapping <strong>of</strong> entity is the most important. In EXPRESS,<br />

the definition <strong>of</strong> entity is shown in Figure 3(The BENF<br />

expression <strong>of</strong> entity in EXPRESS). The concept <strong>of</strong> class<br />

in OWL can be equivalent to that <strong>of</strong> entity. In this paper,<br />

we map the entity to class, But in OWL class, the<br />

definition <strong>of</strong> attributes and classes are separate, while In<br />

EXPRESS, that is defined together. In order to resolve<br />

property name conflicts, we plus the entity name at front<br />

<strong>of</strong> the attribute name.<br />

①Entity name<br />

For entity name, mapping into owl class, e.g. , see it in Figure1 #2.<br />

②Entity inheritance<br />

We adopt rdfs:subClassOf to represent the<br />

‘SUPERTYPE OF’ and “SUBTYPE OF”<br />

③Simple Attribute<br />

We call those Attributes Simple Attributes whose types<br />

are only simple data type or another entity. If the type is<br />

simple data type, then mapping into , or mapping into , such as Figure1 #3.<br />

④Aggregate Attribute<br />

© 2011 ACADEMY PUBLISHER<br />

For Aggregate Attribute, first define Attribute class<br />

as the subclass <strong>of</strong> class in 3) <strong>of</strong> 2.3.2, then set the<br />

Attribute’s lower boundary, upper boundary,<br />

repetitiveness, order, storage type.<br />

⑤OPTIONAL Attribute<br />

For optional attribute, not only mapping into<br />

or, but also<br />

providing attribute cardinality constraints<br />

(maxCardinality). It is shown in Figure1 #4.<br />

⑥DERIVE Attribute<br />

For DERIVE attribute, not only mapping into<br />

or, but also<br />

providing attribute constraints (allValuesFrom). It is<br />

shown in Figure1 #5.<br />

EntityDecl∷= ’’<br />

ENTITY_HEAD∷=ENTITY_ID S INHERITANCE?<br />

ENTITY_BODY∷=ENTITY_DECLARATION + S<br />

MarkupDecl*<br />

ENTITY_DECLARATION∷=ENTITY_AttrDecl *<br />

SENTITY_ClauseDecl?<br />

ENTITY_ClauseDecl∷=INVERSE_ClAUSE |<br />

UNIQUE_ClAUSE | WHERE_ClAUSE<br />

WHERE_ClAUSE∷=’WHERE’ | RuleDecl<br />

UNIQUE_ClAUSE∷=’UNIQUE’ S Unique_Rule+<br />

ENTITY_AttrDecl∷=Explicit_AttrDecl |<br />

Derive_AttrDecl | Inverse_AttrDecl<br />

Figure3. The BENF expression <strong>of</strong> entity in EXPRESS<br />

D. Function and rule definition<br />

In function and rule, there is a wealth <strong>of</strong> mathematical<br />

operations and Constraint mechanism on objects, but<br />

these expressions in OWL at this aspect are limited, so<br />

we adopt the literal translation with SWRL according to<br />

the Conversion 3.<br />

In addition to the above, there are many other concepts<br />

in EXPRESS, the mapping methods are similar.<br />

Ⅳ. DESIGN AND IMPLEMENTATION OF STEP-OWL<br />

CONVERTER<br />

A. Conversion <strong>of</strong> EXPRESS-OWL<br />

The mapping method <strong>of</strong> EXPRESS to OWL file has<br />

been described in detail in part2, so the most important<br />

task for the implementation <strong>of</strong> EXPRESS-OWL file<br />

conversion is lexical analysis. Here we have adopted a<br />

two-step to complete conversion, which are pre-<br />

converter and post-converter.<br />

① Pre-converter<br />

The so-called pre-converter resolve EXPRESS file to<br />

JAVA classes (Figure 4) in accordance with established<br />

EXPRESS keyword vocabulary (Figure 5) file.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1665<br />

ENTITY<br />

entityName : String<br />

superEntity : List<br />

subEntity : List<br />

SimpleAttribute : Map<br />

deriveAttribute : Map<br />

condition : Map<br />

Figure4. ENTITY class diagram<br />

public interface Vocabulary<br />

{<br />

public final static String<br />

ABSTRACT=”ABSTRACT”;<br />

public final static String<br />

AGGREGATE=” AGGREGATE”;<br />

public final static String ALLAS=” ALLAS”;<br />

public final static String AS=”AS”;<br />

public final static String BAG=” BAG”;<br />

public final static String BEGIN=”BEGIN”;<br />

public final static String BINARY=” BINARY”;<br />

public final static String<br />

BOOLEAN=” BOOLEAN”;<br />

public final static String CASE=” CASE”;<br />

…<br />

Figure5. Part <strong>of</strong> EXPRESS keyword vocabulary<br />

The conversion method is roughly the same, so we use<br />

entity conversion for example. According to the EBNF<br />

description <strong>of</strong> entities and the characteristics <strong>of</strong> definition<br />

<strong>of</strong> EXPRESS entity in Figure 3, we can find that the<br />

keyword vocabulary EXPRESS entity definition is<br />

ENTITYEND_ENTITY 、 DERIVE 、 INVERSE 、<br />

WHERE 、 SUPERTYPE OF 、 SUBTYPE OF. Preconverter’s<br />

physical process is shown in Figure 6.<br />

no<br />

init<br />

r ead a l i ne <strong>of</strong><br />

EXPRESS f i l e<br />

has keywor d<br />

ENTI TY?<br />

yes<br />

save the infomat i on<br />

and cont i nue r ead a<br />

line<br />

has keywor d<br />

END_ENTI T£¿<br />

yes<br />

split the string<br />

according ';'£¬and write<br />

it to JAVA cl ass<br />

no<br />

no<br />

document<br />

end£¿<br />

end<br />

Figure6. Pre-converter’s physical process<br />

② Post-converter<br />

The so-called post-converter is a documents writer<br />

based on the work <strong>of</strong> pre-converter, which generate OWL<br />

documents according to the mapping method in chapter 2.<br />

© 2011 ACADEMY PUBLISHER<br />

yes<br />

Figure 7 is the part <strong>of</strong> the STEP-OWL converter’s<br />

convert result for STEP AP203 shown in Protégé.<br />

Figure7. AP203 converted entity relationship results in Protégé<br />

B. STEP Part21 file conversion<br />

STEP Part21 [8] [9] file can be divided into two parts:<br />

HEADER and DATA. HEADER describe the file name<br />

file reference application protocol; DATA section<br />

composed by a number <strong>of</strong> data instances, each data<br />

instance composed by ID, "=", function statements.<br />

Although the data structure is a single paragraph, but the<br />

statement describes a variety <strong>of</strong> functions, how to design<br />

to meet the description <strong>of</strong> data example’s diversity is the<br />

focus <strong>of</strong> the conversion.<br />

①Lexical Analysis<br />

Read STEP file from left to right, just scan the<br />

character stream and then identify the word based on<br />

word formation rules. This step is divided word into data<br />

instance ("#" plus the number), the variable value<br />

(integer, string, data value), reserved words (the special<br />

characters and other special characters in Part21 physical<br />

file).<br />

②Syntax Analysis<br />

Syntax analysis’s task is to combine the word sequence<br />

into various grammatical phrases based on the lexical<br />

analysis, such as the "Program", "statement", "expression<br />

", etc. Syntax analysis charges the step file is correct or<br />

not on structure and analyze the expression phrase in<br />

hierarchical.<br />

③Semantic Analysis<br />

Semantic Analysis is a translation <strong>of</strong> syntax mapping<br />

based on lexical analysis and syntax analysis. According<br />

to the keywords generate by syntax analysis, we search<br />

the keywords in STEP Application Protocol library, and<br />

insert into file at the appropriate location based on<br />

conversion rules. Use data instance #5 =<br />

AXIS2_PLACEMENT_3D ('NONE', #6, #7, #8); for<br />

example.<br />

Step 1.Divide the data instance into #5, =,<br />

AXIS2_PLACEMENT_3D, (, 'NONE', #6, #7, #8).


1666 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Step2.Decompose the words in step 1 hierarchically,<br />

we can get #5->AXIS2_PLACEMENT_3D ->’NONE’,<br />

#6, #7, #8.<br />

Step3.Semantic processor search AXIS2_ PLACE<br />

MENT_ 3D based on definition in AP203 and map to<br />

AP203, for example, ’none’ means the value <strong>of</strong> property<br />

‘name’, which inherited from representation_item; # 6, #<br />

7, # 8, represent the values <strong>of</strong> the ‘location’ (inherited<br />

from the placement), ‘axis’, ‘ref_direction’.<br />

Through the above three steps, the data instance #5 =<br />

AXIS2_PLACEMENT_3D (‘NONE’, #6, #7, #8); is<br />

convert to owl file by STEP-OWL Converter, see in<br />

Figure 8.<br />

<br />

<br />

NONE<br />

<br />

<br />

<br />

<br />

….<br />

….<br />

…<br />

Figure8. Data instance # 5<br />

C. Example<br />

The examples <strong>of</strong> the convert result by the STEP-OWL<br />

converter for the STEP Part21 file are shown in Figure 9<br />

and Figure 10.<br />

© 2011 ACADEMY PUBLISHER<br />

Figure9. AP203 in OWL format<br />

Figure10. STEP Part21 in OWL format<br />

V. CONCLUSION<br />

XOEM+OWL-based STEP product information<br />

description can realize the semantic description while<br />

maintaining semantic consistency and effectiveness.<br />

There are still some issues that need further studies such<br />

as the semantic consistency for the Entity Function and<br />

Procedure with SWRL.<br />

ACKNOWLEDGMENT<br />

Supported by the National Natural Science Foundation<br />

<strong>of</strong> China (No. 60603087), the Project <strong>of</strong> the Science and<br />

Technology Department <strong>of</strong> Zhejiang Province (No.<br />

2009C320076)<br />

REFERENCES<br />

[1] ISO10303-28, Industrial automation systems and<br />

integration, Product data representation and exchange,<br />

part28: Implementation methods: XML representations <strong>of</strong><br />

EXPRESS schemas and data.<br />

[2] OWL Web Ontology Language,<br />

http://www.w3.org/TR/owl-features/<br />

[3] Pan, Wen-Lin, “A formal EXPRESS-to-OWL mapping<br />

algorithm, ” Key Engineering Materials, vol.419, pp. 689-<br />

692, 2010.<br />

[4] Zhao, W., Liu, J.K, “OWL/SWRL representation<br />

methodology for EXPRESS-driven product information<br />

model. Part I. Implementation methodology, ” Computers<br />

in Industry, vol.59, pp. 580-589, August, 2008.<br />

[5] Ricardo Jardim-Goncalves, Nicolas Figay, Adolfo Steiger-<br />

Garcao, “Enabling interoperability <strong>of</strong> STEP Application<br />

Protocols at meta-data and knowledge level, ”<br />

International <strong>Journal</strong> <strong>of</strong> Technology Management, vol.36,<br />

pp.402-421, April, 2006.<br />

[6] Jian Cheng-Feng, Tan Jian-Rong, “Description and<br />

Identification <strong>of</strong> STEP Product Data with XML, ” <strong>Journal</strong><br />

computer-aided design and computer graphics, vol.13, pp.<br />

983-990, Novemember, 2001.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1667<br />

[7] Jian Cheng-Feng, Zhang Mei-yu, “A Uniform Product<br />

Knowledge Representation Semantic Model, ” 2006 IEEE /<br />

WIC / ACM International conference on web intelligence,<br />

Hong Kong, pp.953-956 , Decemember, 2006.<br />

[8] ISO 10303-11 Industrial automation systems and<br />

integration, Product data representation and exchange,<br />

Part11: Description methods: The EXPRESS language<br />

reference manual.<br />

[9] ISO 10303-21 Industrial automation systems and<br />

Integration, Product data representation and exchange,<br />

Part21: Clear text encoding <strong>of</strong> the exchange structure.<br />

© 2011 ACADEMY PUBLISHER<br />

Chengfeng Jian Zhejiang Province,<br />

China. Birthdate: June, 1973. Ph.D.,<br />

graduated from Zhejiang University. And<br />

research interests on CAD/PDM and<br />

Semantic Web/Semantic Grid.<br />

He is an associate pr<strong>of</strong>essor <strong>of</strong> Dept.<br />

Computer Science and Technology<br />

Zhejiang University <strong>of</strong> Technology.<br />

Haizhong Meng Zhejiang Province,<br />

China. Birthdate: July, 1986. BA.,<br />

graduated from Zhejiang Sci-Tech<br />

University. And research interests on<br />

Semantic Web and STEP<br />

He is currently a postgraduate student<br />

<strong>of</strong> Dept. Computer Science and<br />

Technology Zhejiang University <strong>of</strong><br />

Technology.


1668 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Design <strong>of</strong> Greenhouse Control System Based on<br />

Wireless Sensor <strong>Networks</strong> and AVR<br />

Microcontroller<br />

Yongxian Song<br />

The Institute <strong>of</strong> Electronic Engineering Huaihai Institute <strong>of</strong> Technology, Lianyungang , 222005,China<br />

Email: soyox@126.com<br />

Chenglong Gong, Yuan Feng, Juanli Ma and Xianjin Zhang<br />

The Institute <strong>of</strong> Electronic Engineering Huaihai Institute <strong>of</strong> Technology, Lianyungang, 222005, China<br />

Email: soyox@163.com<br />

Abstract—In order to accurately determine the growth <strong>of</strong><br />

greenhouse crops, the system based on AVR Single Chip<br />

microcontroller and wireless sensor networks is developed,<br />

it transfers data through the wireless transceiver devices<br />

without setting up electric wiring, the system structure is<br />

simple. The monitoring and management center can control<br />

the temperature and humidity <strong>of</strong> the greenhouse, measure<br />

the carbon dioxide content, and collect the information<br />

about intensity <strong>of</strong> illumination, and so on. In addition, the<br />

system adopts multilevel energy memory. It combines<br />

energy management with energy transfer, which makes the<br />

energy collected by solar energy batteries be used<br />

reasonably. Therefore, the self-managing energy supply<br />

system is established. The system has advantages <strong>of</strong> low<br />

power consumption, low cost, good robustness, extended<br />

flexible. An effective tool is provided for monitoring and<br />

analysis decision-making <strong>of</strong> the greenhouse environment.<br />

Index Terms—wireless sensor networks, AVR, greenhouse<br />

I. INTRODUCTION<br />

Greenhouse is a kind <strong>of</strong> place which can change plant<br />

growth environment, create the best conditions for plant<br />

growth, and avoid influence on plant growth due to<br />

outside changing seasons and severe weather [4-5]. For<br />

greenhouse measurement and control system, in order to<br />

increase crop yield, improve quality, regulate the growth<br />

period and improve the economic efficiency, the<br />

optimum condition <strong>of</strong> crop growth is obtained on the<br />

basis <strong>of</strong> taking full use <strong>of</strong> natural resources by changing<br />

greenhouse environment factors such as temperature,<br />

humidity, light, CO2 concentration. Greenhouse<br />

measurement and control system is a complex system,<br />

it needs to various parameters in greenhouse automatic<br />

monitoring, information processing, real-time control and<br />

online optimization. The development <strong>of</strong> greenhouse<br />

measurement and control system has made considerable<br />

progress in the developed countries, and reached the<br />

Manuscript received March. 5, 2011; revised March.25, 2011;<br />

accepted April. 10, 2011.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1668-1674<br />

multi-factors comprehensive control level, but if we<br />

introduce the foreign existing systems, the price is very<br />

expensive and maintenance isn’t convenient. In recent<br />

years, our country have launched many studies in aspects<br />

<strong>of</strong> greenhouse structure and control, and made a lot <strong>of</strong><br />

achievements, but the greenhouse measurement and<br />

control system is mostly based on cable, so it is not only<br />

wiring complex, but also unfavorable to improve the<br />

system efficiency. With the rapid development <strong>of</strong> the low<br />

cost, low power sensor and wireless communication<br />

technology, the conditions that construct wireless<br />

greenhouse measurement and control system becomes<br />

mature, and it is important to realize agricultural<br />

modernization [1-3]. According to the needs <strong>of</strong> quickly<br />

and accurately acquisition greenhouse environment<br />

information, in the paper, we have further studies in<br />

aspects <strong>of</strong> greenhouse environment information<br />

collection, treatment, transmission and so on, and we<br />

have developed greenhouse measurement and control<br />

system based on AVR microcontroller and wireless<br />

sensor networks. This system has high practical value to<br />

realize information and automation <strong>of</strong> large-scale<br />

greenhouse monitoring and improve work efficiency.<br />

II. THE GENERAL STRUCTURE OF THE SYSTEM<br />

The greenhouse measurement and control system<br />

compose <strong>of</strong> the monitoring center, sensor nodes and<br />

control equipments. Sensor nodes are deployed in every<br />

place <strong>of</strong> greenhouse, the responsible for periodic<br />

acquisition greenhouse environment information and send<br />

it to control center. The control center analyze these data<br />

which has been obtained, then relevant decisions are<br />

made and send control message to greenhouse control<br />

equipment, which regulate greenhouse environment<br />

parameters to obtain best growth environment for crops.<br />

Modern greenhouse has very large size, and which adopt<br />

hierarchical system structure. Supposed that greenhouse<br />

is rectangular area, the measurement system overall<br />

structure is shown in Fig.1.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1669<br />

Figure.1.The system structure <strong>of</strong> Greenhouse WSN measurement and<br />

control<br />

In Fig.1, the rectangular greenhouse was divided into<br />

several same area <strong>of</strong> greenhouse, each measurement and<br />

control area is managed by a base station, and is divided<br />

into many virtual grids which have the same sizes and is<br />

non-overlapping. A number <strong>of</strong> sensor nodes are deployed<br />

in virtual grid and make a cluster, each cluster includes a<br />

cluster head (sink node) and some cluster member nodes.<br />

Cluster head generated from the member nodes through<br />

cluster head election algorithm, and cluster member<br />

nodes compose <strong>of</strong> sensor nodes which can collect<br />

environmental data and control nodes which can control<br />

actuators and adjust environmental parameters. Control<br />

node does not participate in cluster head election, it<br />

obtain command which the monitoring center send from<br />

cluster head node and execute corresponding control<br />

operation. The star network composed <strong>of</strong> Cluster head<br />

nodes, sensor nodes and control nodes, it mainly<br />

complete data acquisition and control <strong>of</strong> greenhouse<br />

environment. The data which is collected is transmitted<br />

directly from sensor nodes to cluster head, the cluster<br />

nodes transferred data to the base station by way <strong>of</strong><br />

multiple hops, at last, the base station transferred each<br />

cluster head node data which is packaged to the<br />

monitoring center. Base station is relay station between<br />

the monitoring center and greenhouse WSN nodes, the<br />

network control is realized by managing all the nodes <strong>of</strong><br />

single greenhouse measurement and control area. The<br />

monitoring center is not only total console <strong>of</strong> more<br />

greenhouse network, but also data center <strong>of</strong> measurement<br />

and control system <strong>of</strong> the greenhouse network , and take<br />

charge <strong>of</strong> control and management <strong>of</strong> the entire system.<br />

III. GREENHOUSE WIRELESS SENSOR NETWORK NODE<br />

DESIGN<br />

Greenhouse wireless sensor network measurement and<br />

control system consists <strong>of</strong> two types <strong>of</strong> nodes, namely,<br />

sensor nodes and sink nodes. Sensor node composed <strong>of</strong><br />

CPU module, wireless communication module, power<br />

supply module, sensor module and position switch which<br />

can set their physical location information. Sink node<br />

contains three modules: CPU module, wireless<br />

© 2011 ACADEMY PUBLISHER<br />

communication module, continuous power supply<br />

module and serial communication module.<br />

A. Sensor node module design<br />

Sensor node composed <strong>of</strong> CPU module, wireless<br />

communication module, sensor module, position switch<br />

and energy supply module. Its structure is shown in Fig.2.<br />

Sensor module is responsible for monitoring area<br />

information collection and data transfer, according to the<br />

application requirements, and can choose temperature<br />

sensor, humidity sensors, light sensor, carbon dioxide<br />

concentrations sensor etc. Processor module is<br />

responsible for controlling the sensor node operation,<br />

storage and processing the data which collected by the<br />

node and forwarded by other nodes. Wireless<br />

communication module is responsible for wireless<br />

communication, exchanging control information and<br />

transceiver acquisition data between this node and other<br />

nodes. Position setting switch is used to set a sensor node<br />

specific physical location in greenhouses. Energy supply<br />

module can provide energy which the work need for<br />

sensor node, in the paper, we adopt solar self-supply<br />

module for node power supply.<br />

Figure.2 Sensor node structure chart<br />

Figure.3. Sink node structure chart<br />

B. Sink node module design<br />

Sink node mainly complete the sensor nodes data<br />

gathering and fusion within communication network, and<br />

realize ascending and descending communication<br />

protocol conversion. It released monitoring task <strong>of</strong><br />

management nodes, and the data collected is forwarded to<br />

the external network through a serial port. It is not only<br />

an enhanced sensor node, but also special gateway device<br />

which hasn’t monitoring function and only has wireless<br />

communication interface. Its structure is shown in Fig.3.<br />

It composes <strong>of</strong> a power supply module, storage module,<br />

processor module, node communication module and<br />

serial interface communication module and so on.<br />

Because sink node need process many sensor nodes data,


1670 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

it work longer hours and dormancy time is short, the<br />

battery energy can’t satisfy sink node energy<br />

consumption, so the sink node adopt solar self-supply<br />

module for nodes power supply in the paper.<br />

C. Power supply module<br />

In order to solve energy supply problem <strong>of</strong> sensor<br />

nodes, we adopted solar energy supply system in the<br />

paper, and the structure is shown in Fig. 4. Fig.4 show<br />

that power supply module have energy collector, energy<br />

storage, backup energy memory, power management and<br />

control section. Energy collector composes <strong>of</strong> solar<br />

panels, and it is responsible for transforming solar energy<br />

into electrical energy. Energy storage include the main<br />

level energy storage, constitute <strong>of</strong> super capacitance, and<br />

is responsible for storing energy which is collected by<br />

solar battery and provide energy for wireless network<br />

sensor nodes. Backup energy memory composes <strong>of</strong><br />

lithium battery, and provide energy source for system in<br />

an emergency. Power management and control section is<br />

responsible for monitoring status <strong>of</strong> energy memory<br />

which provide power supply according to the energy<br />

status, and take solar cell as energy memory supplement<br />

energy.<br />

Figure.4. Solar self-supply module structure<br />

IV. THE DESIGN OF MONITORING CENTER<br />

The monitoring center control operation <strong>of</strong> the whole<br />

network through the base station <strong>of</strong> all measurement and<br />

control area, and which the main task include sending<br />

control command for network, collection and handling<br />

monitoring data <strong>of</strong> each node in network and data is<br />

stored into database, historical data is inquired and<br />

analyzed. The monitoring center mainly composes <strong>of</strong> PC<br />

and wireless communication module. The hardware<br />

structure is shown in Fig. 5.<br />

In Fig.5, the PC is taken as upper computer, CC2430 is<br />

taken as a wireless communication module, and the<br />

communication between them is realized through serial<br />

port. In short, the main function <strong>of</strong> the monitoring center<br />

is described below.<br />

1. Network management and control function. Such as<br />

starting or stopping network operation, configuration<br />

network parameters. Network parameters include sensor<br />

node data acquisition frequency, the frequency submitting<br />

the data to base station, the length <strong>of</strong> each task time slot,<br />

the routing probability vector and so on. The monitoring<br />

© 2011 ACADEMY PUBLISHER<br />

center can also inquire operation state, environmental<br />

data and send control node to control command etc.<br />

2. Data storage function. The monitoring center need<br />

to preserve historical monitoring data for enquiries, this<br />

function is realized through the database.<br />

3. Data analysis and decision support functions. The<br />

monitoring data is analyzed by agricultural expert system<br />

and establish the most suitable greenhouse environment<br />

control strategy.<br />

The base station <strong>of</strong> measurement and control not only<br />

controls all nodes <strong>of</strong> the district, but also is<br />

communication hub between the monitoring center and<br />

measurement and control area, mainly providing data<br />

forwarding and data buffer function.<br />

Figure.5.The monitoring center hardware structure<br />

A. System s<strong>of</strong>tware design<br />

V. SYSTEM SOFTWARE<br />

Figure.6 System s<strong>of</strong>tware flowchart<br />

Modular design thought is adopted in system s<strong>of</strong>tware<br />

program which mainly composed <strong>of</strong> data collection<br />

system <strong>of</strong> the greenhouse and wireless control systems.<br />

The data acquisition system transfer the data that is<br />

wireless sensor node acquisition own surrounding<br />

environment information to sink node by wireless


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1671<br />

network. The data message that is fused is sent to<br />

controller by sink node. Meanwhile, the sink node<br />

receives instructions from controller, and forwards<br />

instructions to the sensor node. The flow chart <strong>of</strong> system<br />

s<strong>of</strong>tware is shown in Fig.6.<br />

B. The s<strong>of</strong>tware design <strong>of</strong> monitoring center<br />

The monitoring center send the system starts<br />

commands in spare time slot (Tidle) and receive the<br />

network monitoring data <strong>of</strong> each node in cluster interstate<br />

communication (Tinter) time slot. If necessary, other<br />

management control commands can be sent in spare time<br />

slot and routing time slot. In network formation time slot<br />

and communications time slot within the cluster, each<br />

node is busy with networking in greenhouse, and don’t<br />

monitor commands <strong>of</strong> control center, so the management<br />

control command for network need not be sent and<br />

complete some data processing tasks. We adopt<br />

Micros<strong>of</strong>t access for the monitoring center database. The<br />

program flowchart <strong>of</strong> monitoring center spare time slot is<br />

shown in Fig.7<br />

Figure.7.The program flowchart <strong>of</strong> monitoring center spare time slot<br />

In spare time slot, the monitoring center mainly<br />

completes start-up system functions. If the system is the<br />

first start, then must connect to database. Then, the<br />

monitoring center send starts commands to the base<br />

station <strong>of</strong> all measurement and control area in<br />

greenhouse, if not received a confirmation <strong>of</strong> the base and<br />

no more than retransmission times, and the starts<br />

commands is resent. If exceed retransmission times, and<br />

fault diagnosis module is run. If received confirmation<br />

frame that the base station returns and spare time slot is<br />

not over, the monitoring center can complete other<br />

management control command.<br />

In cluster interstate communication, the main task <strong>of</strong><br />

monitoring center collect data that greenhouse WSN<br />

submitted and store in database. If users have<br />

management control requirements, and it may priority<br />

executed. The program flowchart <strong>of</strong> monitoring center<br />

cluster interstate communication time slot is shown in<br />

Fig.8.<br />

© 2011 ACADEMY PUBLISHER<br />

Figure.8 The program flowchart <strong>of</strong> monitoring center cluster interstate<br />

communication time slot<br />

C. The nodes deployed algorithm <strong>of</strong> measurement and<br />

control system based on WSN in Greenhouse<br />

In greenhouse WSN measurement and control system,<br />

the sensor nodes deployed in greenhouse periodically<br />

collected various environmental data and send it to<br />

control center with multiple hops communication manner,<br />

and it belongs to the typical centralized data collection<br />

network. In Such system, due to the nodes near the base<br />

station forward large quantities <strong>of</strong> data and premature<br />

deaths, and the network is divided and even completely<br />

paralyzed. The energy consumption hotspot is caused as a<br />

result <strong>of</strong> load distribution imbalance between the nodes,<br />

so we take phenomenon as funnel effect [6-7]. This<br />

article solve funnel effect <strong>of</strong> greenhouse WSN<br />

measurement and control system through redundancy<br />

node technology, using a single measurement and control<br />

area <strong>of</strong> greenhouse as the research object, taking the<br />

node's next-hop choose road probability as edge fuzzy<br />

weights, and introduce fuzzy graph theory, and the data<br />

probability from source cluster head to the destination<br />

node cluster head node by m jump is calculated, so we<br />

obtain network data load distribution in greenhouse<br />

measurement and control area by it, and the redundant<br />

nodes deployed algorithm (RNDA) based on cluster load<br />

balancing was designed. In order to balance the network<br />

load, we adopt three ways in the algorithm, namely, the<br />

multi-path routing, redundant nodes deployment and<br />

cluster head election. The key <strong>of</strong> RNDA is that<br />

determines each cluster head routing probability<br />

vector v P , and can construct network topology through<br />

this vector. In greenhouse WSN measurement and control<br />

system, v P <strong>of</strong> cluster head v is pre-set according to the<br />

nodes geographical location. In fact, v P became the basis<br />

for routing algorithms, when network begin to run, every<br />

kind <strong>of</strong> node communicate each other by using the same


1672 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

preset v P , if the neighbor <strong>of</strong> a cluster head that can<br />

communicate can’t produce cluster head due to<br />

energy <strong>of</strong> all nodes are exhausted, and cluster head<br />

topology will change, so the cluster head v P should be<br />

adjusted. The cluster interstate communications model is<br />

shown in Fig.9, in order to narrative convenient, the<br />

monitoring area is divided into the5 5 × grid, we can set<br />

automatically grid number in simulation.<br />

p<br />

p<br />

p<br />

v( ev4)<br />

v( ev5)<br />

ve ( v6)<br />

p<br />

v( ev3)<br />

p<br />

v( ev7)<br />

v( ev2)<br />

ve ( v1)<br />

p = { p , p , p , p ,<br />

p p p p<br />

p<br />

v v( ev1) v( ev2) v( ev3) v( ev4)<br />

ve ( 5) ,<br />

v ve ( 6) ,<br />

v ve ( 7) ,<br />

v ve ( 8)<br />

}<br />

v<br />

Figure.9 Cluster interstate communications model<br />

p<br />

p<br />

v( ev8)<br />

Fig.9 (b) shows that each cluster head has eight routing<br />

direction at most, namely, v P has 8 component.<br />

According to cluster head category, taking one part or a<br />

few directions to give choose road probability value.<br />

P (e)<br />

These choose road probability v can be freely<br />

chosen, and ensure that the sum <strong>of</strong> choose road<br />

probability is 1. In Fig.9 (a), according to the<br />

geographical position, the cluster head is divided into hot<br />

cluster head H (black dots representation), boundary<br />

cluster head, general cluster head (colorless circle) etc.<br />

We consider that the cluster head which adopt data fusion<br />

strategy and doesn’t adopt data fusion strategies has on<br />

impact the network lifetime in simulation, The main<br />

purpose <strong>of</strong> WSN data fusion reduce the network data<br />

quantity through integration <strong>of</strong> each sensor node<br />

redundant information. In simulation experiments, the<br />

data fusion is put into practice in cluster head nodes,<br />

supposed that data fusion coefficient is 1( σ = 1 ) when<br />

the data fusion strategy is not executed. If the data fusion<br />

strategy is adopted, the different data fusion coefficient is<br />

chosen according to different fusion degree. Because the<br />

sensor nodes belong to isomorphism sensor nodes here,<br />

the type <strong>of</strong> the information collected is consistent,<br />

according to statistical knowledge, the small range<br />

environmental parameters hasn’t too large difference, so<br />

we fuse all child nodes data <strong>of</strong> one grid into a data, and<br />

describe environmental information <strong>of</strong> the grid (e.g.<br />

temperature, humidity). In Simulation experiments,<br />

supposed that the data fusion coefficient is<br />

1<br />

a ( σ = 1/a ) when the data fusion strategy is<br />

adopted, a is the activities node number inside grid,<br />

© 2011 ACADEMY PUBLISHER<br />

a are all set to 5 in the following simulated experiments.<br />

In Matlab 7.0, M document program is written according<br />

to algorithm process and the performance <strong>of</strong> RNDA<br />

algorithm is researched, and compare with uniform<br />

deployment way. In a uniform deployment mode, the<br />

redundant nodes is evenly distributed in each cluster, the<br />

networks is operated in three tasks slot mode.<br />

1. Fig.10 shows that is 4× 4 grid which d is<br />

25 cm (namely, d = 25m<br />

), communications distance<br />

d 2<br />

within the cluster is CI = d d 2<br />

and CO = dCI.<br />

Fig.10 (a) data fusion coefficient isσ<br />

= 1/a , Fig.10 (b)<br />

data fusion coefficient isσ<br />

= 1.<br />

Network lifetime/round<br />

(a) Data fusion coefficient<br />

(b)Data fusion coefficient<br />

Uniform deployment<br />

RNDA deployment<br />

σ = 1/a<br />

σ = 1<br />

Redundant nodes<br />

Figure.10.The Redundant nodes have impact on the network lifetime(<br />

4 4<br />

× grid)<br />

2. Fig.11 shows that is 5 5 × grid which d is<br />

20 cm (namely, d = 20m<br />

). Fig.11 (a) data fusion


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1673<br />

coefficient isσ<br />

= 1/a , Fig.11 (b) data fusion coefficient<br />

isσ<br />

= 1.<br />

(a) Data fusion coefficient<br />

σ = 1/a<br />

σ = 1<br />

(b)Data fusion coefficient<br />

Figure.11.The Redundant nodes have impact on the network<br />

lifetime ( 5 5 × grid)<br />

As can be seen from the above graph, the network<br />

lifetime that the data fusion strategy is adopted is<br />

probably 2 ~ 3 times than the data fusion strategy isn’t<br />

adopted. Virtual grid number has also impact on the<br />

network life, the more virtual grid were classified in<br />

monitoring area, the greater the network data quantity is,<br />

and the shorter the network lifetime is. RNDA compare<br />

with uniform deployment mode, Fig.11 (a) shows that the<br />

network lifetime improved 35.8 percent in A and B dot.<br />

When we extend the same network life, RNDA can save<br />

a lot <strong>of</strong> redundant nodes. Compared with uniform mode,<br />

the RNDA only deployed 24% redundant nodes when the<br />

4<br />

network lifetime is 3 . 5×<br />

10 round.<br />

© 2011 ACADEMY PUBLISHER<br />

VI. CONCLUSION<br />

According to the characteristics <strong>of</strong> modern greenhouse<br />

production, the paper introduce wireless sensor network<br />

technique to greenhouse wireless detection-control<br />

system, and the whole greenhouse system can automatic<br />

adjust by combining wireless sensor network technology<br />

with greenhouse control technology. In hardware, WSN<br />

nodes mainly compose <strong>of</strong> Atmega128L and wireless<br />

transceiver chip CC2420. In s<strong>of</strong>tware, the modularized<br />

design ideas is adopted in this paper, the sensor nodes<br />

deployment is made a in-depth analysis, the simulation<br />

results show that this algorithm can effectively prolong<br />

the network life.<br />

REFERENCES<br />

[1] Du Xiaoming, Chen Yan.The Realization <strong>of</strong> Greenhouse<br />

Controlling System Based on Wireless Sensor<br />

Network[J].JOURNAL OF AGRICULTURAL<br />

MECHANIZATION RESEARCH, 2009(6): 141-144.<br />

[2] Qiao Xiaojun, Zhang Xin, Wang Cheng, et al. Application<br />

<strong>of</strong> the wireless sensor networks in agriculture[J],<br />

Transactions <strong>of</strong> the CSAE,2005, 9(21):232-234.<br />

[3] S.L. Speetjens, H.J.J. Janssen, etc. Methodic design <strong>of</strong> a<br />

measurement and control system for climate control in<br />

horticulture[J]. COMPUTERS AND ELECTRONICS<br />

IN AGRICULTURE, 2008, (64):162-172.<br />

[4] Wang Linji.The Design <strong>of</strong> Realizing Change Temperature<br />

Control in Greenhouse by PLC [J].ELECTRICAL<br />

ENGINEERING, 2008, 5: 81-83.<br />

[5] Liu Yanzheng, Teng Guanghui, Liu Shirong.The problem<br />

<strong>of</strong> the control system for Greenhouse Climate[J].CHINESE<br />

AGRICULTURAL SCIENCE BULLETIN. 2007,23: 154-<br />

157.<br />

[6] C. Y. Wan, S. B. Eisenman, A. T. Campbell, et al.<br />

Overload traffic management for sensor networks[J]. ACM<br />

Transactions on Sensor <strong>Networks</strong>, 2007, 3, Article No. 18.<br />

[7] G. S. Ahn, E. Miluzzo, A. T. Campbell, et al. Funneling-<br />

MAC: A Localized, Sink-Oriented MAC For Boosting<br />

Fidelity in Sensor <strong>Networks</strong>[C]. Proceedings <strong>of</strong> the 4th<br />

international conference on Embedded networked sensor<br />

systems. New York: ACM, 2006: 293-306.<br />

[8] Li Nan, Liu Chengliang, Li Yanming, Zhang Jiabao, Zhu<br />

Anning. Development <strong>of</strong> remote monitoring system for soil<br />

moisture based on 3S technology alliance[J]. Transactions<br />

<strong>of</strong> the CSAE, 2010, 26(4): 169-173.<br />

[9] P. Santi, J. Simon. Silence Is Golden with High<br />

Probability: Maintaining a Connected Backbone in<br />

Wireless Sensor Network[C]. 1st European Workshop on<br />

Wireless Sensor <strong>Networks</strong>. Berlin: wireless sensor<br />

networks, proceedings, 2004: 106-121.<br />

[10] F. Chen, P. Jiang, Q. He. Phased waking coverage scheme<br />

based on hibernation <strong>of</strong> redundant nodes for wireless<br />

sensor networks[C]. Proceedings-International Symposium<br />

on Computer Science and Computational Technology. NJ:<br />

Institute <strong>of</strong> Electrical and Electronics Engineers Computer<br />

Society. 2008: 709-713<br />

[11] Z.M. Li, L. Lei. Sensor Node Deployment in Wireless<br />

Sensor <strong>Networks</strong> Based on Improved Particle Swarm<br />

Optimization[C].Proceedings <strong>of</strong> 2009 IEEE International<br />

Conference on Applied Superconductivity and<br />

Electromagnetic Devices. 2009:25-27.<br />

[12] J.H. Tarng, B.W. Chuang, PC. Liu. A relay node<br />

deployment method for disconnected wireless sensor


1674 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

networks:Applied in indoor environments[J]. <strong>Journal</strong> <strong>of</strong><br />

Network and Computer Applicatons. 2009.32:652-659.<br />

[13] Y.C. Wang, Y.C. Tseng. Distributed Deployment Schemes<br />

for Mobile Wireless Sensor <strong>Networks</strong> to Ensure Multilevel<br />

Coverage[J]. IEEE TRANSACTIONS ON PARALLEL<br />

AND DISTRIBUTED SYSTEMS. 2008.19(9):1280-1294.<br />

[14] P. Gajbhive, A. Mahajan. A Survey <strong>of</strong> Architecture and<br />

Node deployment in Wireless Sensor Network[C]. 1st<br />

International Conference on the Applications <strong>of</strong> Digital<br />

Information and Web Technologies, ICADIWT 2008: 426-<br />

430.<br />

[15] W.T. Xu, X.H. Hao, C.L. Dang. Connectivity Probability<br />

Based on Star Type Deployment Strategy for Wireless<br />

Sensor <strong>Networks</strong>[C].Proceedings <strong>of</strong> the 7th World<br />

Congress on Intelligent Control and Automation.<br />

2008:1738-1742.<br />

Yongxian Song was born in xuzhou,on<br />

April 1,1975. He r eceived the B.S. degree<br />

in Applied Electronic Technology from Hu<br />

aihai Institute <strong>of</strong> Technology,<br />

Lianyungang,China, in 1997, and the M.S<br />

degree in Control Theory and Control<br />

Engineering from Jiangsu university,<br />

Zhenjiang, China , in 2006. From 2009 to<br />

now, He is studing for Ph.D degree in Control Theory and<br />

Control Engineering from Jiangsu university, Zhenjiang, China.<br />

Since 2006, he has been a teacher in Huaihai Institute <strong>of</strong><br />

Technology, Lianyungang, China. His current research interests<br />

include signal processing ,intelligent control, and industrial<br />

control .<br />

© 2011 ACADEMY PUBLISHER<br />

Chenglong Gong was born in 1964, male.He<br />

received the B.S. degree in Automatic<br />

Control from University <strong>of</strong> Electronic<br />

Science and Technology, Chengdu, China, in<br />

1984, and the M.S degree in Automation<br />

Control from China University <strong>of</strong> Mining and<br />

Technology, Xuzhou, China , in 1988.<br />

He is currently working as a pr<strong>of</strong>essor with the department<br />

<strong>of</strong> electronic engineering <strong>of</strong> Huaihai Institute <strong>of</strong> Technology,<br />

Lianyungang 222005, China. His main research interesting is<br />

automatic measurement, control and system theory, computer<br />

network applications.<br />

Yuan Feng was born in Lianyungang ,on<br />

March 28,1978. He received the B.S. degree<br />

in Computer hardware and application from<br />

Huaihai Institute <strong>of</strong> Technology,<br />

Lianyungang, China, in 1999, and the M.S<br />

degree in Industrial Control from Nanjing<br />

University <strong>of</strong> Science, Nanjing, China, in<br />

2007.From 1999 to now, he has been a teacher in Huaihai<br />

Institute <strong>of</strong> Technology, Lianyungang,China. His current<br />

research interests include signal processing, Computer Control<br />

Technology.<br />

Juanli Ma female, lecturer, born in 1976,<br />

1995-1999 studied at Gansu University <strong>of</strong><br />

Technology, studying electrical automation,<br />

and obtained a bachelor degree. 2004-2007<br />

studied at the Northwestern Polytechnical<br />

University, studying control theory and<br />

control engineering and obtained a Master<br />

degree in Engineering. From1999 to now, she<br />

has been working in the Huaihai Institute <strong>of</strong> Technology.<br />

Xianjin Zhang was born in suqian, in1975.<br />

He received the B.S. degree in Applied<br />

Electronic Technology from Guilin University<br />

<strong>of</strong> Electronic Technology, Guilin, China, in<br />

1998, and the M.S degree in Power Electronic<br />

and Control Engineering from Nanjing<br />

University <strong>of</strong> Aeronautics & Astronautics,<br />

Nanjing, China, in 2005. Since 2005, he<br />

has been a teacher in Huaihai Institute <strong>of</strong> Technology,<br />

Lianyungang, China. His current research interests include<br />

electric and electronical converting technique.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1675<br />

Simulation <strong>of</strong> Networked Control System based<br />

on Smith Compensator and Single Neuron<br />

Incomplete Differential Forward PID<br />

Haitao Zhang<br />

Electronic and Information Engineering College, Henan University <strong>of</strong> Science and Technology, Luoyang, China<br />

Email: zhang_haitao@163.com<br />

Zhen Li<br />

Electronic and Information Engineering College, Henan University <strong>of</strong> Science and Technology, Luoyang, China<br />

Email: lizhenzhen1228@163.com<br />

Abstract—In the networked control system with random<br />

time delay in forward and feedback channels, a kind <strong>of</strong><br />

controller based on Smith compensator and signal neuron<br />

incomplete differential forward PID is presented. First,<br />

using root locus method and simulink simulation s<strong>of</strong>tware,<br />

the influences <strong>of</strong> network’s time delay on the system<br />

stability and dynamic performance are analyzed. Then,<br />

combined with incomplete differential forward PID control<br />

algorithm, Smith compensation model is established.<br />

Compared with existing Smith compensator, the proposed<br />

control model is easy to be implemented, and can also get<br />

better control performance in the case <strong>of</strong> miss-matching<br />

compensator model. Finally, the simulation research on a<br />

DC motor is done, and the simulation results show the<br />

effectiveness <strong>of</strong> the proposed method.<br />

Index Terms—networked control system; Smith<br />

compensator; incomplete differentia forward PID; single<br />

neuron<br />

I. INTRODUCTION<br />

With the extensive application <strong>of</strong> large-scale control<br />

system, the networked control system (NCS) has been<br />

concerned by many researchers. In these networked<br />

control system, the communication among controllers,<br />

sensors and actuators is performed through the networks.<br />

In comparison with the traditional control, it has the<br />

characteristics <strong>of</strong> resources sharing, high reliability, low<br />

cost, easy to maintain and extend. However, since the<br />

carrying capacity <strong>of</strong> the network and communication<br />

bandwidth is fixed and limited, this will inevitably lead<br />

to the collision and retransmission <strong>of</strong> information, which<br />

causes the network-induced delay in the process <strong>of</strong><br />

information transmission. The network-induced delay<br />

makes real-time capability <strong>of</strong> the system become worse,<br />

even leads to system instability. At present, for the<br />

random network-induced delay, two main research<br />

Manuscript received Mar. 1, 2011; revised Apr. 1, 2011; accepted<br />

Apr. 12, 2011.<br />

Project number: 61040010<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1675-1681<br />

methods are adopted in NCS design, the deterministic<br />

method and the stochastic method. The deterministic<br />

method is to convert the random delay to fixed delay by<br />

introducing data buffer, then use the existing method to<br />

design the controller[1][2]. However, this approach<br />

artificially extends the random delay <strong>of</strong> the controller,<br />

and lowers the system control performance. The<br />

stochastic method is directed by the random discrete time<br />

model. Nilsson discusses LQG optimal controller’s<br />

design within the framework <strong>of</strong> discrete control system<br />

in which the independent random delay is less than a<br />

sampling period and its time delay obeys Markov<br />

distribution[3], but this method must know the<br />

probability characteristics <strong>of</strong> time delay in advance,<br />

including mean, variance and other properties. The<br />

amount <strong>of</strong> computation is so large that it is not easy to<br />

achieve. Hu proposes the use <strong>of</strong> stochastic optimal<br />

control and optimal state estimation methods[4], the<br />

method is mainly used in the occasions when time delay<br />

is more than one sample period. Bauer uses Smith<br />

predictor to compensate time delay in the networked<br />

control system, the control structure is simple, but it is<br />

necessary to know the exact value <strong>of</strong> the network delay<br />

in advance [5].<br />

The rest <strong>of</strong> the paper is organized as follows. In<br />

Section 2, we present the system structure <strong>of</strong> NCS, and<br />

analyze the influence <strong>of</strong> network-induced delay on the<br />

system stability and dynamic performance. Section 3<br />

presents a design method <strong>of</strong> NCS based on Smith<br />

compensator and single neuron incomplete differential<br />

forward PID controller. Section 4 gives the simulation<br />

results aiming at the model <strong>of</strong> DC motor, and the results<br />

shows the effectiveness <strong>of</strong> proposed method. Finally, a<br />

brief summary are discussed in Section 5.<br />

II. SYSTEM DESCRIPTION<br />

A. System Structure<br />

The basic structure <strong>of</strong> the networked control system is<br />

shown in Figure 1. The controller, actuator and sensor<br />

transmit data over the network, so there are essentially


1676 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

three kinds <strong>of</strong> computer delays in the system:<br />

communication delay sc<br />

τ between the sensor and the<br />

controller, computational delay c<br />

τ in the controller and<br />

communication delay ca<br />

τ between the controller and the<br />

c<br />

actuator. Because τ is very small, usually it is<br />

considered to merge into ca<br />

τ , so the system delay is<br />

sc ca<br />

expressed as τ = τ + τ . In order to analyze the system<br />

with the effect <strong>of</strong> network delay, we use the approach <strong>of</strong><br />

continuous-time systems to analyze networked control<br />

system, and a typical block diagram <strong>of</strong> the networked<br />

control system is shown in Figure 2. Where , R() s ,<br />

U() s, Y() s and Es () = Rs () − Ys () are the reference, control,<br />

output, and error signals in S domain respectively.<br />

Gc() s is the transfer function <strong>of</strong> controller, and Gp() s is<br />

the transfer function <strong>of</strong> controlled object.<br />

Figure 1. The basic structure <strong>of</strong> networked control system<br />

ca<br />

s<br />

e τ −<br />

Rs () E() s U() s Y() s<br />

Gc() s Gp() s<br />

sc<br />

s<br />

e τ −<br />

Figure 2. A typical block diagram <strong>of</strong> networked control systems<br />

The transfer function <strong>of</strong> the closed-loop system shown<br />

in Figure 2 can be expressed as follows:<br />

ca<br />

−τ<br />

s<br />

Y() s Gc() s Gp() s e<br />

= (1)<br />

sc ca<br />

−τ s −τ<br />

s<br />

Rs () 1 + Gc( s) Gp( s) e e<br />

ca sc<br />

In Figure 2, τ and τ are respectively the time delay<br />

<strong>of</strong> forward and feedback channel. ca<br />

τ makes the control<br />

signal not timely to react on the controller object, and the<br />

response <strong>of</strong> the system lags behind the input <strong>of</strong> the<br />

sc<br />

systems. τ makes the system not timely to produce new<br />

control signal.<br />

In order to analyze the closed-loop control system<br />

with the effects <strong>of</strong> network delay, a typical approach is to<br />

use a rational function to approximate the delays. The<br />

function is as follows [6].<br />

−τsτs−n e ≅ (1 + )<br />

(2)<br />

n<br />

ca sc<br />

where τ may be τ or τ .<br />

Because the primary branches <strong>of</strong> the root locus <strong>of</strong><br />

control system usually contain the dominant eigenvalues<br />

<strong>of</strong> the system, this approximation is adequate for<br />

practical applications.<br />

B. System Stability Analysis<br />

In this paper, we use a DC motor as controlled object<br />

to analyze the system stability, the transfer function <strong>of</strong><br />

the controlled plant is expressed as follows[7]:<br />

2029.826<br />

Gp() s =<br />

(3)<br />

( s+ 26.29)( s+<br />

2.296)<br />

© 2011 ACADEMY PUBLISHER<br />

The controller uses PID control, the transfer function<br />

expressed as follows:<br />

2<br />

Kp(( Kd / Kp) s + s+ ( Ki / Kp))<br />

Gc() s =<br />

s<br />

(4)<br />

1<br />

= Kp( Tds+ 1 + )<br />

Ts i<br />

Where, P K , d T and T i are the proportional gain,<br />

differential time constant and integral time constant,<br />

respectively.<br />

We use the formula (1) to (4), and select the following<br />

controller parameters: K p =0.1701, d T =0, T i =0.45, n<br />

=4, then the open-loop transfer function is expressed as<br />

follows:<br />

ca sc<br />

−τ s −τ<br />

s<br />

Gc() s e Gp() s e<br />

τ s −n<br />

= Gc() s Gp()(1 s + )<br />

n<br />

1 1<br />

= Kp( Tds+ 1 + ) Gp( s)<br />

Ts τ s<br />

i<br />

n<br />

( + 1)<br />

n<br />

0.1701( s + 2.222) 2029.826<br />

=<br />

* (5)<br />

s ( s+ 26.29)( s+<br />

2.296)<br />

1<br />

*<br />

τ s 4<br />

( + 1)<br />

4<br />

345.2734( s + 2.222)<br />

=<br />

τ s 4<br />

ss ( + 26.29)( s+<br />

2.296)( + 1)<br />

4<br />

Seen from the formula (5), with τ changing from 0 to<br />

positive infinity, the system increases four-fold openloop<br />

negative real poles which are from negative infinity<br />

to 0. The existence <strong>of</strong> the poles enhances the system<br />

order, changes the distribution <strong>of</strong> the root locus in the<br />

real axis and shifts the root locus to the right, which is<br />

disadvantageous to the stability <strong>of</strong> the system.<br />

Imaginary Axis<br />

15<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

τ=0.1<br />

τ=0.2<br />

τ=0.5<br />

Root Locus<br />

-15<br />

-8 -6 -4 -2<br />

Real Axis<br />

0 2<br />

Figure 3. Primary branches <strong>of</strong> root locus with different delay<br />

We select different time delay to analyze the<br />

networked control system, and provide a reference that is<br />

used to analyze networked control system with delay


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1677<br />

effect. The primary branches <strong>of</strong> root locus with different<br />

time delay is shown in Figure 3. It is gotten by the<br />

programming in Matlab.<br />

Seen from Figure 3, with the time delay changing, the<br />

system stability region changes with the delay, the<br />

greater the delay is, the smaller the system stability<br />

region becomes.<br />

C. Dynamic Performance Analysis <strong>of</strong> System<br />

From the upper sub-section, we know the stable region<br />

decreases with the increase <strong>of</strong> time delay. In the section,<br />

we will analyze the influence <strong>of</strong> random time delay on<br />

the performance <strong>of</strong> networked control systems in<br />

Simulink. The simulation model is shown as Figure 4,<br />

the parameter is set as follows: the input signal is 50rad/s,<br />

sampling period is 10ms, the selected control algorithm<br />

is formula (4), the model <strong>of</strong> the controlled object is<br />

formula (3), the time delay obeys uniform distribution<br />

which is simulated by the producer <strong>of</strong> Gauss random<br />

number and network delay module.<br />

Figure 4. Simulation model <strong>of</strong> network control system<br />

Under the function <strong>of</strong> typically input signal, the<br />

performance criteria which reflect the time response <strong>of</strong><br />

control system are composed <strong>of</strong> two parts: static<br />

performance criterion and dynamic performance<br />

criterion. We choose the mean square errors MSE ,<br />

overshoot P M and adjusting time ts to reflect the tracking<br />

error, control accuracy, stability and rapidity <strong>of</strong> the<br />

response <strong>of</strong> the control system. The performance cost<br />

function is as follows[8]:<br />

J = ω1J1+ ω2J2+ ω3J3<br />

(6)<br />

2<br />

⎧⎪ ( MSE − MSE0) , MSE > MSE0<br />

J1<br />

= ⎨<br />

⎪⎩ 0 , MSE ≤ MSE0<br />

2<br />

⎧⎪ ( MP − MP0) , MP > MP0<br />

J2<br />

= ⎨<br />

(7)<br />

⎪⎩ 0 , MP ≤ MP0<br />

2<br />

⎧⎪ ( ts − ts0) , ts > ts0<br />

J3<br />

= ⎨<br />

⎪⎩ 0 , ts ≤ ts0<br />

Where,<br />

N 1 2<br />

MSE = ∑ e ( k)<br />

(8)<br />

N K = 0<br />

MSE represents the mean square error <strong>of</strong> system,<br />

ek ( ) = yk ( ) − rk ( ) represents the output error <strong>of</strong> system<br />

when t = kh,<br />

where k represents sampling sequence and<br />

© 2011 ACADEMY PUBLISHER<br />

s represents sampling period. MSE 0 , M P0<br />

and t s0<br />

are<br />

nominal mean square error, nominal overshoot, nominal<br />

adjusting time under the circumstance that system hasn’t<br />

time delay. J 1 , 2 J and J3 are the performance criteria <strong>of</strong><br />

MSE , P M and ts that they deviate from nominal value.<br />

J 1 , 2 J and 3 J satisfy J1 = J2 = J3<br />

= 0 when the system<br />

has no time delay . ω 1 , ω 2 , ω3 are the weight coefficients<br />

<strong>of</strong> 1 J , 2 J and J 3 respectively, their range are from 0 to 1,<br />

and meet 1 2 3 1<br />

ω + ω + ω = .<br />

When the system has no time delay, we get the step<br />

response curve <strong>of</strong> the system by the execution <strong>of</strong><br />

simulation model <strong>of</strong> Figure 4, and get the nominal value<br />

<strong>of</strong> MSE , P M and t s by the computation. Their nominal<br />

value is as follows: MSE 0 = 0.00595 , M P0<br />

= 5% ,<br />

t s0<br />

= 0.309 , J = 0 .<br />

When ω 1 = 1 , ω2 = ω3<br />

= 0 , then J = J1<br />

, and the cost<br />

function reflects the response process <strong>of</strong> system and the<br />

relative stability at the steady state. Its output curve<br />

changed with time delay is shown in Figure 5(a).<br />

When ω 2 = 1 , ω1 = ω3<br />

= 0 , then J = J 2 , and the cost<br />

function reflects the stability <strong>of</strong> system. Its output curve<br />

changed with time delay is shown in Figure 5(b).<br />

When ω 3 = 1 , ω1 = ω2<br />

= 0 , then J = J3<br />

, and the cost<br />

function reflects the rapidity <strong>of</strong> system response. Its<br />

output curve changed with time delay is shown in Figure<br />

3-4(c).<br />

Seen from Figure 5, the time delay could lower the<br />

stability and dynamic performance <strong>of</strong> system. If τ < 12s ,<br />

the control accuracy becomes lower a little, but the<br />

system still has the better stability and dynamic<br />

performance. If τ ≥ 12s , the dynamic performance <strong>of</strong><br />

system becomes poor, and the stability and control<br />

accuracy also becomes poor. If the time delay<br />

reaches 40s , the system becomes instable.<br />

Figure 5. The waveforms <strong>of</strong> 1 J , 2 J and 3<br />

J with delay change<br />

In the following section, we will improve the<br />

performance <strong>of</strong> networked control system by introducing<br />

Smith compensator and single neuron incomplete<br />

differentiation on the basis <strong>of</strong> classical PID control.<br />

III. SMITH COMPENSATOR AND SIGNLE NEURON<br />

INCOMPLE DIFFERENTIAL FORWARD PID CONTROLLER


1678 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

A. The Principle <strong>of</strong> Smith Compensator<br />

The control system model with Smith compensator is<br />

shown as Figure 6.<br />

ps e τ −<br />

R() s E() s U() s Y() s<br />

Gc() s<br />

Gp() s<br />

p ()(1 )<br />

s τ ′<br />

G s e − ′ −<br />

p<br />

Figure 6. Structure <strong>of</strong> Smith predictor<br />

p Gc() s is the transfer function <strong>of</strong> controller, () s<br />

Gps e τ −<br />

is the transfer function <strong>of</strong> controlled object with the pure<br />

p time delay. () s τ ′<br />

G s e − ′ is the compensator function<br />

p<br />

which is introduced by Smith compensator. Then the<br />

closed-loop transfer function is expressed as the<br />

following:<br />

−τ<br />

s<br />

p<br />

Y() s<br />

Gc() s Gp() s e<br />

=<br />

Rs () −τ ps−τp′ s<br />

1 + G ( s) G ′ ( s) + G ( s)( G ( s) e −G<br />

′ ( s) e )<br />

c p c p p<br />

When the system satisfies G ′<br />

p () s = Gp() s , τ ′<br />

p = τ p ,<br />

the formula (9) is simplified and gets the following<br />

relation:<br />

() () () p<br />

() 1 () ()<br />

s<br />

Y s Gc s Gp s −τ<br />

= e<br />

(10)<br />

Rs + Gc sGp s<br />

Its characteristic equation is expressed as the following:<br />

1 + Gc( s) Gp( s)<br />

= 0<br />

(11)<br />

The formula (11) doesn’t include pure time delay, so<br />

Smith compensator eliminates the influence <strong>of</strong> pure time<br />

delay on the system stability which could makes the<br />

control system instable.<br />

B. System Structure <strong>of</strong> NCS<br />

Smith compensator as a control algorithm is<br />

commonly used in the system with time delay. To reduce<br />

the influence <strong>of</strong> time delay on networked control system<br />

performance, Smith compensator has been introduced<br />

into the networked control system, and the networked<br />

control system with Smith compensator is shown as<br />

Figure 7.<br />

Rs () ()<br />

G ′ () s<br />

p<br />

Es U() s<br />

ca<br />

s<br />

e τ −<br />

G s Gp() s<br />

()<br />

c<br />

e G ′ () s e<br />

cam scm<br />

−τ s −τ<br />

s<br />

p<br />

sc<br />

s<br />

e τ −<br />

Y() s<br />

Figure 7. The structure <strong>of</strong> networked control system with Smith<br />

compensator<br />

In Figure 7, Gp() s is the transfer function <strong>of</strong> controlled<br />

object, G ′<br />

p () s is the predicted model <strong>of</strong> controlled object,<br />

cam scm<br />

τ and τ is the predicted model <strong>of</strong> time delay<br />

ca<br />

τ and sc<br />

τ respectively. Then the closed-loop transfer<br />

function <strong>of</strong> the system is expressed as follows:<br />

−τ<br />

s<br />

Y() s<br />

Gc() s e Gp() s<br />

=<br />

ca sc cam scm<br />

Rs () −τ s −τ s −τ s −τ<br />

s<br />

1 + G () s G ′ () s + G ()( s e G () s e −e<br />

G ′ () s e )<br />

c p c p p<br />

© 2011 ACADEMY PUBLISHER<br />

ca<br />

(9)<br />

(12)<br />

When the designed system using Smith compensator<br />

doesn’t have model mismatch, i.e. G'( s) = G ( s)<br />

,<br />

scm sc cam ca<br />

τ = τ , τ = τ , the transfer function <strong>of</strong> the system<br />

shown in Figure 7 could be simplified as follows:<br />

Y() s G () () ca<br />

c s Gp s −τ<br />

s<br />

= e<br />

(13)<br />

Rs () 1 + Gc() sGp() s<br />

In the case, the networked control system could be<br />

simplified to Figure 8. Known from this Figure, when<br />

mathematical model <strong>of</strong> the object is exact there is no<br />

longer process <strong>of</strong> pure time delay in the closed-loop<br />

circuit after using Smith compensator, thus the delay no<br />

longer affects the characteristic equation <strong>of</strong> system.<br />

Compared with control system with no network-induced<br />

delay, it is actually a control system which postpones<br />

time delay ca<br />

τ . For this reason, after adding Smith<br />

compensator, the control quality will be improved and<br />

the stability <strong>of</strong> system can be ensured.<br />

R( s )<br />

Y() s<br />

ca<br />

Gc () s Gp() s<br />

s<br />

e τ −<br />

Figure 8. The simplified diagram when the model is match<br />

C. Related Research <strong>of</strong> Smith Compensator<br />

Since the Smith compensator is based on the accurate<br />

mathematical model <strong>of</strong> controlled object and network<br />

delay, random network-induced delay and disturbances<br />

make the model with Smith compensator mismatches<br />

with controlled object.<br />

Owing to the upper reasons, it is difficult to get better<br />

effect to compensate network delay only utilizing Smith<br />

compensator. In order to overcome the impacts <strong>of</strong> those<br />

factors, it is necessary to introduce effective means <strong>of</strong><br />

control. Some researchers present many improved<br />

methods including two aspects <strong>of</strong> structural<br />

improvement[9] and parameter tuning[10].<br />

In order to make Smith Predictor applied to network<br />

control system and achieve the satisfied control effect,<br />

Du Feng presents two kinds <strong>of</strong> new Smiths compensator.<br />

One is to design the double dynamic Smith compensator<br />

<strong>of</strong> the pure time delay <strong>of</strong> controlled object and network<br />

delay. The structure doesn’t need to measure, predictor<br />

and identify the time delay online, and adapt to the<br />

networked system with random, time-varying, uncertain<br />

time delay[11]; the other is to bring the pure time delay<br />

<strong>of</strong> the controlled object and the time delay between the<br />

controller and actuator in the forward path out <strong>of</strong> the<br />

closed-loop controlled circuit, and eliminate the time<br />

delay between the transducer and actuator <strong>of</strong> feedback<br />

control loops completely. Then we needn’t schedule<br />

feedback channel to adjust network flow so that the<br />

network bandwidth can be utilized effectively, and the<br />

robustness <strong>of</strong> system to the packet loss is raised in the<br />

feedback control loops[12].<br />

Sujuan Wang, etc. regard the network and controlled<br />

object as a time-varying controlled system, estimate the<br />

time-delay <strong>of</strong> system using fading memory LSM and do<br />

the compensation by Smith compensator. By combing<br />

p


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1679<br />

immune feedback control with PI control, they get fuzzy<br />

immune PI controller which can adjust the parameter <strong>of</strong><br />

the PI controller according to the change <strong>of</strong> controlling<br />

amount, and achieve the intention to overcome the<br />

control error caused by the error <strong>of</strong> time delay’s<br />

estimation[13-14].<br />

Aiming at the networked systems with long time delay,<br />

Huiying Chen, etc. combine Smith compensator and T-S<br />

fuzzy model, and get PI controller with Smith<br />

compensator aiming at the networked systems with long<br />

time delay. The approach makes the closed-loop system<br />

obtain better stability and robustness[15].<br />

Peng Chen, etc. construct an adaptive Smith<br />

compensator which is used to compensate the long time<br />

delay <strong>of</strong> NCS based on IP network by adding a firstorder<br />

filter on the feedback loop. This approach<br />

eliminates the negative effects caused by time delay<br />

efficiently and gets the better robustness[16-17].<br />

Reiquan Lin, etc. give a design method <strong>of</strong> neuron<br />

adaptive controller based on Smith compensator. They<br />

study its application in the electric heating furnace by<br />

simulation method, and prove that this controller could<br />

efficiently makes up the deficiency <strong>of</strong> poor robustness<br />

and poor anti-jamming <strong>of</strong> the conventional Smith<br />

compensator[18-19].<br />

As a basic unit <strong>of</strong> neural networks, neuron has the selflearning<br />

ability and adaptability. The algorithm <strong>of</strong> this<br />

control system that is constructed with neuron is simple,<br />

easy to realize, and has better robustness. Besides, the<br />

most prominent characteristic is that the system doesn’t<br />

need accurate identification <strong>of</strong> the controlled object, and<br />

the structure and parameter <strong>of</strong> controlled object. So the<br />

design <strong>of</strong> single neuron adaptive controller doesn’t need<br />

system modeling. Considering these characteristic <strong>of</strong><br />

single neuron, we presents a incomplete differential<br />

forward PID algorithm based on Single Neuron, and<br />

apply it to Smith compensator so as to compromise the<br />

easy implement and the control performance <strong>of</strong> the<br />

existing Smith compensator.<br />

D. Single neuron incomplete differential forward PID<br />

The single neuron model is shown as Figure 9. The<br />

single neuron is an information processing cell with<br />

many inputs and single output. x 1 , 2 x , … , xn are the<br />

inputs <strong>of</strong> neuron, and ω 1 , ω 2 ,…, ω3 are respective weight<br />

value <strong>of</strong> the input x 1 , 2 x , … , n<br />

x . θ is the threshold <strong>of</strong><br />

neuron, f [] ⋅ is excitation function, and yP is the output<br />

<strong>of</strong> neuron. The yP can be expressed as the following<br />

formula:<br />

n<br />

⎡ ⎤<br />

yP = f ⎢∑ ωixi −θ⎥<br />

(14)<br />

⎣ i=<br />

1 ⎦<br />

x1<br />

x2<br />

ω<br />

� �<br />

Σ<br />

� ω<br />

xn<br />

1<br />

ω 2<br />

© 2011 ACADEMY PUBLISHER<br />

n<br />

θ f [] ⋅<br />

yP<br />

Figure 9. Model <strong>of</strong> single neuron<br />

The single neuron model has been applied to PID<br />

control systems. Figure 10 is the model structure <strong>of</strong><br />

single neuron PID control systems.<br />

r<br />

y<br />

x1<br />

x2<br />

x3<br />

ω1<br />

ω2<br />

ω3<br />

Σ<br />

K<br />

1<br />

Z −<br />

Figure 10. Structure <strong>of</strong> single neuron PID<br />

uk ( )<br />

In Figure 10, r and y are respectively the input and<br />

output <strong>of</strong> the system, and satisfy e( k) = r( k) − y( k)<br />

.<br />

x 1 , 2 x and x3 are the inputs <strong>of</strong> neurons, they satisfy the<br />

following relation:<br />

x1( k) = ek ( ) −ek ( −1)<br />

x2( k) = e( k)<br />

(15)<br />

x3( k) = ek ( ) −2 ek ( − 1) + ek ( −2)<br />

ω 1 , ω 2 , and ω3 are respectively the weight value <strong>of</strong> the<br />

input 1 x , 2 x , and x 3 .<br />

Supposed that the proportional coefficient is K , and<br />

K > 0 , then the output <strong>of</strong> the controller can be<br />

expressed as follows:<br />

3<br />

uk ( ) = uk ( − 1) + K∑ ω ( kx ) ( k)<br />

(16)<br />

i=<br />

1<br />

i i<br />

In the control algorithm <strong>of</strong> single neuron, the<br />

coefficient K reflects the adjusting amplitude. Generally,<br />

if the error is bigger, the adjusting amplitude is also<br />

bigger so as to satisfy the requirement <strong>of</strong> rapidity <strong>of</strong> the<br />

system; if the error is smaller, the adjusting amplitude is<br />

also smaller so as to satisfy the requirement <strong>of</strong> stability<br />

<strong>of</strong> the system.<br />

We use Delta learning rule as the learning method <strong>of</strong><br />

weight value, and it can be expressed as the following :<br />

∆ ωij ( k) = η[<br />

di( k) − oi( k)] oj( k)<br />

(17)<br />

Where, ∆ ωij<br />

expresses the weight increment from i th to<br />

j th , η is learning ratio, i o and o j are respectively the<br />

activation value <strong>of</strong> i and j , and di is the expecting<br />

output value.<br />

Single neuron PID control method implements the<br />

adapting control <strong>of</strong> system by adjusting the weight<br />

coefficient. In order to ensure the convergence and<br />

robustness <strong>of</strong> learning method, we normalize the formula<br />

(16) and (17), and get the following the expression:<br />

3<br />

uk ( ) = uk ( − 1) + K∑ w'( kx ) ( k)<br />

(18)<br />

i=<br />

1<br />

3<br />

i i i<br />

i=<br />

1<br />

i i<br />

w '( k) = w ( k) / ∑ || w ( k)<br />

||<br />

(19)<br />

w1( k + 1) = w1( k) + ηPe(<br />

k) x1( k)<br />

w2( k + 1) = w2( k) + ηI<br />

e( k) x2( k)<br />

(20)<br />

w3( k + 1) = w3( k) + ηDe(<br />

k) x3( k)<br />

Where η P , ηI and ηD are respectively proportion,<br />

integration and differentiation coefficient.


1680 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

The single neuron PID controller improves the<br />

traditional PID controller, and overcomes the sensitivity<br />

to parameter change <strong>of</strong> traditional PID control. It has<br />

better learning ability and easily ensures real-time<br />

capability[20]. In addition, the system can get better<br />

control effect under the occasions <strong>of</strong> mismatching object<br />

model.<br />

The differentiation item is sensitive to the change <strong>of</strong><br />

input value and the random disturbance, but the<br />

incomplete differential forward PID controller can<br />

improve the deficiency.<br />

The incomplete differential forward PID only<br />

differentiates to the feedback value, and adds a one-order<br />

filter, so the change <strong>of</strong> input value doesn’t affect the<br />

controller, and the change <strong>of</strong> output value doesn’t<br />

produce a very large control value. So we can combine<br />

the merit <strong>of</strong> single neuron and incomplete differential<br />

forward PID, and design the single neuron incomplete<br />

differential forward PID controller.<br />

The model <strong>of</strong> single neuron incomplete differential<br />

forward PID control is shown in Figure 11.<br />

Figure.11 Structure <strong>of</strong> incomplete differential forward PID control<br />

x<br />

Since we use the incomplete differential, 1 x , 2 x and 3<br />

should satisfy the following relation:<br />

x1( k) = ek ( ) −ek ( −1)<br />

x2( k) = e( k)<br />

(21)<br />

x3( k) = y1( k) −2 y1( k− 1) + y1( k−2)<br />

2<br />

=∆ y1( k)<br />

w 1 , 2 w and w 3 are the weight values <strong>of</strong> neurons, and<br />

still satisfy the equation (20). The control formula still<br />

satisfies the equation (18)-(20).<br />

The neuron controller uses incremental algorithms, so<br />

the relation <strong>of</strong> differential time constant and new learning<br />

algorithm satisfied:<br />

w3 w3'Td = = (22)<br />

w1 w1'h In this paper, single neuron control and incomplete<br />

differential forward PID control which are widely<br />

applied in actual control are introduced into the control<br />

system with Smith compensator so as to improve the<br />

robustness <strong>of</strong> the controller.<br />

IV. SIMULATION<br />

To verify the effectiveness <strong>of</strong> the method, we use a DC<br />

motor as the controlled object to simulate in<br />

Matlab/Simulink environment. The sampling period<br />

T=10ms, the reference input r=50rad/s, the network delay<br />

in forward and feedback channel is produced by gauss<br />

random generator in Simulink toolbox. The initial value<br />

© 2011 ACADEMY PUBLISHER<br />

<strong>of</strong> neuron weighting w 1 (0) = w 2 (0) = w 3 (0) =0.1, the<br />

learning rate <strong>of</strong> neuron η P =5, η I =0.03, η D =1.5, the<br />

proportional coefficient K =0.2, the incomplete<br />

differential coefficient γ =0.1. Using simple PID method,<br />

PID control with Smith compensator method and single<br />

neuron incomplete differential forward PID with Smith<br />

compensator respectively, then observe step responses<br />

under conditions <strong>of</strong> different random delay and the<br />

mismatch model. The results show in Figure 12 to Figure<br />

14.<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

1<br />

2<br />

3<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

Figure 12. The response when the mean <strong>of</strong> delay is 5ms<br />

1-PID; 2-PID control with Smith compensator; 3- Single neuron<br />

incomplete differential forward PID with Smith compensator<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

1<br />

2<br />

3<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

Figure 13. The response when the mean <strong>of</strong> delay is 30ms<br />

1-PID 2-PID control with Smith compensator 3- Single neuron<br />

incomplete differential forward PID with Smith compensator<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

3<br />

1<br />

2<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

Figure 14 The response when the object model is 2500/s2+30s+80<br />

1-PID; 2-PID control with Smith compensator; 3- Single neuron<br />

incomplete differential forward PID with Smith compensator<br />

The simulation results show that, when the delay is<br />

small all <strong>of</strong> the algorithm can achieve stable control<br />

performance.<br />

With the increase <strong>of</strong> delay, the control effect <strong>of</strong> these<br />

methods differs significantly. In the case <strong>of</strong> simple PID<br />

control, there is obvious oscillation in the response curve.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1681<br />

The system reaches the stable state, but the response time<br />

becomes longer, the rapidity becomes lower, and the<br />

overshoot becomes bigger. However, the other two<br />

methods still quickly reaches stable, and the rapidity <strong>of</strong><br />

the response are not affected by the delay. This proves<br />

the validity <strong>of</strong> Smith compensator.<br />

When the model <strong>of</strong> Smith compensator does not fully<br />

match with the object model, Simple PID control system<br />

has serious oscillation, and doesn’t reach stable state in<br />

the simulation time. PID control system with Smith<br />

compensator reaches stable state, but the response time is<br />

longer, and has small oscillation. But model mismatch<br />

makes no great difference to the single neuron<br />

incomplete differential forward PID with Smith<br />

compensator. This shows the validity <strong>of</strong> proposed<br />

method.<br />

V. CONCLUSION<br />

By combing Smith compensator with single neuron<br />

incomplete differential forward PID algorithm, the static<br />

and dynamic performances <strong>of</strong> the networked control<br />

systems are improved. The proposed method is easy to<br />

be implemented, and the simulation results show that the<br />

method could get better control effect than conventional<br />

Smith compensator.<br />

ACKNOWLEDGMENT<br />

This work was supported by Project 61040010 <strong>of</strong> the<br />

National Science Foundation <strong>of</strong> China.<br />

REFERENCES<br />

[1] R. Luck, A. Ray, “An observer based compensator for<br />

distributed delays”, Automatica, vol.26, No.5, pp903-908,<br />

May 1990.<br />

[2] Z. X. Yu, H. T. Chen, Y. J. Wang, “Research on Markov<br />

Delay Characteristic-Based Closed Loop Network Control<br />

System”, Control Theory and Applications, vol.19, No.2,<br />

pp263-267, February 2002.<br />

[3] J. Nilsson, “Real-time Control Systems with Delays”,<br />

Lund. Sweden: Lund Institute <strong>of</strong> Technology, 1998.<br />

[4] S. S. Hu, Q. X. Zhu, “Stochastic Optimal Control and<br />

Analysis <strong>of</strong> Stability <strong>of</strong> Networked Control Systems with<br />

long delay”, Automatica, vol.39, No.11, pp1877-1884,<br />

July 2003.<br />

[5] P. H. Bauer, M. Sichitiu, C. Lorand, etc., “Total Delay<br />

Compensation in LAN Control Systems and Implications<br />

for Scheduling”, Proc. <strong>of</strong> the American Control<br />

Conference, Arlington, vol.6, pp4300-4305, 2001.<br />

[6] Y. Tipsuwan, M. Y. Chow, “Gain Scheduler Middleware:<br />

a Methodology to Enable Existing Controllers for<br />

Networked Control and Teleoperation-Part I: Networked<br />

control”, IEEE Transactions on Industrial Electronics,<br />

vol.5, No.6, pp1218-122, 2004<br />

[7] Y. Tipsuwan, M. Y. Chow, “Control Methodologies in<br />

Networked Control Systems”, Control Engineering<br />

Practice, vol.11, No.10, pp1099-1111, October, 2003.<br />

[8] Y. Tipsuwan, M. Y. Chow, “On the Gain Scheduling for<br />

Networked PI Controller Over IP Network”,<br />

Mechatronics, vol.9, No.3, pp491-498, 2004<br />

© 2011 ACADEMY PUBLISHER<br />

[9] K. Watanabe, “A process-model control for linear system<br />

with delay”, IEEE Transaction on Automatic Control,<br />

vol.26,No.6, pp261-1269, 1981<br />

[10] J. J. Liu, W. M. Ni, Y. P. Yang, “New method for<br />

designing robust Smith predictor”, <strong>Journal</strong> <strong>of</strong> TsingHua<br />

University (Science and Technology), vol.39, No.9, pp54-<br />

57, 1999.<br />

[11] F. Du, Q. Q. Qian,W. C. Du, “Networked Control<br />

Systems Based on New Smith Predictor”, <strong>Journal</strong> <strong>of</strong><br />

Southwest Jiaotong University, vo.4, No.1, pp65-69, 2010<br />

[12] F. Du, “Research <strong>of</strong> Networked Control Systems Based on<br />

New Smith Predictor”, Chengdu, Southwest Jiaotong<br />

University, 2008.<br />

[13] T. N. Shi, S. J. Wang, H. W. Fang, “Fuzzy Immune PI<br />

Control <strong>of</strong> Networked Control System Based on Prediction<br />

Compensation”, <strong>Journal</strong> <strong>of</strong> TianJin University, vol.42,<br />

No.11, pp959-964, 2009.<br />

[14] S. J. Wang, “Fuzzy Immune PI Control <strong>of</strong> Networked<br />

Control System Based on Prediction Compensation”,<br />

Tianjin: Tanjin University, 2008.<br />

[15] H. Y. Chen, Q. Guan, W. L. Wang, “Design <strong>of</strong> a fuzzy PI<br />

controller with Smith predictor for networked control<br />

systems with long time delay”, vol.33, No.4, pp418-420,<br />

2005.<br />

[16] P. Chen, L. K. Dai, “Adaptive Smith compensator for<br />

NCSs over IP networks”, Control Theory and Application,<br />

vol.23, No.1, pp115-118, 2006.<br />

[17] P. Chen, “Modeling and Controller Design for NCS over<br />

IP Network”, Zhejian: Zhejian University, 2005.<br />

[18] R. Q. Lin, G. W. Lin, “Models and simulation <strong>of</strong> neuron<br />

PID applied in electric oven based on MATLAB<br />

language”, vol. 30, No.1, pp55-58, 2002.<br />

[19] R. Q. Lin, F. W. Yang, “Realization <strong>of</strong> a Class <strong>of</strong> Neuron<br />

Controller Based on Smith Predictor”, Information and<br />

Control, vol.33, No.2, pp137-140, 2004.<br />

[20] Y. H. Tao, “New PID Control and Application”, Beijing:<br />

Mechanic Industry Press, 1998.<br />

Haitao Zhang Henan Province, China.<br />

Birthdate: November, 1972. is Control<br />

Theory and Control Engineering Ph.D.,<br />

graduated from the Institute <strong>of</strong><br />

Automation, Chinese <strong>Academy</strong> <strong>of</strong><br />

Sciences. And research interests on<br />

intelligent control and computer<br />

application technology.<br />

He is an associate pr<strong>of</strong>essor <strong>of</strong><br />

Electronic and Information Engineering<br />

College, Henan University <strong>of</strong> Science and Technology.<br />

Zhen Li Henan Province, China.<br />

Birthday: Jan, 1987. is Automation B.S.,<br />

graduated from Electronic and<br />

Information Engineering College, Henan<br />

University <strong>of</strong> Science and Technology,<br />

China.<br />

She is a graduate student <strong>of</strong> Electronic<br />

and Information Engineering College,<br />

Henan University <strong>of</strong> Science and<br />

Technology. And research interests on<br />

networked control system.


1682 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

A Web Crawler System Design Based on<br />

Distributed Technology<br />

Shaojun Zhong<br />

Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />

infor2000@qq.com<br />

Zhijuan Deng<br />

Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />

66162815@qq.com<br />

Abstract—A practical distributed web crawler architecture<br />

is designed. The distributed cooperative grasping algorithm<br />

is put forward to solve the problem <strong>of</strong> distributed Web<br />

Crawler grasping. Log structure and Hash structure are<br />

combined and a large-scale web store structure is devised,<br />

which can meet not only the need <strong>of</strong> a large amount <strong>of</strong><br />

random accesses, but also the need <strong>of</strong> newly added pages.<br />

Experiment results have shown that the distributed Web<br />

Crawler's performance, scalability, and load balance are<br />

better.<br />

Index Terms—Search Engine, Web Crawler, Grasping<br />

Strategy, Distributed System<br />

I. INTRODUCTION<br />

The production, transmission, collection and query <strong>of</strong><br />

information are one <strong>of</strong> the most basic human activities.<br />

Considering information with writing as a carrier,<br />

traditionally libraries, corresponding cataloguing systems<br />

and pr<strong>of</strong>essionals help us quickly find the information we<br />

need with “book” or “article” as the grain size. With the<br />

development <strong>of</strong> computer and information technology,<br />

there comes the field <strong>of</strong> Information Retrieval (IR) as<br />

well as the retrieval system <strong>of</strong> the whole text about books<br />

or literatures, making it convenient for us to obtain the<br />

relevant information with the grain size <strong>of</strong> “key words”.<br />

The openness <strong>of</strong> World Wide Web and the widespread<br />

accessibility <strong>of</strong> the information on it greatly encourage<br />

people to create while bringing new opportunities for<br />

development and technological challenges for the<br />

information retrieval <strong>of</strong> World Wide Web.<br />

The scale <strong>of</strong> traditional IR is relatively limited and the<br />

retrieved objects usually undergo serious screening and<br />

pretreatment. The number <strong>of</strong> queries it responds to is<br />

generally not very big. However, the information inquiry<br />

system (meaning search engine here), which provides<br />

services on web, is different with traditional IR both in<br />

scale and response time. Search engine has to deal with<br />

large-scale information (information swarms in and some<br />

are even fake) and a great number <strong>of</strong> accesses, which still<br />

requires fast response.<br />

Search engine is an application system, which<br />

develops based on IR, suits the features <strong>of</strong> web (or www)<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1682-1689<br />

and provides information query service. Search engine is<br />

generally defined as a kind <strong>of</strong> s<strong>of</strong>tware system used on<br />

web, which collects and discovers information with<br />

certain strategies, deals with and organizes the<br />

information and finally <strong>of</strong>fers web information query<br />

service for users. How does a s<strong>of</strong>tware system like search<br />

engine work? If s<strong>of</strong>tware system works on a data set, the<br />

data it operates includes not only unpredictable user<br />

queries but also huge web pages with dynamic change in<br />

number and these web pages will not come to the system<br />

automatically but need the system to grasp them. But in<br />

face <strong>of</strong> a large amount <strong>of</strong> user queries, it is impossible for<br />

the system to “search” online whenever there is an<br />

inquiry. So, the basis for large-scale search engine should<br />

be a batch <strong>of</strong> web pages gathered beforehand [1].<br />

Therefore, web page catcher is also called Web<br />

Crawler. As a foremost part <strong>of</strong> search engine, it is an<br />

all-important studying object. Like the dynamic system<br />

carrying rocket system in aerospace, Web Crawler is the<br />

basis <strong>of</strong> search engine and all <strong>of</strong> the data it collects come<br />

from the work <strong>of</strong> Web Crawler in a smart, reasonable and<br />

powerful way.<br />

Search engine is one <strong>of</strong> the most high-end and complex<br />

Internet technologies and all companies keep the core<br />

technology to themselves. Some big companies have<br />

already had a mature solution to large web crawlers and<br />

have already put them into use. However, these large<br />

search engines can only provide ordinary users with<br />

common and non-customized search services. They could<br />

not take into consideration <strong>of</strong> various requirements <strong>of</strong><br />

different users and single web crawlers fall down on their<br />

jobs in many cases. The flexible customization and the<br />

incomparable information acquisition speed and scale <strong>of</strong><br />

the distributed web crawlers have satisfied people’s<br />

growing demand on user-oriented web information.<br />

Therefore, this paper presents a distributed design method<br />

<strong>of</strong> web crawler, and strives to achieve a robust, scalable<br />

and efficient hybrid strategy <strong>of</strong> a distributed search<br />

engine.<br />

II. CORE TECHNOLOGY OF DISTRIBUTED WEB CRAWLER


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1683<br />

A. Priority Strategy <strong>of</strong> Webpage Grasping<br />

Priority strategy <strong>of</strong> Webpage grasping determines the<br />

grasping efficiency. Grasping strategies can be roughly<br />

divided into three kinds, i.e. depth-first strategy,<br />

breadth-first strategy and best-first strategy. Depth-first<br />

strategy could be employed when the amount <strong>of</strong><br />

information is not huge. However, the rapid development<br />

<strong>of</strong> the Internet and the massive existence <strong>of</strong> web data will<br />

inevitably run into huge data by adopting depth-first<br />

algorithm strategy. Therefore, grasping strategies <strong>of</strong> the<br />

search engine will generally be breadth-first strategy and<br />

best-first strategy, as well as some <strong>of</strong> their improved<br />

algorithms [2].<br />

B. Diameter <strong>of</strong> the World Wide Web<br />

Diameter <strong>of</strong> the World Wide Web or ‘Web Diameter’<br />

is defined as ‘If d is used to represent a path from Web u<br />

to Web v, then the average length <strong>of</strong> the shortest path<br />

formed by all the different pairs <strong>of</strong> connected pages on<br />

the World Wide Web is called Web Diameter. According<br />

to this definition and the calculation <strong>of</strong> large-scale web<br />

pages, it can be known that Web Diameter is about 17[3].<br />

The calculation formula <strong>of</strong> Web Diameter is<br />

d=0.35+2.06 log (N) (1)<br />

Study shows that the diameter <strong>of</strong> China’s World Wide<br />

Web is 16.26[4], namely if there is a path between any<br />

two web pages, click less than 17 times on average, you<br />

can reach one web page from another, which is shown in<br />

Figure1.<br />

Figure 1. Diagram <strong>of</strong> Diameter <strong>of</strong> the World Wide Web<br />

After analyzing the Diameter <strong>of</strong> the World Wide Web,<br />

the following two conclusions are obtained:<br />

(1) Traversing Algorithm has affected the crawler’s<br />

efficiency to a large extent. The World Wide Web page<br />

structure is not that deep as we have imagined, but<br />

unexpectedly wider. Therefore, the traversal mode <strong>of</strong> the<br />

crawler generally adopts the breadth-first one. Certainly,<br />

there is the reason <strong>of</strong> the importance <strong>of</strong> web pages, and<br />

this kind <strong>of</strong> means can help to grasp more important web<br />

pages.<br />

(2) The World Wide Web is so complex that a chosen<br />

grasping circuit cannot necessarily and invariably<br />

guarantee the best. In order to prevent this problem, the<br />

diameter <strong>of</strong> the web needs to be fully considered, and<br />

"depth-first strategy" should be adopted to control the<br />

grasping depth. In this way, the problem can be perfectly<br />

solved [5].<br />

Let’s look at the following example:<br />

Suppose starting from seed site A, seed site B and seed<br />

site C, there are three paths to web page P, the lengths<br />

respectively being 3, 19 and 127 (CostAP=3; CostBP=19;<br />

© 2011 ACADEMY PUBLISHER<br />

Figure 2. Path cost diagram <strong>of</strong> different seed sites<br />

CostCP=127). As to grasp web page P from seed site A is<br />

very quick while seed site B and C reach P after a long<br />

path, it is apparently not economic enough.<br />

To prevent the Crawler from unlimited breadth-first<br />

grasping, a certain depth must be limited. Once reaching<br />

this depth, grasping should be stopped. The value <strong>of</strong> this<br />

depth is the length <strong>of</strong> diameter <strong>of</strong> the World Wide Web.<br />

When stopping at the maximum depth, those excessively<br />

deep un-grasped web pages always expect to reach from<br />

other seed sites in a more economic way. For example,<br />

seed site B and C stop grasping once reaching the depth<br />

<strong>of</strong> 17, leaving the opportunity for grasping web page P to<br />

the Crawler starting from seed site A to grasp. It is not<br />

hard to see that limiting the grasping depth destroys<br />

conditions <strong>of</strong> infinite loops and loops , if there are, will<br />

stop after limited times. Moreover, the combination <strong>of</strong><br />

depth strategy and breadth-first strategy can effectively<br />

guarantee the closeness in the course <strong>of</strong> grasping, namely<br />

always grasping web pages under the same domain name<br />

in the process <strong>of</strong> grasping while web pages under other<br />

domain names are rare[6].<br />

C. Judgement <strong>of</strong> the Web Importance<br />

While maintaining the priority strategy <strong>of</strong> web page<br />

grasping, please grasp important web pages first to ensure<br />

those more important web pages can be arranged with<br />

limited resources. Which web pages are more important?<br />

How to measure the importance?<br />

The measure <strong>of</strong> importance is decided by the following<br />

three aspects, i.e. IB (P), IL (P) and ID (P).<br />

1) IB(P)<br />

It is mainly decided by the number and quality <strong>of</strong> back<br />

links. Firstly, the more links (a great many back links) a<br />

web page has, the more it is recognized by other pages.<br />

Furthermore, there will be more opportunities for it to be<br />

visited by net-citizen and its importance is more obvious.<br />

Secondly, the more it is pointed to by more important<br />

web pages, the more important it is. The most classic is<br />

cheating web pages, which artificially set lots <strong>of</strong><br />

Backlinks pointing to their own web pages to increase the<br />

importance <strong>of</strong> web pages. If the quality is not considered,<br />

local optimal will appear, rather than problem <strong>of</strong> global<br />

optimal.<br />

2) IL (P)<br />

It is a function <strong>of</strong> URL string which only investigates<br />

the string itself. IL (P) is realized mainly through some<br />

models, for example, it attaches more importance to URL<br />

containing ‘com’ or ‘home’. It also regards that the URL<br />

with fewer slashes is more important.<br />

3) ID(P)<br />

ID (P) represents that in a seed site set; there is a link<br />

(breadth-first traverse rules) in every seed site that can<br />

arrive at the web page. ID (P) is another important index


1684 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

<strong>of</strong> the web pages. The closer it is to the seed site, the<br />

more opportunities it has to be visited. Therefore, it is<br />

more important and the seed site is where the most<br />

important web pages are. The farther it is to the seed site,<br />

the less important it is.<br />

D. Non-Repeated Grasping Strategy<br />

Massive web page images are other important<br />

characteristics <strong>of</strong> web. According to the 24 million page<br />

statistics by Google system, 22% <strong>of</strong> the web pages are<br />

images. The existences <strong>of</strong> a lot <strong>of</strong> duplicated web pages<br />

are unfavorable to the users’ query. It not only wastes the<br />

storage space <strong>of</strong> search engines, but also decreases the<br />

system efficiency.<br />

The reasons, on the one hand, are that the collecting<br />

program does not clearly record the visited URLs. On the<br />

other hand, the domain names and IP addresses have a<br />

multiply corresponding relation. The first problem can be<br />

solved by making a record <strong>of</strong> the visited URLs, and<br />

making a contrast between the new URLs and the visited<br />

ones every time. The second problem is relatively<br />

complex, because different URLs may refer to the same<br />

IP.<br />

There are four kinds <strong>of</strong> corresponding relationships<br />

between the domain names and IP addresses, namely:<br />

one-to-one, one-to-many, many-to-one and<br />

many-to-many. One-to-one relationship won't cause<br />

repeated collection, but the others are likely to do so.<br />

1) Algorithm Based on B-tree<br />

Due to the huge amount <strong>of</strong> web pages, web page<br />

grasping requires network bandwidth, machines, time and<br />

so on. The repeated grasping <strong>of</strong> the same web page<br />

greatly reduces the efficiency <strong>of</strong> the system, so the<br />

Crawler system should design a strategy to avoid<br />

repeated web page grasping to ensure that a web page is<br />

grasped only one time in a certain period <strong>of</strong> time [7].<br />

B-tree is a kind <strong>of</strong> balanced multiway search tree.<br />

What the file system <strong>of</strong> operating system uses is the<br />

search algorithm <strong>of</strong> B-tree, which can also be used to<br />

design the algorithm matching URL to avoid repeated<br />

grasping in the Crawler. B-tree can be empty or multiway<br />

tree. A B-tree <strong>of</strong> m order must meet the following<br />

requirements:<br />

(1) A tree can have m subtrees at most;<br />

(2) If the root node is not the leaf node, at least two<br />

subtrees are necessary;<br />

(3) All non-terminal nodes except root have at least<br />

two subtrees;<br />

(4) All non-terminal nodes contain the following<br />

information data: (n, A0, K1, A1, K2, A2, …, Kn, An, )<br />

Each node includes n pointers pointing to each<br />

keyword record. Ki(i=1, …, n) is keyword and<br />

Ki


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1685<br />

make sure whether it has been grasped by looking into<br />

the Hash table.<br />

E. Webpage Revisiting Strategy<br />

The popularity <strong>of</strong> web results from the information<br />

web brings. Information is constantly changing, and the<br />

webpage information updating is unavoidable. However,<br />

the earlier grasped information may be out-<strong>of</strong>-date or <strong>of</strong><br />

no use at all. A strategy is thus needed to solve the<br />

problem <strong>of</strong> timeliness <strong>of</strong> information, and it is called<br />

webpage revisiting strategy. Through revisiting, these<br />

webpages can keep pace with the changes <strong>of</strong> the World<br />

Wide Web.<br />

In 2000, Cho and Garcia-Monlina <strong>of</strong> Stanford<br />

University randomly chose 500, 000 web page samples<br />

and found that 23% <strong>of</strong> the web pages were updated on a<br />

daily basis while 40% <strong>of</strong> the web pages with .com as the<br />

suffix <strong>of</strong> their domain names wais updated every day.<br />

The half-life <strong>of</strong> web pages is 10 days. In addition, study<br />

shows that the process web pages change boils down to<br />

model <strong>of</strong> Poisson process [8].<br />

To describe the model <strong>of</strong> Poisson process, X(t) is used<br />

to represent the number <strong>of</strong> changes <strong>of</strong> web pages in the<br />

period <strong>of</strong> (0, t) and the Poisson distribution with λ as<br />

its parameter meets the following nature.<br />

As for s>=0, t>=0, random variable<br />

X ( s + t)<br />

− X ( s)<br />

conforms to Poisson distribution,<br />

namely<br />

k<br />

λ t ) − λ t<br />

Pr{ X ( s + t ) − X ( s ) = k } =<br />

(1)<br />

(<br />

k !<br />

�<br />

In which k=1, 2, 3…<br />

The expected value <strong>of</strong> random<br />

variable X ( s + t)<br />

− X ( s)<br />

is λ t .<br />

E [ X ( s + t ) − X ( s )] = λ t<br />

(2)<br />

It can be proved through a simple method. Suppose<br />

that time cycle (time interval) is 1, then<br />

∞<br />

∞ k − k<br />

λ �<br />

E[ X ( t + 1)<br />

− X ( t)]<br />

=<br />

k = λ<br />

In which,<br />

obtained.<br />

∑ ∞<br />

k = 1<br />

∑kPr{ X ( t + 1)<br />

− X ( t)}<br />

= ∑<br />

k= 0 k=<br />

1 k!<br />

(3)<br />

k −1<br />

λ<br />

= �<br />

( k − 1)!<br />

E [ X ( t + 1)<br />

− X ( t)]<br />

= λ t = λ<br />

(4)<br />

λ<br />

, with (2), it can be<br />

Through the trace analysis <strong>of</strong> 500, 000 random web<br />

pages, Cho and Garcia-Molina came to the important<br />

conclusion that the update <strong>of</strong> most web pages belonged to<br />

Poisson distribution [9].<br />

F. Robots Protocol<br />

Robots Protocol is a standard Web Crawler should<br />

conscientiously observe with Robots.txt document as its<br />

main content. In general conditions, Crawler writers will<br />

© 2011 ACADEMY PUBLISHER<br />

observe this protocol. A Crawler can still acquire web<br />

information without observing Robots.txt standard; but if<br />

a webmaster finds that a Crawler has problems, he will<br />

connect with its owner through its logo, or even prevent<br />

this Web Crawler form extracting some web pages in<br />

other ways. So Crawler developers shall conscientiously<br />

observe this protocol [10].<br />

After entering a web page, web spider will first visit<br />

the text file equipped with Robots Protocol, which is<br />

usually in the root directory <strong>of</strong> web server, such as<br />

www.163.com/Robots.txt. With the protocol file,<br />

Robots.txt, webmasters can define the directories Web<br />

Crawler can not visit or the specific directories certain<br />

Web Crawlers can not visit [11]. For instance, if the<br />

executable directory and temporary file directory <strong>of</strong> some<br />

web pages do not want to be searched by search engine,<br />

webmasters can define these two directories as directories<br />

which deny access.<br />

The file format <strong>of</strong> Robots is as follows.<br />

User-agent:<br />

It is the name <strong>of</strong> Crawler. In the file “Robots.txt”, if<br />

more than one User-agent records show that many<br />

Crawlers are limited by this protocol, this file shall have<br />

at least one User-agent record. If the value <strong>of</strong> this record<br />

is set as *, this protocol is effective for any Crawler. In<br />

the file “Robots.txt”, record like“User-agent:*”can only<br />

have one.<br />

Disallow:<br />

It is used to describe a URL which does not want to be<br />

visited. This URL can be a complete path or part <strong>of</strong> it.<br />

Any URL started with Disallow can not be visited by<br />

Robot [12].<br />

For example:<br />

A: “Disallow:/help”means that neither /help.html nor<br />

/help/index.html allows Crawler to grasp.<br />

B: “Disallow:/help/”means that Crawler can grasp<br />

/help.html but can not grasp /help/index.html.<br />

C: If the record <strong>of</strong> Disallow is empty, all pages <strong>of</strong> this<br />

website can be grasped by Crawler and in file<br />

“/robots.txt”, there are two or more Disallow records. If<br />

“/robots.txt”is an empty file, this website is open to any<br />

Crawler and can be grasped.<br />

Apart from observing Robots Protocol, Crawler should<br />

do its best to reasonably plan grasping strength by<br />

weakening the grasping strength during daytime while<br />

moderately increasing grasping strength at night when<br />

visit <strong>of</strong> Web host is low. Because <strong>of</strong> time difference,<br />

when it is daytime in Eastern Hemisphere, Western<br />

Hemisphere is at night. So the Crawler can enhance the<br />

strength <strong>of</strong> grasping American and European websites<br />

during the day while increasing the strength <strong>of</strong> grasping<br />

websites <strong>of</strong> its own country at night [13].<br />

Even so, Crawler always inevitably brings trouble to<br />

Web host <strong>of</strong> other World Wide Web. So monitoring<br />

program <strong>of</strong> website grasping is indispensable. This<br />

program records the grasping traffic <strong>of</strong> every website to<br />

avoid problems caused when grasping strength is<br />

occasionally excessive.


1686 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

III. A DISTRIBUTED DESIGN OF WEB CRAWLER SYSTEM<br />

Thousands <strong>of</strong> WWW servers on the web form mass<br />

information through the web link between them, with<br />

each connection between the hosts being relatively<br />

independent. Single processor system is restricted by the<br />

CPU handling capacity, disk storage capacity and<br />

network bandwidth resources, etc. It is impossible to<br />

boast the ability <strong>of</strong> dealing with such huge amounts <strong>of</strong><br />

information, not to mention to catch up with the rapid<br />

growth <strong>of</strong> web information. The distributed technology<br />

becomes a choice. As a design <strong>of</strong> distributed system, it<br />

pursues the following goals: (1) The grasping ability <strong>of</strong> a<br />

single machine should not decrease a lot when the<br />

number <strong>of</strong> grasping machines increases, i.e. the<br />

communication and management expenses <strong>of</strong> the system<br />

should be reduced to the minimum while pursuing load<br />

balance. (2) Starting from the actual operation, dynamic<br />

configuration <strong>of</strong> the system should be considered, i.e. to<br />

allow the addition or removing <strong>of</strong> one or more machines<br />

during the operation.<br />

A. A Distributed Structure Design <strong>of</strong> Web Crawler<br />

System<br />

To design a robust and efficient web crawler, it is<br />

needed to make the task distribution across multiple<br />

machines in concurrent processing. Huge webpages<br />

should be independently distributed on the network and<br />

they should provide adequate possibility and rationality<br />

for concurrent accesses. Meanwhile, concurrent<br />

distribution will save network bandwidth resources.<br />

Besides, in order to improve the recall ratio, precision and<br />

search speed <strong>of</strong> the whole system, the internal algorithm<br />

<strong>of</strong> the search should boast certain intellectualization.<br />

Therefore, the distributed web crawler adopts the<br />

following structure design as in Figure. 3.<br />

In system design, it is needed to make the task<br />

distribution across multiple machines in concurrent<br />

processing. Huge web pages should be independently<br />

distributed on the network and they should provide<br />

adequate possibility and rationality for concurrent<br />

accesses. Meanwhile, concurrent distribution will save<br />

network bandwidth resources. Besides, in order to<br />

improve the recall ratio, precision and search speed <strong>of</strong> the<br />

whole system, the internal algorithm <strong>of</strong> the search should<br />

boast certain intellectualization.<br />

The core <strong>of</strong> system distribution is data distribution.<br />

The chief dispatcher is responsible for distributing URL<br />

to every distributed crawler. The distributed crawlers<br />

grasp webpages according to the HTTP protocol. In order<br />

to improve the speed, hundreds <strong>of</strong> distributed crawlers<br />

can usually be launched simultaneously. Distributed<br />

crawlers simultaneously analyze and dispose <strong>of</strong> the<br />

collected web pages, extract URL links and other relevant<br />

information, submit to their respective dispatchers, and<br />

their respective dispatchers submit them to the chief<br />

dispatcher.<br />

B. Basic Process Design for a Distributed Web Crawler<br />

Grasping<br />

Figure. 4 is a brief flow chart which only shows page<br />

processes with no errors. In this process, the web crawler<br />

will start working when one URL is added to the waiting<br />

queue. So long as there is one webpage or web crawler<br />

disposing <strong>of</strong> one webpage in the waiting queue, the web<br />

crawler program will continue its working. When the<br />

waiting queue is null and there is no disposing <strong>of</strong> any<br />

webpages, the web crawler will stop working.<br />

C. The Design <strong>of</strong> a Cooperative Grasping Algorithm <strong>of</strong><br />

the Distributed Web Crawler<br />

In the circumstance <strong>of</strong> multiple crawlers grasping, how<br />

the workload will be decomposed becomes the major<br />

problem. If the division is not clear, it is probable that<br />

multiple crawlers have grasped the same web, thus<br />

causing additional expenses. There are two options to<br />

solve it.<br />

Scheme 1: To decompose through the web host's IP<br />

address and make a certain crawler grasp only the<br />

webpages <strong>of</strong> a certain section <strong>of</strong> addresses.<br />

Scheme 2: To decompose through the domain names<br />

<strong>of</strong> a web and make a certain crawler grasp only the<br />

webpages <strong>of</strong> a certain section <strong>of</strong> the domain names.<br />

World Wide Web determines the location <strong>of</strong> host<br />

according to the IP address in the network infrastructure,<br />

but as the IP address is dotted decimal, it is hard to<br />

remember. So domain name is adopted to map the IP<br />

address. Due to the kindness <strong>of</strong> domain name towards<br />

people, such a problem arises: many domain names<br />

correspond to the same IP. Medium-sized and small<br />

websites usually use this method to provide different<br />

Web services. It only takes economic factor into<br />

consideration, for only one server is needed; but large<br />

websites, like Sina, Sohu and other portals, generally<br />

adopt load balance IP multicast technology, which means<br />

Figure 3. A distributed structure design <strong>of</strong> web crawler system Figure 4. Basic process design for a distributed web crawler grasping<br />

© 2011 ACADEMY PUBLISHER


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1687<br />

that the same domain name corresponds to many IP<br />

addresses. In this way, robustness <strong>of</strong> the system is<br />

enhanced and load balance is achieved.<br />

Given the condition that many domain names<br />

correspond to the same IP address or that the same<br />

domain name corresponds to many IP addresses, a fairly<br />

good way is to decompose tasks according to domain<br />

names, which means that as long as the web pages <strong>of</strong><br />

large websites are not repeatedly grasped, small websites<br />

can accept strategy allocation tasks even if they<br />

repeatedly grasp. This method <strong>of</strong> allocation allocates<br />

domain names to different Crawlers to grasp and a<br />

Crawler can only grasp web pages <strong>of</strong> “appointed” domain<br />

name set. For example, sina.com.cn is “appointed” to be<br />

grasped by spider1, jxust.cn to be grasped by spider2 and<br />

sim.jx.cn is “appointed” to be grasped by spider3.<br />

The main differences between these two kinds <strong>of</strong><br />

solutions can be further understood by the following two<br />

examples.<br />

Suppose that we have 3 spiders to analyze 2 websites,<br />

www.jxust.cn and www.sim.jx.cn. They have different<br />

domain name and have the same IP address<br />

(218.87.136.5). The homepages are:<br />

http://www.jxust.cn/index.html and<br />

http://www.sim.jx.cn/index.html. After DNS, they are<br />

actually both http://218.87.136.5/index.html. The domain<br />

decomposition scheme will make spider2 and spider3<br />

repeatedly grasp this page. However, since the<br />

information <strong>of</strong> this site is not so much, the loss resulted<br />

from repeated grasps can be tolerated.<br />

The IP distribution scheme to grasp tasks is different.<br />

For example, sina.com.cn(71.5.7.138) is “appointed” to<br />

be grasped by spider1, sina.com.cn(71.5.6.136) to be<br />

grasped by spider2, jxust.cn(218.87.136.5) to be grasped<br />

by spider3 and sim.jx.cn(218.87.136.5) is “appointed” to<br />

be grasped by spider3. In this allocation scheme, there is<br />

no repetition in the problem <strong>of</strong> different domains pointing<br />

to the same IP, and the grasping tasks <strong>of</strong> jxust.cn and<br />

sim.jx.cn are both completed by spider3. However,<br />

sina.com.cn corresponds to several IPs, and the allocation<br />

is by spider1 and spider2 respectively. In this way, the<br />

grasping task <strong>of</strong> spider1 and spider2 repeat with each<br />

other. Obviously, sina is a large-scale web and the loss<br />

resulting from this repeated grasping will be huge.<br />

Through the comparison, the domain decomposition<br />

strategy is more reasonable which takes into<br />

consideration <strong>of</strong> the large website. Therefore, in Crawler<br />

system, the work <strong>of</strong> decomposing grasping tasks<br />

according to the domain name should be dealt with by a<br />

general scheduling to schedule web pages to different<br />

Crawlers to grasp through domain name decomposition.<br />

A formal scheduling distribution is as follows:<br />

Firstly, we suppose that n crawlers can work<br />

concurrently, and can define a function domain which can<br />

extract an URL domain name, such as:<br />

http://news.163.com.cn/20090824/08116145133.shtml<br />

Domain (URL) =news.163.com<br />

(1) For any URL, it will use the function domain to<br />

extract the domain name <strong>of</strong> URL.<br />

© 2011 ACADEMY PUBLISHER<br />

(2) Use MD5 signature function for the signatures<br />

domain, MD5 (domain (URL)).<br />

(3) Use MD5 signature value to do mould operations<br />

on n, int spider no=MD5 (domain (URL)) %n.<br />

(4) Allocate this URL to crawler numbered spider no<br />

to grasp.<br />

A mold operation can help a universal set be divided<br />

into several equivalence classes. Therefore, the union <strong>of</strong><br />

equivalence classes is equal to the universal set, and the<br />

elements in an equivalence class certainly do not belong<br />

to another equivalence class. A formal equivalence<br />

relation can be expressed as follows.<br />

Set U as an universal set, and it is mapped respectively<br />

to S1, S2, …, Sn through a certain equivalence relation. It<br />

satisfies the following two conditions:<br />

(1) S1∪S2∪...∪Sn=U<br />

(2) if(a∈Si)&(b∈Sj)&(Si!=Sj) then a!=b<br />

Generally, n is the integral power <strong>of</strong> 2. For example,<br />

the mod <strong>of</strong> 4, 8, 16, 32…can be rapidly obtained by the<br />

means <strong>of</strong> digit and (&), i.e. int spider no=MD5 (domain<br />

(URL)) & (n-1). Generally, to mod the integral power <strong>of</strong><br />

2, the means <strong>of</strong> & (n-1) could be employed (In it, n must<br />

be the integral power <strong>of</strong> 2) for rapid calculation.<br />

D. Large-Scale Web Storation Structure Design<br />

The World Wide Web keeps changing all the time, so a<br />

web page database must be able to delete the old version<br />

after deletion <strong>of</strong> web pages. In this way, storage voids<br />

may be left. Updating can be understood as addition after<br />

deletion and the addition <strong>of</strong> application order to the web<br />

database. Therefore, some disk space compact<br />

technologies have to be adopted to recover the storage<br />

voids. Besides, updating and visiting should be mutually<br />

excluded to avoid synchronization <strong>of</strong> the errors.<br />

Therefore, a good page storage structure can bring<br />

excellent access performance.<br />

To combine log structure and Hash structure based on<br />

its advantage is quite a good choice. For new web pages,<br />

the page's signature could be calculated through the URL.<br />

Then through modeling computation, a web page will be<br />

mapped to a unit on the Hash table, with each Hash table<br />

unit corresponding to the location <strong>of</strong> a log file. These<br />

newly added pages are mapped to Hash [1] through the<br />

calculation <strong>of</strong> Hash function, and then to the document<br />

Log1. You may want to randomly read an already<br />

accessed web page <strong>of</strong> URL, or still map to specific log<br />

files through similar Hash function calculation. Then you<br />

can search the B-tree index on the log file for<br />

corresponding page documents. You can acquire<br />

equivalent or even slightly better random access effect<br />

with log files (random access object files greatly<br />

decreased). What is worth mentioning most is that this<br />

kind <strong>of</strong> means can adopt processing batch writing-in,<br />

which will greatly improve the pure Hash structure. In<br />

each log file, writing-in queue will be added. Only when<br />

it has accumulated a certain amount <strong>of</strong> files, the<br />

processing batch can be realized, as shown in Figure. 5.


1688 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Figure 5. batch writing-in <strong>of</strong> newly added pages<br />

Figure 6. n node distributed cooperative grasping system performance<br />

decreases as time varies<br />

A Hash table can help to change the uncertainty <strong>of</strong> the<br />

insertion <strong>of</strong> those newly added web pages into certainty.<br />

Therefore, the addition <strong>of</strong> an inserted queue can be<br />

inserted into the target log files in batch mode. Through<br />

the Hash function decomposition, the size <strong>of</strong> each log on<br />

the basis <strong>of</strong> the Hash structure is far less than that in the<br />

log structure, and at the same time outweighs Hash<br />

barrels in the Hash structure a lot.<br />

Besides, it must be ensured that each log can be stored<br />

in memory. So to determine the size <strong>of</strong> the Hash table in<br />

Hash-Log, it is necessary to consider the size <strong>of</strong> actual<br />

physical memory and the scale <strong>of</strong> web pages which need<br />

to be stored.<br />

Table I gives a qualitative evaluation <strong>of</strong> three storage<br />

ways <strong>of</strong> web pages.<br />

To sum up, without lots <strong>of</strong> opportunities for random<br />

access, log structure can be the best way to store web<br />

pages. As for the possible great deal <strong>of</strong> random access<br />

and the need <strong>of</strong> many new web pages, Hash-Log is a<br />

more ideal way to store web pages, for it can effectively<br />

support distributed web page storage and effectively<br />

distribute web page storage to every storage node to<br />

increase the reliability and stability <strong>of</strong> web page storage<br />

in the condition that multi-machine is used to store web<br />

pages in a larger environment. The overall search effect<br />

will not be affected a lot even if a storage node goes<br />

wrong.<br />

IV. SYSTEM EVALUATIONS<br />

A. Operating System Environment<br />

B. Performance Evaluation<br />

The data statistics results show that different sites have<br />

quite different grasping rates and the grasping amount <strong>of</strong><br />

webpages which depends on the access speed <strong>of</strong> each<br />

site, and some sites have restrictions on crawlers’<br />

grasping. These restrictions include speed restrictions, as<br />

well as some web accessibility restrictions. Under the<br />

© 2011 ACADEMY PUBLISHER<br />

TABLE I.<br />

QUALITATIVE EVALUATION OF THREE STORAGE WAYS OF WEB PAGES<br />

Ordered<br />

access<br />

Random<br />

access<br />

Increase<br />

webpages<br />

Device<br />

type<br />

Lenovo<br />

M6000<br />

Lenovo<br />

T2900V<br />

Lenovo<br />

428E<br />

Web Site<br />

Device<br />

Purposes<br />

Server<br />

Client<br />

Client<br />

TABLE Ⅱ.<br />

HARDWARE DEVICES<br />

Device configuration Count<br />

Intel P4 3.2GHz/ Memory<br />

1G/ HardDisk 160G/ NIC<br />

10M-100M<br />

Intel Celeron 2GHz/<br />

Memory 1G/ HardDisk<br />

160G/ NIC 1000M<br />

Intel P4 3.2GHz/ Memory<br />

512M/ HardDisk 800G/<br />

NIC 10M-100M<br />

TABLE Ⅲ.<br />

GRASPING CONDITIONS OF SOME FAMOUS WEBSITES<br />

Web<br />

Size<br />

Count<br />

<strong>of</strong><br />

Webs<br />

Crawl<br />

Time<br />

(Hour<br />

condition <strong>of</strong> three-node-distributed cooperative grasping,<br />

the average rate can reach 7 pages per second, which<br />

renders very satisfactory results.<br />

C. Scalability Evaluation<br />

A system with good scalability can bring linear growth<br />

to its performance with the addition <strong>of</strong> cost. It is also easy<br />

to be streamlined or expanded.<br />

Below is the influence on the grasping result <strong>of</strong><br />

different numbers <strong>of</strong> cooperative grasping nodes. Figure4<br />

shows the operating result <strong>of</strong> the four kinds <strong>of</strong> different<br />

systems <strong>of</strong> scale respectively (the number <strong>of</strong> inspection<br />

cooperative grasping nodes= 1, 2, 4, 10 etc.) during the<br />

earlier 10 hours. Among them, the abscissa denotes the<br />

running time <strong>of</strong> the crawler system, with the unit being<br />

)<br />

Average<br />

Rate<br />

(Pages/<br />

Second)<br />

163.com 10.5G 180596 5 10.033 3<br />

sina.com.cn 9.3G 132769 5 7.376 3<br />

yahoo.com.<br />

8.7G 150101 5 8.339 3<br />

cn<br />

qq.com 8.2G 142969 5 7.943 3<br />

sohu.com 7.7G 131094 5 7.283 3<br />

1<br />

3<br />

10<br />

Crawl<br />

Nodes<br />

TABLE Ⅳ.<br />

THE VARIATION OF THE NUMBER OF URL ALLOCATION OF THE 3 NODES<br />

AS TIME VARIES<br />

Running<br />

Time<br />

Nodes<br />

Log Structure based<br />

structure on Hash Hash-Log<br />

++ - +<br />

+- ++ +<br />

++ - +<br />

10<br />

Minutes<br />

30<br />

Minutes<br />

1 Hour<br />

2<br />

Hours<br />

5<br />

Hours<br />

Node 1 1893 5574 11753 22961 53742<br />

Node 2 1967 5782 12587 24323 57933<br />

Node 3 1952 5637 12127 22939 55161


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1689<br />

hour. The y-coordinate represents the accumulated<br />

quantity <strong>of</strong> the grasped webpages.<br />

Figure. 6 shows that along with the increase <strong>of</strong> the<br />

number <strong>of</strong> cooperative grasping nodes, the basic system<br />

performance linearly increases. Therefore, this distributed<br />

system boasts good scalability and stability.<br />

D. Task Load Balance Evaluation<br />

The load balance <strong>of</strong> the system is based on the<br />

distributed web crawler cooperative grasping algorithm,<br />

which utilizes Hash function to allocate URL<br />

dynamically among the nodes. Since only one process is<br />

considered, one can not evaluate whether it has attained<br />

the load balancing only depending on the number <strong>of</strong><br />

URLs allocated to each node in the process. Instead, all<br />

phases <strong>of</strong> the whole cooperative grasping process should<br />

be analyzed to evaluate the effect <strong>of</strong> load balance (The<br />

whole grasping process is divided into several phases in<br />

time). The experiment is carried out with 3 nodes<br />

cooperatively grasping 163.com. TABLE III shows the<br />

URL distribution <strong>of</strong> each node in the whole process <strong>of</strong> 5<br />

hours’ running <strong>of</strong> the system.<br />

It is shown in TABLE IV that each node has grasped a<br />

basically equal number <strong>of</strong> webpages. The system load<br />

balance <strong>of</strong> distributed web crawler has reached the<br />

expected elementary objective.<br />

REFERENCES<br />

[1] Li Xiaoming, Yan Hongfei, Wang Jimin, Search Engine-<br />

Principle, Technology and System. Beijing: science press,<br />

2005.<br />

[2] M.Najork, J.Wiener, “Breadth-first search crawling yields<br />

high-quality pages, ” In 10th International World Wide<br />

Web Conference, 2001.<br />

© 2011 ACADEMY PUBLISHER<br />

[3] Reka Albert, Hawoong Jeong, Albert-Laszlo Barabasi,<br />

“Diameter <strong>of</strong> the World-Wide Web, ” Nature 401, pp.<br />

130-131, 1999.<br />

[4] Li Xiaoming, “Estimation <strong>of</strong> the Number <strong>of</strong> Static Web<br />

Pages in China, ” PKU_CS_NET_TR2002006, 2002.<br />

[5] A.Broker, R.Kumar, F.Maghoul, Tomkins, a.J.Winener,<br />

“Graph structure in the web: experiments and models, ”<br />

presented at Proceedings <strong>of</strong> the 9th World-Wide Web<br />

Conference, Amsterdam, 2000.<br />

[6] Arasu. A, Cho. J, Garcia-Molina. H, “Searching the Web, ”<br />

ACM Transactions on Internet Technology, pp. 42.<br />

[7] Narayannan Shivakuma, Hector Garcia-Molina, “Finding<br />

near-replicas <strong>of</strong> documents on the web, ” Web DB 1998,<br />

pp. 204-212.<br />

[8] CHO. J, GARCIA-MOLINA. H, “Estimating Frequency <strong>of</strong><br />

Change, ” ACM Transactions on Internet Technology, Vol.<br />

3, 2003.<br />

[9] A Standard for Robot Exclusion [EB/OL],<br />

http://www.robotstxt.org/wc/norobots.html<br />

[10] J. Talim, Z. Liu, Ph. Nain, E. G. C<strong>of</strong>fman. “Controlling the<br />

robots <strong>of</strong> Web search engines, ” Proceedings <strong>of</strong> the 2001<br />

ACM SIGMETRICS international conference on<br />

Measurement and modeling <strong>of</strong> computer systems,<br />

Cambridge, Massachusetts, United States, 2001.<br />

[11] Junghoo Cho, Hector Garcia-Molina, “Parallel crawlers, ”<br />

In Proceedings <strong>of</strong> the eleventh international conference on<br />

World Wide Web, Honolulu, Hawaii, USA, ACM Press,<br />

pp. 124-135, 2002.<br />

[12] Paolo Boldi, Bruno Codenotti, Massimo Santini and<br />

Sebastiano Vigna, UbiCrawler: A Scalable Fully<br />

Distributed WebCrawler, 2003.<br />

[13] Yan Hongfei, “Primary Exploration on Design, Realization<br />

and Application <strong>of</strong> Extensible Web Information Collection<br />

System, ” Beijing University Doctoral Dissertation, 2002.


1690 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

A Ranking Method <strong>of</strong> Retrieval Results Based on<br />

Web Comprehending<br />

Zhijuan Deng<br />

Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />

66162815@qq.com<br />

Shaojun Zhong<br />

Jiangxi University <strong>of</strong> Science and Technology/ Faculty <strong>of</strong> Science, Ganzhou, China<br />

infor2000@qq.com<br />

Abstract—This thesis put forward a method used to<br />

calculate query similarity <strong>of</strong> webpage search results based<br />

on Web comprehending. According to users’ query input,<br />

this method can use Web comprehending technology to<br />

display the important web pages closer to users’ query in<br />

the first page <strong>of</strong> the list, make users more satisfied with the<br />

response <strong>of</strong> search engine, running after recall ratio and<br />

ensure precision at the same time.<br />

Index Terms—similarity, Web comprehending, search<br />

engine, search results<br />

I. INTRODUCTION<br />

As known to all, the scale <strong>of</strong> the World Wide Web is<br />

great. According to the analysis <strong>of</strong> Lawrence and Giles,<br />

the page number <strong>of</strong> World Wide Web doubles every two<br />

years. As early as 1998, all researches thought the scale<br />

<strong>of</strong> World Wide Web was at the magnitude <strong>of</strong> billion.<br />

Reckoned according to this method, the scale <strong>of</strong> present<br />

World Wide Web has reached the magnitude <strong>of</strong> ten<br />

billion [1]. Search engine should display the related<br />

information list to input content according to the query<br />

strings that users input. Although the precision <strong>of</strong> the<br />

search strategy based on keyword matching is very high,<br />

it obviously ignores many semantic correlative<br />

vocabularies, which limits the ability to <strong>of</strong>fer users<br />

effective information.<br />

Topic-specific search engine is devoted to <strong>of</strong>fering<br />

users more comprehensive and pr<strong>of</strong>essional service<br />

related to the topic, combine with Web comprehending to<br />

<strong>of</strong>fer correlation search, and actively <strong>of</strong>fer users webpage<br />

search results <strong>of</strong> strong pr<strong>of</strong>ession and high correlation.<br />

As for topic-specific search engine, if the query words<br />

that users input are common, the page quantity related to<br />

the input contents by users is great. If the captured pages<br />

are displayed for users not according to any ranking rules,<br />

there will be many pages <strong>of</strong> less correlation and<br />

unimportance in the front position. As some investigation<br />

shows, when querying with search engine, most users<br />

only pay attention to the first page and don’t look at the<br />

next page after they get the search results. In this way, if<br />

the first page is unimportant or <strong>of</strong> less correlation, the<br />

precision will be affected and users will feel unsatisfied<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1690-1696<br />

with the results <strong>of</strong> search engine. So ranking the pages<br />

got by searching is very necessary.<br />

However, the pages got through complete matching<br />

query and the pages got through Web comprehending<br />

technology are different. In order to obey the principle <strong>of</strong><br />

important display in the first page, quantitative basis<br />

should be <strong>of</strong>fered to distinguish such two kinds <strong>of</strong> pages<br />

when ranking pages. That is, the method <strong>of</strong> calculate<br />

similarity value between query words and documents<br />

should be accordingly changed on the basis <strong>of</strong> complete<br />

matching.<br />

II. CALCULATION OF THE SIMILARITY BETWEEN WEB<br />

COMPREHENDING TECHNOLOGY AND SEARCH RESULTS<br />

A. Comprehending the Importance <strong>of</strong> Web Pages with<br />

PageRank Technology<br />

PageRank is an analysis method <strong>of</strong> network links put<br />

forward by Sergey Brin and Lawrence Page in 1998, who<br />

were the doctoral candidates <strong>of</strong> Stanford University. It<br />

evaluate all pages, assigns every page a value to measure<br />

its importance and finally uses in the ranking <strong>of</strong> search<br />

results [2].<br />

Specifically,PageRank assumes surfers do several<br />

steps <strong>of</strong> browse following the link, then follow the link<br />

again and browse after turning to a random starting page,<br />

so the value degree <strong>of</strong> a page is decided by the visiting<br />

frequency <strong>of</strong> random surfing. The basic thoughts <strong>of</strong><br />

PageRank algorithm are shown as follow: (1) if a page is<br />

referenced by many other pages, this page may be<br />

important page. (2) if a page is not referenced many times<br />

but referenced by an important page, this page may be<br />

also important page. (3) the importance <strong>of</strong> a page is<br />

averaged and transferred to the page that it references.<br />

Based on the link structure <strong>of</strong> the entire Web, PageRank<br />

technology calculates the importance <strong>of</strong> all pages, and it<br />

thinks users can visit the entire network via the<br />

hyperlinks between pages. But usually Web figure is not<br />

strongly connected, so PageRank applies the processing<br />

mode <strong>of</strong> random surfing: under the circumstance <strong>of</strong><br />

probability d, visitors may randomly turn to another node<br />

<strong>of</strong> Web figure, which is equivalent to adding a link<br />

between two pages without link. The application <strong>of</strong>


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1691<br />

PageRank in Google have proved that it really could<br />

greatly improve the precision <strong>of</strong> search results by<br />

integrating the PR value <strong>of</strong> the results after analyzing and<br />

comprehending Web page links [3].<br />

The value <strong>of</strong> PageRank is defined as follow:<br />

Assume there are pages Tl…Tn towards pageA (that is<br />

Tl…Tn reference pageA). Parameter d is a damping<br />

coefficient set between 0 and 1. C(A) is defined as link<br />

number starting from pageA. So the PageRank value <strong>of</strong><br />

pageA is obtained by the following (1).<br />

PR(<br />

A)<br />

= ( 1−<br />

d)<br />

+ d × (<br />

PR(<br />

Ti<br />

)<br />

)<br />

C(<br />

T )<br />

n<br />

∑<br />

i= 1 i<br />

PageRank values form probability distribution in the<br />

entire web page groups, so the sum <strong>of</strong> PageRank values<br />

<strong>of</strong> all web pages is 1. In the formula, PR(A) is the<br />

PageRank value <strong>of</strong> given by pageA; d is damping factor,<br />

0


1692 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

similarity. 3) rank the documents and return them to users<br />

according to the similarity with a given query word. So<br />

vector space model reports the comprehended Web<br />

document content back to retrieval system to optimize<br />

users’ query.<br />

C. Utilize Latent Semantic Analysis to Comprehend<br />

Web Documents<br />

Latent semantic index is also called as LSA (Latent<br />

Semantic Analysis), which was put forward to improve<br />

the effect <strong>of</strong> vector space model. The basis <strong>of</strong> latent<br />

semantic index is feature item-text matrix. Singular value<br />

decomposition is conducted in this matrix to get latent<br />

semantic structural model.<br />

The object <strong>of</strong> LSI is just certain relation between<br />

words in texts, that is, some latent incidence relation is<br />

implied in context usage pattern <strong>of</strong> terms in texts. So the<br />

method <strong>of</strong> statistical calculation is applied to analyze<br />

plenty <strong>of</strong> texts to find the latent incidence relation, it<br />

doesn’t need certain semantic coding, it only relies on the<br />

object relation with context, uses such latent semantics to<br />

express words as well as texts and finally realize the<br />

objective <strong>of</strong> eliminating correlation between words and<br />

simplifying text vectors. Because there is strong<br />

correlation between words, LSI uses such correlation to<br />

conduct statistic transform to context usage pattern <strong>of</strong><br />

concentrated words <strong>of</strong> texts to obtain a new semantic<br />

space [8].<br />

Some infrequent usages <strong>of</strong> vocabularies should be<br />

removed from main semantic structure, for example:<br />

misuse <strong>of</strong> some vocabularies, some uncorrelated<br />

vocabularies occasionally exist in the same documents,<br />

and “noise” vocabularies that can’t represent the topic <strong>of</strong><br />

the text such as high-frequency vocabularies,<br />

low-frequency vocabularies and others. The method <strong>of</strong><br />

truncated singular value decomposition to reduce<br />

dimensions is used to realize the objective <strong>of</strong> filtering<br />

information and removing noise. LSI projects<br />

high-dimensional representation <strong>of</strong> texts and vocabularies<br />

in the low-dimensional latent semantic space, which<br />

reduces the dimension <strong>of</strong> problems, at the same time, the<br />

low-dimensional representation shows the semantic<br />

relation between vocabularies and texts [9].<br />

D. Singular Value Decomposition<br />

The potential semantic indexing mainly applies<br />

matrix’s Singular Value Decomposition (SVD)<br />

technology. SVD is a common method in mathematical<br />

statistics, mainly utilized in the unlimited minimum cube<br />

and in the solution for matrix rank evaluation and<br />

relevant analysis on specification.<br />

Definition for matrix’s singular value: suppose A is the<br />

real matrix by m×n, and the arithmetic square root <strong>of</strong><br />

non-zero characteristic value <strong>of</strong> n-rank square matrix<br />

A T A is the singular value <strong>of</strong> matrix A.<br />

The decomposition theorem <strong>of</strong> matrix’s singular value:<br />

suppose A∈R m×n , the rank is r, then there exist m-rank<br />

orthogonal matrix U and n-rank orthogonal matrix V, so<br />

© 2011 ACADEMY PUBLISHER<br />

that ⎡<br />

T<br />

⎤ T<br />

U AV = ⎢ ⎥V ⎣ ⎦<br />

∑ 0 . And ⎡ ⎤ T<br />

A = U ⎢ V<br />

0 0<br />

⎥<br />

⎣ ⎦<br />

∑ 0 is the<br />

0 0<br />

singular value decomposition <strong>of</strong> matrix A.<br />

What’s used in information retrieval is a special form<br />

<strong>of</strong> singular value decomposition, because the matrix<br />

needing singular value decomposition in information<br />

retrieval usually is high-rank sparse matrix [10].<br />

Accurately generality, suppose vocabulary-text matrix<br />

A is a sparse matrix with m rows and n columns; therein,<br />

m>>n; it’s given that rank(A)=r and by the singular value<br />

decomposition theorem, A’ s singular value<br />

T<br />

decomposition is A = T0<br />

S 0 D . 0<br />

Each column <strong>of</strong> T0 is orthogonal and the length is 1,<br />

that is, T0 T T0=I; column vector <strong>of</strong> T0 is called as matrix<br />

A’s left singular value vector. S0 is called the standard<br />

type pattern <strong>of</strong> matrix A’s singular value, a unit value’s<br />

diagonal matrix, that is ∑= diag( λ1,<br />

λ2,...,<br />

λm<br />

) and<br />

there is λ λ ≥ ≥ λ ≥ λ ≥ ... = 0 , in which,<br />

1 ≥ 2 ... r r+<br />

1<br />

λi is A i ’s singular value.<br />

Each column <strong>of</strong> D0 is orthogonal and the length is 1,<br />

that is, D0 T D0=I; column vector <strong>of</strong> D0 is called the right<br />

singular value vector <strong>of</strong> matrix X.<br />

Generally, as for A =T0S0D0 T , matrix T0,S0,D0<br />

are all full rank matrixes, which indicates all information<br />

<strong>of</strong> original matrix A. The edge <strong>of</strong> SVD decomposition<br />

lies in using smaller matrix for best fit approximation<br />

[11]. If all elements on diagonal S0 are ordered by value<br />

size, then select the previous k maximum singular values,<br />

and others are set as 0, thus, the obtained result <strong>of</strong> matrix<br />

Ak is recorded as, an approximation value <strong>of</strong> original<br />

matrix A whose rank is k. It can be proved that in all<br />

matrixes with rank k, matrix Ak is the only one that is<br />

closest to A through F-norm evaluation. After 0 is<br />

introduced to S0, S0 can be simplified by deleting<br />

corresponding rows and columns; a new diagonal matrix<br />

S0 is obtained, meanwhile, take previous k columns <strong>of</strong> T0<br />

and D0, and matrix T and matrix D are obtained<br />

respectively, then A’s k-rank approximation matrix Ak<br />

can be structured.<br />

A =<br />

T<br />

≈ Ak<br />

TSD<br />

(4)<br />

This is an optimum k-rank model with mean square<br />

approximation, which can be used to estimate the<br />

necessary data.<br />

The selection <strong>of</strong> dimension factor k relates to the<br />

efficiency <strong>of</strong> semantic space model; too small k can lose<br />

some useful information and over large k would make<br />

arithmetic complicated; generally when k is selected, as<br />

for ∑= diag( λ1,<br />

λ2,...,<br />

λm<br />

) and there<br />

is λ 1 ≥ λ2<br />

≥ ... ≥ λr<br />

≥ λr<br />

+ 1 ≥ ... = 0 , then make k satisfy<br />

contribution rate inequality.<br />

k<br />

∑<br />

r<br />

i ∑<br />

i=<br />

1 i=<br />

1<br />

λ λ ≥ θ<br />

/ (θ can be40%,50%) (5)<br />

i


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1693<br />

Therein, θ includes threshold value <strong>of</strong> original<br />

information; contribution rate inequality is proposed<br />

according to the corresponding concept <strong>of</strong> factor analysis<br />

so as to measure representation level <strong>of</strong> k-dimension<br />

space for the whole space.<br />

Figure 2. Singular value decomposition diagram <strong>of</strong><br />

vocabulary-text matrix<br />

As for approximation matrix AK, T’s row vector is<br />

called vocabulary vector and D’s row vector is text<br />

vector; in view <strong>of</strong> that, text retrieval and treatment <strong>of</strong><br />

other texts are made, that is, latent semantic indexing<br />

(LSI); vocabulary vector and text vector can be projected<br />

into the same low k-dimension space which is called the<br />

latent semantic space. Figure. 3 is an example for<br />

vocabulary and text in latent semantic space.<br />

Figure 3. Expression <strong>of</strong> vocabulary and text in latent semantic<br />

space.<br />

Through singular value decomposition and selecting<br />

k-rank approximation matrix, LSI effectively solves the<br />

problems about synonym and polysem. For instance,<br />

“computer”, “computing machine”, “programming” and<br />

“home”, therein, “computer” and “computing machine”<br />

are synonyms, while “programming” is related to<br />

“computer” and “computing machine”, but “home” is<br />

totally irrelevant to other three words.<br />

In the retrieval system based on key words, if<br />

“computer” does not appear directly in the text, then<br />

when “computer” is input for retrieval, the text containing<br />

“computing machine” and that containing “home” can<br />

neither be covered. However, the users hope to find out<br />

text about “computing machine” when inquiring<br />

“computer”, or also find out text about “programming”<br />

whose association degree is lower compared with<br />

“computing machine”, but finding out text about “home”<br />

is out <strong>of</strong> the mind.<br />

Through the latent semantic space obtained by singular<br />

value decomposition, latent semantic indexing<br />

technology can well express inner relation between these<br />

words; in the space, the contexts <strong>of</strong> “computer”,<br />

“computing machine” and “programming” are consistent,<br />

to some degree, that is: the distance is shorter while<br />

© 2011 ACADEMY PUBLISHER<br />

farther from “home” so that the semantic relation<br />

between vocabulary is more foregrounded.<br />

As for vocabulary and text, it’s the same between text<br />

and text [12]. Generally speaking, it’s just necessary to<br />

select a smaller k value; the obtained semantic space can<br />

represent most information in original matrix A,<br />

meanwhile, information considered as “noise is removed.<br />

Besides, k-rank approximation matrix is much smaller<br />

than the terms <strong>of</strong> original m×n high-dimension sparse<br />

matrix. Reduction <strong>of</strong> matrix deduces calculation<br />

complication, helpful for improving retrieval efficiency.<br />

E. Calculation <strong>of</strong> Similarity Relation in Latent Semantic<br />

Indexing<br />

There are three important relations in semantic space:<br />

vocabulary and vocabulary, text and text, vocabulary and<br />

text. Because the approximation matrix Ak <strong>of</strong> primitive<br />

matrix A represents the most important and reliable latent<br />

semantic space in matrix A, vocabularies and texts are all<br />

projected into the same space, the similarity relation <strong>of</strong><br />

the three relations can be expediently calculated by virtue<br />

<strong>of</strong> approximation matrix T, S and D [10].<br />

1) Compare two vocabularies and do forward<br />

multiplication.<br />

T<br />

T<br />

T<br />

2<br />

A k × Ak<br />

= T × S × D × D × S × T = T × S ×<br />

Therein D T ×D =I, because D has been orthogonal and<br />

normal. Its row i-column j represents the similarity<br />

between vocabulary i and vocabulary j.<br />

2) Compare two texts and do backward multiplication.<br />

T<br />

T<br />

T<br />

2<br />

A k × Ak<br />

= D × S × T × T × S × D = D × S ×<br />

In the above formula, T T ×T =I, because T has been<br />

orthogonal and normal. Its row i-column j represents the<br />

similarity between texti and textj.<br />

3) Compare vocabulary and text, that is approximation<br />

matrix Ak <strong>of</strong> primitive matrix A.<br />

A ×<br />

T<br />

T<br />

D<br />

T<br />

(6)<br />

(7)<br />

T<br />

k = T × S D<br />

(8)<br />

4) The similarity between users’ query request and<br />

texts.<br />

In retrieval, users’ query request can be vocabularies,<br />

texts or any combinations <strong>of</strong> both. Firstly the system<br />

preprocesses users’ query, generates query vector q<br />

according to word frequency information, regards it as a<br />

“pseudo-text”, and represents it in k-dimension semantic<br />

space. Set q as primitive query vector, it’s represented in<br />

k-dimension semantic space as: q * =q T S -1 , in this way, the<br />

similarity <strong>of</strong> q * and other text vectors can be calculated in<br />

k-dimension space, there are three common formulas as<br />

follows:<br />

a) Inner-product formula<br />

k<br />

∑<br />

i=<br />

1<br />

*<br />

*<br />

Sim ( q , d ) = d × q<br />

(9)<br />

1<br />

b) Cosine formula<br />

j<br />

ji<br />

i


1694 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Sim<br />

*<br />

2 ( q , d j ) =<br />

k<br />

*<br />

∑ d ji × qi<br />

i=<br />

1<br />

k<br />

2<br />

* 2<br />

∑dji × ∑(<br />

qi<br />

)<br />

i=<br />

1<br />

(10)<br />

c) Pearson formula<br />

*<br />

Sim ( q , d ) =<br />

3<br />

j<br />

k<br />

∑<br />

i=<br />

1<br />

k<br />

∑<br />

i=<br />

1<br />

( d<br />

( d<br />

ji<br />

ji<br />

− d<br />

*<br />

− d )( q − q )<br />

ji<br />

ji<br />

) ×<br />

*<br />

i<br />

k<br />

∑<br />

i=<br />

1<br />

*<br />

( q − q )<br />

*<br />

i<br />

(11)<br />

In (9), (10) and (11), q * i is the weight <strong>of</strong> no. i<br />

vocabulary <strong>of</strong> query vector, dji is the weight <strong>of</strong> no. i<br />

vocabulary <strong>of</strong> no. j text vector, k is the dimension <strong>of</strong><br />

semantic space. Finally the texts are ranked according<br />

similarity, and the text list is reported back to users<br />

according to their query request.<br />

Ⅲ. IMPROVED RANKING METHOD OF WEB PAGE<br />

RETRIEVAL RESULTS<br />

A. The Ranking Method <strong>of</strong> Search Results under Web<br />

Comprehending<br />

The quantitative process <strong>of</strong> correlation <strong>of</strong> important<br />

web pages is the basis <strong>of</strong> ranking web pages. It can be<br />

known from the above that the PageRank values <strong>of</strong> web<br />

pages can reflect the importance <strong>of</strong> web pages, so just<br />

link relation in web page sets is needed to use, according<br />

to (12).<br />

n PR(<br />

T ) i<br />

PR = ( 1−<br />

d)<br />

+ d × ( )<br />

(12)<br />

C(<br />

T )<br />

∑<br />

i= 1 i<br />

The PageRank values <strong>of</strong> web pages are obtained, the<br />

query similarity <strong>of</strong> only partial web pages <strong>of</strong> high score is<br />

calculated, which can greatly reduce the scale <strong>of</strong> being<br />

vectors and calculating vector similarity.<br />

Documentd and queryq is simplified into sets <strong>of</strong><br />

vocabularies after word segmentation. Set ∑<br />

={ t1,t2,…,tN} is a dictionary, ti is lexical item, N is its<br />

scale, so<br />

m1<br />

m2<br />

mN<br />

d = { t1<br />

, t2<br />

, �,<br />

t N }<br />

n1<br />

n2<br />

nN<br />

q = { t1<br />

, t2<br />

, �,<br />

t N }<br />

In the above formula, mi and ni(i=1,2,…,N)represent<br />

the weights <strong>of</strong> corresponding words. Because dictionary<br />

is fixed, vectors are represented only with weight value.<br />

{ 1, 2 , , N } m m m d �<br />

=<br />

{ 1, 2 , , N } n n n q � =<br />

In the above formula, the typical TF*IDF calculation<br />

way is applied in weight calculation, the vector<br />

representation <strong>of</strong> documents can be obtained by<br />

normalizing mi.<br />

, , , ) w w w d � =<br />

( 1 2 N<br />

© 2011 ACADEMY PUBLISHER<br />

mi<br />

M<br />

w i = TFi<br />

× IDFi<br />

= × lg( )<br />

mi<br />

ki<br />

Therein,<br />

∑ (13)<br />

In (13), ki represents the involved document number <strong>of</strong><br />

lexical item ti in document sets, and M represents size <strong>of</strong><br />

document sets. In this way, the weight values <strong>of</strong> all<br />

feature items the documents are got. In the same way,<br />

queryq can be formatted into weight value <strong>of</strong> feature<br />

item.<br />

The similarity between document and query string<br />

ultimately decides the display order <strong>of</strong> web pages. Apply<br />

the calculation formula <strong>of</strong> the similarity between users’<br />

query and texts- cosine formula.<br />

Sim(<br />

q,<br />

d)<br />

=<br />

∑<br />

i=<br />

1<br />

k<br />

k<br />

2<br />

∑di× ∑<br />

i=<br />

1<br />

d<br />

i<br />

× q<br />

i<br />

q<br />

2<br />

i<br />

(14)<br />

Calculate the included angle cosine <strong>of</strong> document<br />

weight vector and query weight vector, that is, similarity<br />

value <strong>of</strong> documentd and queryq, and decide webpage<br />

ranking according to this value.<br />

B. The Improvement <strong>of</strong> Similarity Calculation Formula<br />

The webpage from all-pairs query, after all, is different<br />

from webpage from inquiring relevant words. In order to<br />

follow the principle <strong>of</strong> presenting the first page<br />

prominently, quantization basis should be provided when<br />

ranking webpage so as to distinguish the two kinds <strong>of</strong><br />

webpage. That means the method to calculate the<br />

similarity value between query word and text should be<br />

changed correspondingly.<br />

Assume the query content has been preliminarily<br />

filtered when inputting query content. In order to respect<br />

the query strings by users, the traditional cosine formula<br />

is still used to calculate the similarity between the text got<br />

by complete matching query and queryq.<br />

Specific to the documents got according to the<br />

semantic correlation words, its value should be<br />

appropriately reduced when calculating query similarity.<br />

It can be known from the above analysis that the<br />

similarity between vocabularies can be calculated by<br />

forward multiplication <strong>of</strong> matrix. After product matrix is<br />

unitized, set vocabulary similarity θ(0 ≤θ≤1) as<br />

reduction factor, used to calculate the similarity between<br />

the documents got by semantic related term and query<br />

sector. According to the above analysis, this paper put<br />

forward a new calculation formula <strong>of</strong> similarity between<br />

Web document and query sector, which is evolved from<br />

the cosine formula and used to calculate the similarity<br />

value <strong>of</strong> query strings and the documents got from<br />

primitive index term and semantic related term, the<br />

formula as follows.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1695<br />

⎧<br />

⎪<br />

⎪<br />

⎪<br />

⎪<br />

Sim(<br />

q,<br />

d)<br />

= ⎨<br />

⎪<br />

⎪<br />

⎪θ<br />

×<br />

⎪<br />

⎪⎩<br />

∑<br />

i=<br />

1<br />

k<br />

k<br />

∑<br />

i=<br />

1<br />

k<br />

∑<br />

i=<br />

1<br />

k<br />

∑<br />

i=<br />

1<br />

( d × q )<br />

d<br />

2<br />

i<br />

i<br />

×<br />

×<br />

∑<br />

i=<br />

1<br />

( d × q )<br />

d<br />

2<br />

i<br />

i<br />

n<br />

i<br />

n<br />

i<br />

∑<br />

i=<br />

1<br />

q<br />

2<br />

i<br />

q<br />

2<br />

i<br />

( 0 ≤ θ ≤ 1)<br />

(15)<br />

In (15), θ is the threshold value that is got in<br />

common singular value decomposition <strong>of</strong> matrix with<br />

mathematical statistics in the calculation <strong>of</strong> latent<br />

semantic similarity and meets contribution rate in<br />

k<br />

∑<br />

r<br />

∑<br />

λi / λi<br />

≥ θ<br />

equation i=<br />

1 i=<br />

1 with primitive information<br />

included. The contribution rate in equation is used to<br />

measure the representation degree <strong>of</strong> k-dimension<br />

sub-space to the entire space.<br />

According to users’ query input, utilizing the improved<br />

similarity calculation formula to calculate query<br />

similarity <strong>of</strong> search results can give consideration to<br />

precision ratio at the time <strong>of</strong> pursuing the recall ratio <strong>of</strong><br />

retrieval. Ranking web pages according to the similarity<br />

value in the above formula can make users more satisfied<br />

with the response <strong>of</strong> search engine.<br />

C. Experimental Results and Their Analysis<br />

The experiment chose the Web document set <strong>of</strong> the<br />

topic <strong>of</strong> “The 60th Anniversary <strong>of</strong> National Day” specific<br />

to “National Day”, “military parade” and other query<br />

words. Two web page ranking algorithms were made for<br />

the retrieval system respectively based on traditional<br />

cosine formula and similarity improvement formula as<br />

mentioned above. Because the page number displayed in<br />

the first page <strong>of</strong> result list by various common search<br />

engines is 10-20, the experiment tracked the browsing<br />

condition <strong>of</strong> users about retrieval back to the list,<br />

recorded users’ number <strong>of</strong> clicks in the first 10 pages and<br />

the first 20 pages to analyze users’ satisfaction. The<br />

detailed data is showed as table Ⅰ.<br />

TABLE I.<br />

COMPARISON OF USERS’ NUMBER OF CLICKING THE FIRST 10 PAGES/<br />

THE FIRST 20 PAGES<br />

Query<br />

Words<br />

Traditional Cosine Formula<br />

Improved Similarity<br />

Calculation Formula<br />

National<br />

7/10<br />

Day<br />

7/12<br />

Hoisting<br />

5/6<br />

the Flag<br />

6/13<br />

Evening<br />

Party<br />

4/7 8/9<br />

Military<br />

Parade<br />

5/11 10/15<br />

It can be seen from the above table that the users’<br />

satisfaction with different web page sets got by two<br />

© 2011 ACADEMY PUBLISHER<br />

similarity calculation formula is different specific to the<br />

same query words. The precision ratio <strong>of</strong> the first 10<br />

pages and the first 20 pages obtained based on the<br />

improved similarity calculation formula is high.<br />

Therefore, the pages included in the page list meet users’<br />

query request better and users are more satisfied with the<br />

page list. Viewing the retrieval system from the angel <strong>of</strong><br />

users, the returned content in the first page <strong>of</strong> list is closer<br />

to their query words, and the retrieval quality <strong>of</strong> this<br />

retrieval system will be higher. So the implication <strong>of</strong><br />

improved similarity calculation formula can make the<br />

similarity value <strong>of</strong> page query more precise, thereby<br />

optimize the result list and improve retrieval quality.<br />

C. Conduct retrieval by relevant inverted indexing file<br />

Retrieval process is actually extracting query word<br />

according to users’ query strings, a process <strong>of</strong> matching<br />

the query words in indexing wordlist <strong>of</strong> inverted file and<br />

generating result set.<br />

Concretely speaking, after users inputting the query<br />

strings, retrieval system firstly conducts word<br />

segmentation on Chinese character in the strings,<br />

removes the stop-use words and punctuations as well as<br />

extracts query words. Besides, search the inverted file<br />

provided by indexing system and match query words.<br />

Read information about word frequency according to the<br />

successfully-matched indexing items, read inverted list <strong>of</strong><br />

indexing words and keep record the document No.<br />

containing indexing word and the position <strong>of</strong> indexing<br />

word. Then, determine whether check the attribute<br />

content <strong>of</strong> relevant words according to the users’ choice<br />

about whether displaying semantically relevant document<br />

options provided by retrieval interface, If users choose to<br />

display relevant result, read the pointer-range content <strong>of</strong><br />

semantically relevant word <strong>of</strong> indexing word; obtain the<br />

information about semantically relevant word by in-list<br />

deviation. Later, calculate the PageRank score for the<br />

obtained document’s corresponding original webpage;<br />

choose the document with higher score and calculate the<br />

query similarity. Finally, conduct the list displaying<br />

processing <strong>of</strong> webpage; decide the webpage displaying<br />

order according to similarity value.<br />

The displaying work <strong>of</strong> result list should also include<br />

displaying the webpage abstract result and webpage<br />

snapshot; moreover, query words contained in the<br />

webpage title, abstract and webpage snapshot should be<br />

<strong>of</strong> high-light displaying. High-light displaying the query<br />

word’s positioning can use the position list <strong>of</strong> indexing<br />

word stored in inverted file.<br />

Retrieval system receives query strings input by users,<br />

and the following is the arithmetic for searching relevant<br />

inverted file and obtaining search result:<br />

Input: query strings<br />

Output: webpage result list<br />

Algorithm: Searcher<br />

1. Initialize webpage set Res=Φ;<br />

2. Conduct word segmentation for q, and delete<br />

stop-use words to obtain the query vector expression<br />

q={t1,t2,…,tm};


1696 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

3. For each ti∈q do<br />

4. Initialize the number set <strong>of</strong> result text Rid=Φ;<br />

5. Do matching term by term in the wordlist <strong>of</strong> related<br />

inverted file, to find index word ti;<br />

6. Read the inverted list <strong>of</strong> index item ti, and the<br />

document No. is stored in Rid; record word frequency<br />

and the occurrence position <strong>of</strong> item;<br />

7. If users choose to show that the attribute linked-list<br />

<strong>of</strong> semantically relevant words <strong>of</strong> related retrieval<br />

option &&ti is non-empty, then<br />

8. Read the in-list deviation attribute <strong>of</strong> semantically<br />

relevant words to find term t-ri;<br />

9. Record information about relevant item t-ri as<br />

procedure 6;<br />

10. End if<br />

11. Research webpage indexing file according to<br />

document no. in Rid and the obtained webpage is<br />

stored in Res;<br />

12. Calculate the PageRank value in Rid set by<br />

formula (1);<br />

13. Choose the webpage with larger PageRank value<br />

and calculate query similarity by formula (15);<br />

14. End for<br />

15. Rank the similarity <strong>of</strong> webpage in Res to decide<br />

the order <strong>of</strong> displaying list;<br />

16. Display webpage title, abstract and webpage<br />

snapshot in result list;<br />

17. Return to result list;<br />

REFERENCES<br />

[1] Xiaoming Li, Hongfei Run, Jiming Wang, Search<br />

Engine-Principle, Technology and System. Beijing:<br />

Science Press, 2005.<br />

[2] Page L, Brin S, Motwani R, The pagerank citation<br />

ranking:Bringing order to the web. Stanford Digital<br />

Libraries SIDL-WP, 1999.<br />

[3] Havelieala T H, “Topic-sensitive PageRank,” Proceedings<br />

<strong>of</strong> the 1lth International World Wide Web Conference.<br />

Hawaii, pp. 517-526, 2002.<br />

[4] Kleinberg J, “Authoritative sources in a hyperlinked<br />

environment,” Proceedings <strong>of</strong> the Ninth Annual<br />

ACMSIAM Symposium on Discrete Algorithms. San<br />

Francisco, California, pp. 668-677, 1998.<br />

© 2011 ACADEMY PUBLISHER<br />

[5] Xing Wenpu, Ghorbani A, “Weighted PageRank<br />

Algorithm,” Communication <strong>Networks</strong> and Services<br />

Research, Proceedings <strong>of</strong> Second Annual Conference. pp.<br />

305-314, 2004.<br />

[6] Hai Liu, Yuanyuan Wang, Xueren Zhang, “Study <strong>of</strong> Text<br />

Retrieval Problems Based on Latent Semantic Space,”<br />

Information Science, vol. 5, pp. 748-753, 2007.<br />

[7] Yuchang Lu, Mingyu Lu, “The Analysis and Construction<br />

<strong>of</strong> Word Weight Function in Sector Space Method,”<br />

<strong>Journal</strong> <strong>of</strong> Computer Research and Development, vol. 10,<br />

pp. 1205-1210, 2002.<br />

[8] Jiang Lu, “Study <strong>of</strong> the Application <strong>of</strong> Latent Semantic<br />

Analysis in Text Information Retrieval,” Wuhan:<br />

Huazhong University <strong>of</strong> Science and Technology, vol. 4,<br />

pp. 21-22, 2005.<br />

[9] Todd A, “Letsche,Michael W.Berry.Large-Scale<br />

Information Retrieval with Latent SemanticIndexing,”<br />

Information Science, vol. 1, pp. 105-137, 1997.<br />

[10] Nieholas Lester, Justin Zobel, Hugh Williams, “Effieient<br />

Online Index Maintenance for Contiguous Inverted Lists,”<br />

Information Processing and Management, vol. 4, pp.<br />

916-933, 2006.<br />

[11] Jiang Jiahui, Matrix Theoretical Basis. Dalian: Dalian<br />

University <strong>of</strong> Technology, pp. 65, 1995.<br />

[12] Sheng Jun, “Study on Markov Network Retrieval Model<br />

Baed on Latent Semantics,” Nanchang: Dissertation from<br />

Jiangxi Normal University, pp. 5-13, 2006.<br />

[13] Foltz P W, “The Measurment <strong>of</strong> Textual Coherence with<br />

Latent Semantic Analysis,” Discourse processes, vol. 1, pp.<br />

285-307, 1998.<br />

Zhijuan Deng, female, native place is<br />

Ganzhou, Jiangxi Province, born in<br />

Nov.1979, working in Faculty <strong>of</strong><br />

Science, Jiangxi University <strong>of</strong> Science<br />

and Technology as an instructor.<br />

Research directions: Web information<br />

mining, s<strong>of</strong>tware project management.<br />

Shaojun Zhong was born in Guzhou,<br />

China, in Oct.1979.He is now a Lecturer<br />

in Faculty <strong>of</strong> Science, Jiangxi University<br />

<strong>of</strong> Science and Technology, China. His<br />

research interests include data mining,<br />

network technology, and Intelligence<br />

computation.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1697<br />

An Encryption Scheme with Hidden Keyword<br />

Search for Outsourced Database<br />

Xiaoming Wang<br />

Jinan University/ Department <strong>of</strong> Computer Science, Guangzhou, 510632, China<br />

Email: wxm_gz@hotmail.com<br />

Guoxiang Yao and Zhen Zhang<br />

Jinan University/ Department <strong>of</strong> Computer Science, Guangzhou, 510632, China<br />

Abstract—An encryption scheme with hidden keyword<br />

search is proposed for Outsourced Database. In the<br />

proposed scheme, both pseudorandom function and<br />

polynomial function are employed in order to reduce<br />

computation and shortage overhead. The proposed scheme<br />

can not only provide controlled searching, and hidden<br />

searching as well as the provable secrecy for encryption, but<br />

also support the dynamic change <strong>of</strong> the permitted group<br />

users and be transparent to user when the users are added<br />

and removed since they are not involved in the process.<br />

Moreover, there is no interaction between database owner<br />

and server, server and user or database owner and user<br />

when the decrypted key is set up. Each user is only required<br />

to receive messages to setup their decrypted key and hence<br />

can query over encrypted data and decrypt the encrypted<br />

data. Therefore, the proposed scheme is more efficient and<br />

more practical for outsourced database.<br />

Index Terms—outsource database, hidden keyword search,<br />

added and revoked users<br />

I. INTRODUCTION<br />

The management <strong>of</strong> large databases is quite expensive,<br />

as it needs not only storage capacity, but also skilled<br />

personnel. A solution to this problem is outsourced<br />

database. In this solution, data owners store their data to a<br />

third-party service provider (server), which is not trusted.<br />

The server provides services to the users <strong>of</strong> the database.<br />

In outsourced database systems, the main problem is that<br />

sensitive data are stored on a third party site which is not<br />

under the data owner’s direct control; thus, data privacy and<br />

security can be put at risk. To protect resources from being<br />

disclosed to the server and outside attackers, as well as to<br />

realize access control on the server side, encryption<br />

methods are used to protect the sensitive data. By<br />

encrypting the data, the database owner should ensure<br />

that no one except the permitted users read the data.<br />

Although this solution can protect the data from outsider<br />

attackers and the server, the fundamental problem is that<br />

the search over the encrypted data seems very difficult,<br />

and it is hard to protect the user privacy as performing<br />

Manuscript received Mar. 1, 2011; revised April 5, 2011; accepted<br />

Aprl 12, 2011.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1697-1704<br />

queries over encrypted data.<br />

To resolve this problem, there is a need to develop a<br />

solution enabling the user to perform the search over the<br />

encrypted domain in such a way that the server does not<br />

learn any unauthorized information by performing the<br />

search. In 2000, Song et al.[1] first studied a secure<br />

keyword search scheme by using a symmetric cipher. In<br />

their scheme, a user stores her encrypted data in a nontrusted<br />

database and later searches the data with an<br />

encryption keyword that is encrypted by the user with his<br />

secure key. Their techniques provide provable secrecy for<br />

encryption, in the sense that the non-trusted server cannot<br />

learn anything about the plaintext given only the<br />

ciphertext. Their scheme is simple and fast. However,<br />

their scheme applies only to the private-key setting for<br />

user who owns his data and wishes to upload it to a thirdparty<br />

database that they do not trust, their scheme cannot<br />

be used for practical applications such as in an email<br />

routing system, outsourced database etc.[11]. In their<br />

scheme, only the user oneself can search on encrypted<br />

data. If the other user is allowed to search for a word, the<br />

encryption key is disclose to him or disclose to server a<br />

list <strong>of</strong> potential locations where might occur. If the server<br />

is allowed to search for too many words, he may be able<br />

to use statistical techniques to start learning important<br />

information about the documents. One possible defense is<br />

to periodically change the key, re-encrypt data under the<br />

new key. As a result, the user must again re-encrypt the<br />

data and transmitted to server by finite channel. If the<br />

owner does not have the resources stored locally, a<br />

further preliminary step is needed to re-acquire them<br />

from the service, and decrypt them and encrypt them<br />

again, as well as transmit them to server by finite<br />

channel, it is involve a lot <strong>of</strong> performance overhead and<br />

become practically impossible.<br />

In outsourced database, the permitted users are allowed<br />

to search and read the data stored at the external server by<br />

the data owner. The permitted users wish to retrieve some<br />

data or search for some data without revealing to the<br />

server which data it is. Aiming to these requirements, an<br />

encryption scheme with hidden keyword search for<br />

outsourced database is proposed based on Song et al.’s<br />

scheme. But the proposed scheme is different from Song<br />

et al.’s scheme. The proposed scheme allows a group <strong>of</strong><br />

the permitted users to search and read the data stored at


1698 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

the external server by the data owner, not like Song et<br />

al.’s scheme, only to allow the user oneself to search on<br />

encrypted data. In the proposed scheme, we employ both<br />

polynomial function and pseudorandom function in order<br />

to reduce computation and shortage overhead. The<br />

proposed scheme can not only provide controlled<br />

searching, and hidden searching as well as the provable<br />

secrecy for encryption, but also support the dynamic<br />

change <strong>of</strong> the permitted group users and be transparent to<br />

user when the users are added and removed since they are<br />

not involved in the process. Moreover, there is no<br />

interaction between database owner and server, server<br />

and user or database owner and user when the decrypted<br />

key is set up and updated. Each user is only required to<br />

receive messages to setup their decrypted key and hence<br />

can query over encrypted data and decrypt the encrypted<br />

data. Therefore, the proposed scheme is more efficient<br />

and more practical for outsourced database.<br />

The rest <strong>of</strong> the paper is organized as follows: Section 2<br />

presents related works. An encryption scheme with<br />

hidden keyword search for outsourced database is<br />

presented in section 3. In section 4, the security and<br />

properties <strong>of</strong> the proposed protocol are analyzed. Finally,<br />

the concluding remarks are given.<br />

II. RELATED WORK<br />

In the existing some schemes for designing encrypted<br />

outsourced databases [2-5], it is assumed that the entire<br />

database is encrypted with a single key and the users are<br />

granted the key. The assumption is only limited to<br />

protecting data on the server side and the users have<br />

complete access to the database. However, in real world,<br />

complete access to the encrypted outsourced data is not<br />

acceptable. It is desirable that the users can only have<br />

selective access to the encrypted data. Moreover, these<br />

proposals, in case <strong>of</strong> updates <strong>of</strong> the authorization policy,<br />

would require re-encrypting the resources and resending<br />

them to the service. If the owner does not have the<br />

resources stored locally, a further preliminary step is<br />

needed to re-acquire them from the service and decrypt<br />

them by finite channel, and a great <strong>of</strong> the new decryption<br />

keys are frequently transmitted to all the authorized users,<br />

these would involve a lot <strong>of</strong> performance overhead and<br />

become practically impossible for large databases<br />

accessed by a dynamic group <strong>of</strong> users. For resulting the<br />

problems, Vimercati et al.[6] proposed the overencryption<br />

approach to avoid the need for shipping<br />

resources back to the owner for re-encryption when<br />

security requirements change. In their scheme, the<br />

resources are encrypted by the owner for providing initial<br />

protection and are encrypted by the outsourced server to<br />

reflect policy modifications. One potential limitation <strong>of</strong><br />

the over-encryption scheme is that it may require to<br />

publishing too many tokens when the number <strong>of</strong> users is<br />

large[7]. In 2008, Liu et al.[7] proposed a new keyassignment<br />

approach based on secret sharing. In their<br />

scheme, resources are divided into different sets based on<br />

access control lists, and each set corresponds to a distinct<br />

encryption key. Users can use their corresponding key to<br />

derive the encryption key in order to access the resource.<br />

© 2011 ACADEMY PUBLISHER<br />

However, we consider Liu et al’s scheme is insecure<br />

against collusion attack. If two users share a resource,<br />

their scheme employs the two users as a subset to build a<br />

binary linear equation for deriving the encryption key.<br />

They randomly choose points (x, y) on this equation and<br />

assign as a key pair to users and choose another a point<br />

(xpub, ypub) on this equation as public token to publish.<br />

Each user uses the public token (xpub, ypub) together with<br />

his key pair to derive the decryption key. But a user can<br />

also reconstruct the binary linear equation using the<br />

public token (xpub, ypub) together with his key pair, they<br />

can compute many key pairs for many unauthorized<br />

users, thus the unauthorized users can access the<br />

resource.<br />

Database encryption prevents unauthorized users,<br />

including intruders braking into a network and database<br />

administrators, from seeing sensitive data in databases.<br />

However, it is very hard to protect the user privacy as<br />

performing queries over encrypted data. To resolve this<br />

problem, keyword search over encrypted data has<br />

received close attention in various environments such as<br />

encrypted web hard-systems, intelligent email routing,<br />

encrypted vendor systems, etc. In 2000, Song et al.[1]<br />

studied a secure keyword search scheme by using a<br />

symmetric cipher proposed a search technique on<br />

encrypted data. It deals with search problems between a<br />

user and a non-trusted server, also they gave out some<br />

practical solutions. In 2004, Golle et al.[8] first proposed<br />

the notion <strong>of</strong> conjunctive keywords searchable<br />

encryption, also they presented a solution to cope with<br />

this problem. They defined a security model for<br />

conjunctive keyword search over encrypted data and<br />

provided two secure constructions. In 2008, Wang et al.[9]<br />

first gave out a Keyword Field-Free Conjunctive<br />

Keyword Searches scheme, which answers the open<br />

problem asked by Golle et al. In their scheme, the target<br />

ciphertext includes a keyword set, the user could generate<br />

a trapdoor which consists a key word set; subset key<br />

words search means that if a keyword set <strong>of</strong> the target<br />

cipjhertext includes a keyword set <strong>of</strong> the trapdoor, then<br />

trapdoor and ciphertext were matched. However,<br />

reference[10] points out that there is a mistake in their<br />

pro<strong>of</strong> in Golle et al’s scheme.<br />

Above these schemes were constructed in a symmetric<br />

key setting. In this setting, a user encrypts and stores his<br />

private data in the storage <strong>of</strong> remote server. A user can<br />

then retrieve his private data with a particular keyword<br />

from the remote storage. However, these systems cannot<br />

be used for practical applications such as in an email<br />

routing system, outsourced database etc.[11].<br />

In 2005, Boneh et al.[12] first proposed a Public Key<br />

Encryption with Keyword Search (PEKS). With PEKS, a<br />

sender stores the encryption data to a server, the receiver<br />

makes a trapdoor for a keyword and sends the trapdoor to<br />

the server. Then the server can test whether or not the<br />

encryption and the trapdoor were made with the same<br />

keyword. If the keywords in the encryption and the<br />

trapdoor are same, the server sends the encryption to the<br />

receiver. Byun et al. [13] showed that the PEKS scheme<br />

is insecure against Off-line keyword-guessing attack.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1699<br />

That is, given a trapdoor, an attacker can learn which<br />

keyword is used to generate the trapdoor. Since a user<br />

usually queries commonly used keywords with low<br />

entropy, the keyword guessing attacks are meaningful.<br />

Rhee et al.[14], Tang and Camenisch et al.[15] gave out<br />

the way to cope with this attack accordingly. Later, many<br />

papers were published to extend PEKS. But PEKS only<br />

supports an efficient remote storage system for a<br />

designated receiver, which does not provide an efficient<br />

remote storage system for a number <strong>of</strong> users.<br />

III. PROPOSED SCHEME<br />

A. System model<br />

The proposed scheme uses DAS(database-as-a-service)<br />

model. System includes three entities data owner, server<br />

and user. This model is mostly suitable for one-to-many<br />

group where there is a single database owner and a large<br />

number <strong>of</strong> users. Database owner is responsible for<br />

producing, distributing, and updating encryption keys.<br />

Server is responsible for producing the query result on the<br />

encrypted data, and sending encrypted result to the user.<br />

User decrypts the result from the server using the<br />

decryption key in order to get the plaintext result.<br />

We assume that the data owner defines an access<br />

control policy to regulate access to the distributed<br />

resources. All the users <strong>of</strong> the outsourced database are<br />

divided into different groups according to their access<br />

privilege. Users with the same database access privilege<br />

are grouped together and can access the same part <strong>of</strong> the<br />

outsourced data. The outsourced database is protected<br />

with encryption. For the sake <strong>of</strong> simplicity, we assume<br />

the encryption operations to be referred to s single group.<br />

B. Setup<br />

Let p, q be distinct large primes and q|(p-1), a<br />

generator g in GF(p) with an order q, a pseudorandom<br />

function F and an additional pseudorandom function f,<br />

which will be keyed independently <strong>of</strong> F, a pseudorandom<br />

generator G, and a secure hash function H. We write<br />

fτ (x)<br />

for result <strong>of</strong> apply f to input x with secret key τ .<br />

The database owner, sever and each user ui have<br />

ε<br />

respectively a pair <strong>of</strong> keys such as ( ε o , y g o<br />

o = mod p )<br />

ε<br />

( , y g s<br />

ε<br />

ε s s = mod p)<br />

and ( ε i , y g i<br />

i = mod p ). In setup<br />

phase, the encrypted key is established by the database<br />

owner and is sent to the each group user. Without loss <strong>of</strong><br />

generality, assume that a group contains a set <strong>of</strong> privileged<br />

users U=(u1,u2,…un). The setup includes following<br />

several steps.<br />

(1)The database owner chooses at random a polynomial<br />

m−1<br />

f ( x)<br />

= a0<br />

+ a1x<br />

+ ... + am−1x<br />

mod q (1)<br />

where k=f(0)=a0, ai -s are the coefficients <strong>of</strong> f(x), m(m>n)<br />

be large positive integers.<br />

© 2011 ACADEMY PUBLISHER<br />

(2) The database owner chooses a random integerα and<br />

a set <strong>of</strong> random integers Dj=(d1, d2,…,dm), and computes<br />

ki for each group user ui as following<br />

m 1<br />

dl<br />

ki<br />

f ( xi<br />

) ∏ mod q<br />

x d<br />

l 1 i l<br />

−<br />

−<br />

α<br />

= , v = g mod p , (2)<br />

−<br />

=<br />

y p o ε<br />

δ i = i mod , Vi<br />

= Eδ<br />

( ki<br />

) y g i p<br />

i<br />

k<br />

i = mod , (3)<br />

m−1<br />

m−1<br />

−dl<br />

−x<br />

α<br />

i<br />

∑ f ( d j ) ∏ d<br />

j l l j j −dl<br />

d j −x<br />

= 1 = 1,<br />

≠<br />

i<br />

zi<br />

= g<br />

mod p , (4)<br />

for 1 ≤ i ≤ n , and sends Vi to each group user ui., then<br />

publishes(v,zi(i=1,…,n)).Where Eδ {⋅}<br />

denotes encryption<br />

i<br />

operation with a key δ i using symmetrical encryption<br />

algorithm such as AES, xi is the identifier <strong>of</strong> group user ui.<br />

(3) On receiving V i , the group user ui computes<br />

y p i ε<br />

δ i = o mod<br />

(5)<br />

and the decryptsV i , thus obtains k i .<br />

C. Encryption<br />

When the database owner encrypts a data M that<br />

contain the sequence <strong>of</strong> words w1, w2,…,wl, he does<br />

following steps:<br />

(1) The database owner computes<br />

g p<br />

k α<br />

σ = mod , X i = H ( wi<br />

, σ ) , ci = fτ<br />

( X i ) (6)<br />

for 1 ≤ i ≤ l , where Xi is n-bits long, then generates a<br />

sequence <strong>of</strong> pseudorandom values ei using the<br />

pseudorandom generator G, where each ei is n-m bits long.<br />

Finally the database owner computes Fc ( ei<br />

) , adds F ( )<br />

i<br />

c e<br />

i i<br />

in back <strong>of</strong> ei, and gets n bits long Bi<br />

=< ei<br />

|| Fc<br />

( ei<br />

) > .<br />

i<br />

(2) The database owner computes<br />

CTi = X i ⊕ Bi<br />

, C = M ⊕ H (σ ) , (7)<br />

and sends {C, CT i } to server.<br />

D. Trapdoor<br />

When a group user ui needs to query the data with<br />

words w1, w2,…, wl from the outsourced database, he<br />

needs to generate a trapdoor as following<br />

z v i p<br />

k<br />

σ ′ = i mod , X i′<br />

= H ( wi<br />

, σ ′ ) , c i′<br />

= fτ<br />

( X i′<br />

) , (8)<br />

generates trapdoor T i =< ci′<br />

, X i′<br />

> and sends Ti to server.<br />

Where i=1,2,…,l.<br />

E. Test<br />

On receiving Ti, server computes B i′<br />

= CTi<br />

⊕ X i′<br />

and<br />

splits B′ i into two parts, Bi ′ =< ei′<br />

|| r > , where e′ i denotes<br />

the first n-m bits <strong>of</strong> B′ i and r denotes the last m bits <strong>of</strong> B′ i .<br />

Then server computes Fc ( ei′<br />

) and tests F e r<br />

i<br />

c (<br />

i i′)<br />

= . If it<br />

holds, then Ti is correct. Server sends {C, CT i } to group<br />

user ui.<br />

The group user ui computes<br />

M = C ⊕ H (σ ′ )<br />

(9)<br />

thus obtains the data M.


1700 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

F. Adding group users<br />

Adding a new group user unew to a group does not<br />

require re-updating the decryption key. While adding unew<br />

to a group, database owner first picks an unused identifier<br />

xnew, and computes<br />

k<br />

new<br />

m 1<br />

∏<br />

l 1<br />

−<br />

=<br />

− dl<br />

= f ( xnew)<br />

mod q , (10)<br />

x − d<br />

new<br />

y p o ε<br />

δ new = new mod , Vnew<br />

= Eδ<br />

( knew<br />

) , (11)<br />

m−1<br />

m−1<br />

−dl<br />

−x<br />

α<br />

new<br />

∑ f ( d j ) ∏ d<br />

j l l j j −dl<br />

d j −x<br />

= 1 = 1,<br />

≠<br />

new<br />

znew<br />

= g<br />

mod p<br />

(12)<br />

and sends V new to new group user u new , publishes znew .<br />

Upon receivingV new , unew computes y p o ε<br />

δ new = new mod ,<br />

and decrypts Vnew with δ new and obtains secret key k new ,<br />

thus the group member unew can access server, query the<br />

outsourced database and decrypt the ciphertext received<br />

from the server by his secret key k new .<br />

G. Removing group users<br />

When a group user uB is removed from a group, the<br />

encryption key σ has to be changed in order to prevent<br />

the removed group users from querying and reading the<br />

restricted data. Database owner and server have to go<br />

through following steps:<br />

(1) The database owner chooses a random integer ρ ,<br />

and computes as following for t= 1,…,n, t ≠ B , i=1,2,…,l.<br />

m−1<br />

m−1<br />

−dl<br />

−x<br />

ρ<br />

t<br />

∑ f ( d j ) ∏ d<br />

j l l j j −dl<br />

d j −x<br />

= 1 = 1,<br />

≠<br />

t<br />

zt<br />

= g<br />

mod p<br />

,<br />

(13)<br />

ρ<br />

ρ<br />

l<br />

new<br />

v = g mod p , g p<br />

k σ = mod , (14)<br />

X = H ( w , σ ) , Y = H ( w , ) ⊕ H ( w , σ ) (15)<br />

i<br />

i<br />

i<br />

i σ i<br />

ε<br />

s<br />

s = H ( σ ) ⊕ H ( σ ) , y p<br />

o δ = mod , (16)<br />

s<br />

V = Eδ<br />

( Y1<br />

|| Y2<br />

|| ... || Yl<br />

|| s)<br />

(17)<br />

s new<br />

and publishes [( z t (t= 1,2,…,n, t ≠ B ), v ), then send Vs<br />

to server.<br />

(2) Server first computes y p<br />

s δ s = o mod and decrypts<br />

Vs to get (Yi,s), where i=1,2,…,l., then<br />

C = C ⊕ s = M ⊕ H (σ ) , (18)<br />

CT = CT ⊕ Y = B ⊕ H ( w , σ )<br />

(19)<br />

i<br />

i<br />

{ C , i T C } is deposited in outsourced server.<br />

i<br />

A non-removed user ut can compute σ = z v t<br />

t mod p ,<br />

then can generate a valid query trapdoor Ti and recover<br />

k<br />

the decryption key σ since z v t ρk<br />

σ = t = g mod p .<br />

However, the removed user uB cannot get σ since uB<br />

cannot obtain z B by public information z t (t=1,2,…,n,<br />

© 2011 ACADEMY PUBLISHER<br />

i<br />

ε<br />

i<br />

k<br />

t ≠ B ). If uB computes σ using his k B and old z B or<br />

other user z t , then σ ′ ≠ σ since<br />

k k k<br />

σ ′ z v B<br />

t zB<br />

v B ρ<br />

= ≠ ≠ g = σ mod p (20)<br />

Therefore, the removed user uB cannot recover the<br />

decryption keyσ , and he is prevented from generating<br />

query trapdoor and obtaining data, thus he is removed<br />

from the group.<br />

III. ANALYSIS<br />

A. Correctness<br />

Lemma 1. For a given ciphertext {C, CT i }, if the<br />

database owner follows the correct encryption procedure,<br />

then any privileged group user can correctly generate<br />

query trapdoor and decrypt the ciphertext to obtain data<br />

M.<br />

Pro<strong>of</strong>: Because<br />

m−1<br />

m<br />

m−1<br />

−d<br />

x<br />

d<br />

f d<br />

l − i<br />

−<br />

α<br />

f x l<br />

∑ ( j ) ∏<br />

+ α ( i ) ∏<br />

k<br />

d d d x<br />

x d<br />

j l l j j − l j − i l i − l<br />

ziv<br />

i 1 1,<br />

1<br />

σ ′<br />

= = ≠<br />

=<br />

= = g<br />

αk<br />

= g = σ mod p<br />

X i′<br />

= H ( wi<br />

, σ ′ ) = H ( wi<br />

, σ ) = X i , c i′<br />

= fτ<br />

( X i′<br />

) = ci<br />

Bi′<br />

=< ei′<br />

|| r >= CTi<br />

⊕ X i′<br />

= X i ⊕ Bi<br />

⊕ X i′<br />

,<br />

= Bi<br />

=< ei<br />

|| Fc<br />

( ei<br />

) ><br />

i<br />

then e i = ei′<br />

, Fc ( ei<br />

) = r , therefore T =< ′ ′ ><br />

i<br />

i ci<br />

, X i is correct.<br />

Becauseσ ′ = σ , so<br />

M = C ⊕ H ( σ ′ ) = M ⊕ H ( σ ′ ) ⊕ H ( σ ) . □<br />

Lemma 2. For a given ciphertext {C, CT i }, if the<br />

database owner and server follow the correct removing<br />

group user procedure, then equations (18) and (19)hold.<br />

Pro<strong>of</strong>: The equation (18) holds since<br />

s = H ( σ ) ⊕ H ( σ ) ,<br />

C = C ⊕ s = M ⊕ H ( σ ) ⊕ H ( σ ) ⊕ H ( σ )<br />

= M ⊕ H ( σ )<br />

The equation (19) holds since<br />

Y = H ( w , ) ⊕ H ( w , σ ) ,<br />

i<br />

i σ i<br />

CTi<br />

= CTi<br />

⊕ Yi<br />

= Bi<br />

⊕ H ( wi<br />

, σ ) ⊕ H ( wi<br />

, σ ) ⊕ H ( wi<br />

, σ )<br />

= Bi<br />

⊕ H ( wi<br />

, σ )<br />

□<br />

Lemma 3. If ui is a non-removed user, then he can<br />

generate a valid query trapdoor and decrypt the ciphertext<br />

to obtain data M.<br />

Pro<strong>of</strong>: Because ui has<br />

k<br />

i<br />

t<br />

dl<br />

f ( xi<br />

) ∏ mod q<br />

x d<br />

−1<br />

−<br />

=<br />

,<br />

−<br />

l=<br />

1<br />

then ui can compute correctσ<br />

i<br />

l


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1701<br />

k<br />

σ ′ = z v i<br />

i<br />

m−1<br />

m<br />

m−1<br />

−d<br />

x<br />

d<br />

f d<br />

l − i<br />

−<br />

ρ<br />

f x l<br />

∑ ( j ) ∏<br />

+ ρ ( i ) ∏<br />

d d d x<br />

x d<br />

j 1 l 1,<br />

l j j − l j − i l 1 i −<br />

= = ≠<br />

= l<br />

= g<br />

ρk<br />

= g = σ mod p<br />

therefore computes<br />

X i′<br />

= H ( wi<br />

, σ ′ ) = H ( wi<br />

, σ ) , c i = f ( X i′<br />

)<br />

and generates trapdoor T c′<br />

, X ′ > .<br />

i =< i i<br />

′ τ<br />

By the equations (15) and (19) know CTi = Bi<br />

⊕ X i ,<br />

so<br />

X ′ = H ( w , σ ) = X ,<br />

i<br />

i<br />

i<br />

i<br />

ci<br />

i<br />

Bi<br />

=< ei<br />

|| r >= CTi<br />

⊕ X i′<br />

= X i ⊕ Bi<br />

⊕ X i′<br />

,<br />

= B =< e || F ( e ) ><br />

then e i = ei<br />

, ( e ) = r , therefore T i =< ci′<br />

, X i′<br />

> is correct<br />

F i<br />

c i<br />

and the data is recovered from M = C ⊕ H (σ ) since<br />

C = C ⊕ s = M ⊕ H (σ )<br />

□<br />

Lemma 4. If ui is a removed user, then he cannot<br />

generate a valid query trapdoor and recover the<br />

decryption key σ .<br />

Pro<strong>of</strong>: Because<br />

k<br />

σ ′ = z v B<br />

B<br />

m−1<br />

m<br />

m−1<br />

−dl<br />

−xB<br />

−d<br />

α<br />

l<br />

∑ f ( d j ) ∏<br />

+ ρf<br />

( xB<br />

) ∏<br />

d<br />

j l l j j −dl<br />

d j −xB<br />

x<br />

l B −d<br />

= 1 = 1,<br />

≠<br />

= 1 l<br />

= g<br />

≠ σ mod p<br />

Therefore, the removed user uB cannot compute the<br />

decryption key σ , and he is prevented from generating<br />

query trapdoor and obtaining data. □<br />

B. Security Pro<strong>of</strong><br />

The security <strong>of</strong> the proposed scheme is based on<br />

security <strong>of</strong> pseudorandom function and the computational<br />

Diffie-Hellman problem(CDHB). To show the proposed<br />

scheme is secure, we first state a useful lemma 1. Due to<br />

space considerations, we omit the pro<strong>of</strong> <strong>of</strong> the lemma, but<br />

refer to the full version <strong>of</strong> this paper[1].<br />

Lemma 1[1]: If F is a (t,l,eF)-secure pseudorandom<br />

function, f is a (t,l,ef)-secure pseudorandom function, G is<br />

a (t,eG)-secure pseudorandom generator, and if the key<br />

material is chosen as described above. Then the algorithm<br />

described above for generating the sequence will a ( t −ψ , eH<br />

) -secure pseudorandom generator,<br />

where eH=l eF + ef+ eG+l(l-1)/(2/|X|), X={0,1} n-m .<br />

Definition 1: A encryption scheme with hidden<br />

keyword search for outsourced database semantically is<br />

secure against chosen keyword attack if F, f are secure<br />

pseudorandom functions, G is a secure pseudorandom<br />

generator, H is a secure one-way hash function, and there<br />

exits no polynomial-time adversary with a non-negligible<br />

advantage in the following game:<br />

(1) Setup: A challenger C first generates system<br />

parameters, data owner’s key pair and n group users’ key<br />

© 2011 ACADEMY PUBLISHER<br />

i<br />

pairs as the same as section 3 (B). The challenger C gives<br />

system parameters and public keys to an adversary A.<br />

Phase 1: The adversary A issues the following kinds <strong>of</strong><br />

queries adaptively:<br />

(2) Encryption queries: A produces a message M and<br />

sends encryption query for M to C. A will be given the<br />

result { CT i ,C}<strong>of</strong> encryption with input (M, k, wi) by C.<br />

(3) Trapdoor queries: The adversary A makes trapdoor<br />

queries for any keyword <strong>of</strong> his choice to the challenger C.<br />

If the trapdoor is valid, C responses the result; Otherwise,<br />

C returns the symbol⊥.<br />

(4) Challenge: The adversary A produces two keyword<br />

w0 and w1. The challenger C chooses a random bit<br />

b ∈{<br />

0,<br />

1}<br />

and computes a trapdoor Ti with input ( z i , k i ,<br />

v)) to the adversary A as a challenge.<br />

Phase 2: The adversary A issues new queries as in<br />

Phase 1. It is not allowed to make a trapdoor query for the<br />

target challengeT i .<br />

Guess: At the end <strong>of</strong> the game, A outputs a bit b′ . The<br />

adversary A wins this game if b ′ = b . The advantage <strong>of</strong> A<br />

is defined as Adv(A)=Pr[ [ b ′ = b ]-1/2.<br />

Theorem 1: The proposed encryption scheme with<br />

hidden keyword search is ( t , ε )-secure against chosen<br />

keyword attacks if F, f are secure pseudorandom<br />

functions, G is a secure pseudorandom generator, H is a<br />

secure one-way hash function and if there exists no<br />

polynomial-time algorithm that solves CDHP with<br />

( t 1, ε1)<br />

. Where t denotes the running time and ε the<br />

advantage that the adversary A succeeds.<br />

Pro<strong>of</strong>: Assume that there exists an ( t , ε )-adversary A<br />

that can break the encryption scheme with hidden<br />

keyword search in the game <strong>of</strong> Definition 1. In the<br />

following, we will demonstrate how to use A to construct<br />

a ( t1, ε1)<br />

- algorithmη1 that solves one-way hash function<br />

with the advantage ε 1 . η 1 simulates the challenger C and<br />

interacts with A as follows:<br />

Phase 1: The adversary A issues the following kinds <strong>of</strong><br />

queries adaptively:<br />

(1) Setup: η1 outputs the system parameters, data<br />

owner’s key pair as the same as those in Definition 1, and<br />

ϑi<br />

the group users’ public keys ( yi<br />

= g mod p ,i=1,2,…,n,<br />

ζ<br />

v = g mod p ), where ς , ϑi are random integers. After<br />

receiving the system parameters, data owner’s key public<br />

and zi, A outputs the target ui ∈U with public key (v, zi).<br />

(2) Encryption queries: For an encryption query on a<br />

message M chosen by A, η1 first computes<br />

g p<br />

k α )<br />

σ = mod , X i = H ( wi<br />

, σ ) , ci = fτ<br />

( X i )<br />

for 1 ≤ i ≤ l , where Xi is n-bits long, then generates a<br />

sequence <strong>of</strong> pseudorandom values ei using the<br />

pseudorandom generator G, where each ei is n-m bits long.<br />

Finally the database owner computes c ( ei<br />

) , adds<br />

F i


1702 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Fc ( ei<br />

) in back <strong>of</strong> ei, and gets n bits long<br />

i<br />

Bi<br />

=< ei<br />

|| Fc<br />

( ei<br />

) > .<br />

i<br />

Then computes<br />

CTi = X i ⊕ Bi<br />

, C = M ⊕ H (σ )<br />

Finally sends {C, CT i } is returned as the encryption<br />

result <strong>of</strong> this query.<br />

(3) Trapdoor queries: For the trapdoor queries for wi,<br />

the algorithmη1 first computes<br />

k<br />

i σ ′ = ziv mod p , X i′<br />

= H ( wi<br />

, σ ′ ) , c i′<br />

= fτ<br />

( X i′<br />

) ,<br />

generates trapdoor T i =< ci′<br />

, X i′<br />

> .<br />

By the setting <strong>of</strong> CTi above, have<br />

k<br />

αk<br />

σ ′ i = ziv<br />

= g mod p = σ ,<br />

X i′<br />

= H ( wi<br />

, σ ′ ) = H ( wi<br />

, σ ) = X i , c i′<br />

= fτ<br />

( X i′<br />

) = ci<br />

B′<br />

=< e′<br />

|| r >= CT ⊕ X ′<br />

i<br />

= X<br />

i<br />

i<br />

⊕ B ⊕ X ′ = B =< e || F ( e ) ><br />

i<br />

i<br />

i<br />

then e i = ei′<br />

, ( e ) = r , therefore T i =< ci′<br />

, X i′<br />

> is correct.<br />

Fci i<br />

Hence, Ti is a valid trapdoor for wi. η1 outputs Ti.<br />

Otherwise, returns the symbol⊥.<br />

(4) Challenge: The adversary A produces two<br />

keywords w0 and w1. The challenger C chooses a random<br />

bit b ∈{<br />

0,<br />

1}<br />

and computes a trapdoor as following<br />

ϑ<br />

i σ ′ = ziv mod p , X i′<br />

= H ( wi<br />

, σ ′ ) , c i′<br />

= fτ<br />

( X i′<br />

) ,<br />

generates trapdoor T w , c′<br />

, X ′ > is returned as the<br />

result <strong>of</strong> this query.<br />

i =< b i i<br />

ζ<br />

i<br />

Recalling that v = g mod p , y = g mod p . If<br />

ϑ<br />

z v i ϑ<br />

g iς<br />

σ ′ = i = mod p ,<br />

then σ ′ is indeed a random trapdoor <strong>of</strong> wb. If σ ′ is a<br />

random integer, then the last element <strong>of</strong> Ti is a random<br />

element and thereforeσ ′ is independent <strong>of</strong> b.<br />

Phase 2: The adversary A issues new queries as in<br />

Phase 1. It is not allowed to make a trapdoor query for the<br />

target challengeT i .<br />

Analysis: If g p<br />

i ϑ ς<br />

σ ′ = mod , the adversary A’s view<br />

in the simulated experiment is distributed identically to<br />

A’s view in the real experiment. Hence,<br />

Pr[ η 1 = 1]<br />

= Pr[ b = b′<br />

]<br />

On the other hand, when σ ′ is uniformly distributed<br />

in Z p , the adversary A has no information about the value<br />

<strong>of</strong> b and hence the probability <strong>of</strong> it outputs b ′ = b is at<br />

most 1/2. Therefore, η 1’s<br />

advantage<br />

Adv η ) = ε ≥ Pr[ b = b′<br />

] −1/<br />

2 ≥ ε □<br />

( 1 1<br />

C. Security Analysis<br />

(1) The proposed scheme provides data confidentiality.<br />

In the sense that the untrusted server cannot learn<br />

anything about the owner’s outsourced data contents in<br />

any cases when only given the ciphertext since server<br />

administrators cannot know the encryption key σ . In<br />

same reason, outsiders cannot read the owner’s<br />

© 2011 ACADEMY PUBLISHER<br />

i<br />

i<br />

i<br />

i<br />

ϑ<br />

ci<br />

i<br />

outsourced data. However, the authorized users can<br />

access to the outsourced data since they get the<br />

decryption key σ . But an authorized user can only<br />

access the part that owner allowed them to see and cannot<br />

access whole database. Because outsourced database are<br />

divided into different groups Gi and the data <strong>of</strong> the<br />

different groups are encrypted by different keys. A user<br />

ui, who is granted to access the group Gi’s resource by<br />

database owner, he can only obtain the group Gi’s<br />

specific decryption key sent by database owner and<br />

cannot obtain other group’s decryption keys. Therefore,<br />

The proposed scheme assures that no one except the<br />

permitted users can search over encrypted data and read<br />

data. Therefore, the proposed satisfies data<br />

confidentiality.<br />

(2) In the proposed scheme, a removed user will never<br />

be able to search and access restricted data. When a user<br />

uB is removed from a group, the database owner has to<br />

update the encryption key σ withσ , and server updates<br />

the encryption data with (Yi, s) that is sent by data owner<br />

such as C = C ⊕ s = M ⊕ H (σ ) (see the equation (18)),<br />

CTi = CTi<br />

⊕ Yi<br />

= Bi<br />

⊕ H ( wi<br />

, σ ) (see the equation (19)). It<br />

is computationally infeasible for the revoked user uB to<br />

get any information about σ . Therefore, the removed<br />

user uB cannot recover the decryption keyσ , thus he is<br />

prevented from accessing constrained data.<br />

(3) The proposed scheme can resist collusion attack. In<br />

the proposed scheme, even all users collude and give<br />

their secret share ki each other, they cannot reconstruct<br />

polynomial f(x) since they can only obtain at most n share<br />

polynomials that are less than a threshold m. Therefore<br />

The proposed scheme can resist collusion attack.<br />

Moreover, a user cannot use the public information<br />

together with his key pair to derive the decryption key<br />

since the public information is zi and not is f ( d j ) .<br />

(4) The proposed scheme can achieve the hidden<br />

searching. To search for keyword wi, user must compute<br />

trapdoor Ti= to server, where<br />

ki<br />

σ = ziv mod p , X i = H ( wi<br />

, σ ) , ci = fτ<br />

( X i )<br />

Server searches for wi in ciphertext according T i . It is<br />

evident without revealing wi itself. Therefore, the<br />

proposed scheme allows a user to ask server to search for<br />

keyword wi, but he does not reveal the keyword wi to<br />

server.<br />

(5) The proposed scheme can achieve controlled<br />

searching. In the proposed scheme, only privileged group<br />

user can ask server to search for keyword wi since other<br />

users don’t knowσ and don’t generate valid trapdoor T i .<br />

D. Performance Analysis<br />

(1) The proposed scheme supports the dynamic change<br />

<strong>of</strong> the permitted group users, and it is transparent for user<br />

when the users are added and removed since they are not<br />

involved in the process. When granting a new user to a<br />

resource, that is, adding to a group, it is not needed reencrypting<br />

the resource and re-updating the decryption<br />

keys for the users in the group. While adding a new user<br />

to a group, the new user’s decryption key is encrypted


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1703<br />

and sent by database owner to the new group user. Upon<br />

receiving the decryption key, the new group user can<br />

access the server, query the outsourced database and<br />

decrypt the result received from the server. Therefore, the<br />

proposed scheme can easily, efficiently and quickly grant<br />

new users to a resource.<br />

In order to remove a user from the resource, it only<br />

needs to update the encryption key σ with σ by the data<br />

owner, and the resource is re-encrypted with newσ by<br />

server without revealing σ for server (see section Ⅲ). It<br />

does not need to update their secret key for each user who<br />

can access the resource since the users can recover σ<br />

with the public information. Therefore, the proposed<br />

scheme is very easily, efficiently and quickly to remove a<br />

user from the group.<br />

However, in previous many schemes, when a group<br />

user is removed from the group, the database owner has<br />

to re-encrypt data, transmit the encrypted data to server,<br />

and transmit a great <strong>of</strong> the new decryption keys to all the<br />

authorized users. If a large encrypted database is<br />

frequently transmitted to server by finite channel and a<br />

great <strong>of</strong> the new decryption keys are frequently<br />

transmitted to all the authorized users, these would<br />

involve a lot <strong>of</strong> performance overhead and become<br />

practically impossible for large databases accessed by a<br />

dynamic group <strong>of</strong> users. The proposed scheme avoids that<br />

outsourced database has to re-encrypt data by new key,<br />

transmit the re-decrypted data to server, and transmit new<br />

decryption keys to all the authorized users. Therefore,<br />

The proposed scheme is very efficient and practical for<br />

large databases accessed by a dynamic group <strong>of</strong> users.<br />

(2) The proposed scheme, the major computation in the<br />

system is shifted from the user to his database owner and<br />

can be done in the initialization phase. In terms <strong>of</strong><br />

efficiency, the computation cost for recovering the secure<br />

key σ is only a multiplication operation, and a modular<br />

exponent computation for each user. Because using<br />

symmetrical encryption algorithm, the computational cast<br />

<strong>of</strong> trapdoor, encryption and decryption is thus minimized,<br />

therefore, the efficiency <strong>of</strong> the proposed scheme is high.<br />

The storage overhead only includes a key <strong>of</strong> constant<br />

size for each user, therefore, the storage overhead <strong>of</strong> the<br />

scheme is very low. Moreover, the scheme doesn’t<br />

require any interaction between database owner and<br />

server, server and user as well as database owner and user<br />

when the decrypted key is set up and updated.<br />

V. CONCLUSION<br />

In this paper, we have presented the efficient and<br />

secure an encryption scheme with hidden keyword search<br />

for outsourced database. We also analyze security and<br />

performance and show that the scheme is secure and<br />

practical for outsourced database. Whenever the<br />

permitted group users change, the data owner does not<br />

need to re-encrypt data, transmit the encrypted data to<br />

server and a great <strong>of</strong> the new decryption key to all the<br />

authorized users. User joining or removing is also simple,<br />

quick and efficient. The proposed scheme can ensure the<br />

© 2011 ACADEMY PUBLISHER<br />

privacy and confidentiality <strong>of</strong> sensitive data from even<br />

inside attackers and outside attackers.<br />

ACKNOWLEDGMENT<br />

This work was supported in part by a grant 61070164<br />

from the National Natural Science Foundation <strong>of</strong> China;<br />

by a grant 81510632010000022 from Natural Science<br />

Foundation <strong>of</strong> Guangdong Province, China; by grants<br />

2010B010600025 and 2010A032000002 from Science<br />

and Technology Planning Project <strong>of</strong> Guangdong Province,<br />

China.<br />

REFERENCES<br />

[1] D.Song, D.Wagner, A.Perrig. Practical Techniques for<br />

Searching on Encrypted Data. In: IEEE Symposium on<br />

Research in Security and Privacy 2000, pp. 44–55.<br />

[2] R. Agrawal, J. Kierman, R. Srikant, and Y. Xu. Order<br />

preserving encryption for numeric data. In Proc. <strong>of</strong> ACM<br />

SIGMOD 2004, Paris, France, June 2004.<br />

[3] E. Damiani, S. De Capitani di Vimercati, S. Foresti,<br />

Jajodia, S.Paraboschi, and P.Samarati. Metadata<br />

management in outsourced encrypted databases. In Proc. <strong>of</strong><br />

the 2nd VLDB Workshop on Secure Data Management<br />

(SDM’05), Trondheim, Norway, September 2005.<br />

[4] R. Brinkman, J. Doumen, and W. Jonker. Using secret<br />

sharing for searching in encrypted data. In Proc. <strong>of</strong> the<br />

Secure Data Management Workshop, Toronto, Canada,<br />

August 2004.<br />

[5] S.Paraboschi, and P. Samarati. Modeling and assessing<br />

inference exposure in encrypted databases. ACM<br />

Transactions on Information and System Security, 8(1),<br />

pp.119–152, February 2005.<br />

[6] S. De Capitani di Vimercati, S. Foresti, S. Jajodia, S.<br />

Paraboschi, and P. Samarati. Over-encryption:<br />

Management <strong>of</strong> access control evolution on outsourced<br />

data. In VLDB, 2007.<br />

[7] S.Liu,W.Li,L.Y.Wang.Towards Efficient Over-Encryption<br />

in Outsourced Databases Using Secret Sharing. New<br />

Technologies, Mobilety and Security,pp.1-5, 2008.<br />

[8] P.Golle, J.Staddon, B.Waters. Secure conjunctive search<br />

over encrypted data. In: ACNS 2004, Lecture notes in<br />

computer science, vol.3089. Springer; 2004. pp. 31–45.<br />

[9] P.Wang, H.Wang, J.Pieprzyk. Keyword field-free<br />

conjunctive keyword searches on encrypted data and<br />

extension for dynamic groups. In: CANS 2008, Lecture<br />

notes in computer science, vol. 5339. Springer; 2008. pp.<br />

178–95.<br />

[10] B.Zhang, F.Zhang. An efficient public key encryption with<br />

conjunctive-subset keywords search. <strong>Journal</strong> <strong>of</strong> Network<br />

and Computer Applications 34 (2011) ,pp.262–267.<br />

[11] Y. H. Hwang,P. J. Lee. Public Key Encryption with<br />

Conjunctive Keyword Search and Its Extension to a Multiuser<br />

System. Lecture Notes in Computer Science, 2007,<br />

Volume 4575/2007, 2-22.<br />

[12] J.W Byun, H.S.Rhee, H.A.Park,D.H.Lee. Off-line keyword<br />

guessing attacks on recent keyword search schemes over<br />

encrypted data. In: Proceedings <strong>of</strong> SDM’06. LNCS, vol.<br />

4165, pp. 75–83.<br />

[13] H.S.Rhee, J.H. Park, W.Susulo, D.H.Lee. Trapdoor<br />

security in a sear chable public-key encryption scheme<br />

with a designated tester. <strong>Journal</strong> <strong>of</strong> Systems and S<strong>of</strong>tware<br />

2010, 83(5),pp.763–71.<br />

[14] Q. Tang. Revisit the concept <strong>of</strong> PEKS: problems and a<br />

possible solution. Technical report TR-CTIT-08-54, Centre


1704 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

for telematics and information technology, University <strong>of</strong><br />

Twente, Enschede. ISSN 1381-3625, 2008.<br />

[15] J.Camenisch, M.Kohlweiss, A.Rial, C.Sheedy. Blind and<br />

anonymous identity-based encryption and authorised<br />

private searches on public key encrypted data. In: PKC,<br />

Lecture notes in computer science, vol. 433, 2009. pp.<br />

196–214.<br />

Xiaoming Wang received her Ph.D degree in Department <strong>of</strong><br />

Mathematics from Nankai University in 2003. She is a pr<strong>of</strong>essor<br />

<strong>of</strong> Department <strong>of</strong> Computer Science, Jinan University. Her<br />

research areas include database security, cryptography, network<br />

security, etc.<br />

© 2011 ACADEMY PUBLISHER


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1705<br />

A Method <strong>of</strong> Object-based De-duplication<br />

Fang Yan<br />

School <strong>of</strong> Computer Science and Technology, Beijing Institute <strong>of</strong> Technology, Beijing, China<br />

School <strong>of</strong> Information, BeiJing WuZi University, BeiJing, China<br />

Email: yanfang.joy@gmail.com<br />

YuAn Tan<br />

School <strong>of</strong> Computer Science and Technology, Beijing Institute <strong>of</strong> Technology, Beijing, China<br />

Email: victortan@yeah.net<br />

Abstract—Today, the world is increasingly awash in more<br />

and more unstructured data, not only because <strong>of</strong> the<br />

Internet, but also because data that used to be collected on<br />

paper or media such as film, DVDs and compact discs has<br />

moved online [1]. Most <strong>of</strong> this data is unstructured and in<br />

diverse formats such as e-mail, documents, graphics,<br />

images, and videos. In managing unstructured data<br />

complexity and scalability, object storage has a clear<br />

advantage. Object-based data de-duplication is the current<br />

most advanced method and is the effective solution for<br />

detecting duplicate data. It can detect common embedded<br />

data for the first backup across completely unrelated files<br />

and even when physical block layout changes. However,<br />

almost all <strong>of</strong> the current researches on data de-duplication<br />

do not consider the content <strong>of</strong> different file types, and they<br />

do not have any knowledge <strong>of</strong> the backup data format. It<br />

has been proven that such method cannot achieve optimal<br />

performance for compound files.<br />

In our proposed system, we will first extract objects from<br />

files, Object_IDs are then obtained by applying hash<br />

function to the objects. The resulted Object_IDs are used to<br />

build as indexing keys in B+ tree like index structure, thus,<br />

we avoid the need for a full object index, the searching time<br />

for the duplicate objects reduces to O(log n).We introduce a<br />

new concept <strong>of</strong> a duplicate object resolver. The object<br />

resolver mediates access to all the objects and is a central<br />

point for managing all the metadata and indexes for all the<br />

objects. All objects are addressable by their IDs which is<br />

unique in the universe. The resolver stores metadata with<br />

triple format. This improved metadata management<br />

strategy allows us to set, add and resolve object properties<br />

with high flexibility, and allows the repeated use <strong>of</strong> the same<br />

metadata among duplicate object.<br />

Index Terms—data de-duplication, object-based, backup,<br />

object index, metadata<br />

I. MOTIVATION<br />

Limited storage capacity are increasingly becoming the<br />

bottleneck <strong>of</strong> IT systems. There are two main reasons:<br />

first, the information revolution have led to far more data<br />

than in the past, all the time produced a flood <strong>of</strong> new<br />

data; second, With the calculation and storage capacity<br />

increase, people tend to permanently save all the data,<br />

Physical capacity must be purchased for all allocated<br />

storage. In this trend, more and more computer storages<br />

bear the pressure, in order to save huge amounts <strong>of</strong> data<br />

while in storage on the input costs, <strong>of</strong>ten has come to a<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1705-1712<br />

shocking degree. To address these problems, data deduplication<br />

technology is used to effectively reduce the<br />

duplication <strong>of</strong> user data in the daily backup, so backup<br />

data is greatly reduced[2, 3].<br />

Broadly speaking, there are three approaches to how<br />

data can be de-duplicated. They are file level data deduplication,<br />

block-level data de-duplication and object<br />

level data de-duplication.<br />

File-level de-duplication is the most basic form <strong>of</strong> deduplication,<br />

which can identify identical files and store<br />

them only once. Also known as Single Instance Storage,<br />

this is also perhaps the easiest approach to<br />

implement. The weak point is that if you change the file<br />

by even a single byte, the entire file needs to be stored<br />

again [4]. If you change a file and save it with a different<br />

name, the entire file will also be backed up again. This<br />

happens more <strong>of</strong>ten that one may think.<br />

Disk-based backup technology commonly used blocklevel<br />

data de-duplication technology, same block from<br />

different files stored only once. Block-level deduplication<br />

generally includes three steps: chunking,<br />

compute the hash, find and store the unique chunk data.<br />

Block-level data de-duplication technology partition the<br />

backup file into multiple data chunks, and identify<br />

duplicate chunks by comparing their fingerprints, which<br />

are hash values computed by hash function. If find the<br />

same data chunk, then insert a pointer to the index node<br />

<strong>of</strong> the backup file which point to the data chunk already<br />

stored; only non-repeated data chunk can be stored. The<br />

biggest difference in the implementation <strong>of</strong> current block<br />

de-duplication technologies is the use <strong>of</strong> fixed size data<br />

chunks versus variable sized data chunks and the use <strong>of</strong><br />

sliding windows to define the address <strong>of</strong> common chunks<br />

versus using fixed <strong>of</strong>fsets to define the address <strong>of</strong> a<br />

chunk. Fixed-sized data chunking refers to partition files<br />

into fixed-sized data chunks, the chunk size is always<br />

equal to the physical block size <strong>of</strong> storage devices, for<br />

example, 8KB, 16KB and so on; To tolerate shifted<br />

contents, variable-sized chunking is a way <strong>of</strong> breaking a<br />

file into a sequence <strong>of</strong> chunks so that chunk boundaries<br />

are determined by the local contents <strong>of</strong> the file. This is in<br />

contrast to using fixed size chunks[5]. The Basic Sliding<br />

Window Algorithm [6] is the prototypical variable sized<br />

chunking algorithm.<br />

The most useful area for file-level and block-level deduplication<br />

implementation is in backup workflows


1706 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

where the same exact set <strong>of</strong> files are archived routinely<br />

and there is a relatively low change rate in the files. In<br />

these workflows, the files are backed up regardless <strong>of</strong><br />

whether they have changed or not, so it is highly likely<br />

that there is a very high level <strong>of</strong> commonality between<br />

many blocks from one backup to another. In general,<br />

these techniques work well for text based or simple<br />

content and do not work very well for compound file<br />

formats and workflows. Furthermore, in online<br />

versioning schemes such as snapshots or in backup<br />

workflows where only the modified files are backed up,<br />

there is a very low likelihood <strong>of</strong> finding common blocks.<br />

In such schemes, block de-duplication schemes will not<br />

yield any benefit and existing technologies for online<br />

archives (backups), snapshots and mirroring become<br />

expensive.<br />

This paper presents an object-based data deduplication<br />

solutions to existing problems. In our<br />

proposed system, after file type detection, we will first<br />

extract objects from files. According to the size and<br />

content <strong>of</strong> the object, Object_ID are then obtained by<br />

applying hash function. The object resolver is a central<br />

point for managing all the metadata and indexes for all<br />

the objects. The advantage <strong>of</strong> object based data deduplication<br />

is that even if the physical layout <strong>of</strong> a file<br />

changes – which can happen with a simple save operation<br />

– the logical objects can still be detected and stored only<br />

once. Unlike file level and block level technologies,<br />

object-based de-duplication chunks the file into well<br />

known logical objects like images, paragraphs,<br />

worksheets, slides, etc.<br />

II. SYSTEM ARCHITECTURE<br />

In many cases, because the same files or different<br />

versions <strong>of</strong> the information are used, the name and<br />

location <strong>of</strong> the objects are same in compound files.<br />

Alternatively, the creation <strong>of</strong> relevant documents is<br />

unknown, so, we will first parse the file before extraction<br />

<strong>of</strong> objects. Accordingly, the system architecture design is<br />

shown in figure 1.<br />

Input files<br />

file parser<br />

Object extractor<br />

Duplicated Object Resolver<br />

Storage<br />

File update log<br />

MetaData<br />

Figure 1. Object-based data de-duplication system structure<br />

© 2011 ACADEMY PUBLISHER<br />

The system includes the major components: file parser,<br />

object extractor, duplicate object resolver and storage.<br />

Input file formats may include .pdf, .ppt, .doc, .jpg, etc.,<br />

depending on file type.<br />

A. File Parser<br />

The system will parse a file to determine if it is<br />

compound or primitive and determine the file type and<br />

attributes. It will determine the boundaries <strong>of</strong> the<br />

primitive objects within the compound file.<br />

We divide file into two categories: compound objects<br />

and atomic objects. Among them, the compound <strong>of</strong> object<br />

encapsulates a number <strong>of</strong> other objects, such as ZIP files,<br />

PPT files, word documents. They are typically encoded<br />

representations <strong>of</strong> the union <strong>of</strong> their contained objects.<br />

File extension name may be as many as 20 kinds, file<br />

encoding format may be more than 10 species. Primitive<br />

objects are the most basic representations <strong>of</strong> discrete data<br />

structures such as images, executable files, etc.<br />

B. Object Extractor<br />

Define abbreviations and acronyms the first time they<br />

are used in the text, even after they have been defined in<br />

the abstract. Abbreviations such as IEEE, SI, MKS, CGS,<br />

sc, dc, and rms do not have to be defined. Do not use<br />

abbreviations in the title or heads unless they are<br />

unavoidable.<br />

• Step 1:extract objects<br />

For atomic objects, such as JPEG images, CAD<br />

drawings, AVI clips, etc. you can go directly to step 2;<br />

For the compound file, they differ in the specific<br />

document headers that they used to identify the encoded<br />

sections and objects. The object extraction process is<br />

recursive, that is, a recursive process as layer after layer<br />

is uncovered until the lowest level atomic object is<br />

uncovered. Some compound file does not include clear<br />

rules elements as HTML tags, such as PPT files. So for<br />

different types <strong>of</strong> compound documents, objects should<br />

be extracted using different algorithms. Sometimes, the<br />

analysis by analyzing the header may be done, and by<br />

analyzing file header to determine the potential<br />

combination <strong>of</strong> objects and object code format. For<br />

example, TIFF images have specific header information<br />

to describe the representation <strong>of</strong> the image and<br />

compression algorithm that may have been used.<br />

• Step 2: compute objects fingerprints<br />

With collision-resistant hash function, such as SHA-1,<br />

for each atomic object is assigned a globally unique 160bit<br />

identifier called an object ID (Object Identifier).<br />

Fingerprint is the start <strong>of</strong> the 32-bit bytes, the size <strong>of</strong> the<br />

object. Size does not bother to get the object, and objects<br />

<strong>of</strong> different sizes is clearly not the same. The remaining<br />

contents <strong>of</strong> the object by 128-bit hash function to<br />

calculate the running. Object ID is not only used for<br />

verification, but also a unique virtual address as an object<br />

for a given request and locate objects, namely the use <strong>of</strong><br />

the underlying storage mechanism for storing objects<br />

based on object fingerprint, and use that name to retrieve<br />

them, the actual storage and we have no relationship.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1707<br />

C. Duplicate object resolver<br />

The duplicate object resolver mediates access to all the<br />

objects and is a central point for managing all the<br />

metadata and indexes for all the objects. The resolver<br />

knows the total set <strong>of</strong> objects. All objects are addressable<br />

by their IDs which is unique in the universe. The resolver<br />

is a singleton object which may be created in multiple<br />

threads or processes and accesses the same underlying<br />

data storage and synchronization engine. The resolver<br />

provides the following services:<br />

1) Metadata Services<br />

The metadata is an abstract concept that can exist<br />

independently from the data itself. There are many<br />

variations that can be made for each object, and each<br />

object requires different parameters. Rather than having<br />

different constructors for each object type, the resovler<br />

maintains consistency and flexibility by following a very<br />

simple pattern:<br />

• We term metadata to be a set <strong>of</strong> statements about<br />

objects, expressed in triple notations (Subject,<br />

Attribute, Value), where Subject is the object_ID<br />

the statement is made about. An Attribute can be<br />

any kind <strong>of</strong> value or relationship, such as the size<br />

<strong>of</strong> a object, a file number where the object is<br />

extracted from, or the timestamp, etc. A Value is<br />

the value <strong>of</strong> the attribute, which is either some<br />

textural value, or another object_ID. All metadata<br />

reduce to the triple representation.<br />

• Using this system, we are able to store arbitrary<br />

attributes about any object. The triple shows that<br />

the object has all these attributes and their values.<br />

We call these ''relations'' or ''facts''. As follows,<br />

Obj represents global object domain.<br />

• This flexibility allows duplicate objects to use the<br />

same metadata, and allows different storage<br />

strategies according to different types <strong>of</strong> objects,<br />

while allowing third parties to extend type <strong>of</strong><br />

object properties, or to introduce a new type to<br />

improve de-duplication efficiency.<br />

• The duplicate object resolver construct the object<br />

index tree based on these facts and relation, and<br />

stores object metadata in triple storage format.<br />

This includes setting, adding and resolving<br />

attributes for a given object_ID.<br />

2) Object index and object de-duplication services<br />

First, it must be noted that comparison for the two<br />

objects must have the same encoding format, otherwise,<br />

you can not be compared for the same , but can only<br />

make approximate comparison. Encoded files have this<br />

property: any two documents appears to be similar or the<br />

same information, may be represented by totally different<br />

bit on the storage medium. Most General compound file<br />

© 2011 ACADEMY PUBLISHER<br />

using different coding schemes[7]. Thus, we should<br />

compare duplicate objects based on the object content<br />

encoding format.<br />

Indexing plays an important role in de-duplication<br />

process. In this work, the duplicate object resolver try to<br />

build and search the B+ tree like structure for object<br />

indexing (see section 4), to identify two or more duplicate<br />

atomic objects from one or more files.<br />

III.OBJECT EXTRACTION GRANULARITY<br />

Two similar large objects perhaps contain only one<br />

byte <strong>of</strong> different content in large body <strong>of</strong> data, but this<br />

will prevent de-duplication due to hash code index<br />

method. Therefore, you can choose object de-duplication<br />

granularity based on object type during the de-duplication<br />

processing. We classify the object content type into text,<br />

images, audio, video and executable programs. Here we<br />

introduce object size threshold. The object size threshold<br />

can be used as the basis for object extraction.<br />

The method for determining the object size threshold :<br />

A. Generate a sample files collection<br />

Generate a sample files collection in the storage pool:<br />

we randomly select backup file set for 1 to 2 times from<br />

backup systems as sample files collection, placed in the<br />

storage pool.<br />

B. Sample objects classification<br />

The system extracts and analyzes objects according to<br />

different file types, the sample objects has the same type<br />

is placed in the same collection.<br />

C. Determine the range <strong>of</strong> candidate size thresholds<br />

Objects <strong>of</strong> different size is clearly not the same.<br />

According to the distribution <strong>of</strong> object size, supposing<br />

there are n objects in the sample object collection, the<br />

size distribution <strong>of</strong> objects in the collection is represented<br />

by a collection <strong>of</strong> S:<br />

S = { s1, s2,...... sk}, k ≤n, si ≠ si + 1,1 ≤i≤ k (1)<br />

Let dmin = MIN( s1, s2,...... sk<br />

) , represents the<br />

minimum object size in the sample object collection.<br />

Let dmax = MAX( s1, s2,...... sk)<br />

, represents the<br />

maximum object size in the sample object collection.<br />

Determine the range <strong>of</strong> candidate size thresholds:<br />

D= [ d1, d2,...... dm],1≤m≤ k<br />

(2)<br />

To consistent with the specified minimum average<br />

block size 256B in backup system, the candidate<br />

thresholds meet the following value conditions ((3)~(6)):<br />

d 1 = d<br />

(3)<br />

min<br />

if ( dmin<br />

+ �)<br />

mod 256 = 0<br />

(4)<br />

then d2 = min( dmin<br />

+ �), �=<br />

1, 2,3,......<br />

di + 1 = di + 256, 2 ≤i≤m− 2<br />

(5)<br />

if ( dmax<br />

+ �)mod256=<br />

0<br />

(6)<br />

then d = min( d + �), �=<br />

1, 2,3,......<br />

m<br />

max


1708 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

D. Generate object size thresholds<br />

For various types <strong>of</strong> objects in the sample collection,<br />

the system traverses the range <strong>of</strong> candidate thresholds for<br />

each candidate threshold. If an object size larger than the<br />

candidate threshold, it will be divided into smaller objects<br />

by the threshold value. Then we calculate data<br />

compression ratio called DCR generated by the candidate<br />

threshold value. We calculate the DCR by the following<br />

equation:<br />

Initial Dedup _ ObjTS<br />

DCR = (7)<br />

Dedup _ ObjTS<br />

Where, :<br />

Initial Dedup_ObjTS is the total amount <strong>of</strong> data after<br />

de-duplication based on the size <strong>of</strong> original objects;<br />

Dedup_ObjTS is the total amount <strong>of</strong> data after deduplication<br />

based on the candidate threshold value.<br />

Candidate threshold that produced the maximum DCR<br />

will be selected as the size threshold for particular object<br />

type.<br />

E. Save threshold<br />

We establish one mapping relationship between each<br />

type <strong>of</strong> object and the corresponding size threshold, and<br />

save into the object-type threshold library.<br />

IV. OBJECT INDEX MECHANISM<br />

In the de-duplication system, the data block<br />

comparison is operation <strong>of</strong> the highest frequency, because<br />

the most important task in de-duplication is to compare<br />

Object_ID5<br />

all the data blocks to determine whether the data has been<br />

stored. Traditional method <strong>of</strong> comparing the data block,<br />

generally use the hash value database approach to retain<br />

each block a unique hash value. But the complexity <strong>of</strong> the<br />

hash query is generally linear or logarithmic order, that is,<br />

With the expansion <strong>of</strong> data size, the efficiency <strong>of</strong> the data<br />

block comparison will be gradually reduced. In largescale<br />

de-duplication system, this will cause great impact<br />

on the system, and lead to lower the system operating<br />

efficiency. Therefore, how to use a fast data comparison<br />

technology to make the data comparing efficiency has<br />

nothing to do with the size <strong>of</strong> backup data, to improve the<br />

operating efficiency <strong>of</strong> large-scale backup systems, is the<br />

main problem in the data de-duplication system [8, 10].<br />

Our proposed object index mechanism for data deduplication<br />

is based on B + tree index structure. The<br />

optimal search time is O (log n), which is more efficient<br />

than the full indexing O(n). The duplicate object resolver<br />

constructs the index tree according to the extracted object<br />

fingerprint and object information. By using the<br />

advantage <strong>of</strong> B+ tree properties, all the number <strong>of</strong> nodes<br />

in the left and right sub-trees <strong>of</strong> non-leaf node are<br />

balanced. Comparing with binary search in contiguous<br />

memory space, its advantage is to change the B+ tree<br />

(insert and delete nodes) do not need to move the large<br />

segment <strong>of</strong> the memory data, or even usually a constant<br />

overhead.<br />

Proposed indexing mechanism is shown in the figure2,<br />

which Object_ID is object identifier, Object_IDn's<br />

MetaData is the metadata for particular object, Objectn is<br />

the content <strong>of</strong> the object.<br />

Object_ID27 Object_ID64 ……<br />

Object_ID5 Object_ID10 Object_ID20<br />

Object_ID27 Object_ID30 Object_ID50 Object_ID64 Object_ID75 ……<br />

Object_ID5's<br />

Metadata<br />

file123,file789,Ojbect_ID5<br />

Object relation node<br />

Object_ID10's<br />

Metadata<br />

……<br />

…… ……<br />

……<br />

Object_ID27's<br />

Metadata<br />

Object5 Object7 Object27 Object10 Object11 Object25 Object20 Objec12 ……<br />

Figure 2. the object index structure<br />

In the path <strong>of</strong> an object index contains the following types <strong>of</strong> nodes:<br />

© 2011 ACADEMY PUBLISHER


c<br />

b<br />

JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1709<br />

A. Object index node<br />

Object index node is constituted by the object<br />

identifier ( Object_IDn ) . Objects in each node are<br />

ordered according to their size.<br />

B. Object metadata node<br />

The metadata maintain the object identifier, object<br />

size, object type, object encoding format, the object's<br />

location in the document, etc., which are stored in the<br />

form <strong>of</strong> the triple. Object metadata can be stored in an<br />

external SQL server.<br />

C. Object Relation node<br />

Object Relation node is used to describe the<br />

relationship between two objects. Relations stored in a<br />

file format that contains the object hash code and the<br />

filename on each line. In practice, it is much more<br />

efficient to refer to the filename and its long directory<br />

path via a short index number into a separate table <strong>of</strong><br />

filenames stored in a database [9].<br />

Multiple file number referring to identical object are<br />

listed out with the first file number that contains object<br />

that have been stored, and the second file number that<br />

contains duplicate object, followed by the identical<br />

object fingerprint. In fact, the relation nodes implicitly<br />

Data Dedupliction… …<br />

Data Dedupliction<br />

became very popular<br />

in storage archiving<br />

and backup ……<br />

Data Dedupliction<br />

consist in partioninng<br />

a large file into<br />

smaller parts……<br />

file<br />

Chunk1 Chunk2<br />

File237<br />

What are you waiting for? …… data<br />

Dedupliction may be the best thing.<br />

Data Dedupliction<br />

file became very popular<br />

in storage archiving<br />

Chunk1 Chunk2<br />

and backup ……<br />

figure2<br />

All backups have<br />

duplicate data,but<br />

how much air a<br />

dedupe applicance<br />

or app can ……<br />

File169<br />

© 2011 ACADEMY PUBLISHER<br />

figure1<br />

a<br />

d<br />

include the file-file similarity pairs as desired. In the<br />

future, we can use the well-known union-find algorithm<br />

to determine clusters <strong>of</strong> interconnected files. We then can<br />

compare the similarity <strong>of</strong> the files.<br />

D. Object Content node<br />

Object Content node is used to store the contents <strong>of</strong><br />

the object.<br />

V. OBJECT DE-DUPLCIATION PROCESS<br />

For different file types, such as .pdf, .word, .ppt, .txt,<br />

and zip, rar, tar, etc., perform the following steps:<br />

• Step1: Accept input file;<br />

• Step2: Analysis <strong>of</strong> file types;<br />

Step3: Extract objects from files, and compute<br />

Object_IDs;<br />

• Step4: Check whether duplicate objects exist or<br />

not by comparing object fingerprints composed<br />

by object size and hash code with efficient object<br />

indexing mechanism;<br />

• Step5: If the object is duplicate, update object<br />

relation node. Otherwise, insert the object index<br />

node and metadata, then store the new data.<br />

Extracted Objects<br />

Object_Content<br />

_Hashes<br />

a<br />

Extracted Objects<br />

c<br />

d<br />

Object_Content<br />

_Hashes<br />

a<br />

(245)<br />

b<br />

(1010)<br />

c<br />

(1067<br />

d<br />

(3035)<br />

……<br />

9D321418 B34F2C12 313F3C20 805C4E32 ……<br />

a<br />

(2569)<br />

Figure 3. object extraction diagram<br />

Duplicate Objects<br />

b<br />

(3035)<br />

c<br />

(1010)<br />

Duplicate Objects<br />

d<br />

(1965)<br />

……<br />

4E312FF8 805C4E32 B34F2C12 32B5C804E ……


1710 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

The figure3 to figure5 show the object de-duplication<br />

process, the example include a PDF file (as File237 shown<br />

in the figure3) and a PPT file(as File169 shown in the<br />

figure3). The contents boxed by a dashed line represent a<br />

unit able to be treated as an independent object. As shown ,<br />

Object a, b, c and d are extracted from the file (brackets is<br />

Figure 4. object index tree<br />

object size in bytes). Content hash is calculated for each<br />

object.<br />

Assume that the system has stored the objects in the<br />

file237. Before inserting objects in file169, structure <strong>of</strong> the<br />

object index tree is shown in figure4.<br />

File169 contains two duplicate objects, the object index tree after insert operation is shown below:<br />

……<br />

2459D32... 1010B34F...<br />

237,169,1010B34F2C127EBDF18526F6323F3E2D2E3D<br />

……<br />

1010B34F... 3035805C..<br />

Obj:ID= 1010B34F2C127EBDF18526F6323F3E2D2E3D<br />

Obj:ID:filenum=237<br />

Obj:ID:type = txt<br />

Obj:ID:stored= 5bdbf7bcd8a540cb9af0fd7e4d0e2c9e<br />

Object Metadata node<br />

© 2011 ACADEMY PUBLISHER<br />

196532BC.. 3035805C..<br />

106732B5.. 196532BC.. 25694E31.. 3035805C..<br />

…… ……<br />

Obj:ID= 3035805C4E32BF559232DDA4D1FBF161D068<br />

Obj:ID:filenum=237<br />

Obj:ID:type = image<br />

Obj:ID:stored= 479ef7bce9n340cb9af0fd7e4d0e18a<br />

Object Metadata node<br />

Object Relation node 237,169,3035805C4E32BF559232DDA4D1FBF161D068<br />

Object Relation node<br />

Object_a Object_b …… Object_dObject_c<br />

……<br />

Figure 5. Object index tree


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1711<br />

It can be seen from above example:object based deduplication<br />

can detect common embedded data across<br />

unrelated files and even when physical block layout<br />

changes. However, block level de-duplication has no idea<br />

where a logical object begins and where it ends. As a<br />

result, the chunking process will split the images in files.<br />

Due to different positions <strong>of</strong> the image, duplicate data<br />

will not be detected at all.<br />

VI.EVALUATION<br />

This paper mainly focuses on one evaluation aspect for<br />

data backup: the de-duplication ratio archived by our<br />

proposed method. We chose 2 representative data sets:<br />

one was a collection <strong>of</strong> compound files, a compound file<br />

<strong>of</strong>ten contains text, figures, audio or video clips. The<br />

details <strong>of</strong> data set1 are described in Table 1. In Table1,<br />

#<strong>of</strong> files represents the number <strong>of</strong> files.; and the other was<br />

a collection <strong>of</strong> source code, source code are typically<br />

versioned, this data set consisted <strong>of</strong> 450 versions from<br />

1.2.1 to 2.5.75, the total size is 26GB.<br />

We use the two data sets and four full backups for our<br />

evaluations. We performed three different de-duplication:<br />

file-level de-duplicaiton, block-level de-duplication and<br />

Deduplication Ratios<br />

TABLE I.<br />

BACKUP DATASET1<br />

Type Size(KB) #<strong>of</strong> files<br />

1st PDF 4, 113,<br />

862<br />

6020<br />

PPT 335, 006 562<br />

2nd<br />

3rd<br />

PDF 1, 113,<br />

862<br />

1420<br />

PPT 34, 019 108<br />

PDF 5, 002,<br />

635<br />

6421<br />

PPT 310, 006 511<br />

4th PDF 263, 943 2501<br />

© 2011 ACADEMY PUBLISHER<br />

object-level de-duplication. SHA-1 is used as our hash<br />

algorithm. It generates 160 bit fingerprint for each file,<br />

chunk or object. Block level deduplication will start with<br />

a fixed size block, we chose 16KB. The experiment<br />

results are showed in figure6 and figure 7.<br />

We can draw a few <strong>of</strong> conclusions from the results :<br />

The improvements to each data set are different. Object<br />

based data de-duplication can effectively improve the<br />

data de-duplication ratio to dataset1. This is because the<br />

object based data de-duplication can mainly improve the<br />

de-duplication ratio <strong>of</strong> unstructured data sets. According<br />

to our experiments, the improvements to data sets 2 are<br />

not obvious than block-level and file-level de-duplication.<br />

Note that , our evaluation currently is not a production<br />

quality storage deduplication system but rather a research<br />

prototype. Hence, our experiment results should not used<br />

for absolute comparison with other storage de-duplication<br />

systems. We will do more comprehensive experiments in<br />

our future work, especially for data index and metadata<br />

management.<br />

VII. CONCLUSION AND FUTURE WORK<br />

Existing file and block-based data de-duplication<br />

technology is very suitable for text and simple content,<br />

but not for compound documents. This paper proposes an<br />

object-based de-duplication framework and an efficient<br />

object index mechanism to speed up the searching facility<br />

to identify duplicate objects. It can detect common<br />

embedded data for the first backup across completely<br />

unrelated files and even when physical block layout<br />

changes. As a result, object-based de-duplication<br />

provides the best efficiency for compound files vs. block<br />

based de-duplication.<br />

Future work includes: a) Implementing the framework;<br />

b) Improving the processing speed by move most<br />

computations to the graphic processing unit(GPU), which<br />

we expect will reduce the time spent on intensive<br />

computations such as object extraction and computing the<br />

fingerprints.<br />

fixed-block whole file object<br />

45%<br />

40%<br />

35%<br />

30%<br />

25%<br />

20%<br />

15%<br />

10%<br />

5%<br />

0%<br />

1 2 3 4<br />

Figure 6. De-duplicaiton efficiency comparison <strong>of</strong> data set1


1712 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Deduplication Ratios<br />

45%<br />

40%<br />

35%<br />

30%<br />

25%<br />

20%<br />

15%<br />

10%<br />

5%<br />

fixed-block whole file object<br />

0%<br />

1 2 3 4<br />

REFERENCES<br />

[1] Dell product group, Object Storage — A Fresh Approach<br />

to Long-Term File Storage, A Dell Technical White Paper.<br />

[2] Tony A, Biggar H. Data De-Duplication and Disk-to-Disk<br />

Backup Systems: Technical and Business Considerations.<br />

The Enterprise Strategy Group Technical Report. 2007.<br />

[3] Biggar H. Experiencing in Data De-Duplication:<br />

Improving Efficiency and Reducing Capacity<br />

Requirements. The Enterprise Strategy Group Technical<br />

Report. 2007.<br />

[4] William J. Bolosky, Scott Corbin, David Goebel*, and<br />

John R. Douceur , Single Instance Storage in Windows<br />

2000, In Proceedings <strong>of</strong> the 4th conference on USENIX<br />

Windows Systems Symposium, Volume 4 USENIX<br />

Association Berkeley, CA, USA , 2000.<br />

[5] An in-depth look at data deduplication methods, The<br />

Enterprise Strategy Group Technical Report,<br />

www.falconstor.com.<br />

[6] A.Muthitacharoen, B.Chen, and D.Mazieres. A lowbandwidth<br />

network file system. In Proceedings <strong>of</strong> the 18th<br />

ACM Symposiumon Operating Systems Principles<br />

(SOSP’01), pages174–187, Ban, Canada, October 2001.<br />

[7] Goutham Rao, San Jose, Eric Brueggemann, Carter<br />

George, Object deduplication and application aware<br />

snapshots, patent application publication, US, 2010.<br />

[8] Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in<br />

the data domain deduplication file system. In: Proceedings<br />

<strong>of</strong> the 6th USENIX Conference on File and Storage<br />

Technologies. 2008.<br />

[9] George Forman, Kave Eshghi, Stephane Chiocchetti,<br />

Finding Similar Files in Large Document Repositories. In<br />

the 11th ACM SIGKDD International Conference on<br />

Knowledge Discovery and Data Mining (KDD’05),<br />

Chicago, USA, August 2005.<br />

[10] Bayer.R and Me. Creight, "Organization and Maintenance<br />

<strong>of</strong> Large ordered Indices", Acta Informatica, Volume I,<br />

Springer, Berlin/Heidelberg, New York, 1972, pp. 173-<br />

189.<br />

[11] S. Walter, T.Thiago, M.Carla and Jr. Wagner Meira, "A<br />

Scalable Parallel Deduplication Algorithm", 19th<br />

International Symposium on Computer Architecture and<br />

© 2011 ACADEMY PUBLISHER<br />

Figure 7. De-duplicaiton efficiency comparison <strong>of</strong> data set2<br />

High Performance Computing, IEEE Computer Society,<br />

Brazil, 2007, pp. 79-86.<br />

[12] W.You et aI., "PRUN: Eliminating Information<br />

Redundancy for Large Scale Data Backup System",<br />

International Conference on Computational Sciences and<br />

Its Applications (ICCSA 2008), IEEE Computer Society,<br />

Italy, 2008<br />

[13] V. Henson and R. Henderson. Guidelines for Using<br />

Compare-by-Hash. Forthcoming, 2005.<br />

http://infohost.nmt.edu/~val/review/hash2.html<br />

[14] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise<br />

G, Camble P. Sparse indexing: large scale, inline<br />

deduplication using sampling and locality. In: Proceedings<br />

<strong>of</strong> the 7th USERNIX Conference on File and Storage<br />

Technologies. 2009<br />

[15] Quinlan S, Dorward S. Venti: a new approach to archival<br />

storage. In Proceedings <strong>of</strong> the Conference on File and<br />

Storage Technologies. 2002, 89–101<br />

Fang YAN, born in 1980 .Ph.D.<br />

candidate. Beijing Institute <strong>of</strong><br />

Technology, Beijing, China. And<br />

research interests include data deduplication<br />

and network storage.<br />

She is a senior lecturer <strong>of</strong> Dept.<br />

Information BeiJing WuZi university.<br />

Yuan TAN, BeiJing, China.born in<br />

1972. is computer science Ph.D. And<br />

current research interests include<br />

Information Security and network<br />

storage.<br />

He is a Pr<strong>of</strong>essor, Ph.D. Beijing<br />

Institute <strong>of</strong> Technology, Beijing, China .<br />

and supervisor, senior member <strong>of</strong> China<br />

Computer Federation.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1713<br />

Analysis on E-consumers’ Purchasing Behavior<br />

Based on Data-driving Model<br />

Lijuan Huang<br />

Information Management College <strong>of</strong> Jiangxi University <strong>of</strong> Finance and Economics, Nanchang, 330013, China<br />

Email: huanglijuan66s@126.com<br />

Abstract—It is the Internet world with vasty purchasing<br />

data sea online that makes research model <strong>of</strong> e-consumers’<br />

purchasing behavior very different from traditional ones.<br />

Firstly this paper proposes three kinds <strong>of</strong> research models <strong>of</strong><br />

consumers’ purchasing behavior, and then pointed out that<br />

data-driving model is the best one to analyze e-consumers’<br />

purchasing behavior on the Internet. Secondly, it adopts the<br />

improved SOFM Neural Network as the tool <strong>of</strong> data-driving<br />

model to detailedly analyze e-consumers’ purchasing<br />

behavior <strong>of</strong> Internet marketing. Lastly experiment results<br />

demonstrate that the method has more visualization,<br />

exactness and robustness. Because consumers’ purchasing<br />

behavior analysis based on the SOFM Neural Network is a<br />

comparatively novel method, the research fruit in this paper<br />

is just for reference.<br />

Index Terms—Internet marketing, purchasing behavior,<br />

neural network, data-driving model<br />

I. INTRODUCTION<br />

Research about consumers’ purchasing behavior<br />

characteristics dates back to England in eighteenth<br />

century. At that time, large number <strong>of</strong> farmers poured<br />

into cities. These new urban residents show faith in the<br />

products which were able to demonstrate their social<br />

status, and the faith and attitude for these products from<br />

the residents brought people’s attention focused on<br />

consumer behavior[1]. The research about consumer<br />

behavior originated and developed from a western paper<br />

named Consumer Analysis published by Guest in Annual<br />

Review <strong>of</strong> Psychology in 1962 [2]. Afterwards, many<br />

celebrated scholar did active work on characteristics <strong>of</strong><br />

consumer behavior. For example, Engel, Kotler and Cliff<br />

Allen proposed T-I-K model <strong>of</strong> consumer behavior in<br />

1993, Solomon, Schiffman and Kanuk raised U-S-E<br />

model <strong>of</strong> consumer behavior in 1999, J. Paul Peter and<br />

Jerry C. Olsom presented S-C-T model <strong>of</strong> consumer<br />

behavior in 2000 [3-6]. But these researches were<br />

attributed to one <strong>of</strong> experience-driving research model or<br />

theory-driving research model. The author believes that<br />

research model <strong>of</strong> consumer behavior should include<br />

data-driving model besides experience-driving research<br />

model and theory-driving research model, with the<br />

development <strong>of</strong> modern science and technology, and<br />

especial with development <strong>of</strong> neural network technology,<br />

data mining, artificial intelligence, and multi-disciplinary<br />

technology. These three kinds <strong>of</strong> research models are<br />

described in Table I.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1713-1718<br />

TABLE I.<br />

RESEARCH MODEL OF CONSUMERS’ PURCHASING BEHAVIOR<br />

Method 1: Experience-driving model<br />

Researcher can communicate with consumers by means <strong>of</strong> tongue,<br />

facial expression and other body language, and then make an analysis <strong>of</strong><br />

consumers’ purchasing behavior based on the researcher’s own<br />

experience. However, in the virtual world <strong>of</strong> the Internet, there is large<br />

sum <strong>of</strong> data about e-consumers’ purchasing behavior and the researcher<br />

lose the chance face to face to communicate with consumers, So<br />

analysis <strong>of</strong> consumers’ purchasing behavior based on experience-driving<br />

model loses effect.<br />

Method 2: Theory-driving model<br />

The research steps <strong>of</strong> theory-driving model are shown in Fig 1. From<br />

Fig 1, we can know, in this kind <strong>of</strong> research mode, researcher first<br />

obtains a theory model from purchasing behavior theories; Then makes<br />

full use <strong>of</strong> purchasing data to test and modify the model repeatedly;<br />

Finally, based on the last model to deduct and analyze the consumers’<br />

purchasing behavior. This kind <strong>of</strong> research mode usually can get an<br />

unreliable analysis result due to the imperfect and even wrong<br />

purchasing behavior theories.<br />

Method 3: Data-driving model<br />

The research steps <strong>of</strong> data-driving model are shown in Fig 2. From<br />

this figure, we can know, in this kind <strong>of</strong> research mode, researcher first<br />

select appropriated intelligent algorithm; Then a model is drawn from<br />

purchasing data and is modified repeatedly by these purchasing data;<br />

Finally, based on the last model to deduct and analyze the consumers’<br />

purchasing behavior. Obviously, data-driving model is based on real<br />

data other than personal experience or pure theories and this kind <strong>of</strong><br />

model realizes the scientific idea that Let data say for themselves. So,<br />

the result <strong>of</strong> analyzing consumers’ purchasing behavior is more<br />

scientific, objective and fair.<br />

Table I shows that it is difficult to adopt experiencedriving<br />

model to analyze characteristics <strong>of</strong> online<br />

consumer purchase behavior, and adopting theory-driving<br />

model or data-driving model may be appropriate. Seen in<br />

Table I, Fig. 1 and Fig. 2, it is more objective, scientific<br />

and unbiased to adopt data-driving model than to adopt<br />

experience-driving model or theory-driving model for<br />

analyzing e-consumers’ purchasing behavior.


1714 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Purchasing<br />

behavior<br />

theories<br />

Figure 1. Analyzing consumers’ purchasing behavior based on theorydriving<br />

model<br />

Data<br />

warehouse<br />

Model<br />

deducting<br />

①<br />

Model<br />

modifying<br />

②<br />

Model<br />

Model<br />

modifying<br />

②<br />

Figure 2. Analyzing consumers’ purchasing behavior based on datadriving<br />

model<br />

Therefore, data-driving model is the most suitable for<br />

analyzing characteristics <strong>of</strong> online Consumers’<br />

purchasing behavior, and all input data is from the<br />

consumers, it also fully reflected the idea: “Consumer is<br />

the God”. Because Self-Organizing Feature Map Neural<br />

Network (SOFM NN) belongs to a typical data-driving<br />

mode , this paper takes SOFM NN as a tool to analyze econsumers’<br />

purchasing behavior. The basic principles <strong>of</strong><br />

SOFM NN are described as follows.<br />

II. BSICAL PRINCIPLES OF THE SOFM NEURAL NETWORK<br />

In 1981, Finnish scholar Teuvo Kohonen firstly raised<br />

the concept <strong>of</strong> SOFM NN[7], which can simulate the<br />

function <strong>of</strong> the brain that reflects to different kinds <strong>of</strong><br />

input signals (e.g. light signal, sound signal) and<br />

automatically sort these input signals into different zones<br />

<strong>of</strong> the brain layer[8]. Through inputting large sum <strong>of</strong><br />

purchasing data <strong>of</strong> consumers into SOFM NN, these econsumers<br />

can be objectively, scientifically, and<br />

automatically clustered and divided into different groups<br />

based on the similarity <strong>of</strong> consumers’ purchasing data,<br />

and this means minimizing difference between the<br />

consumers in the same group and maximizing the<br />

difference between different groups. Analyzing and<br />

aiming directly at the different feature <strong>of</strong> these different<br />

consumer groups, it would be helpful to make some<br />

aimed marketing strategies for promotion, service, price<br />

etc, avoid the risk <strong>of</strong> taking the uniform strategies for all<br />

the consumers and with high cost for not important<br />

consumers or taking the unscientific ranked service to<br />

lost the potential VIP consumers.<br />

A. Topology Structure <strong>of</strong> the SOFM NN<br />

The typical SOFM NN (seen in Fig. 3) forms topology<br />

structure <strong>of</strong> input signals based on one-dimension or two-<br />

③<br />

Drawing conclusion from<br />

the model<br />

Model<br />

© 2011 ACADEMY PUBLISHER<br />

Algorithm<br />

selecting<br />

①<br />

③<br />

Drawing conclusion from<br />

the model<br />

Data<br />

warehouse<br />

Feature <strong>of</strong><br />

purchasing<br />

behavior<br />

e.g.<br />

DM,<br />

ANN<br />

Feature <strong>of</strong><br />

purchasing<br />

behavior<br />

dimension cellular array [8], so the SOFM NN has the<br />

ability to extract the feature <strong>of</strong> the input signals’ model[9].<br />

The SOFM NN commonly only includes a onedimensional<br />

or two-dimensional arrays, but could also be<br />

extended to handle the multi-dimensional cellular array<br />

[10-12]. In order to have better stability and operating<br />

efficiency <strong>of</strong> SOFM NN, we add a feedback loop on the<br />

traditional SOFM NN to obtain improved SOFM NN<br />

(seen in Fig. 4).<br />

Victorious neuron<br />

Input Layer<br />

Competitive Layer<br />

Figure 3. Topology structure <strong>of</strong> the traditional SOFM NN<br />

Victorious neuron<br />

Competitive layer<br />

feedback loop<br />

Input layer<br />

Figure 4. Topology structure <strong>of</strong> the improved SOFM NN<br />

The improved SOFM NN is composed <strong>of</strong> the<br />

following four parts.<br />

• Cellular array for recognizing: This is mainly<br />

used for receiving the input signals and forming<br />

the “discrimination function” to recognize the<br />

input signals.<br />

• Mechanism for comparing and choosing: This is<br />

used for comparing these “discrimination<br />

functions” and making a decision to choose a<br />

processing unit with stronger functional output<br />

signals.<br />

• Local inter-connection and inter-action: This is<br />

used for stimulating both the chosen signals<br />

processing unit and its nearby signals processing<br />

unit.<br />

• Self-adapting process: This is used for modifying<br />

the parameters <strong>of</strong> stimulated processing unit so<br />

that it can increase the output value <strong>of</strong> the given<br />

“discrimination function”.<br />

B. The SOFM NN’s Algorithm<br />

The SOFM NN’s algorithm are described as follows.<br />

1) Initialization: choose “nearby neuron” set S j (0)<br />

with output neurons j, and the connection weight value<br />

, (0) wi j for both the input neuron i and the output neuron j<br />

is computed as equation(1).


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1715<br />

1<br />

w (0) = ∑ X<br />

(1)<br />

i, j PAM<br />

n X∈S( k)<br />

2) Calculating the Euclidean distance: euclidean<br />

distance means the distance between the input sample and<br />

every output neuron j, Calculating the Euclidean distance<br />

d j () t is shown in equation (2).<br />

n<br />

2<br />

j() = ln(|| − j ||) = ln( [ i() − i, j()]<br />

)<br />

i=<br />

1<br />

d t X w ∑ x t w t (2)<br />

3) Defining a neighborhood function: neighborhood<br />

function Sj( t) is expressed in equation (3), where<br />

Sj() t gets decreased as the time goes on.<br />

d j () t<br />

Sj( t) = Sj(0)exp(<br />

− )<br />

(3)<br />

2<br />

2σ<br />

4) Working out the minimum distance: the minimum<br />

distance min( d j ) among these corresponding neurons is<br />

calculated as equation (4).<br />

n<br />

2<br />

j = ∑ i − i, j (4)<br />

j<br />

i=<br />

1<br />

min( d ) argmin [ x () t w ()] t<br />

5) Setting learning rate: learning rate η may be<br />

computed according to equation(6) , where η gets<br />

decreased to zero as time t goes on.<br />

t<br />

η(t)= η(0)exp(<br />

− )<br />

(5)<br />

τ<br />

6) Modifying the weight value: When the weights<br />

∆wij<br />

() t<br />

variation reduces to zero, topology structure <strong>of</strong> the<br />

∆wij<br />

() t<br />

SOFM NN is most stable, and is computed as<br />

equation(6).<br />

⎛η()[ t xi() t −wij()], t X ∈S(<br />

k)<br />

⎞<br />

∆ wij () t =⎜ ⎟ (6)<br />

⎝0, X ∉ S( k)<br />

⎠<br />

7) Offering new learning samples to repeat the learning<br />

process mentioned above, then t←t+1, till<br />

η()<br />

t<br />

decreases<br />

to 0 or enough small, and process <strong>of</strong> network learning is<br />

terminted.<br />

III. AN EXAMPLE OF ANALYZING E-CONSUMERS’<br />

PURCHASING BEHAVIOR<br />

Because selling book is one typical choice to do Ebussiness,<br />

this paper takes consumers <strong>of</strong> book bussiness<br />

website for example to analyze e-consumers’ purchasing<br />

behavior[13].<br />

A. Main Clustering Variables<br />

Most data <strong>of</strong> customers come from online dealing<br />

records <strong>of</strong> a famous book website (dingdang.com) in<br />

China[1]. These data could be divided into two groups:<br />

customers’ attributes data, and transaction data.<br />

Customers’ basic attributes data mainly include:<br />

customer’s name, gender, age, income, educational<br />

status, occupation, city, marriage status, enrolment time,<br />

home address, hobby etc. Transaction data mainly<br />

include: shopping time, frequency <strong>of</strong> shopping,<br />

consumption <strong>of</strong> shopping, product name, price, way <strong>of</strong><br />

© 2011 ACADEMY PUBLISHER<br />

paying (e.g. cash on delivery, cash on postage and credit<br />

Card), latest shopping time etc.<br />

Main clustering variables <strong>of</strong> the SOFM neural network<br />

are seen in Table II, where main variables labeled by (*)<br />

indicates to be clustering variables..<br />

TABLE II.<br />

MAIN CLUSTERING VARIABLES OF THE SOFM NEURAL NETWORK<br />

Total amount<br />

<strong>of</strong> purchase<br />

Monthly<br />

income<br />

Frequency<br />

<strong>of</strong> ihopping<br />

Latest time<br />

<strong>of</strong> shopping<br />

x1 (*) x2(*) x3(*) x4(*)<br />

Age Gender<br />

Educational<br />

status<br />

District<br />

x5 x6 x7 X8<br />

B. Sample Data <strong>of</strong> Consumers’ Behavior<br />

There are 5000 sample records but limited by the<br />

length <strong>of</strong> this paper, we will only list part <strong>of</strong> the samples<br />

as demonstrated in Table III, where capitalized variables<br />

in Table III means to be standardized in the domain [0,<br />

1].<br />

Cust-ID<br />

TABLE III.<br />

CONSUMING SAMPLE DATA FROM E-MARKET<br />

Total amount <strong>of</strong><br />

purchase (X1)<br />

Monthly income<br />

(X2)<br />

1001 0.9260 0.9454<br />

1002 0.7549 0.6950<br />

1003 0.8118 0.8975<br />

1004 0.7982 0.6825<br />

1005 0.6532 0.5816<br />

… … … … … …<br />

Cust-ID<br />

Frequency <strong>of</strong><br />

shopping (X3)<br />

Latest shopping<br />

time (X4)<br />

1001 0.9720 0.9335<br />

1002 0.7273 0.6918<br />

1003 0.7586 0.7324<br />

1004 0.8180 0.7817<br />

1005 0.6609 0.5141<br />

… … … … … …<br />

Through system function premnmx() or user-defined<br />

functions, sample data can be normalized in the domain<br />

[0, 1]. In this paper, we adopt the Min-Max standardize<br />

method shown in equation (7).<br />

X(i) =<br />

x(i) - min{x(i)}<br />

max{<br />

x(i)} - min{x(i)}<br />

(7)


1716 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

C. Design <strong>of</strong> the SOFM Neural Network<br />

1) Topology Structure<br />

There are three kinds <strong>of</strong> topology structures:<br />

rectangular topology structure, hexagonal topology<br />

structure and random topology structure, which can take<br />

the corresponding three functions (namely gridtop(),<br />

hextop() and randtop() ) to describe the different topology<br />

structure <strong>of</strong> these neuron areas [14]. Here we take the 6*4<br />

random topology structure (shown in Fig. 5).<br />

Figure 5. 6*4 Random topology structure<br />

2) Main Programming Codes<br />

We firstly use function newsom() to create a SOFM<br />

neural network; then we use function train() and function<br />

sim() to train and simulate the new created network in<br />

order. Different training steps have different effects over<br />

efficiency <strong>of</strong> self-recognizing. Here, we set the training<br />

steps as 1000, 3000, 5000 and 10000 and observe the<br />

efficiencies <strong>of</strong> clustering respectively. The main<br />

programming codes are shown as follows:<br />

net=newsom(minmax(X),[6,4],’ randtop’);<br />

a=[1000 3000 5000 10000];<br />

yc=rands(1,10);<br />

for i=1:4<br />

net.trainParam.epochs=a(i);<br />

net=train(net,X);<br />

figure;<br />

w1=net.IW{1,1}<br />

plotsom(w1,net.layers{1}.distances);<br />

y=sim(net,X);<br />

yc=vec2ind(y)<br />

end<br />

D. Analysis on the Result <strong>of</strong> Training and Computing<br />

1) Network ’s Weight Value Structure<br />

There are great differences <strong>of</strong> SOFM neural<br />

network’s performance when we take different training<br />

steps. In the paper, we only take four kind <strong>of</strong> different<br />

training steps namely 1000, 3000, 5000 and 10000, and<br />

the corresponding Network ’s weight value structure are<br />

shown in Fig. 6, Fig. 7, Fig. 8, and Fig. 9 respectively as<br />

follows.<br />

© 2011 ACADEMY PUBLISHER<br />

Figure 6. Network ’s weight value structure (training steps: 1000)<br />

Figure 7. Network ’s weight value structure (training steps: 3000)<br />

Figure 8. Network ’s weight value structure (training steps: 5000)<br />

Figure 9. Network ’s weight value structure (training steps: 10000)<br />

From the above 4 figures, we can easily find that<br />

Network’s weight value figure comes to a comparatively<br />

stable status when the training steps is 5000 and 10000.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1717<br />

2) Network ’s Clustering Result<br />

Through training and simulating according to the four<br />

different kinds <strong>of</strong> training steps, we can also acquire a<br />

clustering result as shown in Table IV, where only 20<br />

sample records are listed for demonstration.<br />

Training<br />

Steps<br />

1000<br />

3000<br />

5000<br />

10000<br />

Training<br />

Steps<br />

1000<br />

3000<br />

5000<br />

10000<br />

TABLE IV.<br />

NETWORK ’S CLUSTERING RESULT TABLE<br />

1 2 3 4 5<br />

11 12 13 14 15<br />

8 20 8 20 8<br />

15 20 20 8 15<br />

16 19 13 19 13<br />

13 7 19 13 16<br />

11 18 7 19 7<br />

7 11 13 11 13<br />

6 7 8 9 10<br />

16 17 18 19 20<br />

15 20 8 8 20<br />

8 20 8 8 15<br />

7 19 13 7 19<br />

19 13 19 16 7<br />

13 19 7 18 19<br />

18 11 13 19 7<br />

To observe Table IV, we can find some rules as<br />

follows:<br />

� When the training steps are 1000, all the<br />

samples are divided into 1 group.<br />

� When the training steps are 3000, all the<br />

samples are divided into 2 groups.<br />

� When the training steps are 5000, all the<br />

samples are divided into 3 groups.<br />

� When the training steps are 10000, all the<br />

samples are divided into 3 groups<br />

From Fig. 10, we can also find that there is the unqiue<br />

minimum from a single neuron’s error surface, so the<br />

structure <strong>of</strong> the above improved SOFM NN is<br />

comparatively stable. This means Customers clustering<br />

stability are robust.<br />

© 2011 ACADEMY PUBLISHER<br />

Figure 10. Single Neuron’s Error<br />

3) Customers’ Recognition and the Corresponding<br />

Marketing Strategies<br />

According to the above 4 network’s weight value<br />

structure Figures (namely Fig.6-9) and one network ’s<br />

clustering result table (namely Table IV), We can also<br />

reach a further conclusion: when the training steps are<br />

more than 5000 (including 5000), the samples are steadily<br />

clustered and divided into 3 groups. To observe these 3<br />

groups and make an analysis <strong>of</strong> customers’ purchasing<br />

behavior, we find each group has its own special features<br />

as illustrated in Table V, where 3 distinguished marketing<br />

strategies are strongly suggested aiming at these 3<br />

groups’ special features. Obviously, recognizing<br />

customers’ features and taking the distinguishing<br />

marketing strategies can help to reach a win-win result<br />

between customers and bussiness website, increase the<br />

loyalty <strong>of</strong> customers (esp. VIPs), and maximize the pr<strong>of</strong>it<br />

<strong>of</strong> e-marketing.<br />

TABLE V.<br />

ANALYSIS RESULT OF E-CONSUMERS’ PURCHASING BEHAVIOR<br />

Cluster NO 1 Customers (5.71%) Consumption (0.13%)<br />

� Features <strong>of</strong> consumers’ purchasing behavior: occasional<br />

customers, most <strong>of</strong> the occasional customers are teenagers<br />

who come from different districts <strong>of</strong> the nation, and the<br />

total amount <strong>of</strong> purchase is low with low income and low<br />

shopping frequency. Most <strong>of</strong> them have a low-level<br />

educational status.<br />

� Marketing strategy: These customers deserve the normal<br />

service, such as racking up points for discount, getting the<br />

book information through e-mail but reading e-books not<br />

free on the Internet.<br />

Cluster NO 2 Customers (74.71%) Consumption: (17.86%)<br />

� Features <strong>of</strong> consumers’ purchasing behavior: main<br />

customers, most <strong>of</strong> the main customers are youths who<br />

come from different districts <strong>of</strong> the nation, and the total<br />

amount <strong>of</strong> purchase is higher with middle-level income<br />

and higher shopping frequency. They have a middle-level<br />

educational status.


1718 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

� Marketing strategy: these customers deserve the middleclass<br />

service, such as ordering individualized information<br />

<strong>of</strong> book through e-mail, racking up points for higher<br />

discount, reading some e-books free on the Internet when<br />

the amount <strong>of</strong> purchase accumulate to a certain point,<br />

enjoying free e-cards or e-flowers on their birthday and so<br />

on.<br />

Cluster NO 3 Customers (19.58%) Consumption (82.01%)<br />

� Features <strong>of</strong> consumers’ purchasing behavior: most is VIPs,<br />

who are young women who <strong>of</strong>ten come from highly<br />

developed districts or remote places, and the total amount<br />

<strong>of</strong> purchase is the highest with high income or low income<br />

and the highest shopping frequency. Most <strong>of</strong> the VIPs have<br />

a middle-level or high-level educational status.<br />

� Marketing strategy: these customers deserve the top<br />

service, such as enjoying VIP service to have free private<br />

cyberspace and fastest green passage, downloading or<br />

reading some e-books free on the Internet, conferring the<br />

latest book catalogue in both paper’s form and e-mail’s<br />

form, free biggest cards and best flowers on their birthday,<br />

the highest discount and so on.<br />

Table V strongly proves the Pareto 80/20 Principle:<br />

20% <strong>of</strong> all customers are the VIPs (Cluster NO 3), and<br />

their contribution is 80%. In this table, we can also find<br />

some interesting phenomena. For example, VIPs would<br />

not definitely be customers with high income, and most<br />

<strong>of</strong> VIPs are young women rather than men, VIP<br />

customers are not only from developed regions, but also<br />

from less developed regions.<br />

IV. CONCLUSION<br />

Famous economist Christopher pointed out: in today’s<br />

unpredictable business competition, the market is no<br />

longer on the sellers’ side but on the buyers’ side [14].<br />

“Customer is the god”. So exactly to analyze consumers’<br />

purchasing behavior on the Internet and accordingly to<br />

make some scientific Internet marketing strategy for sale<br />

promotion are key factor to success for assuring the pr<strong>of</strong>it<br />

<strong>of</strong> E-business website. As for how to analyze econsumers’<br />

purchasing behavior, this paper proposes and<br />

compares three kinds <strong>of</strong> research models, and pointed out<br />

thedata-driving model is best one to analyze econsumers’<br />

purchasing behavior. SOFM NN belongs to a<br />

typical data-driving model, so this paper improves the<br />

tradional SOFM NN and takes the improved one as a tool<br />

to analyze e-consumers’ purchasing behavior. Because econsumers’<br />

purchasing behavior analysis based on the<br />

SOFM Neural Network is a comparatively novel method,<br />

the result <strong>of</strong> research in this paper is just for reference.<br />

ACKNOWLEDGMENT<br />

The author thanks the anonymous reviewers for their<br />

valuable remarks and comments. This work is supported<br />

© 2011 ACADEMY PUBLISHER<br />

by 2010 National Social Science Fund <strong>of</strong> China (Grant<br />

No. 10BGL028), National Natural Science Fund <strong>of</strong> China<br />

(Grant No. 70861002), China Postdoctoral Science Fund<br />

(Grant No. 200902535), 2010 Science and Technology<br />

Project <strong>of</strong> education department <strong>of</strong> Jiangxi Province<br />

(Grant No. GJJ10430), and 2010 Social Science Planning<br />

Project <strong>of</strong> Jiangxi Province (Grant No. 10GL35).<br />

REFERENCES<br />

[1] H.Rubost, “Consumer Behavior <strong>of</strong> Online Procurement<br />

and Book Supply Chain,” Service Operations, Logistics,<br />

and Informatics. May 2005. pp 49-66.<br />

[2] J.P.Peter, and J.C. Olson, “Consumer Behavior and<br />

Marketing Strategy,” McGraw-Hill Press, 2009.<br />

[3] Blanca Hernández, Julio Jiménez, and M. José, “Customer<br />

behavior in electronic commerce: The moderating effect <strong>of</strong><br />

e-purchasing,” <strong>Journal</strong> <strong>of</strong> Business Research, Volume 63,<br />

Issues 9-10, September-October 2010, pp. 964-971.<br />

[4] H. J. Chang, L. P. Hung and C. L. Ho, “An anticipation<br />

model <strong>of</strong> potential customers’ purchasing behavior based<br />

on clustering analysis and association rules analysis,”<br />

Expert Systems with Applications, Vol.32, Issue 3, April<br />

2007, pp. 753-764.<br />

[5] P.W Engel, “A View Coming from Database Management<br />

<strong>of</strong> Consumer’s Behavior,” New York: Dryden Press, 2008<br />

[6] I. C. Yeh, C. H. Lien, T. M. Ting, Y. Y Wang and C. M.<br />

Tu, “Cosmetics purchasing behavior–An analysis using<br />

association reasoning neural networks,” Expert Systems<br />

with Applications, Vol.37, Issue 10, October 2010, pp.<br />

7219-7226.<br />

[7] Simon Haykin, A Comprehensive Foundation, World<br />

publishing house, February 2004.<br />

[8] D J.Willshow, “How Patterned Neural Connections Can<br />

Be Set Up By Self-organizations,” Proc Roy Soc London<br />

B,1976,194: 431-445.<br />

[9] T.Kohonen, “Self-organized Formation <strong>of</strong> Topologically<br />

Correct Feature Maps,” Biological Cybernetic.<br />

1982,43(1):59-69.<br />

[10] FeiSi Science Research Center, Neural Network theory and<br />

Realization in Matlab 7, Beijing: Publishing House <strong>of</strong><br />

Industry Electronics, May 2005, pp. 165-178.<br />

[11] Z. H. Yang and Y. Yan, “Research and Development <strong>of</strong><br />

Self-organizing Maps Algorithm,” Computer Engineering,<br />

2006, 32 (16), pp. 201-228.<br />

[12] Mao Guojun, et al. Principle and Algorithm <strong>of</strong> Data<br />

Mining, Beijing: Tsinghua University Press, 2008.<br />

[13] Huang Lijuan. Yu Guoping. “Research on the Design for<br />

the National Unified E-marketing Platform <strong>of</strong> Chinese<br />

Book Supply Chain,” UESTC Press, 2006.<br />

[14] M.Christopher, “Logistics and Supply Chain<br />

Management,” London: Pitman Publishing House, 1992.<br />

Lijuan Huang Jiangxi Province, China.<br />

Birthdate: February, 1971. is<br />

Management Science and Engineering<br />

Ph.D., graduated from Nanchang<br />

University. And research interests on ecommerce<br />

and Logistics and Supply<br />

Chain Management.<br />

She is a postdoctor <strong>of</strong> Jiangxi<br />

University <strong>of</strong> Finance and Economics.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1719<br />

Repair Method <strong>of</strong> Complex Network Based on<br />

Matthew Effect<br />

Minsheng Tan<br />

School <strong>of</strong> Computer Science and Technology, University <strong>of</strong> South China, Hengyang Hunan, 421001,China<br />

Email:tanminsheng65@163.com<br />

Qiang Cui, Lingfeng Zhu and Hui Zhao<br />

School <strong>of</strong> Computer Science and Technology, University <strong>of</strong> South China, Hengyang Hunan, 421001, China<br />

Email:{kiteblue@126.com, 407999562@qq.com, zhaohui.1006@yahoo.com.cn }<br />

Abstract — Complex network repair after suffering the<br />

deliberate assault becomes extraordinarily important. In this<br />

paper, a repair method <strong>of</strong> complex network based on<br />

Matthew Effect was proposed. Single-node selective attack<br />

algorithm and multi-node cluster attack algorithm was given.<br />

Aiming at the two kinds <strong>of</strong> attack, linear detection algorithm<br />

and BA network generation algorithm was put forward to<br />

get experiment data. Correspondingly, repair experiments<br />

were done. Experimental results show that repair rate <strong>of</strong> the<br />

method is more than 95% in sampling Internet and BA<br />

network. For repair rate <strong>of</strong> complex network, the conception<br />

<strong>of</strong> stability and its mathematics description was addressed.<br />

Experiments show that the complex network can achieve a<br />

steady topology state after some steps <strong>of</strong> attacks and repairs.<br />

Index Terms—Complex Network, Repair Method, Power-law,<br />

Matthew Effect, Stability<br />

I. INTRODUCTION<br />

Issues on complex network repair are raised as<br />

forefront topic in recent years in this field. Currently,<br />

research on this topic is very little domestic and in its<br />

infancy abroad. Complex network repair, which has no<br />

uniform definition, only use connectivity to evaluate<br />

repair method is good or bad. The repair is lack <strong>of</strong> some<br />

unified considerations, such as the cost <strong>of</strong> restoration, the<br />

stability <strong>of</strong> the network and the ability against attacks after<br />

repair [1-2].<br />

Repair methods and attack methods are inseparable.<br />

Studies on attack efficiency, damage degree and attack<br />

principle <strong>of</strong> different attack strategies help to find speed<br />

and efficient repair strategies. Through constant attacks<br />

and repair on the network, we can observe changes in<br />

network topology, anti-attacks level and easy-repairing<br />

ability <strong>of</strong> different types <strong>of</strong> network topology. Currently,<br />

measure <strong>of</strong> repair method to measure quality just reflects<br />

the connectivity <strong>of</strong> the network topology but not the<br />

performance <strong>of</strong> network run-time communication services<br />

which is precisely one <strong>of</strong> the greatest concern <strong>of</strong> users and<br />

Manuscript received Feb.25, 2011; revised Mar.5, 2011; accepted Apr.<br />

2, 2011.<br />

project number: 60572137, 10JJ9025, 2009GK3036 , 10C1185.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1719-1725<br />

managers [3-5]. Repair strategies and exploration on<br />

complex network has important theoretical significance<br />

and application value. This paper proposes a new repair<br />

method <strong>of</strong> complex network based on Matthew Effect for<br />

the power-law.<br />

II. MATTHEW EFFECT<br />

Matthew Effect is a phenomenon, which is the good<br />

better, the bad worse and worse, much more, little less,<br />

and its name comes from a fable in the "Bible. Gospel <strong>of</strong><br />

Matthew"[6]. In 1968, the United States • History <strong>of</strong><br />

Science researcher Robert Morton proposed the term used<br />

to summarize a social psychological phenomenon. Robert<br />

• Morton interpreted "Matthew Effect" as: any individual,<br />

group or region, if success and progress in one respect<br />

(such as money, fame, status, etc.) it will produce a<br />

cumulative advantage, and there will be more<br />

opportunities to achieve greater success and progress<br />

[7-9].<br />

Real Internet in the generation process is a true<br />

portrayal <strong>of</strong> Matthew in practical application, when a new<br />

node is added to the network, the node will tend to be<br />

connected with network nodes which have larger degrees<br />

[10-11].<br />

BA network model can well reflect the Matthew Effect.<br />

Its generation process well considered the two following<br />

characteristics [12]:<br />

� Growth characteristic: network growing larger.<br />

� Preferential attachment characteristic: new nodes<br />

tend to connect with those with high degree <strong>of</strong> "big"<br />

node connections.<br />

Figure 1 shows the evolution process <strong>of</strong> BA network<br />

when m = m0 = 2.<br />

Figure 1 Formation process <strong>of</strong> BA network


1720 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

III. REPAIR METHOD OF COMPLEX NETWORK<br />

BASED ON MATTHEW EFFECT<br />

This article research Matthew Effect application in<br />

repair from scale-free network based on the Matthew<br />

Effect in complex network. A single-node selective attack<br />

(for example, delete the network nodes with degree <strong>of</strong> the<br />

maximum value) and multi-node cluster attack (such as<br />

one-time attacks on 30% network nodes <strong>of</strong> moderate value<br />

ones) under sustained attack is the main consideration.<br />

While a node or multiple nodes are deleted as a ratio r<br />

priority in the network, a node is reconnected. Those<br />

nodes losing neighbor nodes reconnect to other nodes to<br />

replace the lost nodes; in addition, the node attacked is<br />

reconnected to the network as a new node. Compensation<br />

dynamics in linear priority sustained attack will lead to<br />

power-law degrees distribution associated with index<br />

truncation which depends on the rate <strong>of</strong> priority deletion.<br />

Thus, when the node <strong>of</strong> the network with maximum<br />

degree was attacked, compensation agreement could still<br />

protect the index <strong>of</strong> power-law distribution. Even in a high<br />

rate <strong>of</strong> priority attack, or attacking the network nodes with<br />

a large value, as long as the new node can connect<br />

network randomly with m ≥ 2, the network will be able to<br />

maintain a large connection parts, and the lost connection<br />

is no longer the damage result <strong>of</strong> this sustained attack. The<br />

repair method considered here are changing from the time,<br />

which is showed as follows:<br />

A. Repair Algorithm under Single-node Selective Attack<br />

For a given network topology, according to the size <strong>of</strong><br />

first statistical degree <strong>of</strong> the network nodes to do selective<br />

attack network nodes, because the result <strong>of</strong> attacks and<br />

repair will lead to changes in the degree <strong>of</strong> network nodes,<br />

which need count the degree <strong>of</strong> the network nodes<br />

according to changes <strong>of</strong> time in real-time, this repair is to<br />

change over time. Here are the steps in the recovery<br />

algorithm:<br />

(1) Count degree <strong>of</strong> nodes in the network, data<br />

storage :d;<br />

(2) According to the degree <strong>of</strong> the nodes from the<br />

statistics (1) to attack a node in the network<br />

(assuming the network nodes numbered from 1<br />

onwards, if a node i meet that a (i, 1) == M (k) ∪ a<br />

(i, 2 ) == M (k) is true, then the edge(i, j) connecting<br />

this node will be deleted);<br />

(3) Recount degree <strong>of</strong> nodes in the network;<br />

(4) Count the number <strong>of</strong> node with the maximum degree<br />

and the one with the maximum degree in the network:<br />

p ← max (d (i, 1));<br />

(5) Remove the node M;<br />

(6) Count network node degree, the number <strong>of</strong> nodes<br />

with maximum degree and the ones with maximum<br />

degree in the network;<br />

(7) Unicom generated the largest sub-graph f, recount<br />

the degrees <strong>of</strong> nodes in the network and statistics the<br />

number <strong>of</strong> nodes in f;<br />

(8) Repeat steps (1) - (7), when the number <strong>of</strong> network<br />

nodes and average degrees tend to balance, the<br />

© 2011 ACADEMY PUBLISHER<br />

algorithm end;<br />

(9) Repair rate calculation.<br />

Input: connected network with N nodes and certain<br />

number <strong>of</strong> edges.<br />

Output: N nodes in the connected network, repair rate<br />

s (r).<br />

d said the matrix storing node degree, n said the number<br />

<strong>of</strong> network nodes, m said the number <strong>of</strong> edges in the<br />

network, (i, j) said one network edge, a said network<br />

before each attack or repair, M (k) said the node with<br />

degree k, M, M ∈ (1 ... n), p said the matrix storing node<br />

degree, f said the largest restored Unicom network<br />

sub-graph.<br />

In both algorithms attacks and repair process, Matthew<br />

Effect is used to remove <strong>of</strong> a linear priority and repair in<br />

the network. After each once, the degree <strong>of</strong> network nodes<br />

are recounted to ensure nodes attacked by a linear attack<br />

and repair. The time complexity is O (n 3 ).<br />

B. Repair Algorithm under Multi-node Cluster Attack<br />

According to the degree, network nodes are divided<br />

into the central node, sub-central node, the intermediate<br />

value node and small scale value node, each attack the<br />

bulk <strong>of</strong> those kinds <strong>of</strong> node, the specific algorithm is as<br />

follows:<br />

(1) A new node added to the network, and the linear<br />

preferential attachment to m nodes from the network;<br />

(2) In accordance with the value <strong>of</strong> node degree, the<br />

nodes include the central node, sub-central node, the<br />

intermediate value <strong>of</strong> the node and the value <strong>of</strong> small<br />

degree nodes, then select n nodes w1、w2、w3…wn<br />

from those kinds <strong>of</strong> node at different rates r and<br />

delete them, the following steps:<br />

� Remove nodes w1, w2, w3 ... wn and all their sides,<br />

then w1, w2, w3 ... wn, respectively, as a new node<br />

connect to m nodes <strong>of</strong> the network;<br />

� Each node connected to the node w1, w2, w3 ... wn<br />

has lost an edge, and added a random edge to<br />

compensate;<br />

(3) Repeat steps (1), (2), until the network nodes and<br />

edges become balanced.<br />

Input: N nodes <strong>of</strong> connected network with a certain<br />

number <strong>of</strong> edges.<br />

Output: N nodes <strong>of</strong> the connected network, repair rate<br />

s (r).<br />

From steps <strong>of</strong> the repair method based on Matthew<br />

Effect, first adding one node in linear preferential<br />

attachment with m nodes in the network ensures that most<br />

connection information will be stored in the new nodes<br />

which will preferential attack to or attach with network<br />

nodes in linear to promise the power-law <strong>of</strong> network. The<br />

attack and repair process is similar to natural growth<br />

process, so in the term <strong>of</strong> topology <strong>of</strong> total network, the<br />

topology after attack and attachment will change<br />

strikingly.<br />

r<br />

The repair rate in the process is: ()<br />

N<br />

sr = , N r said<br />

N<br />

the node number after repair in the network, N said the


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1721<br />

node number before.<br />

IV. TWO KEY ALGORITHMS GETTING<br />

EXPERIMENT DATA<br />

Considering that Internet topology has important<br />

influence to its anti-destroying ability, to research better<br />

on Internet topology, NSF (National Science Foundation<br />

<strong>of</strong> America) subsidizes National Laboratory for applied<br />

network research to measure and analysis on Internet<br />

topology. The original measured result includes AS-level<br />

Internet topology which truly reflects the status <strong>of</strong> Internet<br />

connection. Taking into account the authenticity<br />

<strong>of</strong> network simulation and constraints <strong>of</strong><br />

experimental hardware, to validate the effectiveness <strong>of</strong><br />

repair methods in this paper, experiment test in the<br />

Matalab simulation platform. To make the simulation<br />

closer to the real Internet model, we used real network<br />

simulation statistics. Specific steps to get experimental<br />

data:<br />

A. Algorithm <strong>of</strong> Sampling from Actual Network.<br />

(1) b=zeros(37447,3)<br />

(2) for n=1:37447<br />

(3) b(n,1)=data(n,1)<br />

(4) b(n,2)=data(n,2)<br />

(5) end<br />

(6) a=zeros(2400,2)<br />

(7) k=1<br />

(8) a(1,1)=b(4513,1)<br />

(9) a(1,2)=b(4513,2)<br />

(10) for m=1:50<br />

(11) for i=m+1:37447<br />

(12) for j=1:2<br />

(13) if b(i,j)==a(k,1)&&b(i,3)==0||b(i,j)==a(k,2)<br />

&&b(i,3)==0<br />

(14) k=k+1;%%%<br />

(15) a(k,1)=b(i,1)<br />

(16) a(k,2)=b(i,2)<br />

(17) b(i,3)=1<br />

(18) end<br />

(19) end<br />

(20) end<br />

(21) end<br />

Input: one network with ten thousands <strong>of</strong> nodes.<br />

Output: one network with one thousands <strong>of</strong> nodes.<br />

First import a matrix from measured data and detect in<br />

linear from one node omnipresence one connection<br />

sub-graph kept other matrix, <strong>of</strong> which the number <strong>of</strong> node<br />

is not continuous, so the nodes <strong>of</strong> the graph is necessary to<br />

renumber start from 1 to make number continuous.<br />

According to the actual data on the<br />

http://moat.nlanr.net/routing/rawda-ta (the total number <strong>of</strong><br />

edges in the network is 37,448, the total number <strong>of</strong> nodes<br />

is 26589, the number <strong>of</strong> nodes zero is 13010, the<br />

maximum degree is the 2637, the average degree is<br />

5.515576), sampling network by detection method (after<br />

sampling, number <strong>of</strong> edges is 2358, the number <strong>of</strong> nodes<br />

is 1028, the maximum is 191 degrees, the average degree<br />

<strong>of</strong> 4.587549). Figure 2 shows results which obey the<br />

degree distribution <strong>of</strong> real network, the nodes <strong>of</strong> between<br />

© 2011 ACADEMY PUBLISHER<br />

1~5 degree account for about 80% in network nodes, and<br />

less large value ones.<br />

B. BA Scale-free <strong>Networks</strong> Generated and The Steps as<br />

Follows:<br />

I)Initializing network<br />

(1) nodes ← zeros (N)<br />

(2) cii ← zeros (1, N)<br />

(3) t ← zeros (1, N)<br />

(4) for i ←1: m<br />

(5) nodes (i,m+1) ←1<br />

(6) nodes (m+1,i) ←1<br />

(7) list (i) ←i<br />

(8) end<br />

(9) for i←m+1:2*m<br />

(10) list (i) ←m+1<br />

(11) end<br />

II)Increasing the node and edge into the Internet and<br />

add 2m each t into the auxiliary vector list.<br />

(1) for n←m+2: N<br />

(2) t←2*m*(n-m-1)<br />

(3) for i←1: m<br />

(4) list (t+i) ←n<br />

(5) end<br />

(6) k←1<br />

(7) while k0&p(k)


1722 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

0.5<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0 20 40 60 80 100<br />

度数<br />

120 140 160 180 200<br />

Figure 2 Internet degree distribution sampling<br />

Figure 3 BA network degree distribution<br />

Figure 4 Sampling Internet degree and BA network degree<br />

distribution in Logarithmic Coordinates<br />

V. EXPERIMENT PROCESS AND RESULT<br />

ANALYSIS<br />

To validate the effectiveness <strong>of</strong> repair methods in this<br />

paper, experiment test in the Matalab simulation platform.<br />

To make the simulation closer to the real Internet model,<br />

we used real network simulation statistics.<br />

A. Repair Process under Single-node Selective Attack<br />

According to the algorithm <strong>of</strong> the previous section,<br />

single-node selective attack and repair to the sampling<br />

Internet and BA network, as follows:<br />

(1) A new node, respectively, was added to sampling<br />

Internet and BA network, and connect to the m<br />

(where m = 3) nodes with maximum degree <strong>of</strong> both<br />

network;<br />

(2) At ratio r = 0.0125,0.03,0.2,0.33,0.5,1 select nodes<br />

from sampling Internet, the 40 <strong>of</strong> 3 degree, the 1019<br />

<strong>of</strong> 6 degree, the 632 <strong>of</strong> 14 degree, the 1015 <strong>of</strong> 21<br />

degree, the 599 <strong>of</strong> 25 degree, the 457 <strong>of</strong> 191 degree,<br />

in each time attack one <strong>of</strong> them and remove all edges<br />

connecting it; To select nodes at ratio r = 0.0047,<br />

0.04, 0.17, 0.33,1,1 from BA network, the 114 <strong>of</strong> 3<br />

degree, the 85 <strong>of</strong> 7 degree, the 194 <strong>of</strong> 10 degree, the<br />

127 <strong>of</strong> 15 degree, the 13 <strong>of</strong> 24 degree, the 3 <strong>of</strong> 69<br />

degree, in each time attack one <strong>of</strong> them and remove<br />

all edges connecting it;<br />

(3) Nodes attacked have priority to connect to m (m = 1)<br />

nodes, while each one lose one edge;<br />

(4) Repeat steps (1), (2), (3), until the number <strong>of</strong> nodes<br />

in the network remain at 1027, the average Internet<br />

remained at about 4.5 degree, the average degree <strong>of</strong><br />

BA network kept steady state <strong>of</strong> about 3.8;<br />

(5) At this time calculating connection rate and<br />

power-law <strong>of</strong> both networks and index sharp<br />

truncated <strong>of</strong> sampling Internet. Internet, BA network<br />

connectivity rate s (r) is equal to the total number <strong>of</strong><br />

nodes in the network after the repair dividing 1028,<br />

the results shown in Table I, Table II, Table III.<br />

Table I RATE OF CONNECTIVITY OF THE SAMPLING NETWORK s (r) AND THE POWER-LAW k<br />

r 0.0125 0.03 0.2 0.33 0.5 1<br />

s(r) 1.0 0.999027 0.998054 0.997082 0.990272 0.955253<br />

dmax 192 192 192 192 192 122<br />

dave 4.585 4.578 4.555 4.520 4.525 4.354<br />

k 2.372 2.384 2.467 2.352 2.664 2.572<br />

Table II RATE OF CONNECTIVITY OF THE BA NETWORK s (r) AND THE POWER-LAW k<br />

r 0.0047 0.04 0.17 0.33 1 1<br />

s(r) 0.995136 0.995136 0.995136 0.996109 0.995136 0.995236<br />

dmax 55 55 55 55 54 55<br />

dave 3.858 3.848 3.836 3.81 3.767 3.767<br />

k 3 3 3 3 3 3<br />

Table III INDEX SHARP CUT OF THE SAMPLING INTERNET<br />

r 0.0 0.01 0.03 0.05 0.07<br />

Kc(r) 27 20 14 12 8<br />

© 2011 ACADEMY PUBLISHER


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1723<br />

B. Repair Process under Multi-node Cluster Attack<br />

According to the algorithm <strong>of</strong> the previous section,<br />

multi-node cluster attack and repair to the sampling<br />

Internet and BA network, as follows:<br />

(1) A new node, respectively, was added to the sampling<br />

Internet and the BA network, and connected to m<br />

(where m = 3) nodes <strong>of</strong> maximum degree value;<br />

(2) In the subnet <strong>of</strong> Internet, all central nodes, 1% <strong>of</strong> the<br />

total number, 10% and 50% <strong>of</strong> the sub-central node<br />

<strong>of</strong> 3% in total, 10% and 50% the middle value <strong>of</strong><br />

degree nodes, and 10% and 50% small value <strong>of</strong><br />

degree nodes <strong>of</strong> 60% in total, are attacked. BA model<br />

in the same proportion <strong>of</strong> the nodes were also tested;<br />

(3) Nodes attacked have priority to connect to m (m = 1)<br />

nodes in the network, while these m nodes will lose<br />

edges;<br />

(4) Repeat steps (1), (2), (3), until the number <strong>of</strong> nodes<br />

in the network remains at 1027, the average<br />

remained at about 4.5 degree, BA degree <strong>of</strong> the<br />

network to keep the average steady state <strong>of</strong> about<br />

3.8;<br />

(5) At this time calculating connection rate and power<br />

rate <strong>of</strong> both networks and index sharp truncated <strong>of</strong><br />

sampling internet. The results are shown in Table ,<br />

Table and Table .<br />

C. Analysis <strong>of</strong> Experimental Results<br />

From Table and Table , the repair method with a<br />

very high repair rate on the sampling real Internet, even if<br />

nodes <strong>of</strong> the maximum degree value are attacked ,or nodes<br />

are subjected to cluster attack, a simple repair can make<br />

the network connectivity rate still reach more than 95%<br />

and 99% under attack <strong>of</strong> nodes with the general value <strong>of</strong><br />

degree; From Table II and Table V, for different r, repair<br />

rate <strong>of</strong> BA networks is more than 99%; Table I, Table II<br />

Table IV and Table V again proved that applying Matthew<br />

Effect to construct BA network can generate network<br />

topology very close to the real Internet. But the Internet in<br />

the build process, following Matthew Effect, is also<br />

affected by other factors on which research is advantage to<br />

research in the real Internet; Table I and Table II also<br />

shows the network average degree value decreased when a<br />

high rate <strong>of</strong> repair methods, which indicates that the repair<br />

method can remove redundant edge to the network easier.<br />

From Table I, Table II, Table IV and Table V, with this<br />

method, the power-law distribution network can well<br />

maintain its power-law and high rate <strong>of</strong> repair, and the<br />

original topological properties has also been well<br />

maintained. Table III, Table VI mirrored index sharp<br />

truncated appears in the sampling Internet in different<br />

options proportion, which again illustrates the real<br />

network in the build process is affected by other factors, in<br />

addition to follow Matthew Effect.<br />

Table IV RATE OF CONNECTIVITY OF THE SAMPLING NETWORK s (r) AND THE POWER- LAW k AFTER REPAIR<br />

Node class Central node Next central node Intermediate value node Small scale value node<br />

Node ratio <strong>of</strong> r 50% 10% 50% 10% 50% 10% 50%<br />

s(r) 1.0 0.999027 0.988054 0.997082 0.980272 0.9955 0.9936<br />

dmax 167 178 192 192 192 192 192<br />

dave 4.585 4.578 4.555 4.520 4.525 4.354 4,237<br />

k 2.37 2.42 2.54 2.28 2.46 2.41 2.39<br />

Table V RATE OF CONNECTIVITY OF THE BA NETWORK s (r) AND THE POWER -LAW k AFTER REPAIR<br />

Node class Central node Next central node Intermediate value node Small scale value node<br />

Node ratio <strong>of</strong> r 50% 10% 50% 10% 50% 10% 50%<br />

s(r) 0.998 0.999027 0.998054 0.997082 0.990272 0.9975 0.99<br />

dmax 173 189 192 192 192 192 0.99<br />

dave 4.585 4.578 4.555 4.520 4.525 4.354 4.37<br />

k 3 3 3 3 3 3 3<br />

Table VI INDEX SHARP CUT OF THE SAMPLING INTERNET AFTER REPAIR<br />

Node class Central node Next central node Intermediate value node Small scale none<br />

Node ratio <strong>of</strong> r 50% 10% 50% 10% 50% 10% 50%<br />

k 14 9 12 21 23 19 17<br />

VI. The STABILITY OF COMPLEX NETWORK IN<br />

THE REPAIR PROCESS<br />

By complex network repair algorithm based on<br />

Matthew Effect, this paper researched and compared<br />

random network, scale-free network and small world<br />

network and found that all <strong>of</strong> them can be evolved into a<br />

state <strong>of</strong> equilibrium. So the introduction <strong>of</strong> the stability <strong>of</strong><br />

S (t) is to describe the repair extent <strong>of</strong> the system after<br />

© 2011 ACADEMY PUBLISHER<br />

repair and that <strong>of</strong> the network easy to repair.<br />

The current international and domestic study on the<br />

destruction <strong>of</strong> complex network still limits the robustness<br />

which is the capacity <strong>of</strong> complex network to bear the<br />

external damage. This is the first study on the<br />

characteristics <strong>of</strong> complex network under destruction and<br />

repair.<br />

In general, a maximal connected sub-graph <strong>of</strong> the<br />

network tends to a stable value in the process <strong>of</strong> constant


1724 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

attack and repair process and it is said to have reached a<br />

steady state.<br />

Considering N(t), the size <strong>of</strong> the sub-graph <strong>of</strong> the<br />

network, changing along with time, there are N(t0),N(t1),<br />

N(t2),N(t3),…N(tn), then the stability <strong>of</strong> S (t) is defined as:<br />

N(<br />

t0<br />

)<br />

S(<br />

t)<br />

=<br />

(1)<br />

N(<br />

tn<br />

) n→∞<br />

That is to say, the stability S (t) is a ratio <strong>of</strong> the size <strong>of</strong><br />

network and the size <strong>of</strong> the largest connected sub-graph <strong>of</strong><br />

final network in the constant attacks and repair process.<br />

N ( tn<br />

) said the size <strong>of</strong> the largest connected sub-graph <strong>of</strong><br />

the network, N( t0<br />

) said the size <strong>of</strong> network. As can be<br />

seen from the definition, S (t) is a step-wise increasing<br />

function with initial value 1 and S (t) whose value is a<br />

number greater than or equal to 1.<br />

S (t) reflects, to some extent, the stability <strong>of</strong> network<br />

topology. The greater S (t) is, more easily the system<br />

reaches a steady state after the repair. Relative to the<br />

topology structure at other times, the topology at this time<br />

is more easily to fix.<br />

Figure 5 shows the evolution <strong>of</strong> the stability along with<br />

the time step t, size for the sample Internet N = 1028,<br />

connection probability and repair probability Pr = PC =<br />

0.02.<br />

We find that S (t) grows very fast at the beginning with<br />

evolution, S (t) gradually slows the growth and eventually<br />

reaches a balance value Sb. Stability S (t) gradually<br />

increasing means that the system becomes more balanced<br />

through a series <strong>of</strong> attacks and repair, more easily to<br />

achieve good restoration results.<br />

One point worthy to illustrate here: the stability S (t)<br />

finally reached a balance value Sb. In equilibrium, the<br />

value <strong>of</strong> S (t) is the largest one in the system. This<br />

implies that the system reaches a vulnerable state after<br />

thousands <strong>of</strong> steps evolution.<br />

Figure 5 Stability s (t) changes with time t map, t that repair times,<br />

s (t) that the stability<br />

VII. CONCLUSION<br />

Single-node selective attack and multi-node cluster<br />

attack is the most difficult to deal in complex network<br />

attacks. For these two attacks, this paper proposed a repair<br />

method <strong>of</strong> complex network based on Matthew Effect.<br />

Experimental results show that the rate <strong>of</strong> the proposed<br />

repair method under attacks both sampling Internet and<br />

the BA network can reach 95% or more. Applying the idea<br />

© 2011 ACADEMY PUBLISHER<br />

<strong>of</strong> building the BA network to the repair <strong>of</strong> power-law<br />

distribution network can not only get a high repair rate,<br />

but also optimize the network topology. For the level <strong>of</strong><br />

complex network repair, we also proposed the conception<br />

<strong>of</strong> stability and described it in mathematics. Experimental<br />

results show that complex network after several steps <strong>of</strong><br />

attacks and repairs can gradually evolved into a relatively<br />

stable state. In this state, the complex network is easily<br />

repaired.<br />

Matthew Effect increased the efficiency <strong>of</strong> information<br />

exchange in network, but also brought problems to<br />

network security. If network nodes <strong>of</strong> large value were<br />

attacked, the probability would increased that part nodes<br />

in the network can not be able to connect with others. We<br />

will consider this issue in future research.<br />

ALKNOWLEDGEMENT<br />

This work was supported by Project 60572137 <strong>of</strong> the<br />

National Science Foundation, Project 10JJ9025 <strong>of</strong> the<br />

Hunan Natural Science Foundation, Project 2009GK3036<br />

<strong>of</strong> the Hunan Science and Technology Plan and Porject<br />

10C1185 <strong>of</strong> the Hunan Province <strong>of</strong> Science Research.<br />

REFREENCES<br />

[1] Wang Xiao-fang, Li Xiang, Chen Guan-rong. Complex<br />

nextwork theory and application[M].BenJing: Tsing Hua<br />

University punishment. 2006:11-14.<br />

[2] Carreras B A, Newman D E, Dobson I, et al. Evidence for<br />

self organized criticality in electric power system<br />

blackouts[C]. Thirty forth Hawaii International<br />

Conference on System Sciences. Maui, Hawaii,<br />

2001:705-709.<br />

[3] Wu Jun, Tan Yue-jin. Complex network anti-destroying<br />

ability estimation research[J]. system engineering journal,<br />

2005(2):128-131.<br />

[4] Chen Zhen-yi, Wang Xiao-fang, congestion and control in<br />

scale-free network[J]. system engineering journal,<br />

2005,20(1):132-138.<br />

[5] Faloutsos M, Flaoutsos P, Faloutsos C. On power-law<br />

relationship <strong>of</strong> the Internet topology[J]. ACM SIGCOMM<br />

Computer Communication Review, 1999,29(4): 251- 262.<br />

[6] Albert R, Barabási A L. Statistical mechanics <strong>of</strong> complex<br />

network[J]. Review <strong>of</strong> Modern Physics, 2002,74(1):47-97.<br />

[7] Barthelemy M, Amaral L A N. Small-world networks:<br />

Evidence for a crossover picture [J]. Phys.Rev.Lett,<br />

1999,82:5180-5184.<br />

[8] Erdos P, Renyi A. On the evolution <strong>of</strong> random graph[J].<br />

Publ.Math.inst.Hung. Acad Sci, 1960,5:17-60.<br />

[9] Watts D J, Strogatz S H. Collective dynamics <strong>of</strong><br />

small-world networks[J]. Nature, 1998, 393(6684):440-442.<br />

[10] Holme P, Kim B J, Yoon C N, et al. Attack vulnerability <strong>of</strong><br />

complex networks[J]. Phys.Rev.E, 2002,65(5):056109.<br />

[11] Xiao Zhong-zhe, Dong Zai-Wang. Improved GIB<br />

synchronization method for OFDM system[J]. IEEE<br />

Telecommunications,2003,2(8):1417-1421.<br />

[12] Criado R, Flores J, Hernández-Bermejo B, et al. Effective<br />

measurement <strong>of</strong> network vulnerability under random and<br />

intentional attacks[J]. <strong>Journal</strong> <strong>of</strong> Mathem-atical Modelling<br />

and Algorithms, 2005,4(3):307-316.<br />

[13] Che Hong-an, Gu Ji-fa. Scale-free network and its system<br />

scientific significance[J]. system engineering theory and<br />

practice, 2004 (4):11-16.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1725<br />

Minsheng Tan, Hunan province, China,<br />

Birthday: Sep, 1965, master tutor,<br />

graduated from Dept. Computer Science,<br />

Wuhan University. His research interests<br />

include computer network and<br />

information security.<br />

He is a pr<strong>of</strong>essor <strong>of</strong> School <strong>of</strong><br />

Computer Science and Technology,<br />

University <strong>of</strong> South China.<br />

Pr<strong>of</strong>. Tan is the member <strong>of</strong> ACM, senior member <strong>of</strong> China<br />

Computer Society, director <strong>of</strong> Hunan Computer Society,<br />

executive director <strong>of</strong> Hunan Computer Committee <strong>of</strong> Higher<br />

Education Institute, executive director <strong>of</strong> Hunan Computer Users<br />

Association.<br />

© 2011 ACADEMY PUBLISHER<br />

Qiang Cui, Shandong province, China,<br />

Birthday: Nov, 1981, is master. He<br />

graduated from School <strong>of</strong> Computer<br />

Science and Technology, University <strong>of</strong><br />

South China. And the main research<br />

interest is complex network.<br />

Lingfeng Zhu, Hunan province, China,<br />

Birthday: June, 1984, is working toward<br />

master in computer science <strong>of</strong> University<br />

<strong>of</strong> South China. And the main research<br />

interests include computer network and<br />

information security.<br />

Hui Zhao Henan province, China,<br />

Birthday: Oct, 1986, is working toward<br />

master in computer science <strong>of</strong> University<br />

<strong>of</strong> South China. And the main research<br />

interest is trusted network<br />

.


1726 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Study and Design an Anycast Routing Protocol for<br />

Wireless Sensor <strong>Networks</strong><br />

Demin Gao<br />

Nanjing University <strong>of</strong> Science and Technology Department <strong>of</strong> Computer Science and Engineering, Nanjing, China<br />

Email:gdmnj@163.com<br />

Huanyan Qian, Zheng Wang, Jiguang Chen<br />

Nanjing University <strong>of</strong> Science and Technology Department <strong>of</strong> Computer Science and Engineering Nanjing, China<br />

Email:ninanan@tom.com, wangzheng@163.com, chenjiguang@163.com<br />

Abstract—In wireless sensor networks, there is usually a sink<br />

which gathers data from the battery-powered sensor nodes.<br />

As sensor nodes around the sink consume their energy faster<br />

than the other nodes, several sinks have to be deployed to<br />

increase the network lifetime. Anycast is a mechanism that<br />

the source node sends the data to the nearest sink node. The<br />

paper study and design an anycast service for deploying<br />

several sinks in wireless sensor network. A novel anycast<br />

tree-based is proposed approach to minimize the path cost.<br />

Here the nodes form a tree with a sink node as the root, while<br />

the height <strong>of</strong> the tree integrates multiple metrics to calculate<br />

path cost based on diverse selection criteria. This paper<br />

discusses and analyzes the model deeply. The experimental<br />

data proves its validity and efficiency. Computer simulation<br />

shows that the proposed scheme reduces and balances the<br />

energy consumption among the nodes effectively, so it<br />

significantly extends the network lifetime compared to the<br />

existing schemes.<br />

Key words: Wireless sensor networks; Anycast; Routing<br />

protocol<br />

I. INTRODUCTION<br />

Wireless sensor networks are paid to lots <strong>of</strong> attention<br />

due to their promising techniques and wide-ranging<br />

applications in recent years. This kind <strong>of</strong> network consists<br />

<strong>of</strong> a large number <strong>of</strong> low-cost, low-power, small-size, and<br />

multifunction sensor nodes which can sense and process<br />

data and communicate with other nodes in a short distance.<br />

In many applications <strong>of</strong> wireless sensor network, usually a<br />

sink node and numerous tiny sensor nodes are deployed in<br />

the monitoring area randomly. With the scale <strong>of</strong> wireless<br />

sensor network increasing, nodes close to the sink<br />

consume their energy faster than that <strong>of</strong> farther nodes.<br />

When the energy all the nodes around the sink have<br />

exhausted, the sink node is not able to receive any data<br />

from the sensors, nor gets connecting with the network.<br />

When this situation happens, the whole network is<br />

considered to be down. In addition, sensor nodes are<br />

deployed in a remote or dangerous area in which servicing<br />

a node may be impossible. A solution to these problems is<br />

to deploy several sinks and tiny sensor nodes that need to<br />

send data to a nearest sink node in the sensor networks. If<br />

the traffic is balanced among the sinks, the network<br />

lifetime can be significantly increased since the energy<br />

consumption will be almost equal for all the nodes in the<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1726-1733<br />

network.<br />

Internet Protocol Version 6(IPv6) specifically defines<br />

a new addressing scheme called "Anycast address” that is<br />

an identifier for a set <strong>of</strong> interfaces [1, 2] . A data packet is<br />

intended to be delivered to an Anycast address and routed<br />

to the "nearest" interface. The routing protocols can be<br />

classified into unicast, broadcast, multicast, and anycast<br />

roughly [3] . Nowadays the Anycast technology is studied in<br />

wireless network widely. The Anycast communications<br />

becomes quite important in a network with multiple sinks.<br />

Anycast can be an important paradigm for a wireless<br />

sensor network in terms <strong>of</strong> resource, robustness and<br />

efficiency for replicated service applications. Assuming<br />

that the sources and the sinks are distributed in the network<br />

uniformly, the sources sending the data packet to<br />

the ”nearest” sink around the area in which the events<br />

happen can reduce the hops <strong>of</strong> packets transmitting, so that<br />

it saves energy, reduces the cost <strong>of</strong> router table<br />

maintenance and extends the effect <strong>of</strong> network survival.<br />

This simple strategy is assumed to balance the energy<br />

consumption. When a sensor node produces data, it has to<br />

send it to any available sink. A sink selection strategy is to<br />

choose a sink for each source arbitrarily.<br />

This paper addresses the sink discovery and routing<br />

problem in sensor networks. Generic routing protocols<br />

designed for wireless ad hoc networks fail in sensor<br />

networks primarily due to the fact that they are designed<br />

for more powerful nodes with higher transmission range<br />

and power as compared to sensors. In addition to this, the<br />

packet structure, routing table sizes, implemented code<br />

size and many other states that are maintained, cannot be<br />

ported to tiny sensors directly. This paper contains a<br />

description <strong>of</strong> a protocol implementing the anycast service.<br />

Construct an anycast tree that is rooted at the sink and<br />

contains many sensor nodes as leaves. The objective is to<br />

select a minimum path cost for every sensor node. The<br />

paper is organized as follows. In section II we present a<br />

number <strong>of</strong> existing Anycast solutions, while in section III<br />

specify the network model and energy model used, Section<br />

IV we present our anycast protocol. Section V contains<br />

experimental results. Conclusions are presented in section<br />

VI.<br />

II. RELATED WORKS<br />

The concept <strong>of</strong> anycast was studied in multiple


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1727<br />

contexts, including network type, communications model,<br />

and purpose <strong>of</strong> usage. For example, anycast is studied in<br />

the TCP/IP networks deeply. As it is used for directing<br />

DNS queries to the closest root name server [4] .It is also<br />

used for server selection in distributed systems [5] . When<br />

Anycast is used to access gateways which interconnect<br />

IPv6 with IPv4 networks, it gain more attention. Though<br />

Anycast is originally designed for Internet service, it has<br />

been applied to routing protocol design for wireless ad hoc<br />

and sensor networks. In mobile networking, there are<br />

some Anycast routing protocols which were improved to<br />

support Anycast service and mainly come from current<br />

routing protocol.<br />

In the paper [6, 7], the AODV protocol is used to<br />

support Anycast service. AODV is an on-demand reactive<br />

routing protocol designed for ad hoc networks. When there<br />

are packets needed to transmit, the source node initiates<br />

the process <strong>of</strong> route establishment. It’s suitable for the<br />

situation <strong>of</strong> mobile nodes. In addition, Anycast routing<br />

protocols based on the tree structure [8, 9, 10] is in accordance<br />

with the agreement, the extended model usually in the tree<br />

by hop count, physical interval or time intervals for unit, to<br />

build an Anycast tree. An query is transported along the<br />

most fitting Anycast tree. Routing and sink discovery<br />

protocols which are designed for ad hoc networks do not<br />

adapt to the sensor networks.<br />

Low-Energy Adaptive clustering Hierarchy (LEACH)<br />

[11] is one <strong>of</strong> the representative clustering schemes. In<br />

LEACH sensors are organized into clusters and one node<br />

in each cluster acting as cluster-head takes the<br />

responsibility to collect data, aggregate data and finally<br />

transmit data to the distant Sink. Lifetime <strong>of</strong><br />

heterogeneous wireless sensor networks can be increased<br />

in networks with more than one data sink when access to<br />

the sinks is provided by an Anycast protocol [12] .Such a<br />

network consists <strong>of</strong> two types <strong>of</strong> devices resource rich<br />

(information sinks) and resource-constrained (sensors<br />

generating new data) [13] .A similar concept <strong>of</strong> improving<br />

the energy efficiency <strong>of</strong> WSNs has been proposed in the<br />

HAR [14] protocol. All the above anycast solutions are<br />

different from our paper. In each <strong>of</strong> them, the set <strong>of</strong><br />

attributes used as the anycast address is not a singleton.<br />

Usually, node sent data to the nearest sink, rather than<br />

a specific one which is different from the TCP/IP<br />

networks and the ad hoc networks. Another type <strong>of</strong><br />

anycast which can be found in the WSN environment, is<br />

anycasting to a region. Solutions such as SPEED [15] and<br />

HLR [16] assume a situation where it is sufficient to deliver<br />

a packet to any node in a specified area. Algorithms for<br />

region-targeted anycast rely on the strong spatial<br />

correlation <strong>of</strong> the attributes used for addressing, which is<br />

not the case in this paper.<br />

In the view <strong>of</strong> the Anycast routing protocol in<br />

wireless sensor network, combining the characteristics <strong>of</strong><br />

wireless sensor networks and to improve the performance<br />

<strong>of</strong> Anycast routing, this paper puts forward a method<br />

which based on the Anycast tree routing algorithm for<br />

wireless sensor networks. Some protocols are simplified to<br />

suit for the wireless sensor network application. Algorithm<br />

is used to establish an Anycast tree for each sink node.<br />

© 2011 ACADEMY PUBLISHER<br />

Each sensor node joins in an Anycast tree which is nearest<br />

to it. Applications require minimizing certain cost metric(s)<br />

to optimize the performance, such as energy consumption.<br />

Thus, applications require using <strong>of</strong> multiple metrics for<br />

path cost calculation to guarantee the performance. Based<br />

on the multiple-metric path cost specified by the<br />

application requirement, path with the minimum cost<br />

value will be selected as the best route. This algorithm can<br />

balance the network load greatly, extend the whole<br />

network <strong>of</strong> survival and improve the performance <strong>of</strong><br />

Anycast routing algorithm.<br />

III. SYSTEM MODEL AND PATH SELECTION<br />

It is first discuss the topology model, energy model and<br />

path selection metrics used in the proposed routing<br />

scheme.<br />

A. Topology Model<br />

Consider a static wireless network modelled as an<br />

undirected graph G = ( V, A)<br />

where V are the set <strong>of</strong> sensor<br />

nodes and sink nodes. A is the set <strong>of</strong> links. A graph is<br />

simple if it has no loops and no two <strong>of</strong> its links join the<br />

same pair <strong>of</strong> vertices. An acyclic graph is one that contains<br />

no cycles. A tree is a connected acyclic graph. A sink tree<br />

is a tree with a sink node as tree root and sensor nodes as<br />

tree leaves. G consists <strong>of</strong> a finite nonempty vertex set V<br />

and edge set A <strong>of</strong> ordered pairs <strong>of</strong> distinct vertices <strong>of</strong> V. A<br />

leaf is a vertex <strong>of</strong> degree 1.Two nodes i and j are<br />

connected by a link if they can transmit a packet to each<br />

other with a transmission power less than the maximum<br />

transmission power at each node. Thus all links are<br />

assumed to be bi-directional. This assumption is not<br />

necessary for the convergence <strong>of</strong> the distributed<br />

algorithms however it can make the presentation clearer.<br />

The set <strong>of</strong> nodes are connected to node i by links is<br />

denoted as N i .We assumes that the network graph is<br />

connected, i.e. It is always exists a path between any pair<br />

<strong>of</strong> nodes i and j inV .<br />

A wireless sensor network contains a number <strong>of</strong><br />

sensor nodes and multiple sinks is considered which are<br />

distributed in a given region randomly. These sensor nodes<br />

transmit the information they have collected to the sink<br />

node. We make some assumptions about the sensor nodes<br />

and the underlying network model as follows:<br />

� All sensor nodes are started with the same<br />

initial energy. The sink node doesn’t have<br />

energy constraint.<br />

� Every node is aware <strong>of</strong> its own location. A<br />

sensor node can compute approximate distance<br />

<strong>of</strong> the source based on the received location<br />

information.<br />

� The transmitting power <strong>of</strong> a sensor node is<br />

controllable which means transmitting power<br />

can be modulated according to the transmitting<br />

distance.<br />

� Sink and sensor nodes are static. All nodes are<br />

homogeneous and have the same capabilities.<br />

Each node is assigned a unique identifier (ID)<br />

except the sink node and all sink nodes form an


1728 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

anycast group sharing an ID.<br />

These hypotheses are reasonable because <strong>of</strong> wireless<br />

hardware technology and low power calculation<br />

technology's development and progress.<br />

B. Energy Model<br />

The power consumption <strong>of</strong> a sensor node consists <strong>of</strong><br />

four parts: sensing and generating data, idling, receiving,<br />

and transmitting. Also the power e g for generating one bit<br />

<strong>of</strong> data is assumed to be the same with all nodes. The idle<br />

power consumed by a node, is assumed to be the same for<br />

all nodes and independent <strong>of</strong> traffic, is denoted by e s . For<br />

power consumption in receiving and transmitting, the first<br />

order radio model is adopted in [17-19]. Specifically, a<br />

node needs ε elec = 50nJ<br />

for running the circuitry and<br />

2<br />

ε amp = 100 pJ / bit / m for the transmitting amplifier.<br />

Therefore, the power consumption for receiving one bit <strong>of</strong><br />

data is given by er = ε elec .The power consumption for<br />

transmitting one bit <strong>of</strong> data to a neighbor node j is given<br />

n<br />

by eij = εelec+ εamp<br />

∗ dij,<br />

where n is the path loss exponent,<br />

which typically ranges between 2 and 4 for free-space and<br />

short-to-medium-range radio communication. Let i E<br />

denote the initial battery energy <strong>of</strong> node i and w i denote<br />

the fraction <strong>of</strong> power consumption for one bit <strong>of</strong> data.<br />

w = e + e + e + e (1)<br />

i s g r ij<br />

Where the first term is the idling power consumption, the<br />

second term is the power for sensing, the third term is the<br />

power consumption for receiving and the last term is the<br />

power consumption for transmitting.<br />

C. Path Selection<br />

A simple linear combination <strong>of</strong> different routing<br />

metrics is used to determine the path cost, as shown in<br />

following equation:<br />

'<br />

φ = φ + α ∗metric<br />

Where<br />

∑<br />

i∈V i i (2)<br />

'<br />

φ is the accumulated cost <strong>of</strong> previous nodes along<br />

the path, metric i is scaled value from (0, 1) and αi is the<br />

weight factors (or called coefficients) for metric i to<br />

calculate the cost. Basing on application requirement,<br />

these weight factors can be flexibly varied to change the<br />

importance <strong>of</strong> the cost metrics during route discovery. Our<br />

protocol adopt four path cost metrics: hop count, energy<br />

cost, data delay, and remaining energy. Therefore, the path<br />

cost equation becomes:<br />

'<br />

φ φ α1 hopi α2 wi α3 delay α4<br />

Ei<br />

= + ∗ + ∗ + ∗ + ∗ (3)<br />

Here, hop i =1, which is the hop count, energy cost<br />

denotes the normalized energy cost for the link from the<br />

previous hop to the current node, data delay denotes the<br />

time for transmitting the data from the node to next, and<br />

E i denotes the surplus energy. Different applications can<br />

define their requirement by including different sets <strong>of</strong><br />

weight factors. For example, an application might only<br />

© 2011 ACADEMY PUBLISHER<br />

want to consider energy consumption, thus, (α1, α2, α3, α4)<br />

= (0, 1, 0, 0).In order to demonstrate how different<br />

requirements and path cost metrics guiding route<br />

discovery and resource consumption, simulations with<br />

three different network deployment are conducted. The<br />

model will be used to the choosing tactics <strong>of</strong> the next node.<br />

D. PROBLEM DEFINITION<br />

The core <strong>of</strong> anycast routing protocol for wireless<br />

sensor networks is to select a “nearest” sink as destination.<br />

The problem <strong>of</strong> optimal sink selection can be formulated<br />

as follows. Consider a case <strong>of</strong> n sources{ s1, s2, …, sn}<br />

and<br />

a group <strong>of</strong> k sink nodes where 1 ≤k ≤ n.The<br />

problem is<br />

to assign the n sources to k sink nodes so that the total<br />

path cost <strong>of</strong> the network is minimized. The problem can be<br />

formulated as a 0-1 integer programming problem as<br />

follows:<br />

n k<br />

Minimum∑∑ φijλij (4)<br />

i= 1 j−1<br />

Subject to<br />

k<br />

∑ λij<br />

= 1(1 ≤i≤n) (5)<br />

j=<br />

1<br />

λ ij = 0or1(1 ≤i≤ n)(1 ≤ j ≤ m)<br />

(6)<br />

Where λij is the path cost <strong>of</strong> the best route between<br />

node i and node j and λ ij is a binary variable used for<br />

sink selection: if the best sink node chosen for node i is<br />

node j , then λ ij =1, otherwise λ ij =0. Constraint (5) states<br />

that node i can only transmit all its packets to one sink.<br />

IV. ANYCAST ROUTING POTOCOL<br />

The anycast routing proposed scheme which employs<br />

the tree-based is introduced approach to distribute the<br />

energy load evenly among the sensors in the network and<br />

thus minimize data transfer time. An objective <strong>of</strong> our<br />

protocol is to establish a connection between sensor nodes<br />

and sinks which belong to an anycast group based on<br />

multiple path selection metrics. Thus, the selected sink can<br />

forward packets to the destination in the core network.<br />

A. Packet Format<br />

Four types <strong>of</strong> control packets are designed for our<br />

protocol, as it’s explained in this section. Hello packet<br />

(HELLO) is a special type <strong>of</strong> packet generated only by the<br />

sink nodes which is broadcasted periodically to all sensor<br />

nodes, for sensor nodes that do not have any valid route<br />

available to any member <strong>of</strong> the anycast group in its routing<br />

table. The traditional Route Request (RREQ), Route Reply<br />

(RREP), and Route Error (RERR) packets are stripped <strong>of</strong><br />

unnecessary fields for a WSN, such as the reserved fields,<br />

flags for multicast, prefix field, and life time field. In<br />

addition, a small HELLO packet is added for sink<br />

advertisement. A Hello message is transmitted<br />

periodically to advertise the presence <strong>of</strong> a Sink. The<br />

transmission range <strong>of</strong> a mobile platform will cover all


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1729<br />

sensor nodes more than one hop away. Thus, there is no<br />

need to retransmit the Hello message by sensor nodes.<br />

Nodes receive the hello packet and cache the information.<br />

Route Request packet (RREQ) is generated to<br />

initialize the route discovery. RREQ is different from the<br />

packet in the TCP/IP, such as AODV protocol. The major<br />

differences are: instead <strong>of</strong> using unicast address as<br />

destination address, the packet has the anycast group ID as<br />

the destination address. Two more fields are added for<br />

adapting application requirements and utilizing multiple<br />

metrics as path cost. In our protocol, the RREQ include<br />

CRQ (Child Request) and PRQ (Parent Request). CRQ is<br />

used to discover a child node and PRQ to discover a parent<br />

node.<br />

The data packet format in our protocol is defined as<br />

follows:<br />

(Type, Anycast group ID, Path costφ , Next node’s<br />

ID, Node’s address)<br />

If type=1, the packet is CRQ. If type=2, the packet is<br />

PRQ. If Anycast group ID =0, denotes the packet comes<br />

from a sink, otherwise from a sensor node. If the next<br />

node’s ID is empty, denotes the packet comes from a<br />

sensor node and the node hasn’t discovery a rout to a sink.<br />

Every node doesn’t need to remember its child nodes,<br />

because the node doesn’t transmit message to its child<br />

nodes. The sink transmission range can cover all sensor<br />

nodes. Node’s address denotes the node address, such as<br />

the node ID and position.<br />

Route Reply packet (RREP) is generated by sinks or<br />

sensors for corresponding RREQ packets. While<br />

destination anycast group ID represents the anycast group<br />

that the destination node belongs to. The accumulative<br />

path cost is the accumulative cost along the path from the<br />

destination node to the source node. Route ERROR Packet<br />

(RERR) is the same as that <strong>of</strong> AODV protocol.<br />

B. Established an Anycast tree<br />

The sensor nodes are distributed in the monitoring<br />

area randomly. There are multiple sink nodes and n<br />

sensor nodes. The anycast group is assigned an identifier<br />

( ID = 0 ) which identifies the anycast group and contains<br />

all sink nodes. Every sink node can construct an anycast<br />

tree and the root is the sink node. Sensor nodes can get<br />

anycast services from the anycast tree. This protocol starts<br />

with the creating <strong>of</strong> a number <strong>of</strong> spanning trees. In this<br />

model, if a sensor node wants to become an Anycast<br />

member it must join in an anycast tree firstly. A sensor<br />

node can join in an anycast tree through the following<br />

process:<br />

1) Every sink node broadcasts a query CRQ to its<br />

neighbor nodes within small range. The CRQ contains the<br />

location information <strong>of</strong> some one sink node, the ID <strong>of</strong><br />

Anycast group, the path costφ from a sink to the node that<br />

has sent the CRQ. If the CRQ comes from a sink node<br />

directly, the value <strong>of</strong>φ is zero and Next node’s ID is zero.<br />

2) If a neighbor node receives the CRQ and it hasn’t<br />

joined in any anycast tree. The node accepts the CRQ and<br />

checks if it comes from a tree’s node through checking the<br />

id that identifying anycast group, if it is, it appends the<br />

© 2011 ACADEMY PUBLISHER<br />

CRQ into its father node table and records the father node's<br />

relevant parameters including location information, the<br />

path costφ and anycast id it is requested to join. If the id in<br />

the CRQ was not belonging to any anycast tree’s node, the<br />

node discards the CRQ.<br />

3) After receiving a CRQ, the node set a timer whose<br />

time interval may be decided by the current network status.<br />

The node may receive more than one CRQ in the time<br />

interval. After the timer expires, the node selects the<br />

neighbor node with the minimum path costφ as its father<br />

node through comparing the size <strong>of</strong> the path costφ in the<br />

CRQ, records the information on its father node and<br />

returns a RREP to its father node. If more than one the path<br />

costφ <strong>of</strong> the CRQ received is equal, the node selects a<br />

neighbor node as its father node randomly.<br />

4) After the father node receives the joining message,<br />

it will return an ACK message to this child node. Due to<br />

the characteristics <strong>of</strong> the algorithm, each node only needs<br />

to retain the information <strong>of</strong> his father node. The father<br />

node doesn’t need to record the relevant information <strong>of</strong> the<br />

child node. This is different from the TCP/IP which will<br />

record the child’s IP. The child node replaces Next node’s<br />

ID in the CRQ with it’s the father ID, recalculate the path<br />

costφ from the sink to this node and replaces the path<br />

cost φ in the CRQ with the new φ . At the same time,<br />

updates the relevant parameters (position, etc) and<br />

broadcasts the CRQ to the next hop until all node join in an<br />

anycast tree, just as it is shown in fig. 1.<br />

Figure.1 a anycast tree is establish from all sensor to a sink<br />

C. New node joins in the Anycast tree<br />

In fact, if a sensor node wants to share the anycast<br />

service. It must join in an anycast tree. If a node wants to<br />

join in an Anycast tree, it will broadcast a joining message<br />

PRQ to its neighbor nodes. The PRQ contains the location<br />

information <strong>of</strong> the node. If one neighbor node receives the<br />

PRQ and it has joined in any anycast tree, the node will<br />

accept the PRQ and return a CRQ. The CRQ contains the<br />

location information <strong>of</strong> the neighbor node, the ID <strong>of</strong><br />

Anycast group, the path costφ from the sink node to this<br />

neighbor node.<br />

The node that sends the joining message accepts the<br />

CRQ and appends the CRQ into its father node table with<br />

the node's relevant parameters including the location<br />

information, the path cost φ from the sink node to this<br />

node. The node then sets a timer and expects to receive<br />

more CRQ in the time interval. After the timer expires, the


1730 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

node selects a node with the minimum the path costφ as<br />

its father node through comparing the size <strong>of</strong> the path cost<br />

φ <strong>of</strong> CRQ from the father table, recalculates the path<br />

costφ from the sink to this node and replaces the path<br />

costφ in the CRQ with the newφ , returns a RREP to its<br />

father node. If there is more than one minimum path costφ<br />

<strong>of</strong> the CRQ received, the node selects a neighbor node as<br />

its father node randomly. The father node will receive the<br />

RREP and sent an ACK message to the child node. So the<br />

new node joins in an anycast tree successfully.<br />

D. Node leave or be failed<br />

The energy <strong>of</strong> some nodes was exhausted as the<br />

sensor node power is constrained then the node become<br />

invalid. There are three cases when a node becomes failed.<br />

1) The failed node is the anycast tree’s leaf and the<br />

node has no child. When the father node can’t receive the<br />

information from the node in a time interval set in<br />

advance, the node is considered to be failed. This case is a<br />

sample, as it is show in the fig.2, if v5 is failed, v4 don’t<br />

need to do anything and nor revise the relevant<br />

information <strong>of</strong> v4.<br />

2) If the failed node is an intermediate node and it has<br />

a child node. In fig. 2, the v2 is the intermediate node. If v2<br />

is failed, the v4 and v5 will get disconnected to v1and the<br />

data that v4 and v5 have collected can’t transmit to the sink<br />

node. In this case, v4 should broadcast a joining message<br />

PRQ to its neighbor node, such as the node v3 and node v6.<br />

The process is the same as a new node join in an anycast<br />

tree that is shown in the above section C. In fig. 2, the node<br />

v4 will receive the CRQ from the node v3 and node v6.<br />

Clearly, the node v4 selects the node v3 with the minimum<br />

path costφ as its father node because <strong>of</strong>φ3< φ6.<br />

3) If the failed node is the sink node which is the root<br />

<strong>of</strong> the anycast tree. All data collected by the tree’s node<br />

can’t be transmitted to the sink node and the anycast tree<br />

will become invalid. All nodes will start the tree creating<br />

process that has shown in the above section B.<br />

Figure2 when the node v2 was failed, the node v4 will be disconnected to<br />

the sink node s1 and should rebuild the connection to the node v3<br />

E. Anycast tree<br />

After the tree construction is over, every node joins in<br />

an anycast tree successfully where many anycast trees<br />

exist, as it is shown in fig. 3. In this phase every node sends<br />

the collected data to the parent node. Every parent node<br />

receives data from the children nodes, fuses the data with<br />

its own and forwards them to its parent node along the<br />

© 2011 ACADEMY PUBLISHER<br />

anycast tree. When the data from all member nodes in the<br />

anycast tree have been received, the sink node applies data<br />

fusion to the received data. After that, it sends the fused<br />

data to the internet or other devices. In fact, sensor nodes<br />

don’t know which sink nodes the data is sent to in the<br />

transition, but the data was certainly transmitted to some<br />

one sink.<br />

Note that the notable feature <strong>of</strong> the proposed anycast<br />

routing protocol is that several trees are constructed<br />

instead <strong>of</strong> one which allows more distributed operation<br />

among the nodes. The tree construction which is based on<br />

the path cost further increases this effect, which results in<br />

more balanced energy consumption and data delay among<br />

the nodes and increases network lifetime in the long run.<br />

Data collected by sensor nodes may contain<br />

redundant information due to the spatiotemporal<br />

correlation. Therefore, it is desirable to aggregate the data<br />

at the sink to remove the redundant information. However,<br />

the correlation data may be transmitted to different sink. If<br />

sinks transmit so redundant information to the internet or<br />

others, the frequent communications is vulnerable to be<br />

wiretapped and the transition interference will be very<br />

serious. In our paper, the data correlation is taking into<br />

account. The data received by every sink should be<br />

aggregated. All sinks form a tree and one sink is selected<br />

as root sink randomly. The root sink will gain all data from<br />

all sinks and aggregate them. An example is shown in<br />

Fig.3<br />

V1<br />

S1<br />

S4<br />

S3<br />

S2<br />

V1<br />

V2<br />

V3<br />

V3<br />

Source node<br />

Middle node<br />

Sink node<br />

Root sink<br />

Link<br />

Pseudo link<br />

Figure3. Multiple anycast trees are established cove all sensor<br />

nodes and one sink is selected as the root sink<br />

V. SIMULATIONS<br />

In this section the performance <strong>of</strong> the anycast routing<br />

protocol is evaluated via computer simulation and<br />

compared it with other schemes such as AODV [6] ,<br />

LEACH [11] . Assume that there are 100 sensor nodes<br />

including 5 sink nodes and 95 sensor nodes distributed<br />

randomly in a 100×100 region. The simulation parameters<br />

are given in Table 1. All nodes’ transmission power is<br />

adjustable and they adjust transmission power to<br />

communicate with other nodes according to actual need.<br />

Every two nodes can communicate directly with each<br />

other in the transmission range.<br />

Fig4. Shows the resultant network topology obtained<br />

by different schemes for a network. The topology <strong>of</strong><br />

LEACH is shown in Fig4 (a). The transition distance is<br />

one hop count from every sensor node to sink. There are<br />

no transmissions between sensor nodes. Data collected by


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1731<br />

sensor nodes are transmitted directly from the member<br />

nodes to the cluster-head. Sensor node need to consume<br />

more energy because <strong>of</strong> many long distance transmissions.<br />

The topology <strong>of</strong> anycast routing protocol <strong>of</strong> our paper is<br />

shown in Fig4 (b). Every sensor node joins in an anycast<br />

tree according to the path cost and data is transmitted<br />

along the tree from sensor leaves to root sink. Observe that<br />

the proposed scheme display more balanced and<br />

distributed pattern <strong>of</strong> network.<br />

TABLE 1.THE PARAMETERS USED IN THE SIMULATION<br />

Parameter Value Parameter Value<br />

Size <strong>of</strong> target 100×100 Data packet 512 byte<br />

area<br />

size<br />

Number <strong>of</strong> 5 Metadata 25 byte<br />

sink nodes<br />

packet size<br />

Number <strong>of</strong> 95 Maximum 20m<br />

sensor nodes<br />

radius, R<br />

Initial energy 10J α 1 1<br />

ε elec 50 nJ/bit α 2 1<br />

ε amp<br />

50<br />

nJ/bti/m2<br />

α 3 1<br />

α 1<br />

e 100 nJ/s s<br />

4<br />

(a)LEACH<br />

(b) Anycast routing protocol<br />

Figure4. The network topology with different protocols<br />

For a network flow f , let f ij denote the rate <strong>of</strong><br />

information flow from node i to node j .The energy<br />

© 2011 ACADEMY PUBLISHER<br />

spent by node i to transmit a unit <strong>of</strong> information directly<br />

to node j is e ij .Then the lifetime <strong>of</strong> node i under<br />

flow fij is given by<br />

Ei<br />

Ei<br />

T = = T =<br />

wi ⋅ fij<br />

( es+ eg+ er+ eij ) ⋅ fij<br />

Fig.5 measures network energy consumption and<br />

lifetime when we vary the number <strong>of</strong> sensor nodes which<br />

shows that deploying 2, 3, 4, 5, 6 sink respectively. As the<br />

network size increases, the network total energy<br />

consumption rate rises and the network lifetime is<br />

gradually reduced. With the increase <strong>of</strong> the sinks, the<br />

network rate <strong>of</strong> total energy consumption decreased and<br />

the lifetime <strong>of</strong> network increases. Meanwhile, with the<br />

number <strong>of</strong> sinks increase, sinks added to the network<br />

newly prolong the lifetime capacity reduced gradually.<br />

This is because the number <strong>of</strong> nodes increase, cause the<br />

shortening <strong>of</strong> distance between nodes, data relevance<br />

increase and lower transmission power is needed, while<br />

the routing algorithm can effectively balance the node data<br />

traffic load, which makes the network lifetime increases.<br />

The new sinks adding to the network can reduce the<br />

distance between nodes, so the network lifetime can be<br />

prolonged. With the number <strong>of</strong> sinks increase, it only<br />

affects the route near the sinks. The impact on the network<br />

becomes smaller and the effect <strong>of</strong> increasing the lifetime<br />

<strong>of</strong> the network decreases.<br />

Energy consumption/nJ<br />

Energy consumption/nJ<br />

(a) The energy consumption<br />

Times/S<br />

(b). the lifetime <strong>of</strong> network<br />

Figure.5. the network energy consumption and lifetime when vary the<br />

number <strong>of</strong> sensor nodes and deploy 2, 3, 4, 5, 6 sink respectively.


1732 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

We define that the delay time means the time interval<br />

between the transmission <strong>of</strong> a packet by the source and the<br />

reception <strong>of</strong> the same packet by the sink. The delay time <strong>of</strong><br />

AODV is the longest, as it shows in Fig6 (a). When there<br />

are some packets that some sensor node collected need to<br />

be transmitted to a sink node, these sensor nodes initiate<br />

the process <strong>of</strong> route establishment. The time is accounted<br />

to the delay time, so that the delay time is increased.<br />

AODV tries to create a route to a single sink, thus waste<br />

more time comparing to the other two methods. Our<br />

protocol and LEACH are proactive routing protocol. The<br />

route has been established before the packets are<br />

transmitted to a sink node, so that the packets can be sent<br />

to a sink node in the shortest time. Our protocol is a little<br />

better than the LEACH. In LEACH, as data are transmitted<br />

directly from the member nodes to the cluster-head, many<br />

long distance transmissions are required in a cluster. The<br />

number <strong>of</strong> long distance transmissions will increase as the<br />

network size grows. However, in our protocol, the<br />

minimum the path cost node is selected as the father node.<br />

So we can say that our route is better than the other two<br />

protocols. In particularly, when the rate <strong>of</strong> transmission is<br />

quick, the property <strong>of</strong> our protocol increases 5% than<br />

LEACH.<br />

Fig.6 (b) shows that the comparison <strong>of</strong> energy<br />

consumption as time moves. With the increasing <strong>of</strong> the<br />

time, more and more packets can be transmitted to sink<br />

node and the energy consumption increases. Compared<br />

with the AODV and the LEACH, our protocol has a less<br />

energy consumption, and it’s more with the increasing. As<br />

it expected, our protocol has the best performance. The<br />

AODV protocol transmits more packets than the others<br />

because it sends route request and rebuilds the route every<br />

time when the new packets collected by sensor nodes need<br />

to be sent to a sink node, then it consumes more energy.<br />

The gap between them was getting wider and wider as the<br />

time moving. We can also see that both LEACH and ours<br />

perform better than AODV. Our protocol is a little better<br />

than the LEACH. The main reason is that communication<br />

radius is may be very large in LEACH. However, multiple<br />

paths cost metrics is considered and the minimum the path<br />

cost node is selected as the father node in our protocol, so<br />

that it can minimize the energy consumption and reduce<br />

the data delay. As previously discussing, this is because<br />

the anycast can reflect the wireless advantage when there<br />

are more than one sink nodes.<br />

VI. CONCLUSIONS<br />

In this paper an anycast routing protocol basing on<br />

anycast tree scheme for energy efficient data transfer and<br />

reducing average delay time is proposed in wireless sensor<br />

networks. To form a tree for each sink node, every node<br />

sends the collected data to the parent node along the<br />

anycast tree. The architecture <strong>of</strong> anycast tree is decided<br />

according to the path cost <strong>of</strong> nodes to sink. Some protocols<br />

are simplified to suit for the wireless sensor network<br />

application.The data packet can be sent to the nearest sink<br />

node along the anycast tree. Multiple-metric is used to<br />

instruct the route discovery and sink selection. The node<br />

© 2011 ACADEMY PUBLISHER<br />

own a minimum path cost is selected as the father node<br />

and forwards the packet. It can minimize the energy<br />

consumption which was required for the communication<br />

between the nodes and the sinks. Simulation results show<br />

that the proposed scheme reduces the delay time<br />

successfully and balances the energy consumption among<br />

the nodes and thus significantly extends the network<br />

lifetime comparing to those existing schemes.<br />

(a) Delay as the packets transfer rate<br />

(b) Energy consumption as time moving<br />

Figure6. The comparison <strong>of</strong> delay and energy consumption<br />

REFERENCES<br />

[1] Weber S, Cheng L.A survey <strong>of</strong> Anycast in IPv6 networks.<br />

IEEE Communications Magazine, 2004, 42 (1):127-132.<br />

[2] Doi S, A ta S, Kitamura H.Protocol design for Anycast<br />

communication in IPv6 network .Proceedings <strong>of</strong> 2003<br />

IEEE Pacific Rim Conference on Communications,<br />

Computers and Signal Processing(PACR MI’03). New<br />

York, USA: IEEE Press, 2003.470-473.<br />

[3] Jia W, Zhou W, and Kaiser J.Efficient algorithm for mobile<br />

multicast using anycast group. IEEE Proc.<br />

Communications, 2001, 148 (1):14–18.<br />

[4] Abley, J.:Hierarchical Anycast for Global Service<br />

Distribution(2003)<br />

[5] Michael, J., Freedman, K.L. Mazieres, D.:Oasis: Anycast for<br />

any service. In:Proceedings <strong>of</strong> the 3rd Symposium on<br />

Networked Systems Design and Implementation, San Jose,<br />

CA(May 2006)<br />

[6] Subramanian Swaminat han, Jinye Huo, Fang Liu.An


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1733<br />

Anycast Routing Protocol for Ad-Hoc <strong>Networks</strong>.<br />

http://www.cs.ucsb.edu/ebelding/, 2003- 03.<br />

[7] Jianxin Wang, Yuan Zheng, Weijia Jia.A-DSR:A Based-DSR<br />

Anycast Protocol for IPv6 Flow in Mobile Ad Hoc<br />

<strong>Networks</strong>.IEEE Proc V TC2003[C].2003.<br />

[8] Thepvilojanapong N, Tobe Y, Sezaki K.HAR: hierarchy<br />

based anycast routing protocol for wireless sensor<br />

networks//Proceedings <strong>of</strong> Symposium on Applicat- ions<br />

and the Internet Workshops. 2005: 204- 212.<br />

[9] WANG Xiao-nan etc, Routing protocol for w ireless sensor<br />

networks based on Anycast, Application Research <strong>of</strong><br />

Computers, 2009, 7(7):2695-2697.<br />

[10] Michal Koziuk, Jaroslaw Domaszewicz. Tree-based anycast<br />

for wireless sensor/actuator networks. Lecture Notes in<br />

Computer Science archive Proceedings <strong>of</strong> the 9th<br />

international conference on Distributed computing and<br />

networking. Kolkata, India SECTION: Sensor<br />

networks .2008<br />

[11] W.R.Heinzelman, A.Chandrakasan, and H. Balakris- hnan,<br />

“Energy-Efficient Communication Protocol for Wireless<br />

Micro-sensor <strong>Networks</strong>”, In Proceedings <strong>of</strong> the Hawaii<br />

International Conference on System Science, Maui,<br />

Hawaii, 2000.<br />

[12] Hu, W., Bulusu, N., Jha, S.:A communication paradigm for<br />

hybrid sensor/actuator networks(2004)<br />

[13] Hu, W., Chou, C.T.:S.J.N.B:Deploying long-lived and<br />

cost-effiective hybrid sensor networks(2004)<br />

[14] The pvilo jana pong, N., Tobe, Y, Sezaki, K.:Har:<br />

Hierarchy-based anycast routing protocol for wireless<br />

sensor networks.In:SAINT 2005:Proceedings <strong>of</strong> the The<br />

2005 Symposium on Applications and the Internet<br />

(SAINT 2005), pp.204–212.IEEE Computer Society, Los<br />

Alamitos(2005)<br />

[15] He, T, Stankovic, J.A, Lu, C, Abdelzaher, T.F:A<br />

spatiotemporal communication protocol for wireless<br />

sensor networks.IEEE Transactions on Parallel and<br />

Distributed Systems 16, 995–1006(2005)<br />

[16] Bian, F., Govindan, R., Schenker, S., Li, X.:Using<br />

hierarchical location names for scalable routing and<br />

rendezvous in wireless sensor networks.In:SenSys<br />

2004:Proceedings <strong>of</strong> the 2nd international conference on<br />

Embedded networked sensor systems, pp. 305–306.ACM<br />

Press, New York(2004)<br />

[17] W.R.Heinzelman, A.Chandrakasan, and H.Balakrishnan,<br />

“EnergyEfficient Communication Protocol for Wireless<br />

Micro-sensor <strong>Networks</strong>”, In Proceedings <strong>of</strong> the Hawaii<br />

International Conference on System Science, Maui,<br />

Hawaii, 2000.<br />

© 2011 ACADEMY PUBLISHER<br />

[18] Lindsey, C.S.Raghavendra, “PEGASIS:Power-Efficient<br />

gathering in sensor information systems, ”in Proc.<strong>of</strong> the<br />

IEEE Aerospace Conf., Canada, March 2002.pp.1-6.<br />

[19] S.S.Satapathy and N.Sarma, “TREEPSI:tree based energy<br />

efficient protocol for sensor information”, Wireless and<br />

Optical Communications <strong>Networks</strong> 2006 IFIP<br />

International Conference, April 2006.<br />

Demin Gao ShanDong Province,<br />

China. Birthdate: September, 1980. He<br />

received the M.S. degree in computer<br />

application technology from Jingdezhen<br />

Ceramic Institute, Jingdezhen, Jiangxi,<br />

china, in 2008. He is pursuing the Ph.D.<br />

degree in Nanjing University <strong>of</strong> Science<br />

and Technology Department <strong>of</strong> Computer<br />

Science and Engineering. And research<br />

interests on routing protocols for wireless<br />

sensor networks and data aggregation in wireless sensor<br />

networks.<br />

Huanyan Qian Jiangsu Province, China. Birthdagte: October,<br />

1950. He is currently a pr<strong>of</strong>essor in the Nanjing University <strong>of</strong><br />

Science and Technology at Department <strong>of</strong> Computer Science and<br />

Engineering. His current research interests include sensor<br />

networks, mobile communication and wireless communication<br />

networks.<br />

Zheng WANG Jiangsu Province, China. Birthdate: September,<br />

1980. He received the M.S. degree in computer application<br />

technology from Nanjing University <strong>of</strong> Science and Technology,<br />

Nanjing, Jiangsu, china, in 2007. He is pursuing the Ph.D. degree<br />

in Nanjing University <strong>of</strong> Science and Technology Department <strong>of</strong><br />

Computer Science and Engineering. And research interests on<br />

routing protocols for wireless sensor networks.<br />

Jiguang Chen Henan Province, China. Birthdate: February,<br />

1982. He received the M.S. degree in Education from Henan<br />

Normal University, Xinxiang, Henan, China, in 2008. He is<br />

pursuing the Ph.D. degree in Nanjing University <strong>of</strong> Science and<br />

Technology Department <strong>of</strong> Computer Science and Engineering.<br />

And research interests on routing protocols for wireless sensor<br />

networks.


1734 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Management Model Research <strong>of</strong> Low-power<br />

Wireless Sensor Network<br />

LinGe Wang<br />

Ningbo Dahongying University college <strong>of</strong> s<strong>of</strong>tware, Ningbo, 315175, China<br />

Email:Wanglingew@163.com<br />

YueDou Qi<br />

Ningbo Dahongying University college <strong>of</strong> s<strong>of</strong>tware, Ningbo, 315175, China<br />

Email:yuedouqi@sohu.com<br />

Abstract—Nowadays most <strong>of</strong> the wireless sensor network<br />

management modes have a shorter lifetime because<br />

adopting the way <strong>of</strong> transferring management information<br />

to each other, which thus consuming energy too fast. This<br />

paper Present a new modal based on mobile agent for<br />

wireless sensor network scluster management.this model<br />

can make up the shortcoming <strong>of</strong> the nowadays wireless<br />

sensor networks management architecture.The nowadays<br />

models are fall eousider the information report <strong>of</strong> each<br />

nodes can consulne lots <strong>of</strong> energy and result in reduce the<br />

network lifetime.The mobile agent-based wireless seusor<br />

networks management model inherit the preponderant <strong>of</strong><br />

the traditional merit, and have plenty consideration in nodes<br />

energy feature.Through the analysis <strong>of</strong> the model, the model<br />

author provided have more predominance than traditional<br />

model in save energy, data integrate, topology control and<br />

so on.<br />

Index Terms—wireless sensor network, mobile Agent, Low<br />

energy consumption<br />

I. INTRODUCTION<br />

With the rapid development and increasingly<br />

sophisticated <strong>of</strong> communication, embedded computing<br />

and sensor technology, with a perception by the<br />

substantial capacity, computing power and<br />

communications capability <strong>of</strong>, Sensor networks<br />

composed <strong>of</strong> thousands <strong>of</strong> micro-sensors, with each senor<br />

capable <strong>of</strong> sensing, computing and communication, has<br />

aroused great concern.It integrates the sensor, embedded<br />

computing, networking, and wireless communications<br />

technology, become a new information acquisition and<br />

processing technology, Be widely used in national<br />

defense and, military, environmental, monitoring, traffic<br />

management, medicine and health care.<br />

Agent technology is developed from artificial<br />

intelligence.Agent system is a loosely coordinated system<br />

which stands for the trends <strong>of</strong> distributed s<strong>of</strong>tware<br />

development, is more flexible and intelligent.Agent<br />

Manuscript received Mar. 25, 2011; revised Apr. 15, 2011; accepted<br />

Apr. 20 2011.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1734-1739<br />

s<strong>of</strong>tware as a new s<strong>of</strong>tware technology has made<br />

considerable progress, is used in many areas such as<br />

internet information retrieval, information collection, ecommerce,<br />

data mining, integrated manufacturing and SO<br />

on. The node <strong>of</strong> the wireless sensor network <strong>of</strong>ten USeS<br />

batteries to supply electricity, but the electricity energy in<br />

the batteries is limited, meanwhile the communication,<br />

calculation and storage abilities <strong>of</strong> the node are limited<br />

which raise challenges to the hardware and s<strong>of</strong>tware<br />

design <strong>of</strong> the wireless sensor network<br />

II. WIRELESS SENSOR NETWORK ARCHITECTURE<br />

Composed <strong>of</strong> wireless sensor network system are as<br />

shown in figure 1.A large number <strong>of</strong> sensor nodes are<br />

randomly distributed in the monitoring area.These nodes<br />

constitute a network <strong>of</strong> self-organization structure way,<br />

Each node not only data collection but also routing, The<br />

data was collected through the multi-hop transmission to<br />

the focal point, Passed to the Internet, Information will be<br />

man-agement, classification, treatment that is the task<br />

manager node in the network.Finally, for users to focus<br />

on.<br />

Figure 1. Wireless Sensor Network<br />

Past communication system for wireless sensor<br />

networks, Network nodes to collect raw data and send it


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1735<br />

directly to the central node, By the central node for the<br />

signal processing tasks.This central approach to waste a<br />

lot <strong>of</strong> bandwidth resources, At the same time,, Because a<br />

lot <strong>of</strong> information forwarded, The Nodes near the center,<br />

Soon lead to depletion <strong>of</strong> energy.<br />

Sensor network is an integrated monitoring, control,<br />

and wireless communication network system, Much<br />

larger number <strong>of</strong> nodes (thousands), More intensive<br />

distribution <strong>of</strong> nodes;Because the environmental impact<br />

and energy depletion, Node failure more easily,<br />

Environmental interference and node failures could easily<br />

lead to changes in network topology;Typically, Most <strong>of</strong><br />

the sensor nodes are stationary. In addition, Sensor node<br />

has the energy, processing, storage capacity and<br />

communication are very limited, This makes the transfer<br />

<strong>of</strong> resources, Power management, Computing, Network<br />

topology discovery, etc. should be considered<br />

comprehensive in the Wireless Sensor Network.In<br />

particular, the energy consumption <strong>of</strong> wireless sensor.On<br />

the one hand to minimize the energy consumption <strong>of</strong><br />

sensor nodes, On the other hand when the node energy<br />

depletion should be able to find a new topology, Isolation<br />

the death node, generate a new path to complete data<br />

collection and processing.This fact also shows two<br />

important elements <strong>of</strong> wireless sensor networks:Topology<br />

discovery and reduce energy consumption.This paper will<br />

explore three aspects <strong>of</strong> wireless sensor networks to<br />

reduce energy consumption as a precondition to achieve<br />

the route discovery, while ensuring the reliability <strong>of</strong><br />

wireless sensor networks.<br />

Custering model is based on mobile agent, namely,<br />

how to solve clustering problems.<br />

In the cluster model, how to choose the cluster head<br />

node.<br />

In the cluster model, once the cluster head node is<br />

identified, the next, Solve the routing problem which the<br />

Mobile agent in the cluster how to move. Between Nodes<br />

within the same cluster<br />

III. ANALYSIS OF EXISTING MANAGEMENT MODEL<br />

A. MANNA<br />

Advantages:<br />

Groundbreaking research, A set <strong>of</strong> complete network<br />

management system, Specificity for sensor networks, The<br />

SNMP management model based on Summarizes the<br />

sensor network management architecture, including the<br />

organization, functions, and information etc.<br />

Disadvantages:<br />

Although made some simulations, but did not give<br />

detailed implementation programs, research level is the<br />

initial stage.<br />

B. MIADSN<br />

Advantages:<br />

The entire sensor network is divided into<br />

everal.subsystems <strong>of</strong> different functions, Conducive to<br />

modular. To conserve bandwidth, The introduction <strong>of</strong><br />

mobile agent technology, Fuzzy theory <strong>of</strong> statistical<br />

© 2011 ACADEMY PUBLISHER<br />

methods, Only a few sensor nodes achieve the purpose <strong>of</strong><br />

collecting data.<br />

Disadvantages:<br />

Main consideration is the negative effect <strong>of</strong> data<br />

fusion, Management talked about less specific methods<br />

under study.<br />

C. Other existing methods rely on broadcast traffic:<br />

Advantages:<br />

Low-level nodes are organized and managed through<br />

high-level, When the Mobile Agent in the<br />

implementation <strong>of</strong> each task is assigned a certain degree<br />

<strong>of</strong> strategy, Through these strategies to control the Mobile<br />

Agent Wireless sensor nodes to achieve the data<br />

collection.<br />

Disadvantages:<br />

Because the management information required notice<br />

with each other, Lead to excessive energy consumption <strong>of</strong><br />

nodes, Lead to reduced network lifetime. There are other<br />

issues not considered the energy <strong>of</strong> the node.<br />

D. Comparison <strong>of</strong> common methods<br />

Because the characteristics <strong>of</strong> wireless sensor<br />

networks,, The network management model <strong>of</strong> CMIP,<br />

SNMP and ANMP is no longer adapted to the wireless<br />

sensor network management, So researchers put forward<br />

some new network management solution.For example<br />

Linnyer B. Ruiz .etc. From the management level.the<br />

management functions and management functions<br />

domain described in three aspects <strong>of</strong> the management<br />

framework for wireless sensor networks, And design the<br />

architecture <strong>of</strong> wireless sensor network — MANNA,<br />

Through it configuration and managementwireless sensor<br />

network. WangFeng etc. Proposed the Distributed Sensor<br />

Network Management Model based on Mobile Agent that<br />

is the sensor network management based on Mobile<br />

intelligent agent.(Mobile Intelligent Agent-based DSN,<br />

called MIADSN), It uses a new model:Data retained in<br />

the local, Data fusion in remote.It was also proposed<br />

cluster-based wireless sensor network self-management<br />

hierarchical model, Low-level nodes are organized and<br />

managed through high-level, Because the management<br />

information required notice with each other, Lead to<br />

excessive energy consumption <strong>of</strong> nodes, Lead to reduced<br />

network lifetime. It was also proposed based on strategic<br />

management framework for wireless sensor MobileAgent,<br />

In this management structure, According to management<br />

needs, When the Mobile Agent in the implementation <strong>of</strong><br />

each task is assigned a certain degree <strong>of</strong> strategy,<br />

Through these strategies to control the Mobile Agent<br />

Wireless sensor nodes to achieve the data<br />

collection.There are other issues not considered the<br />

energy <strong>of</strong> the node.<br />

Based on the above analysis <strong>of</strong> the problem, The most<br />

popular model for sensor networks is Multiple mobile<br />

agents, Clustering topology network model. In this<br />

model, To reduce energy consumption in wireless sensor<br />

networks the problem becomes how to cluster, How to<br />

select cluster heads, How to cluster routing and intercluster<br />

routing problem. According to the literature,<br />

Currently used topology as shown in figure 2.


1736 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

In order to achieve the purpose <strong>of</strong> reducing energy<br />

consumption <strong>of</strong> nodes in wireless sensor networks, In this<br />

paper, we Proposed the wireless sensor network<br />

management model based on clustering and multiple<br />

Mobile Agents<br />

At the same time clustering based on mobile agents in<br />

wireless sensor network management model to improve<br />

the clustering algorithm, Agent routing algorithm.<br />

Simulation results show that the proposed network<br />

management structure and algorithm can achieve the<br />

purpose <strong>of</strong> reducing power consumption <strong>of</strong> sensor nodes.<br />

Figure 2. Clustering structure<br />

IV. WIRELESS SENSOR NETWORK MANAGEMENT MODEL<br />

A. Problems <strong>of</strong> traditional Wireless Sensor Network<br />

In traditional wireless sensor network, data collection<br />

was conducted through each sensor node and the data<br />

collected transferring to the designated destination node<br />

sink. In this mode, the power <strong>of</strong> the wireless sensor<br />

networks is mainly consumed in data transfer <strong>of</strong> sensors.<br />

The power is the most important resource <strong>of</strong> the wireless<br />

sensor network. power consumption The communication<br />

among the net nodes is much larger than that <strong>of</strong> computer<br />

processing and perception and it focuses on the states <strong>of</strong><br />

sending, receiving and idleness. In traditional wireless<br />

sensor network the large amounts <strong>of</strong> power is consumed<br />

in data transfer processing resulting in a rapid death <strong>of</strong><br />

the sensor nodes. It is suitable for deploying in the<br />

environment <strong>of</strong> a few data monitoring not for deploying<br />

with a large-scale and a long time<br />

Because <strong>of</strong> traditional shortcomings <strong>of</strong> wireless sensor<br />

network, clumps and management measure is popular at<br />

present That is through differentiating a number <strong>of</strong><br />

different regions in the whole wireless sensor networks<br />

and choosing the suitable node which is called cluster<br />

head in each region, the cluster will give a basic<br />

processing and then transfer the data to the terminal sink<br />

node. This method in a certain extent can reduce the<br />

consumption <strong>of</strong> sensor nodes.<br />

© 2011 ACADEMY PUBLISHER<br />

B. Wireless sensor network management model is based<br />

on LEACH method<br />

Suppose, In a two-dimensional square area A, There<br />

are N sensor nodes are randomly and evenly distributed,<br />

the sensor network has the following properties:<br />

A sensor nodes and Sink nodes are stationary, the sink<br />

node is far away from the network area, and it is unique.<br />

The sensor nodes have the same Initialization energy,<br />

can not be added.<br />

Sink node has enough energy.<br />

Sensor node can calculate its distance to the cluster<br />

nodes. Sink node through the launch <strong>of</strong> the test signal<br />

strength.<br />

Nodes in all directions the same amount <strong>of</strong> energy.<br />

In wireless sensor networks, the differences in signal<br />

transmission <strong>of</strong> energy will affect the performance <strong>of</strong><br />

routing protocols.which Including the Receiver Model<br />

and the Launch model,<br />

A model is assumed, as shown in figure 2, The model<br />

considers the Energy consumption <strong>of</strong> Transmit<br />

Electronics, the Energy consumption by power amplifier,<br />

The Energy consumption which Receive Electronics<br />

receive signals.<br />

Figure 3. Energy consumption model for sensor networks<br />

According to energy model, When the transmission<br />

distance is d, the data is L bits, the Energy consumption<br />

<strong>of</strong> the Transmit Electronics is :<br />

The Energy consumption <strong>of</strong> the Receive Electronics<br />

is :<br />

E = E − elec ( k)<br />

= L×<br />

recieve Rx E elec<br />

C. LEACH protocol shortcomings<br />

LEACH protocol is only applicable to homogeneous<br />

network, heterogeneous network can not get good results.<br />

The mechanisms <strong>of</strong> each node to probability act as<br />

cluster heads.Without considering the residual energy <strong>of</strong><br />

nodes, without considering the node location.In each<br />

cycle, all nodes must act as a cluster head.<br />

In the process <strong>of</strong> transferring data. From cluster head<br />

to the base station.<br />

All the cluster heads are sent directly to the base<br />

station.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1737<br />

V. CLUSTER HEAD ELECTION<br />

On the optimal probability, How to determine the<br />

cluster head node election The people did a lot <strong>of</strong><br />

simulation and analysis, They think that the optimal<br />

probability as theSpace density function which Evenly<br />

distributed nodes in the monitoring area.The best<br />

clustering can be achieved optimal energy distribution in<br />

the network, Bringing the total minimum energy<br />

consumption.<br />

Suppose, N nodes Evenly distributed in a square area<br />

whose Side length is 2a Distribution density observe<br />

poisson distribution Within Parameters for λ .The N is a<br />

random variable for number <strong>of</strong> sensor nodes, N= λ A.The<br />

p is the probability <strong>of</strong> cluster head election, np is the final<br />

number <strong>of</strong> clusters was calculated. Assuming the base<br />

station is located in the center <strong>of</strong> the square area, Then<br />

the average distance From a cluster head node to the base<br />

station as:<br />

1<br />

E[<br />

D | N n]<br />

x y dA 0.<br />

765a<br />

i i<br />

A 4a 2<br />

2 2⎛<br />

⎞<br />

= = ⎜ ⎟ =<br />

i ∫ +<br />

⎜ ⎟<br />

⎝ ⎠<br />

B is the Poisson distributed random variables, means<br />

the distance from Cluster head node to the base stations<br />

( ) i y at the i<br />

+ x , In the network has np cluster head<br />

node from the cluster head to base station.And between<br />

the cluster head and cluster head, the position is<br />

independent <strong>of</strong> each other.between.then, the length is<br />

0.765npa.from All the cluster head to o the sink node.<br />

The cluster head obey Poisson point process pp0 whose<br />

( λ)<br />

Intensity is λ λ = p<br />

i i , Cluster member nodes obey<br />

Poisson point process pp1 whose Intensity is<br />

λ ( λ = ( 1−<br />

p)<br />

λ)<br />

0 0<br />

, we define the e1 is the energy<br />

consumption which Member nodes within a unit to<br />

transfer data to the cluster head.then:<br />

E<br />

[ ] [ L | N = n]<br />

| N = n =<br />

E e<br />

1<br />

r<br />

e2 is all the Energy consumption which All the<br />

ordinary nodes in the network transmit data to their<br />

respective cluster heads .then:<br />

[ e | N = n]<br />

= npE[<br />

| N = n]<br />

2<br />

1<br />

E e<br />

e3 is the Energy consumption which Cluster head<br />

node transfer data to Sink nodes.then:<br />

[ | N = n]<br />

0.765npa<br />

E e =<br />

3<br />

r<br />

e is the energy consumption <strong>of</strong> the entire network, then:<br />

© 2011 ACADEMY PUBLISHER<br />

E<br />

=<br />

[ e | N = n]<br />

= E[<br />

| N = n]<br />

+ E[<br />

e | N = n]<br />

2<br />

3<br />

np<br />

r<br />

2<br />

e<br />

( 1−<br />

p)<br />

0.<br />

765npa<br />

+<br />

3<br />

2<br />

λ<br />

r<br />

p<br />

In Theorem:N= λ A, then:<br />

⎡ ( 1−<br />

p)<br />

0.<br />

765 pa ⎤<br />

E[]<br />

e = E[<br />

e | N = n]<br />

= E[<br />

N]<br />

⎢ + ⎥<br />

⎢⎣<br />

2r<br />

pλ<br />

r ⎥⎦<br />

⎡ ( 1−<br />

p)<br />

0.<br />

765 pa ⎤<br />

= λA⎢<br />

+ ⎥<br />

⎢⎣<br />

2r<br />

pλ<br />

r ⎥⎦<br />

When the above formula to obtain the minimum, the<br />

system Can find the optimal value p which Determine<br />

the system probability <strong>of</strong> cluster head election.<br />

p =<br />

⎡<br />

⎤<br />

⎢<br />

⎥<br />

⎢ +<br />

⎥<br />

⎢<br />

⎥<br />

⎢<br />

⎥<br />

⎢<br />

+ ⎥<br />

⎢ + + + ⎥<br />

⎢<br />

⎥<br />

⎢ 2<br />

2 3 ⎥<br />

⎢<br />

+ + +<br />

⎥<br />

⎢<br />

3<br />

⎣<br />

⎥⎦<br />

1<br />

2<br />

2 3<br />

1<br />

1<br />

3e<br />

3<br />

2<br />

3e(<br />

2 27e<br />

3 3e<br />

27e<br />

4)<br />

( 2 27e<br />

3 3e<br />

27e<br />

4)<br />

1<br />

.<br />

3e<br />

2<br />

In Theorem: e = 3.<br />

06 λ , This time, p is the<br />

probability <strong>of</strong> the optimal cluster head election and<br />

Minimum energy consumption in the whole<br />

network.Then the p into the formula can Calculate the<br />

distance threshold T(n) for each round.Finally, the<br />

number <strong>of</strong> the optimal number <strong>of</strong> cluster heads will be<br />

obtained for each round..<br />

With the operation <strong>of</strong> the network, the network energy<br />

change, P value also changes, the number <strong>of</strong> cluster heads<br />

in the network also with the dynamic changes.<br />

VI. MOBILE AGENT<br />

Mobile agent is the combination <strong>of</strong> distributed<br />

technology and artificial intelligence, simply, which is<br />

intelligent agent with mobility. The main idea is to<br />

transfer the code <strong>of</strong> calculation module to each node, then<br />

to finish calculation on a node, and return the processing<br />

results to the objective sink, so it can reduce the power <strong>of</strong><br />

sensor node generated by transmitting data.<br />

Through the discussion above, we can complete the<br />

establishment <strong>of</strong> cluster group, at the same time selected<br />

cluster head node. Through the use <strong>of</strong> mobile agent, we<br />

can in each cluster head node transfer data information,<br />

and will eventually results return to the objective sink<br />

node.<br />

2


1738 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Assuming mobile agent migration process, only<br />

improve data information accuracy, the S MA is fixed<br />

and it does not consider the energy consumption <strong>of</strong> free<br />

0<br />

monitoring. When E = =<br />

idl e b4 , the node energy<br />

transmission agent can be defined as<br />

a<br />

E d b b d ).<br />

tx S MA<br />

( ) ( 1 2 + =<br />

,<br />

Receiving energy consumption can be defined as<br />

b . 3 S<br />

, b , 1 2 b<br />

E = rx<br />

MA ,<br />

b , 3 b4<br />

In the<br />

is the constants with a sensor<br />

node wireless transceiver related; d is the transmission<br />

distance between the nodes; 2 ≤ a ≤ 4 is the attenuation<br />

factor for signal transmission path with energy<br />

consumption for measurement; the migration cost that<br />

agent moving from<br />

v()<br />

i<br />

a<br />

ij<br />

to<br />

⎧<br />

⎪(<br />

= ⎨<br />

⎪⎩<br />

∞<br />

( ) j v<br />

b<br />

1<br />

+<br />

b<br />

2<br />

is :<br />

d<br />

a<br />

ij<br />

).<br />

S<br />

MA<br />

+<br />

b<br />

3<br />

.<br />

S<br />

MA<br />

Thend ij is the distance between () i v<br />

d ij<br />

d<br />

and<br />

ij<br />

≤<br />

><br />

R<br />

R<br />

max<br />

max<br />

( ) j v<br />

; node<br />

can be reached when d ij is no greater than Rmax , on<br />

the contrary, the aim will not be visited, this process that<br />

node perception aim is considered the process <strong>of</strong> target<br />

signal<br />

So, information gains, the<br />

SE<br />

⎧<br />

⎪<br />

( j)<br />

= ⎨<br />

⎪⎩<br />

b<br />

0<br />

5<br />

d<br />

−a<br />

'<br />

jo<br />

( ) j v<br />

d<br />

d<br />

d jo ( )<br />

is the distance between<br />

v j<br />

jo<br />

jo<br />

is<br />

><br />

≤<br />

D<br />

D<br />

max<br />

max<br />

and the aim; a '<br />

is the attenuation factor to arrive at destination. When<br />

d jo<br />

is no more than Dmax , node may perceived goals,<br />

a<br />

d jo<br />

the information gains and into inverse.<br />

VII. SIMULATION AND RESULTS<br />

Using the improved algorithm to calculate cluster head<br />

node based on the LEACH algorithm, based on this,<br />

realize routing optimization algorithm <strong>of</strong> mobile agent,<br />

using NS - 2 to realize simulation, network is set in the<br />

area <strong>of</strong> 1000× 1000 sensor node random distribution, the<br />

number <strong>of</strong> node from 10 change to 1000, Compare<br />

energy consumption between without optimization<br />

© 2011 ACADEMY PUBLISHER<br />

'<br />

algorithm and based on the improved algorithm <strong>of</strong> mobile<br />

agent, the simulation results are as shown in figure 4.<br />

Figure 4. Comparison <strong>of</strong> energy consumption<br />

Random application: ten test scenario for each node<br />

scale, every scene test 10 times, taking average T. The<br />

simulation results as shown in table 1.<br />

Ran<br />

do<br />

m<br />

sce<br />

ne<br />

TABLE I. SHOWS THE RESULTS OF PERFORMANCE<br />

COMPARISON<br />

Not optimization<br />

Energy<br />

consump<br />

tion<br />

algorithm<br />

Informat<br />

ion<br />

Improved agent method<br />

Energy<br />

consumption<br />

Information<br />

1 545612 1078.33 216532 948.76<br />

2 432125 1468.41 154334 1399.78<br />

3 409563 960.12 126697 842.64<br />

4 896521 321.46 301682 355.47<br />

5 502184 1823.68 192360 1253.14<br />

6 57620 1123.17 221302 987.17<br />

7 496172 989.88 182780 1132.54<br />

8 457890 1572.91 155076 1475.21<br />

9 870475 309.24 270817 333.69<br />

10 429761 1254.71 131721 982.06<br />

VIII. CONCLUSION<br />

Based on the traditional sensor network algorithm to<br />

optimize the generating cluster head node through the use<br />

<strong>of</strong> energy calculation was used by routing algorithms<br />

which have been improved, This dissertation firstly<br />

analyzes the advantages and disadvantages <strong>of</strong> the existing<br />

routing protocols.The classical LEACH(Low Energy<br />

Adaptive Clustering Hierarchy)protocol <strong>of</strong> hierarchical<br />

sensor networks is analyzed and discussed in detail, and<br />

then it presents a new energy-efficient routing protocol <strong>of</strong><br />

WSN:Multi-Hierarchical Algorithm based on<br />

Clustering(MHAC).Simulation results show that MHAC<br />

Can balance energy load and prolongs the life time 0f<br />

WSN.<br />

The prominence feature <strong>of</strong> the wireless sensor<br />

networks is energy limited.The algorithm <strong>of</strong> the<br />

nowadays ale fall to pay attention to feature <strong>of</strong>the nodes<br />

energy, when issue the topology discovery, this algorithm<br />

consume much energy and can not guarantee the<br />

connectivity <strong>of</strong> the networks.The mobile agent-based<br />

topology discovery algorithm considerated the<br />

improvement on energy aspect for traditional.Through the<br />

simulation <strong>of</strong>the algorithm, the energy consumption can


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1739<br />

save than Waditional algorithm when hold oil some<br />

feature <strong>of</strong>traditional algorithm and have even luore<br />

connectivity.<br />

ACKNOWLEDGMENT<br />

This work was supported by Project Y200804680 <strong>of</strong><br />

the Research planning issues.<br />

REFERENCES<br />

[1] Shaojun Yang, Haoshan Shi, Rui Huang.Spatial-Temporal<br />

Information Integration Framework Based on Mobile-<br />

Agent in Wireless Sensor <strong>Networks</strong>.In Proc.<strong>of</strong> 16th<br />

International Conference on Computer<br />

Communication(ICCC2004), 2004, beijing, China;1096-<br />

1100, (ISIP:000228632800198)<br />

[2] LI N, Hou J C, Sha L.Design and analysis <strong>of</strong> an MSTbased<br />

topology control algorithm[A].In:Proceedings 12th<br />

Joint Conf on IEEE Computer and Communications<br />

Socienties(INFOCOM 2003)[C].San Francisco,<br />

2003.1702-1712<br />

[3] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and<br />

exchange anisotropy, ” in Magnetism, vol. III, G. T. Rado<br />

and H. Suhl, Eds. New York: Academic, 1963, pp. 271–<br />

350.<br />

[4] Tynan R, Marsh D, O'Knae D, O'Hare GMP.Agents for<br />

Wireless Sensor Network Power Management[A].In :<br />

Proceedings <strong>of</strong> the 2005 International Conference on<br />

Parallel Processing Workshops[c].June 205.413`418<br />

[5] Chen Min, Kwon T, Choi Y.Data Dissemination based on<br />

Mobile Agent in Wireless Sensor<br />

<strong>Networks</strong>[A].In:Preceedings <strong>of</strong> the IEEE Conference on<br />

Local Computer <strong>Networks</strong> 30TH<br />

Anniversary(LCN'05)[C].IEEE Computer Society,<br />

SYDNEY, AUSTRALIA, November 2005.2~3<br />

[6] Kui Wu, Yong Gao, Fulu Li, Yang Xiao. Lightweight<br />

Deployment-Aware Scheduling for Wireless Sensor<br />

<strong>Networks</strong>[J]. Mobile <strong>Networks</strong> and Applications, 2005,<br />

10(6)<br />

[7] Zhang wenjuan, Zhu Xiangbin, Mobile Agent-based<br />

Clustering Data Fusion in WSN[J], Computer & Digital<br />

Engineering, March 2010.<br />

[8] Wang Jietai, Yang Shaojun, Yu Haixun, Application <strong>of</strong><br />

Mobile Agent in Wireless Sensor <strong>Networks</strong>, Computer<br />

Engineering, March 2008<br />

[9] Xiao Qing, Jiao Jian, Application for artificial bee<br />

algorithm in migration <strong>of</strong> mobile agent[J], Application<br />

Research <strong>of</strong> Computers, June.2010.<br />

[10] Li Ming, Fan Gaojuan, An EIW-DSR Route Algorithm<br />

Based on the Energy Integrated Weight in Ad Hoc<br />

<strong>Networks</strong>[J], Computer Engineering & Science, November<br />

2010<br />

[11] Fdd Zhang Sheng, He Qingquan, Improved ant colony<br />

algorithm to solve mobile agent in wireless sensor<br />

networks[J], Application Research <strong>of</strong> Computers,<br />

November 2010.<br />

[12] LinGe Wang, Management Model Research <strong>of</strong> Wireless<br />

Sensor Network Based on Mobile Agent, Intelligent<br />

Computatyion Technology and Automaton(ICICTA), 2010,<br />

shenzhen, china<br />

[13] FanGaoJuan, Management Model Research <strong>of</strong> Wireless<br />

Sensor Network Based on Mobile Agent, 2007, 5<br />

© 2011 ACADEMY PUBLISHER<br />

``<br />

LinGe Wang ZheJiang Province, China.<br />

Birthdate: January, 1979. is Master <strong>of</strong><br />

computer technology.graduated from<br />

fudan University . And research interests<br />

on Network Engineering, Information<br />

Security, Wireless sensor networks.<br />

he is a senior lecturer <strong>of</strong> Dept. Network<br />

Ningbo Dahongying University<br />

college <strong>of</strong> s<strong>of</strong>tware.<br />

YueDou QI ZheJiang Province, China.<br />

Birthdate: Feb, 1964. is Bachelor <strong>of</strong><br />

Computer Application. graduated from<br />

QiQiHaEr University . And research<br />

interests on data mining, complex<br />

networks, business intelligence.<br />

He is a pr<strong>of</strong>essor <strong>of</strong> Dept. Network<br />

Ningbo Dahongying University college <strong>of</strong><br />

s<strong>of</strong>tware.


1740 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Covert Flow Graph Approach to Identifying<br />

Covert Channels<br />

XiangMei Song<br />

School <strong>of</strong> Computer Science and Telecommunication Engineering,<br />

Jiangsu University, Zhenjiang, 210013, China<br />

Email: jlsxm@ujs.edu.cn<br />

ShiGuang Ju<br />

School <strong>of</strong> Computer Science and Telecommunication Engineering,<br />

Jiangsu University, Zhenjiang, 210013, China<br />

Email: jushig@ujs.edu.cn<br />

Abstract—In this paper, the approach for identifying covert<br />

channels using a graph structure called Covert Flow Graph<br />

is introduced. Firstly, the construction <strong>of</strong> Covert Flow<br />

Graph which can <strong>of</strong>fer information flows <strong>of</strong> the system for<br />

covert channel detection is proposed, and the search and<br />

judge algorithm used to identify covert channels in Covert<br />

Flow Graph is given. Secondly, an example file system<br />

analysis using Covert Flow Graph approach is provided,<br />

and the analysis result is compared with that <strong>of</strong> Shared<br />

Resource Matrix and Covert Flow Tree method. Finally, the<br />

comparison between Covert Flow Graph approach and<br />

other two methods is discussed. Different from previous<br />

methods, Covert Flow Graph approach provides a deep<br />

insight for system’s information flows, and gives an effective<br />

algorithm for covert channel identification.<br />

Index Terms—multilevel security, covert channels,<br />

information flows, covert flow graph, shared resource<br />

matrix, covert flow trees<br />

I. INTRODUCTION<br />

Multilevel secure computer systems are used to protect<br />

hierarchic information by enforcing both mandatory and<br />

discretionary access controls. They can restrict the flow<br />

<strong>of</strong> information through legitimate communication<br />

channels[1]. However, covert channels are usually<br />

beyond the scope <strong>of</strong> the security model. Covert channels<br />

usually signal information through system facilities not<br />

intended for data transfer. That is, the sending process<br />

alters some system attributes, and the receiving process<br />

monitors the alteration[2]. In order to decrease the threat<br />

<strong>of</strong> covert channels, several covert channel analysis<br />

techniques have been proposed and utilized in the past<br />

thirty years. Among these techniques are the Shared<br />

Resource Matrix methodology (SRM)[3], the Noninterference<br />

approach[4], the Information Flow<br />

analysis[5], the Covert Flow Trees methodology<br />

Manuscript received Mar. 25, 2011; revised Apr. 17, 2011; accepted<br />

Apr. 20, 2011.<br />

project number: 60773049, 61003288, 20093227110005,<br />

BK2010192, 07JDG014, 08KJD520015<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1740-1746<br />

(CFT)[6], the Backward Tracking approach[7], and<br />

others.<br />

The SRM method is one <strong>of</strong> the most successful<br />

approaches for covert channel identification. The method<br />

starts from identifying shared resources. A shared<br />

resource is any object or collection <strong>of</strong> objects that may be<br />

referenced or modified by more than one subject. All<br />

identified shared resources are enumerated by a matrix<br />

structure, and then each resource is carefully examined to<br />

determine whether it can be used to transfer information<br />

from one subject to another covertly. The usage <strong>of</strong> the<br />

matrix structure makes the SRM method simple and<br />

intuitive. However, the shared resources matrix is<br />

helpless when constructing covert communication<br />

scenarios, and amount <strong>of</strong> analysis work by hand is<br />

enormous. Lots <strong>of</strong> research work to improve the SRM<br />

method has been done. The CFT method is virtually a<br />

transformation <strong>of</strong> the SRM method. Furthermore,<br />

McHugh[8] made three extensions to the matrix structure,<br />

and Shen and Qing[9-10] optimized the SRM method.<br />

The CFT method uses a tree structure instead <strong>of</strong> the<br />

shared resources matrix. Due to its tree structure, the CFT<br />

method is capable <strong>of</strong> recording flow paths and helpful to<br />

construct covert communication scenarios. But the covert<br />

flow trees usually are huge, and the construction <strong>of</strong> the<br />

trees could fall into infinite loop. To resolve the problem,<br />

a constraint parameter named REPEAT has to be<br />

introduced, which may lead to lose some potential covert<br />

channels.<br />

This paper presents a graph data structure that models<br />

system information flow from one shared resource<br />

attribute to another. This data structure is referred to as<br />

Covert Flow Graph (CFG). The process for constructing a<br />

covert flow graph is easy, and the graph can include<br />

almost information flows in a system. By searching for<br />

information flow paths, operation sequences can be<br />

<strong>of</strong>fered that will help the analysis work for detecting<br />

covert channels. To demonstrate this technique, an<br />

example file system is analyzed. The result is compared<br />

to other covert channel analysis methods.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1741<br />

II. THE COVERT FLOW GRAPH APPROACH<br />

The goal <strong>of</strong> Covert Flow Graph approach is to identify<br />

operation sequences that support potential<br />

communication channels exploited by two users. The<br />

Covert Flow Graph is a direct graph. The nodes <strong>of</strong> graph<br />

describe the information flows from one or more shared<br />

resource attributes to another one in an operation. While<br />

the direct edges denote the dependency relationships<br />

between two operations that share the same attribute and<br />

generate information flows. Section Ⅱ-A presents the<br />

graph notation and semantics used in the construction <strong>of</strong><br />

Covert Flow Graph. Section Ⅱ -B explains how to<br />

construct a covert flow graph. Section Ⅱ-C discusses the<br />

reason for pruning <strong>of</strong> Covert Flow Graph. Section Ⅱ-D<br />

introduces an algorithm for searching information flow<br />

paths and judging potential covert channels.<br />

A. Graph notation and semantics<br />

Let SA denote the collection <strong>of</strong> all shared resources<br />

(or shared attributes) in system and OP denote the<br />

collection <strong>of</strong> all primitive operations. Specially, the set<br />

SA contains an attribute, named output , whose value is<br />

returned by primitive operations. For opi ∈ OP (1 ≤ i ≤ n,<br />

n is the number <strong>of</strong> primitive operations),<br />

let , ,<br />

i i i<br />

SAR SAM SAO ⊂ SA ; SAR i contains all recognized<br />

attributes by op i , SAM i contains all modified attributes<br />

by op i and SAO i contains all output attributes by op i .<br />

The Covert Flow Graph is a direct graph. Fig.1 shows<br />

three kinds <strong>of</strong> nodes in Covert Flow Graphs. The triple<br />

< SARi, opi, v > in Fig 1.(a), where v∈ SAMi,<br />

presents<br />

that v is modified by op i according to the values <strong>of</strong> the<br />

shared attributes in SAR i . Another triple<br />

< v, opi , output > in Fig 1.(b), where v∈ SAOi,<br />

indicates<br />

that v is returned by op i . The OUTPUT node in<br />

Fig.1(c) is the finish node which appears only once in a<br />

covert flow graph. Its use will be discussed later.<br />

Figure 1. Nodes used in Covert Flow Graphs<br />

Definition 1. Covert Flow Graph (CFG):<br />

CFG =< SA, SAO, OP, AR, OUTPUT, S, E > . SA is the<br />

set <strong>of</strong> all shared attributes. SAO ⊂ SA is the set <strong>of</strong> all<br />

returned attributes by primitive operations. OP is the set<br />

<strong>of</strong> all primitive operations. AR = { SARi| i = 1,..., n}<br />

.<br />

S = S1∪S2 ∪ S3<br />

is the node set; S1⊆ AR× OP× SA,<br />

S2 ⊆ SAO × OP × { output}<br />

, S3= OUTPUT ,<br />

E = E1∪E2 ∪ E3<br />

is the edge set,<br />

E1 = { < si, sj > | si, sj ∈S1∧ si =< SARi, opi, v >∧<br />

,<br />

s =< SAR , op , v > ∧v ∈ SAR }<br />

j j j j j<br />

© 2011 ACADEMY PUBLISHER<br />

E2 = { < si, sj > | si ∈S1∧sj ∈S2 ∧ si =< SARi, opi, v ><br />

∧ s j =< v, op j, output > ∧v∈SAOj} E = { < s , s > | s ∈S ∧s ∈ S } .<br />

3 i j i 2 j 3<br />

The directed edges in E 1 connecting two nodes<br />

describes the dependency relationship between two<br />

operations, such as opi and op j . It means that one shared<br />

attribute like v is modified by op i and then referenced<br />

by op j . In E 2 , the directed edges present a shared<br />

attribute named v is modified by one operation named<br />

op i and then its value is returned by another operation<br />

named op j . The node like < vop , i , output><br />

must be<br />

connected to the finish node by a directed edge in E 3 ,<br />

which means the value <strong>of</strong> the attribute v is returned<br />

by op i .<br />

B. Construction <strong>of</strong> Covert Flow Graph<br />

Similarly to CFT methods, here creates reference list,<br />

modify list and return list for each primitive operation;<br />

then uses these lists as input to construct Covert Flow<br />

Graph. The information for creating three lists can get<br />

from system’s description, formal specification or<br />

implementation code. So, Covert Flow Graphs can be<br />

applicable to either phase <strong>of</strong> system life cycle.<br />

Constructing a covert flow graph has two main steps.<br />

The first step is to construct nodes. For ∀vij ∈ SAMi<br />

(1 < j < mi,<br />

m i is the number <strong>of</strong> the attributes that are<br />

modified by op i ), generate the triple < SARi, opi, vij<br />

> ;<br />

and for ∀vik ∈ SAOi(1<<br />

k < oi,<br />

o i is the number <strong>of</strong> the<br />

attributes whose values are returned by op i ), generate the<br />

triple < vik , opi , output > . Furthermore, the finish node<br />

OUTPUT should be generated. Next, generate oriented<br />

edges among the nodes. Firstly, for ∀opi, op j ∈ OP<br />

(1 < i, j < n),<br />

if there is an attribute v that is modified by<br />

opi and referenced or returned by op j , then op j is<br />

dependent on op i in connection with v . In other words,<br />

for every two inequality operations, such as op i and op j ,<br />

if having constructed the nodes < SARi, opi, v1<br />

> and<br />

< SAR j, op j,<br />

v2<br />

> ( v1∈ SARj)<br />

or < v, op j , output > in the<br />

first step, , then an oriented edge should be generated<br />

from the former node to the latter one. Besides, for the<br />

node < v, op , output > , it should be connected to the<br />

j<br />

finish node.<br />

Fig 2 illustrates an example graph. The operation lists<br />

used to build this example is just the same one used in<br />

[6], defined as follows in Table 1.<br />

According to the method above, for the operation op 4<br />

in Table Ⅰ, SAR 4 is { A } , SAM 4 is { BC , } , then<br />

generate triples as < { A}, op4, B > and < { A}, op4, C > ;<br />

and SAO4 is { A } , so < A, op4, output > is created.<br />

Therefore three nodes marked with gray color have been


1742 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Figure 2. Example CFG representing information flows<br />

TABLE I. EXAMPLE OPERATION LISTS<br />

Operation 1 Operation 2 Operation 3 Operation 4<br />

Reference List: Reference List: Reference List: Reference List:<br />

D A B A<br />

Modify List: Modify List: Modify List: Modify List:<br />

A, B B Null B, C<br />

Return List: Return List: Return List: Return List:<br />

Null Null B A<br />

Attributes: A, B, C, D<br />

constructed in Fig 2. The bold edge from<br />

< { D}, op1, A><br />

to < A, op4, output > is generated<br />

op .<br />

because A is modified by op 1 and referenced by 4<br />

And as A ’s value is returned by op 4 , the node<br />

< A, op , output > is connected to the finish node.<br />

4<br />

C. Pruning <strong>of</strong> Covert Flow Graph<br />

When a covert flow graph has been constructed, it can<br />

be pruned before the analysis. The pruning work is a twostep<br />

process. First, remove the node that has indegrees<br />

but no outdegrees, except for the finish node and the<br />

edges connected to it. Because only those paths that end<br />

with the finish node in the covert flow graph may be<br />

potential covert storage channels. Second, identify and<br />

remove the starting nodes provided that the paths started<br />

with those nodes cannot occur in practice. In a system,<br />

such pairs <strong>of</strong> operations <strong>of</strong>ten exist that one operation,<br />

named post-executed operation, must be executed<br />

consecutively after the execution <strong>of</strong> the other operation,<br />

named pre-executed operation, and the consecutive<br />

executions <strong>of</strong> the two operations can nullify each other’s<br />

effects, such as Lock_File and Unlock_File 1 . Almost no<br />

operation sequences are started with post-executed<br />

operations under the running circumstance. When<br />

analyzing the primitive operations <strong>of</strong> a system, such pairs<br />

<strong>of</strong> consecutive operations should be identified. And in<br />

this step, if a starting node presents a post-executed<br />

operation, then remove the node and the edges emitted<br />

from it.<br />

D. Search for information flow paths and identification<br />

<strong>of</strong> covert channels<br />

A pruned covert flow graph includes all information<br />

flow paths in the system, but not all <strong>of</strong> them are covert<br />

1 the operation Lock_File and Unlock_File come from a file<br />

system used in [3], which is referred to following.<br />

© 2011 ACADEMY PUBLISHER<br />

storage channels. The next task is to search for<br />

information flow paths and identify covert storage<br />

channels. According to the minimum criteria that a covert<br />

storage channel must be satisfied [3], exploiting covert<br />

channels to communicate between two users has three<br />

characters:<br />

(1) The sending process (or user) must be able to<br />

modify some shared attribute’s value.<br />

(2) The receiving process must be able to detect the<br />

attribute change. Namely, the attribute’s value<br />

should be returned by the operation invoked by<br />

receiving process.<br />

(3) The security class <strong>of</strong> the sending process must be<br />

dominant or incomparable to that <strong>of</strong> the receiving<br />

process.<br />

These characters have special behaviors on operation<br />

sequences <strong>of</strong> covert channels. Because the operation<br />

names are included in nodes <strong>of</strong> covert flow graphs, the<br />

operation sequences can be acquired when searching for<br />

information flow paths in covert flow graphs. According<br />

to (1)-(3) characters above the criteria rules for covert<br />

channel identification based on Covert Flow Graphs can<br />

get as following.<br />

Regulation 1. If the start operation in an operation<br />

sequence for covert communication has dependency, the<br />

operation sequence will not built covert channels.<br />

In systems, whether some operation’s reading or<br />

writing action can execute or not is decided by other<br />

operation’s execution results. The former operation is<br />

dependent on the latter one. According to character (1),<br />

the sending process must modify shared attribute’s value<br />

independently, so this kind <strong>of</strong> dependent operation cannot<br />

be exploited by sending process.<br />

Regulation 2. If one operation sequence can built<br />

covert channels, its corresponding information flow paths<br />

must be ended with the finish node in covert flow graph.<br />

According to character (2), the receiver has to invoke<br />

an operation to output the attribute’s value finally.<br />

Because the nodes presenting some attribute’s value<br />

returned must be connected to the finish node, Regulation<br />

2 is valid.<br />

Regulation 3. If one operation sequence can built<br />

covert channels, the user authority which the start<br />

operation needs should not be dominated by that <strong>of</strong> the<br />

end operation in the operation sequence.<br />

If Regulation 3 is not satisfied, then the covert<br />

communication will be a legal channel between the<br />

sender and receiver because the sender’s security class is<br />

lower than or equal to the receiver’s.<br />

According to Regulation 2, only those paths that end<br />

with the finish node in covert flow graph may be<br />

potential covert channels. So, the search method for<br />

information flow paths consists <strong>of</strong> the following steps:<br />

Firstly, get the converse digraph <strong>of</strong> a covert flow graph,<br />

named CFG -1 . In CFG -1 , the finish node is the only node<br />

without indegrees. Secondly, use the deep first search<br />

method to find out all the paths which begin with the<br />

finish node. While searching, determine whether it can be<br />

exploited by covert channels. The judge basis is<br />

Regulation 2 and 3. In order to avoid endlessly cycles


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1743<br />

while searching, one directed edge must appear only once<br />

in a path. The search and judge algorithm is as following:<br />

Algorithm 1: the search and judge algorithm<br />

1<br />

Procedure PathSearching( CFG − )<br />

1<br />

Input: CFG −<br />

Output: PATH // information flow paths<br />

Begin<br />

initial stack 1 // stack 1 is used to backtrack<br />

initial PATH<br />

T :=Φ //T is the set <strong>of</strong> direct edges<br />

push( stack 1 , OUTPUT )<br />

// push the OUTPUT node into the stack<br />

while stack 1 != NULL do<br />

pop( stack 1 , v )<br />

push( stack 1 , NULL)<br />

//NULL presents the null node<br />

flag:=FALSE<br />

−1<br />

while ( ∃v → w∈CFG ) ∧( v→ w∉ T)<br />

do push( stack 1 , w )<br />

T =T ∪{ v→ w}<br />

flag:=TRUE<br />

od<br />

if flag=TRUE then<br />

push( stack 2 , ( vw)) ,<br />

else<br />

JudgeCovertChannel( )<br />

pop( stack 1 , v )<br />

while v =NULL do<br />

pop( stack 2 , ( vw)) ,<br />

T=T-{ v→ w}<br />

pop( stack 1 , v )<br />

od<br />

push( stack 1 , v )<br />

fi<br />

od<br />

End<br />

Procedure JudgeCovertChannel( )<br />

Output: CC // potential covert channels<br />

Begin<br />

i:=0<br />

j:=0<br />

while PATH !=NULL do<br />

pop( PATH , ( vw)) ,<br />

Array[i++]:= w<br />

if the operation in w is independent<br />

then lab[j++]=i<br />

od<br />

v :=Array[i-1]<br />

for k:=0 to j-1 do<br />

w :=Array[lab[k]]<br />

if sl( w≮ ) sl() v then<br />

// sl( x ) presents node’s security class<br />

CC ← Array[lab[k]..i-1]<br />

output CC<br />

fi<br />

od<br />

End<br />

© 2011 ACADEMY PUBLISHER<br />

In Algorithm 1, procedure PathSearching is used to<br />

1<br />

deep first search for paths in CFG − . In order to find out<br />

all paths, here needs a stack structure for backtracking,<br />

named stack 1 . While backtracking, determine the steps<br />

with the number <strong>of</strong> NULL nodes pop from stack 1 .<br />

stack 2 is another stack structure used to recorder the<br />

nodes in a path. During the search time, as long as a<br />

direct edge has been found out, the pair <strong>of</strong> nodes<br />

corresponding to the edge should be pushed into stack 2<br />

Furthermore, a set T is defined to denote whether every<br />

edge in a path appears only once in order to avoid<br />

endlessly cycles. Once searching out a path, procedure<br />

JudgeCovertChannel will be invoked to judge whether<br />

covert channels exist in the path.<br />

III. EXAMPLE FILE SYSTEM ANALYSES USING COVERT<br />

FLOW GRAPH<br />

This section presents the results from an example<br />

covert channel analyses using the CFG approach<br />

described in Section 2. A brief description <strong>of</strong> the example<br />

system is included, which is taken from [6], in order to<br />

provide an overview <strong>of</strong> the basic functions <strong>of</strong> the<br />

primitive operations and attributes. For more detailed<br />

descriptions <strong>of</strong> the system the reader is referred to [3], the<br />

paper from which the specification was taken. The<br />

operation description lists used in the construction <strong>of</strong><br />

covert flow graph are also taken from [6].<br />

A. A brief description <strong>of</strong> the file system example<br />

The attributes <strong>of</strong> the system includes six file attributes,<br />

three process attributes, and one attribute associated with<br />

the global state <strong>of</strong> the system: Current_Process.<br />

Current_Process contains the ID <strong>of</strong> the process currently<br />

running on the CPU. The file attributes <strong>of</strong> the system are<br />

File_ID, Locked, Locked_By, Value, Security_Class, and<br />

In_Use. The three process attributes are Process_ID,<br />

access_Rights and Buffer. The operations are discussed<br />

in more detail in the following paragraphs.<br />

The Write_File operation is used by a process to<br />

change the contents <strong>of</strong> a file. The file is locked by the<br />

current process. The value <strong>of</strong> the file is modified to<br />

contain the contents <strong>of</strong> the current process's buffer.<br />

The Read_File operation is used by a process to<br />

interrogate the contents <strong>of</strong> file. If the current process is<br />

included in the in-use set for the file specified, the value<br />

<strong>of</strong> the file is copied to the current process's buffer.<br />

The Lock_File operation is used by a process to<br />

modify the contents <strong>of</strong> file. A process must lock a file<br />

before modifying it and must unlock the file after the<br />

modification is complete. If the current process has write<br />

access for the specified file, if the file specified is<br />

unlocked, and if its in-use set is empty, then the file is<br />

locked, and its locked by attribute is set to the id <strong>of</strong> the<br />

current process.<br />

The Unlock_File operation makes a file accessible<br />

when a process is done modifying its contents. If the<br />

specified file's locked by attribute is the current process,<br />

the file is unlocked.


1744 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

The Open_File operation is used by a process to<br />

initiate retrieval <strong>of</strong> the contents <strong>of</strong> a file. This primitive<br />

guarantees that no other process is modifying the contents<br />

<strong>of</strong> the file being interrogated. If the current process has<br />

read access or the specified file and the file is not locked,<br />

the current process's id is added to the in-use set for this<br />

file.<br />

The Close_File operation is used when a process has<br />

completed interrogation <strong>of</strong> a file and wants to release it so<br />

that it can be modified. If the current process's id is an<br />

element <strong>of</strong> the in-use set for the specified file, then it is<br />

removed from that set.<br />

The File_Locked operation is used by a process to<br />

determine whether a file locked. If the current process has<br />

write access for the specified file, then, if the file is<br />

locked, a value <strong>of</strong> true is returned. If the file is unlocked<br />

the value false is returned. If the current process lacks<br />

write access for the specified file, the result is undefined.<br />

The File_Opened operation is used by a process to<br />

determine whether a file is open for reading. If the current<br />

process has write access for the specified file, then, if the<br />

file's in-use set is nonempty, a value <strong>of</strong> true is returned. If<br />

it is empty the value false is returned. If the current<br />

process does not have write access for the specified file,<br />

the result is undefined.<br />

The View_Buf operation is introduced to explicitly<br />

state how a process is allowed to view its buffer attribute.<br />

The lists constructed from the operation descriptions<br />

are as in Table Ⅱ.<br />

TABLE II. FILE SYSTEM OPERATION DESCRIPTION LISTS<br />

Operation Reference List<br />

Buffer,<br />

Modify List Return List<br />

Write_File<br />

Current_Process,<br />

Locked_By,<br />

Locked<br />

Value,<br />

Value Null<br />

Read_File current_Process,<br />

In_Use<br />

Buffer Null<br />

View_Buf Buffer<br />

Current_Process,<br />

Null Buffer<br />

Lock_File<br />

Access_Rights,<br />

Locked, In_Use,<br />

Security_Class<br />

Locked,<br />

Locked,<br />

Locked_By<br />

Null<br />

Unlock_File Locked_By,<br />

Current_Process<br />

Current_Process,<br />

Locked Null<br />

Open_File<br />

Access_Rights,<br />

Security_Class,<br />

Locked,<br />

In_Use Null<br />

Close_File<br />

Current_Process,<br />

In_Use<br />

Access_Rights,<br />

In_Use Null<br />

File_Opened Security_Class,<br />

In_Use<br />

Access_Rights,<br />

Null In_Use<br />

File_Locked Security_Class,<br />

Locked<br />

Null Locked<br />

B. Example covert flow graph and scenario list for file<br />

system example<br />

Fig.3 is the CFG constructed for the file system<br />

example. To make the analysis easier, two nodes marked<br />

with dark grey color is considered firstly.<br />

© 2011 ACADEMY PUBLISHER<br />

The triple describes the information flows by executing the<br />

operation Close_File. The node is the only one that connects to the marked<br />

node, which describes the information flows by executing<br />

the operation Open_File, shown in Fig.3. While<br />

Open_File and Close_ file are the pair <strong>of</strong> consecutive<br />

operations, they can nullify each other’s effects when<br />

existing in an operation sequence. Therefore they can be<br />

reduced from the operation sequence. And in the CFG,<br />

the dotted edge from to<br />

should<br />

be deleted . This results in the node has no indegree. Because<br />

Close_File is post-executed operation, the node<br />

and<br />

edges from it can also be deleted in the CFG.<br />

Similarly, the Lock_File and Unlock_File are the pair<br />

<strong>of</strong> consecutive operations. But they can nullify each<br />

other’s effects only on the Locked attribute. So the dotted<br />

edge from <br />

to should be deleted, however the<br />

edge from to should not be<br />

deleted.<br />

In Fig 3, the path with bold black lines is one <strong>of</strong> the<br />

information flow paths searched out by using Algorithm<br />

1. Each subpath to the finish node in the path may be a<br />

potential covert channel. Because Read_File, Write_File<br />

and Unlock_File are dependent other operations and<br />

Lock_File’s security class is not dominate View_File’s,<br />

such subpaths that starts with these four operations could<br />

not be used as covert channels. Only the subpaths staring<br />

with Open_File can be exploited. The corresponding<br />

operation sequences are shown in Fig 4.<br />

Figure 3. The covert flow graph <strong>of</strong> example system<br />

Figure 4. Potential covert communication sequences starting with<br />

Open_File


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1745<br />

The two sequences in Fig 4 need further analysis by<br />

constructing covert scenarios to determine whether they<br />

are covert channels. The result is that only sequence (2) is<br />

a covert channel. In this covert channel, the high security<br />

class user can choose whether to invoke Open_File, while<br />

another user with low security class can judge the former<br />

user’s action by invoking a serial <strong>of</strong> operations.<br />

Fig 5 shows covert communication sequences existing<br />

in the example system using Covert Flow Graph method.<br />

The method finds out six covert channels that were<br />

provided by CFT in reference [6], as sequences (a)-(f) in<br />

Fig 5. Besides, sequence (g) presents a new covert<br />

channel which was not found by CFT. Corresponding<br />

covert scenario can be constructed as following: the<br />

sender can affect the receiver’s observation result through<br />

whether invoking Open_File or not. If the sender invokes<br />

Open_File to open a file, then the receiver can not locked<br />

the same file. The following operation Open_File<br />

invoked by receiver will be successful and File_Opened<br />

will return TRUE to receiver. Otherwise, the receiver will<br />

get FALSE from File_Opened. Therefore, the receiver<br />

can detect whether the sender has opened the given file.<br />

Table Ⅲ enumerates the covert channel analysis<br />

results for the above file system with Shared Resource<br />

Matrix, Covert Flow Tree method and Covert Flow<br />

Graph. Using SRM method, only the exploited shared<br />

resource attribute can be detected. While both CFT and<br />

CFG approach can provide detailed covert<br />

communication sequences.<br />

Figure 5. Potential covert communication sequences existing in<br />

example system<br />

TABLE III. CORRESPONDENCE BETWEEN CHANNEL<br />

ANALYSIS TECHNIQUES FOR THE FILE SYSTEM EXAMPLE<br />

SRM CFT CFG<br />

Cover channel using<br />

File_Locked to<br />

sense changes in<br />

Locked<br />

Covert channel<br />

using Lock_File to<br />

sense changes in<br />

Locked<br />

Covert channel<br />

using Lock_File to<br />

sense changes in<br />

In_Use<br />

Covert channel<br />

using File_Opened<br />

to sense changes in<br />

In_Use<br />

Covert<br />

communication<br />

sequences A<br />

Covert<br />

communication<br />

sequences B<br />

Covert<br />

communication<br />

sequences C, D, E<br />

Covert<br />

communication<br />

sequences F<br />

Covert<br />

communication<br />

sequences A<br />

Covert<br />

communication<br />

sequences B<br />

Covert<br />

communication<br />

sequences C, D, E, G<br />

Covert<br />

communication<br />

sequences F<br />

VI. COMPARISON AMONG SRM, CFT AND CFG<br />

The Shared Resource Matrix approach works well<br />

since it has been introduced. The major problem may be<br />

© 2011 ACADEMY PUBLISHER<br />

that it cannot afford the operation sequences which can<br />

help the analysis <strong>of</strong> covert channels. In contrast, the CFT<br />

approach, which can present the operation sequences by a<br />

tree structure. Compared with CFT, the CFG may have<br />

two advantages as follows:<br />

(4) The CFG can provide almost complete<br />

information flows <strong>of</strong> a system in one graph, while<br />

the CFT has to construct trees for every shared<br />

attribute that would be modified by operations.<br />

Usually the size <strong>of</strong> the tree structure is quite<br />

large. For example, the CFT representing the<br />

information flow via attribute In_Use for the file<br />

system example used 136 nodes. The CFG in<br />

Fig.6 only uses 11 nodes for all attributes that can<br />

be exploited for covert communication.<br />

(5) The CFT construction algorithm dependents on a<br />

parameter, called REPEAT, which is used to<br />

control the constructing CFT with infinite tree<br />

paths. The parameter defines the number <strong>of</strong> times<br />

any attribute may be repeated in an inference<br />

path, thus providing the analyst with a way to<br />

avoid cpu or memory exhaustion by controlling<br />

the depth <strong>of</strong> the CFT paths. But unsuitable value<br />

<strong>of</strong> REPEAT may result in missing covert<br />

channels. For example, when REPEAT set to 0,<br />

scenario D and E would not be discovered. While<br />

the CFG avoids this problem.<br />

Notwithstanding these advantages, CFG encounters<br />

problems similar to the CFT approach. In the CFG,<br />

pseudo communication paths still exist. One way to<br />

reduce pseudo communication paths is to consider a finer<br />

relationship between the referenced and modified<br />

attributes and to consider conditional modifies and<br />

references, on which our research group and others are<br />

working.<br />

V. CONCLUSION AND FUTURE WORK<br />

This paper introduces a technique for detecting covert<br />

channels. The approach uses covert flow graphs, which<br />

can present the information flow paths and operation<br />

sequences. A algorithm for searching information flow<br />

paths and judging potential covert channels is introduced.<br />

To illustrate the approach, one example file system is<br />

analyzed and the result is compared to previous channel<br />

analysis <strong>of</strong> the same system using CFT approach.<br />

Compared with SRM methods, Covert Flow Graph<br />

approach can provide operation sequences. In the<br />

meantime, Covert Flow Graph approach avoids the<br />

difficult problem that CFT method has encountered.<br />

In future work, other example system should be<br />

analyzed by Covert Flow Graph approach. The emphasis<br />

will be put on automated tool for the construction <strong>of</strong><br />

covert flow graphs.<br />

ACKNOWLEDGMENT<br />

This work was supported by the National Natural<br />

Science Foundation <strong>of</strong> China (Grant Nos. 60773049,<br />

61003288), the Ph.D. Programs Foundation <strong>of</strong> Ministry<br />

<strong>of</strong> Education <strong>of</strong> China (Grant Nos. 20093227110005), the<br />

Natural Science Foundation <strong>of</strong> Jiangsu Province (Grant


1746 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Nos. BK2010192), the People with Ability Foundation <strong>of</strong><br />

Jiangsu University(Grant Nos.07JDG014), the<br />

Fundamental Research Project <strong>of</strong> the Natural Science in<br />

Colleges <strong>of</strong> Jiangsu Province (Grant Nos.<br />

08KJD520015).<br />

REFERENCES<br />

[1] D. E. Bell, L. J. LaPadula, “Secure Computer Systems:<br />

Unified Exposition and Multics Interpretation,” Mitre<br />

Crop., Bedford, MA, Tech. Rep. ESD_TR_75_306(1975).<br />

[2] R. A. Kemmerer, P. A. Porras, “Covert Flow Trees: a<br />

Visual Approach to Analyzing Covert Storage Channels,”<br />

IEEE Transactions on S<strong>of</strong>tware Engineering, vol.17, no.<br />

11, pp. 1166 – 1185, Nov. 1991.<br />

[3] R. A. Kemmerer, “Shared Resource Matrix Methodology:<br />

an Approach to Identifying Storage and Timing Channels,”<br />

ACM Transactions on Computer Systems, vol. 1, no. 3, pp.<br />

256-277, Aug. 1983.<br />

[4] J. Goguen, J. Meseguer, “Security Policies and Security<br />

Models.,” In: Proc. 1982 Symposium on Security and<br />

Privacy, pp. 11-20, IEEE Press, New York (1982).<br />

[5] D. E. Denning, “A Lattice Model <strong>of</strong> Secure Information<br />

Flow,” Communications <strong>of</strong> the ACM, vol. 19, no. 5, pp.<br />

236-243, May 1976.<br />

[6] P. A. Porras, R. A. Kemmerer, “Covert Flow Tree Analysis<br />

Approach to Covert Storage Channel Identification.,”<br />

Comput. Sci. Dept., Univ. California. Santa Barbara, Tech.<br />

Rep. No. TRCS 90-26, Dec 1990.<br />

[7] S.H. Qing, J.F. Zhu,: “Covet Channel Analysis on<br />

ANSHENG Secure Operating System.,” <strong>Journal</strong> <strong>of</strong><br />

S<strong>of</strong>tware, vol. 15, no. 9, pp. 1385-1392, 2004.<br />

[8] J. McHugh, “Handbook for the Computer Security<br />

Certification <strong>of</strong> Trusted Systems - Covert Channel<br />

Analysis.” Technical Report, Naval Research Laboratory,<br />

Feb 1996.<br />

[9] J.J. Shen, S.H. Qing, Q.N. Shen, L.P. Li, “Covert Channel<br />

Identification Founded on Information Flow Analysis,”<br />

Lecture Notes in Computer Science, Vol. 3802, pp. 381-<br />

387, 2005.<br />

[10] J.J. Shen, S.H. Qing, Q.N. Shen, L.P. Li, “Optimization <strong>of</strong><br />

covert channel identification, ” In: Proceeding <strong>of</strong> the Third<br />

IEEE International Security in storage workshop<br />

(SISW’05), 13 Dec 2005.<br />

[11] J. Zeng, S.G. Ju, X.M. Song, “Construct Information Flow<br />

Graph Based on PDG,” Computer Science and<br />

Computational Technology, Vol. 1, pp. 756-759, 20-22<br />

Dec. 2008.<br />

[12] Y.J. Wang, J.Z. WU, H.T. Zeng, L.P. DING, X.F. LIAO,<br />

“Covert Channel Research,” <strong>Journal</strong> <strong>of</strong> S<strong>of</strong>tware, Vol. 21,<br />

No. 9, pp.2262-2288, Sep 2010.<br />

© 2011 ACADEMY PUBLISHER<br />

XiangMei Song JiLin Province, China.<br />

Birthdate: Nov, 1979. is Computer<br />

Science doctoral student, studying in<br />

School <strong>of</strong> Computer Science and<br />

Telecommunication Engineering, Jiangsu<br />

University. And research interests on<br />

information security.<br />

She is a senior lecturer <strong>of</strong> Dept.<br />

information security, School <strong>of</strong> Computer<br />

Science and Telecommunication Engineering, Jiangsu<br />

University.<br />

ShiGuang Ju JiangSu Province, China.<br />

Birthdate: May, 1955. is Computer<br />

Science Ph.D., graduated from National<br />

Polytechnic Institute (Mexico). And<br />

research interests on information security<br />

and data base.<br />

He is a pr<strong>of</strong>essor and Ph.D. supervisor<br />

<strong>of</strong> School <strong>of</strong> Computer Science and<br />

Telecommunication Engineering, Jiangsu<br />

University.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1747<br />

A Novel HAVE Message <strong>of</strong> Peer-to-peer<br />

Protocol in BitTorrent Systems<br />

Jianyong Li<br />

School <strong>of</strong> Computer & Communication Engineering, Zhengzhou University <strong>of</strong> Light Industry, Zhengzhou, China<br />

Email: lijianyong@zzuli.edu.cn<br />

Jianchun Li, Daoying Huang and Qiang Wei<br />

School <strong>of</strong> Computer & Communication Engineering, Zhengzhou University <strong>of</strong> Light Industry, Zhengzhou, China<br />

Email: lijianchun@zzuli.edu.cn, dyhuang@zzuli.edu.cn, weiqiang200456@163.com<br />

Abstract—In BitTorrent systems, there are eleven types <strong>of</strong><br />

messages for data communication between the peers, among<br />

which HAVE, REQUEST and PIECE messages are the<br />

three main transmission parts in terms <strong>of</strong> quantity and flow.<br />

In order to improve the efficiency <strong>of</strong> network transmission<br />

and decrease the management costs <strong>of</strong> file delivery, this<br />

paper investigates the mechanism <strong>of</strong> HAVE message <strong>of</strong><br />

BitTorrent systems and propose a novel MultiHAVE<br />

message scheme, which comprises several HAVE messages<br />

via a proper set timer. Experiment results show that under<br />

the environment <strong>of</strong> high bandwidth and consistent peers,<br />

together with assistant <strong>of</strong> the timer, the flow ratio <strong>of</strong><br />

MultiHAVE message to HAVE message can be reduced to<br />

11%, so MultiHAVE message can decrease the flow <strong>of</strong><br />

messages and prevent the HAVE message storm efficiently.<br />

Furthermore, MultiHAVE message can adapt itself to<br />

various BT systems with various bandwidths. If the action<br />

<strong>of</strong> network peers is inconsistent, it can degenerate to the<br />

original HAVE message and keep the high performance <strong>of</strong><br />

BitTorrent systems.<br />

Index Terms—BitTorrent, protocol, Peer-to-peer networks,<br />

MultiHAVE message, performance analysis<br />

I. INTRODUCTION<br />

BitTorrent (BT) is a Peer-to-peer (P2P) protocol<br />

designed to distribute and replicate data quickly,<br />

efficiently and fairly [1-2]. It possesses similar<br />

technological principle to other P2P downloading<br />

s<strong>of</strong>tware. In BT system, each peer is a client as well as a<br />

server. So the more people download the file, the quicker<br />

its speed is. Numerous practical results have verified the<br />

flexibility, efficiency and reliability <strong>of</strong> BT systems [3].<br />

However, the widely usage <strong>of</strong> BT systems may result<br />

in message storm and decrease the communication<br />

efficiency. Recent studies showed that the proportion <strong>of</strong><br />

P2P traffic on the backbone links has increased from 10%<br />

to 80% [4-6] and the BitTorrent traffic has increased from<br />

26% to 52% <strong>of</strong> the total P2P traffic during the first half <strong>of</strong><br />

2004, and even amounts to 60% in 2005, according to the<br />

report <strong>of</strong> CacheLogic [4]. Due to the extensive use <strong>of</strong> BT<br />

systems and the congestion <strong>of</strong> local network, many ISPs<br />

began to constrain the application <strong>of</strong> BT systems.<br />

However, some <strong>of</strong> the original file-distributing services<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1747-1753<br />

based on the central servers need to invoke the support <strong>of</strong><br />

BT systems.<br />

In order to improve the performance <strong>of</strong> BT systems,<br />

many researches have been carried out to modify the<br />

existing BitTorrent mechanisms. Qureshi [7] suggested<br />

the use <strong>of</strong> proximity in BitTorrent overlay network and<br />

the peers that are close by in the real world should be<br />

close by in the overlay network. Bindal et al [8] proposed<br />

a new algorithm based on biased neighbor selection for<br />

the cross-ISP problem. In [9], Yamazaki et al put forward<br />

a so-called Cost-Aware BitTorrent strategies to reduce<br />

the ISP costs. To improve the piece exchange mechanism,<br />

Garbacki et al [10] proposed a protocol named 2Fast<br />

which extended the bartering model <strong>of</strong> BitTorrent and<br />

Garbacki et al [11] extended it by proposing a novel<br />

mechanism in which incentives are built around<br />

bandwidth rather than content. Noticing that a free-rider<br />

is a node that downloads pieces from other peers but does<br />

not upload any pieces to others, Sirivianos et al [12]<br />

presented a new free-riding technique named the large<br />

view exploit and suggested a modification to the<br />

BitTorrent tracker and clients to address the problem.<br />

In this paper, we investigate the performance<br />

enhancement <strong>of</strong> BitTorrent systems by inducing the<br />

management costs. It is known that in BitTorrent systems,<br />

there are eleven types <strong>of</strong> messages for data<br />

communication between peers and the management costs<br />

are mainly depending on HAVE message, REQUEST<br />

message and PIECE message. In some specific<br />

applications, management costs even reach 23% [13]. A<br />

HAVE message is sent once the peer has received the<br />

entire piece and verified the corresponding hash value in<br />

the torrent file. The purpose <strong>of</strong> the message is to inform<br />

all the connected peers that they could update the<br />

download piece information which was notified by the<br />

BITFIELD message in HANDSHAKE stage. In a BT<br />

system the peers that the tracker returns can reach up to<br />

50 due to the numerous peers joining the system.<br />

Correspondingly, the ratio <strong>of</strong> the number and flow <strong>of</strong><br />

HAVE message will increase quickly and result in a<br />

possible HAVE message storm.<br />

Actually, sending HAVE message to all peers in a high<br />

frequency cannot improve other peers’ downloading rate.


1748 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

So reducing the frequency <strong>of</strong> HAVE message can not<br />

only relieve the burden <strong>of</strong> peer in receiving and sending<br />

HAVE message but also reduce the network bandwidth<br />

costs. Under above consideration, in this paper, we<br />

propose a novel HAVE message mechanism,<br />

MultiHAVE message, to improve the efficiency <strong>of</strong><br />

network transmission and decrease the management costs<br />

<strong>of</strong> file delivery. The proposed MultiHAVE message is<br />

composed <strong>of</strong> several HAVE message via a proper set<br />

timer. The regular sending scheme <strong>of</strong> MultiHAVE<br />

message is analyzed as well. In order to show the<br />

effectiveness <strong>of</strong> the proposed mechanism, we compare<br />

the performance <strong>of</strong> MultiHAVE message and HAVE<br />

message. Experiment results show that under the<br />

environment <strong>of</strong> high bandwidth and consistent peers, the<br />

flow ratio <strong>of</strong> MultiHAVE message to HAVE message<br />

reduces to 11%. So the proposed MultiHAVE message<br />

can effectively decrease the amount <strong>of</strong> HAVE message<br />

and reduce the management costs <strong>of</strong> BT system.<br />

The rest <strong>of</strong> this paper is organized as follows. In<br />

Section 2, we propose a novel structure <strong>of</strong> MultiHAVE<br />

message and illustrate the regular sending scheme <strong>of</strong> the<br />

MultiHAVE message. In Section 3, we compare the<br />

performance <strong>of</strong> MultiHAVE message HAVE message.<br />

Experiment results are given in Section 4 to verify the<br />

efficiency <strong>of</strong> the proposed scheme. Section 5 summarizes<br />

the paper and draws the conclusion.<br />

II. STRUCTURE OF MULTIHAVE MESSAGE AND REGULAR<br />

SENDING SCHEME OF MULTIHAVE MESSAGE<br />

A. Structure <strong>of</strong> MultiHAVE Message<br />

The purpose <strong>of</strong> the HAVE message is to inform all the<br />

connected peers that they could update the download<br />

piece information which was notified by the BITFIELD<br />

message in HANDSHAKE stage. Sending HAVE<br />

message to all peers in a high frequency cannot improve<br />

other peers’ downloading rate. In this subsection we<br />

propose a new HAVE message mechanism, which<br />

comprises several HAVE messages via a proper set timer.<br />

Noticing that HANDSHAKE, KEEP ALIVE message<br />

and the other 9 messages have 4B message prefix and 1B<br />

message ID, the structure <strong>of</strong> MultiHAVE message can be<br />

formulated as follows:<br />

(1) 4B long Message prefix. Message prefix shows the<br />

bytes size <strong>of</strong> message ID and the payload in MultiHAVE<br />

message. The value range is n × 4 + 1 , where n is the<br />

number <strong>of</strong> piece’s index in payload.<br />

(2) 1B Message ID. The largest message ID in current<br />

BT system is 8. Here the value is declared as 9.<br />

(3) Payload. The length is n × 4 B, where n is the<br />

number <strong>of</strong> pieces. Each 4B represents the index <strong>of</strong> a<br />

piece.<br />

The comparison between MultiHAVE message and<br />

HAVE message is shown in TABLE I.<br />

B. Regular Sending Scheme <strong>of</strong> MultiHAVE Message<br />

The purpose <strong>of</strong> sending HAVE message is to notify<br />

other peers <strong>of</strong> the local peer’s downloaded piece state. It<br />

could also update the downloaded piece information<br />

© 2011 ACADEMY PUBLISHER<br />

TABLE I.<br />

THE COMPARISON OF MULTIHAVE MESSAGE AND HAVE MESSAGE<br />

Message name Length prefix<br />

Message<br />

ID<br />

Payload<br />

HAVE 0005 4 Integer/4B<br />

MultiHAVE Payload + 1 9<br />

Variable<br />

length<br />

which was notified by the BITFIELD message in<br />

HANDSHAKE stage.<br />

Conventionally, when the peer receives a piece, a<br />

HAVE message is sent to tell all the connected peers that<br />

it has the piece. As the connecting number <strong>of</strong> peers<br />

increases, the largest increase range <strong>of</strong> HAVE message is<br />

( )<br />

2<br />

O n , where n is the connecting number <strong>of</strong> the peers. In<br />

particular, under the high-bandwidth network<br />

environment, in a choke conversion cycle (10s) or an<br />

optimistic unchoking cycle (30s), a high-speed peer may<br />

receive hundreds <strong>of</strong> MB data. Calculated with a typical<br />

size <strong>of</strong> piece as 256KB, the data-receiving peers will send<br />

400 or 1200 HAVE messages to all its connecting peers<br />

in 10s or 30s. If the default number <strong>of</strong> connection is 50,<br />

the peer would send a total <strong>of</strong> 20000 or 60000 HAVE<br />

messages, 2000 per second on average. Obviously, under<br />

this circumstance, a serious HAVE message storm will<br />

appear at the end <strong>of</strong> receiving peer. The above<br />

calculations are only HAVE message that a receiving<br />

peer has sent. In fact, each peer has similar action<br />

because the relationship between them is symmetrical. If<br />

each peer has balanced equivalent sending and receiving<br />

data action in a period, the entire bandwidth is shared by<br />

uploading and downloading. Then the data that each peer<br />

receives are reduced by half and the frequency <strong>of</strong> peer<br />

sending HAVE message will be reduced to 1000 per<br />

second consequently. It should be noted that, due to the<br />

symmetry <strong>of</strong> peer action (called peer action consistency),<br />

in this period, each peer should receive totally 1000<br />

HAVE messages per second from the other 50 peers, and<br />

it will bring 51000 HAVE messages per second among<br />

51 peers. Clearly, high density HAVE message<br />

transmission will seriously affect the entire network<br />

performance.<br />

When a small number <strong>of</strong> low-bandwidth peers and a<br />

large number <strong>of</strong> high-bandwidth peers coexist in a BT<br />

system, the high-bandwidth peers may send a mass <strong>of</strong><br />

HAVE messages in a period. To the low-bandwidth<br />

peers, HAVE message is the message that they must<br />

receive and handle. The large amount <strong>of</strong> HAVE messages<br />

will definitely occupy their valued bandwidth and block<br />

the PIECE message which carries real data. In some<br />

serious cases, the low-bandwidth peers may not<br />

download any data during a long time. In other words, in<br />

a network where large numbers <strong>of</strong> high-bandwidth peers<br />

are constantly joining, the low-bandwidth peers are<br />

probable to be attacked by the HAVE message storm.<br />

In order to avoid forming the new MultiHAVE<br />

message storm, the frequency <strong>of</strong> sending MultiHAVE<br />

message should be taken into consideration when<br />

deciding the payload <strong>of</strong> MultiHAVE message. In<br />

practice, it can be managed by a timer. When the timer


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1749<br />

times out, the peer aggregates the entire HAVE messages<br />

produced by the newly- received pieces, composing them<br />

into one MultiHAVE message and sending it. At the<br />

same time the timer starts the next round <strong>of</strong> re-timing.<br />

Different to the 10s choking algorithm cycle and 30s<br />

optimistic unchoking cycle, when choosing a long<br />

MultiHAVE regular cycle (such as 30s), for highbandwidth<br />

peers, it is likely that the two connecting highbandwidth<br />

peers may send NOT INTERESTED<br />

messages and choke each other, because they may not<br />

find the new piece’s timely change between them in 10s<br />

cycle. For low-bandwidth peers (56k modem), if the size<br />

<strong>of</strong> piece is 256KB, they cannot get a complete piece in<br />

this cycle, when a complete piece has been achieved, the<br />

timer times out and a MultiHAVE message is sent.<br />

According to the length <strong>of</strong> the interval, the MultiHAVE<br />

messages that the low-bandwidth peer sends always<br />

includes only a piece <strong>of</strong> payload. At the same time, the<br />

MultiHAVE message returns to the original HAVE<br />

message and will not affect its downloading performance.<br />

To be summarized, the principles <strong>of</strong> choosing timer<br />

value are as follows:<br />

(1) It can prevent the new MultiHAVE message storm;<br />

(2) It cannot exceed the choking algorithms cycle.<br />

Based on the above two principles, the 5s (less than<br />

10s) interval is chosen for the MultiHAVE scheme.<br />

III. PERFORMANCE COMPARISON OF MULTIHAVE<br />

MESSAGE AND HAVE MESSAGE<br />

In this section, we calculate the flow <strong>of</strong> MultiHAVE<br />

and HAVE message and compare their performance<br />

consequently.<br />

First, assume n be the number <strong>of</strong> peers that<br />

connecting with peer A, the frequency <strong>of</strong> MultiHAVE<br />

message set by peer A is<br />

1<br />

f MS ( n)<br />

= n , (1)<br />

T<br />

where T I is the timer interval <strong>of</strong> MultiHAVE message.<br />

The frequency <strong>of</strong> MultiHAVE message received by peer<br />

A can be formulated as<br />

n<br />

MR n)<br />

= ∑ FM<br />

i<br />

i=<br />

1<br />

f<br />

I<br />

( , (2)<br />

where F M is the frequency <strong>of</strong> the i-th peer connecting<br />

i<br />

with the peer A and sending MultiHAVE message to peer<br />

A. The flow <strong>of</strong> MultiHAVE message sent by peer A is<br />

4Bd<br />

n n<br />

fl MS ( n)<br />

= + ( OP<br />

+ 5)<br />

(3)<br />

S T<br />

with B d being the download bandwidth <strong>of</strong> peer A, S P<br />

being the size <strong>of</strong> the piece, O P being the size <strong>of</strong> the<br />

TCP/IP header, which is 40B.<br />

The flow <strong>of</strong> MultiHAVE message received by peer A<br />

is<br />

© 2011 ACADEMY PUBLISHER<br />

p<br />

I<br />

fl<br />

n<br />

MR n)<br />

= ∑ FLM<br />

i<br />

i=<br />

1<br />

( , (4)<br />

where FL M is the flow <strong>of</strong> the i-th peer connecting with<br />

i<br />

the peer A and sending MultiHAVE message to peer A.<br />

Similarly, the frequency HAVE message sent by peer<br />

A is<br />

Bd<br />

fHS<br />

( n)<br />

= n . (5)<br />

S<br />

The frequency <strong>of</strong> HAVE message received by peer A is<br />

f<br />

p<br />

n<br />

HR n)<br />

= ∑ FH<br />

i<br />

i=<br />

1<br />

( , (6)<br />

where F H is the frequency <strong>of</strong> the i-th peer connecting<br />

i<br />

with and sending HAVE message to peer A.<br />

The flow <strong>of</strong> HAVE message sent by peer A is<br />

fl ( n)<br />

f ( n)<br />

× ( 4 + O + 5)<br />

. (7)<br />

HS<br />

= HS<br />

P<br />

The flow <strong>of</strong> HAVE message received by peer A can be<br />

presented as<br />

fl<br />

n<br />

HR n)<br />

= ∑ FLH<br />

i<br />

i=<br />

1<br />

( , (8)<br />

where FL H is the flow <strong>of</strong> the i-th peer connecting with<br />

i<br />

the peer A and sending HAVE message to it.<br />

Supposing that the peer actions be consistent, we have<br />

1<br />

fMS<br />

( n)<br />

= fMR(<br />

n)<br />

= n , (9)<br />

T<br />

4Bd<br />

n n<br />

fl MS ( n)<br />

= flMR<br />

( n)<br />

= + ( OP<br />

+ 5),<br />

(10)<br />

S T<br />

B n<br />

f =<br />

p<br />

I<br />

d<br />

HS ( n)<br />

= f HR(<br />

n)<br />

, (11)<br />

S p<br />

fl ( n)<br />

fl ( n)<br />

= f ( n)<br />

× ( 4 + O + 5)<br />

. (12)<br />

HS<br />

= HR HS<br />

P<br />

The ratio <strong>of</strong> the frequency <strong>of</strong> sending HAVE message<br />

to the frequency <strong>of</strong> sending MultiHAVE message is as<br />

follows:<br />

f<br />

f<br />

HS<br />

MS<br />

I<br />

I<br />

( n)<br />

=<br />

( n)<br />

Bdn<br />

S p<br />

n<br />

T<br />

BdTI<br />

=<br />

S p<br />

. (13)<br />

The ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE message to the<br />

flow <strong>of</strong> sending MultiHAVE message is as follows:


1750 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

fl<br />

fl<br />

HS<br />

MS<br />

( n)<br />

f HS ( n)<br />

× ( 4 + OP<br />

+ 5)<br />

=<br />

( n)<br />

n ⎛ 4<br />

⎞<br />

⎜<br />

TI<br />

Bd<br />

+ O 5⎟<br />

⎜<br />

P +<br />

T<br />

⎟<br />

I ⎝<br />

S p ⎠<br />

Bd<br />

n<br />

× ( OP<br />

+ 9)<br />

S p<br />

=<br />

n ⎛ 4<br />

⎞<br />

⎜<br />

TI<br />

Bd<br />

+ O + 5⎟<br />

⎜<br />

P<br />

T<br />

⎟<br />

I ⎝<br />

S p ⎠<br />

TI<br />

Bd<br />

( OP<br />

+ 9)<br />

=<br />

.<br />

4T<br />

B + S ( O + 5)<br />

I<br />

d<br />

p<br />

P<br />

(14)<br />

According to (13) and (14), if the peers actions are<br />

consistent, the improvement <strong>of</strong> MultiHAVE message to<br />

HAVE message is relative with the download bandwidth<br />

B d , the size <strong>of</strong> piece S p and the timer interval <strong>of</strong><br />

MultiHAVE message T I . If the these parameters have<br />

been confirmed, the ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE<br />

message to the flow <strong>of</strong> sending MultiHAVE message is<br />

constant. The ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE message<br />

to the flow <strong>of</strong> sending MultiHAVE message is constant,<br />

too.<br />

For example, suppose each peer have a maximum<br />

upload and download speed, 5MB/s, the size <strong>of</strong> piece be<br />

256KB, the timer interval be 5s and the downloading files<br />

are big enough. Further suppose the peers actions being<br />

consistent and ignore the seed peers, then the frequency<br />

<strong>of</strong> sending and receiving <strong>of</strong> MulitHAVE message and<br />

HAVE message, the flow <strong>of</strong> MultiHAVE and HAVE<br />

message can be shown in Figure 1 and Figure 2<br />

respectively.<br />

As can be seen from Figure 1 and Figure 2, when the<br />

download bandwidth B d =5MB, the size <strong>of</strong> piece<br />

S p =256KB and MultiHAVE message regular intervals<br />

T I =5s, the ratio <strong>of</strong> the frequency <strong>of</strong> sending HAVE<br />

message to that <strong>of</strong> sending MultiHAVE message is 100,<br />

The ratio <strong>of</strong> the flow <strong>of</strong> sending HAVE message to that <strong>of</strong><br />

sending MultiHAVE message is 11.01. The frequency<br />

and flow <strong>of</strong> sending MultiHAVE message have been<br />

improved a lot than that <strong>of</strong> HAVE message. The<br />

improvement <strong>of</strong> frequency is mainly due to MultiHAVE<br />

message sending the payload <strong>of</strong> HAVE message in<br />

aggregation, and the improvement <strong>of</strong> flow is a decrease<br />

<strong>of</strong> the 40B overhead <strong>of</strong> TCP/IP header <strong>of</strong> HAVE message<br />

which are repeatedly sent.<br />

It need to be pointed out that these conclusions are<br />

based on the assumption that the peers are highbandwidth<br />

peers and actions are consistent. In the actual<br />

network environment, all peers <strong>of</strong>ten have different<br />

bandwidths, that is, high-bandwidth peers and lowbandwidth<br />

peers co-exist, and the time when each peer<br />

joins BT system is also different. In such cases, the peers<br />

will lose coherence and show diversification. For highbandwidth<br />

peers, due to the fact that each peer joins BT<br />

system in different time, there might not be the full<br />

downloading flow, it will lead to the decline in the<br />

payload <strong>of</strong> MultiHAVE message, thus the ratio <strong>of</strong> the<br />

© 2011 ACADEMY PUBLISHER<br />

frequency <strong>of</strong> sending HAVE message to that <strong>of</strong><br />

MultiHAVE message will reduce a lot, the ratio <strong>of</strong> the<br />

flow will also reduce. When the two ratios are reduced to<br />

1, MulitHave message will return to HAVE message. In<br />

addition, for the low-bandwidth peers, the time <strong>of</strong><br />

downloading a piece is <strong>of</strong>ten longer than the timer<br />

interval <strong>of</strong> the MultiHAVE message, then, MulitHave<br />

message will also return to HAVE message. But whatever<br />

the circumstances, the HAVE message storm in BT<br />

system will be prevented. In fact, along with the<br />

continuous improve -ment <strong>of</strong> the network environment,<br />

more and more peers will have the characteristics <strong>of</strong> highbandwidth,<br />

so MultiHAVE message scheme will also<br />

play a more effective role.<br />

FSR (times/s)<br />

Flow (B/s)<br />

1E+03<br />

1E+02<br />

1E+01<br />

1E+00<br />

MultiHAVE message HAVE message<br />

6 7 8 9 10 11 12 13 14 15<br />

n<br />

Figure 1. Frequency <strong>of</strong> sending and receiving(FSR)<br />

<strong>of</strong> MultiHAVE and HAVE message<br />

1E+05<br />

1E+04<br />

1E+03<br />

1E+02<br />

1E+01<br />

1E+00<br />

MultiHAVE message HAVE message<br />

6 7 8 9 10 11 12 13 14 15<br />

n<br />

Figure 2. Flow <strong>of</strong> MultiHAVE and HAVE message<br />

IV. EXPERIMENT<br />

In this section, some experiments are carried out to<br />

illustrate the effectiveness <strong>of</strong> the proposed MultiHAVE<br />

message scheme. As to the experiment parameters we<br />

refer to the first BitTorrent client developed by Bram<br />

Cohen, the inventor <strong>of</strong> the protocol[2]. The main<br />

parameters and their default values are as follows:<br />

(1) The maximum upload rate, no limitation;


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1751<br />

(2) The minimum number <strong>of</strong> peers in the peer set<br />

before requesting more peers to the tracker, default to be<br />

20;<br />

(3) The maximum number <strong>of</strong> connections the local<br />

peer can initiate, default to be 50;<br />

(4) The maximum number <strong>of</strong> peers in the peer set,<br />

default to be 80;<br />

(5) The number <strong>of</strong> peers in the active peer set<br />

including the optimistic unchokes, default to be 4;<br />

(6) The block size, set to be 2MB;<br />

(7) The number <strong>of</strong> pieces downloaded before<br />

switching from random to rarest first piece selection,<br />

default to be 4.<br />

In addition, the downloading file size is 2.15GB, the<br />

Torrent file is 43.1KB and the downloading file is<br />

divided into 2205 pieces.<br />

The experimental evaluation <strong>of</strong> the BitTorrent protocol<br />

is very complex and each experiment in not reproducible<br />

as it heavily depends on the behavior <strong>of</strong> peers, the<br />

number <strong>of</strong> seeds and leechers in the torrent, and the<br />

subset <strong>of</strong> peers randomly returned by the tracker.<br />

However, by choosing a large variety <strong>of</strong> peers and<br />

designing the experiment process deliberately, we can<br />

identify the fundamental behaviors <strong>of</strong> the BitTorrent<br />

protocol.<br />

During the experiment, we send ten kinds <strong>of</strong> messages<br />

in the BT system peers. All the messages are with TCP.<br />

The size <strong>of</strong> each message is given with the TCP/IP header<br />

overhead <strong>of</strong> 40B. The details <strong>of</strong> each message are shown<br />

in TABLE II.<br />

TABLE II.<br />

THE COMPARISON OF MULTIHAVE MESSAGE AND HAVE MESSAGE<br />

Message name Message size/B Function<br />

HANDSHAKE 108 Initiate a connection<br />

CHOKE 45 Choke the remote peer<br />

UNCHOKE 45 Unchoke the remote peer<br />

INTERESTED 45 Interested the remote peer<br />

NOT<br />

INTERESTED<br />

45 Not interested the remote peer<br />

Announce each remote peer<br />

49 when the local peer has<br />

HAVE 49<br />

BITFIELD<br />

⎡ Numberpiece<br />

⎤<br />

⎢<br />

⎥ + 45<br />

⎢ 8 ⎥<br />

received a new piece<br />

Notify the remote peer <strong>of</strong> the<br />

pieces the local peer already<br />

has<br />

REQUEST 47<br />

Request data to the remote<br />

peer<br />

PIECE Length piece + 53 Send data to the remote peer<br />

CANCEL 47 Cancel request message<br />

The BT systems adopted in the experiment are 1 seed<br />

and 5 downloaders, 1 seed and 10 downloaders and 1<br />

seed and 20 downloaders, where the classical HAVE<br />

message and the proposed MultiHAVE message in this<br />

paper are adopted respectively. Experiment results are<br />

shown in Figure 3~Figure 5, where “After Extension”<br />

and “Before Extension” columns describe the message<br />

© 2011 ACADEMY PUBLISHER<br />

flow <strong>of</strong> BT system with conventional message and<br />

MultiHAVE message respectively.<br />

1E+10<br />

1E+08<br />

1E+06<br />

1E+04<br />

1E+02<br />

1E+00<br />

After Extension<br />

Before Extension<br />

HS C UC I NI H BF R P CA<br />

Figure 3. Bytes per Type <strong>of</strong> Messages in 1 seed and 5 downloader<br />

1E+10<br />

1E+08<br />

1E+06<br />

1E+04<br />

1E+02<br />

1E+00<br />

After Extension<br />

Before Extension<br />

HS C UC I NI H BF R P CA<br />

Figure 4. Bytes per Type <strong>of</strong> Messages in 1 seed and 10 downloader<br />

1E+10<br />

1E+08<br />

1E+06<br />

1E+04<br />

1E+02<br />

1E+00<br />

After Extension<br />

Before Extension<br />

HS C UC I NI H BF R P CA<br />

Figure 5. Bytes per Type <strong>of</strong> Messages in 1 seed and 20 downloader<br />

It can be seen that at each case, though the flue <strong>of</strong><br />

UNINTERESTED and BITFIELD and other messages<br />

change little in the BT systems with the proposed<br />

MultiHAVE message, the HAVE message reduced 89%<br />

approximately. Furthermore, the flux <strong>of</strong> BITFIELD<br />

message reduces about a half than that <strong>of</strong> the BT systems<br />

with original HAVE message. So the proposed


1752 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

MultiHAVE message scheme can reduce the total<br />

message amount in the BT systems and hence decrease<br />

the management costs.<br />

It should be point out that the above experiments are<br />

carried out in BT systems with high bandwidth and<br />

consistent peers. In order to complete the MultiHAVE<br />

message, the timers are used. In the real network<br />

environment, the peers <strong>of</strong>ten possess various bandwidths,<br />

that is, high-bandwidth peers and low-bandwidth peers<br />

coexist in the system. Furthermore, we cannot demand all<br />

the peers in the network join the BT system at the same<br />

time. Actually, they join the system stochastically. In<br />

such cases, the peers will lose coherence. For highbandwidth<br />

peers, due to each peer joins BT system in<br />

different time, there might not be the full downloading<br />

flow, it will lead to the decline in the payload <strong>of</strong><br />

MultiHAVE message, thus the ratio <strong>of</strong> the frequency <strong>of</strong><br />

sending HAVE message to that <strong>of</strong> MultiHAVE message<br />

will reduce a lot, the ratio <strong>of</strong> the flow will also reduce.<br />

When the two ratios are reduced to 1, the MultiHAVE<br />

message will degenerate to the original HAVE message.<br />

Furthermore, for the low-bandwidth peers, the time <strong>of</strong><br />

downloading each piece is <strong>of</strong>ten longer than the timer<br />

interval <strong>of</strong> the MultiHAVE message, so the MultiHAVE<br />

message will also degenerate to HAVE message. So<br />

whatever the circumstances are, the HAVE message<br />

storm in BT system will be prevented considerably.<br />

In fact, along with the continuous improvement <strong>of</strong> the<br />

network environment, more and more peers will have the<br />

characteristics <strong>of</strong> high-bandwidth, so MultiHAVE<br />

message scheme can work effective to prevent the HAVE<br />

message storm in BT systems.<br />

V. CONCLUSION AND FURTHRE WORK<br />

In this paper we propose a novel HAVE message<br />

scheme, MultiHAVE message, to prevent the possible<br />

message storm in BT systems. MultiHAVE message<br />

comprises several HAVE messages via a proper set timer.<br />

By adjusting the timer interval, we can change the size <strong>of</strong><br />

MultiHAVE message. We compare the performance <strong>of</strong><br />

the proposed MultiHAVE message and conventional<br />

HAVE message to illustrate the effectiveness <strong>of</strong> the<br />

MultiHAVE message. Experiments on BT systems with<br />

high-bandwidth, consistent peers show that the proposed<br />

MutiHave message scheme can significantly reduce the<br />

flow <strong>of</strong> HAVE message, thus reducing the management<br />

costs in BT system and effectively preventing the HAVE<br />

message storm. When the action <strong>of</strong> network peers is<br />

diverse for the low -bandwidth peers, the MultiHAVE<br />

message will degenerate to the original HAVE message,<br />

thus remaining the high performance <strong>of</strong> BT system.<br />

There are still further works need to be carried out. For<br />

instance, when the BT client that is compatible with<br />

MultiHAVE message communicates with the BT client<br />

that is incompatible with MultiHAVE message, how to<br />

match them intelligently is an unsolved problem.<br />

© 2011 ACADEMY PUBLISHER<br />

ACKNOWLEDGMENT<br />

This work was supported by the National Natural<br />

Science Foundation <strong>of</strong> China under Grant 60974005, the<br />

Specialized Research Fund for the Doctoral Program <strong>of</strong><br />

Higher Education under Grant 20094101120008, the<br />

Natural Science Foundation <strong>of</strong> Henan Province under<br />

Grant 092300410201, Zhengzhou Science and<br />

Technology Research Program under Grant<br />

0910SGYN12301-6 and the Science Fund for<br />

Distinguished Yong Scholars <strong>of</strong> Henan Province under<br />

Grant 0612000600. The authors would like to thank Dr<br />

Yanhong Liu for her invaluable suggestions.<br />

REFERENCES<br />

[1] “Bittorrent,” http://www.bittorrent.com/.<br />

[2] B. Cohen, “Incentives build robustness in BitTorrent,” in<br />

First Workshop on Economics <strong>of</strong> Peer-to-peer Systems,<br />

Berkeley, USA, June 2003.<br />

[3] R. L. Xia and J. K. Muppala, “A survey <strong>of</strong> BitTorrent<br />

performance,” IEEE Communications Surveys & Tutorials,<br />

2010, vol. 12, no 2, pp. 140-158.<br />

[4] Andrew Parker. “The True Picture <strong>of</strong> Peer-to-Peer<br />

Filesharing”. http://www.cachelogic.com/research/slide9.<br />

php, May 2005.<br />

[5] T. Karagiannis, A. Broido, M. Faloutsos, and K. C. Claffy.<br />

“Transport Layer Identification <strong>of</strong> P2P Traffic”. In<br />

Proceedings <strong>of</strong> ACM IMC, Taormina, Sicily, Italy,<br />

October 2004.<br />

[6] T. Karagiannis, A. Broido, N. Brownlee, and K. C. Claffy.<br />

“Is P2P Dying or Just Hiding?”. In Proceedings <strong>of</strong> IEEE<br />

GLOBECOM, Dalla, Texas, USA, Nov. 29 - Dec. 3, 2004.<br />

[7] A. Qureshi, “Exploring proximity based peer selection in a<br />

BitTorrentlike protocol,” MIT 6.824 student project, 2004<br />

[8] R. Bindal, P. Cao, W. Chan, J. Medved, G. Suwala, T.<br />

Bates, and A. Zhang, “Improving traffic locality in<br />

BitTorrent via biased neighbor selection,” in ICDCS ’06:<br />

Proc. 26th IEEE International Conference on Distributed<br />

Computing Systems. Washington, DC, USA: IEEE<br />

Computer Society, 2006, p. 66.<br />

[9] S. Yamazaki, H. Tode, and K. Murakami, “CAT: A costaware<br />

BitTorrent,” in 32nd IEEE Conference on Local<br />

Computer <strong>Networks</strong> (LCN 2007), Oct 2007, pp. 226–227.<br />

[10] P. Garbacki, A. Iosup, D. Epema, and M. van Steen, “2fast:<br />

Collaborative downloads in p2p networks,” in P2P ’06:<br />

Proc. Sixth IEEE International Conference on Peer-to-Peer<br />

Computing. Washington, DC, USA: IEEE Computer<br />

Society, 2006, pp. 23–30.<br />

[11] P. Garbacki, D. Epema, and M. van Steen, “An amortized<br />

tit-for-tat protocol for exchanging bandwidth instead <strong>of</strong><br />

content in p2p networks,” Self-Adaptive and Self-<br />

Organizing Systems, 2007. SASO ’07. First International<br />

Conference on, pp. 119–128, July 2007.<br />

[12] M. Sirivianos, J. Park, R. Chen, and X. Yang, “Freeriding<br />

in BitTorrent networks with the large view exploit,” in<br />

IPTPS’07, 2007.<br />

[13] Arnaud Legout, Guillaume Urvoy-Keller, and Pietro<br />

Michiardi. “Understanding BitTorrent: An Experimental<br />

Perspective”. Technical Report, INRIA, Sophia Antipolis,<br />

July 2005.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1753<br />

network control.<br />

© 2011 ACADEMY PUBLISHER<br />

Jianyong Li received his master degree<br />

from the Department <strong>of</strong> Computer,<br />

Huazhong University <strong>of</strong> Science and<br />

Technology in 2001. He is currently an<br />

associate pr<strong>of</strong>essor with the School <strong>of</strong><br />

Computer and Communication<br />

Engineering, Zhengzhou University <strong>of</strong><br />

Light Industry. His research interest<br />

covers Peer-to-peer networks and<br />

Jianchun Li received his master degree<br />

from the Department <strong>of</strong> Computer,<br />

Zhengzhou University in 2005. He is<br />

currently a lecturer with the School <strong>of</strong><br />

Computer and Communication<br />

Engineering, Zhengzhou University <strong>of</strong><br />

Light Industry. His research interest<br />

covers computer networks and<br />

distributed computing systems.<br />

paper.<br />

Daoying Huang received his Ph. D.<br />

degree from the PLA Information<br />

Engineering University in 2001. Since<br />

2006, he has been a pr<strong>of</strong>essor with the<br />

School <strong>of</strong> Computer and<br />

Communication Engineering,<br />

Zhengzhou University <strong>of</strong> Light Industry.<br />

His research interest covers computer<br />

networks and distributed computational<br />

systems. Corresponding author <strong>of</strong> this<br />

Qiang Wei is currently a master<br />

candidate with the School <strong>of</strong> Computer<br />

and Communication Engineering,<br />

Zhengzhou University <strong>of</strong> Light Industry.<br />

His research interest covers computer<br />

networks.


1754 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

Image-based Position Estimation and Adaptive<br />

Modulation Coding in Vehicular Communication<br />

Hao Yang 1 , Qingmin Meng 1,2 , Xiong Gu 1 , and Baoyu Zheng 1<br />

1 School <strong>of</strong> Geography and Biological Information,<br />

Key Lab <strong>of</strong> Broadband Wireless Communication and Sensor Network Technology (Ministry <strong>of</strong> Education)<br />

Nanjing University <strong>of</strong> Posts and Telecommunications, Nanjing, 210003, China<br />

2 National Mobile Communications Research Lab, Southeast University, Nanjing, 210096, China<br />

Email: {yanghao, mengqm, zby}@njupt.edu.cn, guxiong108@gmail.com<br />

Abstract—Vehicle position estimation is a key technology for<br />

Inter-Vehicle Communications, while template matching<br />

can be used to get information <strong>of</strong> vehicular position. In the<br />

paper, a simplified template matching, namely area-based<br />

template match is considered. A vehicular communication<br />

system designed for wireless data application is proposed<br />

where a camera is fixed in a vehicle which is served as a<br />

base station. By means <strong>of</strong> comparison between the outline<br />

area <strong>of</strong> vehicular image and reference templates, the base<br />

station can obtain the position estimation <strong>of</strong> the vehicle. The<br />

reference templates can be pre-calculated from a group <strong>of</strong><br />

field experiment data. Based on supervised learning, we<br />

develop an image-based vehicle position estimation method<br />

and evaluate its effect on an adaptive coding modulation<br />

scheme. The computer simulation results show that in the<br />

wireless fading channel with the OFDM physical model,<br />

compared with fixed modulation coding scheme, the studied<br />

adaptive modulation and coding (AMC) scheme taking<br />

account <strong>of</strong> the position estimation can gain greater<br />

throughput.<br />

Index Terms—Inter-Vehicle Communications, supervised<br />

learning, template matching, OFDM, adaptive modulation<br />

and coding<br />

I. INTRODUCTION<br />

In recent years, research on how to achieve Inter-<br />

Vehicle Communications (IVC) has become one <strong>of</strong> the<br />

focuses <strong>of</strong> research and application. It is emerging as a<br />

key part <strong>of</strong> Intelligent Transportation Systems (ITS)<br />

which facilitates the ITS to realize short distance<br />

wideband wireless communication without expensive<br />

infrastructure. IVC has attracted research attention from<br />

both the academia and industry <strong>of</strong>, notably, US, EU, and<br />

Japan [1]. Refering to [2], we find that IVC can be briefly<br />

divided into two categories: one is mainly to solve traffic<br />

safety, called Safety Application; the other mainly<br />

contributes to providing value-added services, such as<br />

meeting passengers’ need for business, entertainment and<br />

information functions in the car, called User Application.<br />

In other words, IVC can provide various road traffic<br />

applications ranging from traffic safety to pleasant<br />

Manuscript received March 1, 2011; revised April 10, 2011;<br />

accepted April 20, 2011.<br />

Project number: 2010ZX03003-003-02, 60972039, 61001077,<br />

20090451239.<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1754-1759<br />

driving. In [3], IVC is simplified into three layer model<br />

which consists <strong>of</strong> physical layer, data link layer and<br />

application layer. Literature [4] gived the specification <strong>of</strong><br />

Dedicated Short Range Communications (DSRC), a type<br />

<strong>of</strong> high-speed mobile broadband. Recently, many<br />

automobile manufactures regard DSRC as a vehicle<br />

communication platform called DSRC Vehicle Ad Hoc<br />

Network (VANET). Specially IEEE 802.11 adds the<br />

Wireless Access to Vehicle Environment (WAVE) [5] to<br />

form the IEEE 802.11p and the latter is very closely<br />

related to the IEEE 802.11a standard [6].<br />

Orthogonal Frequency Division Multiplexing (OFDM)<br />

is a multiplexing technique that divides a channel with a<br />

higher data rate into multiple orthogonal sub-channels<br />

with a lower data rate. OFDM has been adopted in<br />

several wireless standards such as digital audio<br />

broadcasting (DAB), digital video broadcasting (DVB-T),<br />

the IEEE 802.11a local area network (LAN) standard<br />

and the IEEE 802.16a metropolitan area network (MAN)<br />

standard [7]. OFDM is also being pursued for the abovementioned<br />

DSRC for road side to vehicle<br />

communications.<br />

The significance <strong>of</strong> the paper is to propose an image-<br />

based IVC design. In order to improve the performance<br />

<strong>of</strong> the AMC in the OFDM transmission, we use the<br />

supervised learning <strong>of</strong> machine learning to estimate the<br />

position <strong>of</strong> the vehicle.<br />

The remainder <strong>of</strong> the paper is organized as follows:<br />

Section 2 introduces some relevant research work about<br />

image processing; Section 3 describes the system model<br />

and vehicle position estimation; Section 4 gives the signal<br />

model and the AMC selection; In Section 5, we bring out<br />

the simulation and results; and finally the conclusion is<br />

given in Section 6.<br />

II. RELATED WORK OF IMAGE PROCESSING<br />

Digital image processing refers to handling digital<br />

images or video frames by means <strong>of</strong> a digital computer.<br />

The results <strong>of</strong> digital image processing are generally<br />

images or a set <strong>of</strong> characteristics and parameters related<br />

to the images [8].<br />

Image processing techniques can be used to measure<br />

distance. In [9], Lu et al. proposed a novel measuring<br />

system using a scan-counter method via a CCD camera.<br />

The system can be used to measure the distance between


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1755<br />

a CCD camera and an object. Set on either side <strong>of</strong> a CCD<br />

camera, two laser projectors in the system produced two<br />

parallel rays that projected two bright spots on the object<br />

and the CCD. The interval between the two bright spots<br />

in the video image was calculated. As there is a linear<br />

relationship between the actual distance and the interval<br />

<strong>of</strong> the two bright spots, the actual distance from the CCD<br />

camera to the object can be obtained from a simple<br />

formula. Later, Hsu et al. [10] brought forward a new<br />

method for calculating the distance. The proposed<br />

scheme counted pixel number variation <strong>of</strong> reference<br />

points in the images to acquire the displacement <strong>of</strong> the<br />

camera movement along the photographing direction.<br />

In [11], Chang et al. proposed a method to use images<br />

to measure the relative distance between vehicles. The<br />

procedures <strong>of</strong> the method were divided into two parts.<br />

First, the location <strong>of</strong> the license plate in the image was<br />

found by several image processing techniques. Second,<br />

the image size <strong>of</strong> the plate was obtained by the region<br />

growing technique, then the relative distance was<br />

computed by using the geometric relation.<br />

In [12], Lü et al. put forward an efficient measuring<br />

method for live plant leaf area. The proposed method was<br />

composed <strong>of</strong> four steps. First, image geometric<br />

distortions were corrected by using mapping function.<br />

Then, image segmentation was performed using threshold<br />

method and leaf region was obtained. Next, leaf contour<br />

was extracted and contour region was filled. Finally, leaf<br />

area was calculated through pixel number statistic.<br />

An object size in images can be obtained by using the<br />

result <strong>of</strong> contour extraction. There are many papers<br />

focused on this topic. Active contour model, known as<br />

“snakes”, is a framework for delineating an object outline<br />

from a noisy image [13]. Snakes have been successfully<br />

used in segmentation, matching and tracking the<br />

interested target. In [14], Dubuisson proposed a specific<br />

method for the contour extraction <strong>of</strong> the moving object.<br />

The method is based on the fusion <strong>of</strong> a motion<br />

segmentation technique, which uses image subtraction<br />

and color segmentation based on the split-and-merge<br />

paradigm and edge information. The edge information<br />

can be obtained by using the Canny edge detector. He<br />

also applied the object matching in intelligent<br />

vehicle/highway system.<br />

III. SYSTEM MODEL AND VEHICLE POSITION<br />

ESTIMATION<br />

The scene <strong>of</strong> IVC in the paper is shown in Fig.1. The<br />

three vehicles form a linear topology <strong>of</strong> the Ad Hoc<br />

Network and each vehicle is regarded as a<br />

communication node. Each vehicle is considered to be<br />

equipped with a whole communication system, which<br />

consists <strong>of</strong> three main components: the wireless<br />

transceiver, the microcomputer and the camera. The main<br />

function <strong>of</strong> each part is as follows:<br />

1) The Wireless Transceiver: It is used for the<br />

receiving and sending <strong>of</strong> information between vehicles <strong>of</strong><br />

short distance.<br />

2) The Microcomputer: On one hand it receives the<br />

image information from camera through a specific<br />

© 2011 ACADEMY PUBLISHER<br />

interface and then handles the information and displays<br />

the results on the screen; On the other hand, through a<br />

specific interface it communicates with the wireless<br />

transceiver.<br />

3) The Camera: it is the main sensing component <strong>of</strong><br />

the system and used for capturing the surrounding<br />

environment. In the paper, it is used for capturing the<br />

snapshots <strong>of</strong> the vehicle so as to track its position.<br />

Figure 1. The scene <strong>of</strong> the IVC<br />

A. Assumptions <strong>of</strong> the model<br />

Before introducing the specific design, for the sake <strong>of</strong><br />

simplicity, we make the following assumptions <strong>of</strong> the<br />

system.<br />

1) In order to facilitate the camera to capture the<br />

snapshots <strong>of</strong> the vehicle, we assume vehicles are traveling<br />

in the queue, that is to say, they are traveling in a straight<br />

line.<br />

2) Taking into account the driver has good vision in<br />

front <strong>of</strong> the vehicle, we assume that the camera is<br />

installed at the tail <strong>of</strong> the vehicle and only captures the<br />

snapshots <strong>of</strong> the following vehicle.<br />

3) As the communication process as well as the control<br />

process <strong>of</strong> the vehicle with its preceding one and the<br />

following one is similar, we just consider the vehicle<br />

communicate with its following vehicle. Hereinafter,<br />

when referring to a vehicle that transmits data, we called<br />

it active vehicle, otherwise we called it inactive vehicle.<br />

4) We assume that the camera is with ordinary and<br />

fixed focal length optical lens.<br />

5) We assume that the type and size <strong>of</strong> the vehicle are<br />

the same.<br />

B. Vehicle Position Estimation<br />

Considering the position <strong>of</strong> the vehicle changes rapidly<br />

in IVC, it is difficult to achieve exact matching <strong>of</strong> the<br />

vehicle. The paper performs a fast matching based on the<br />

contour area <strong>of</strong> the vehicle. The active vehicle selects the<br />

appropriate modulation and coding scheme for OFDM<br />

transmission with the assumption <strong>of</strong> an ideal OFDM<br />

channel estimation. The key part <strong>of</strong> the process is to<br />

determine the distance between vehicles through<br />

snapshots <strong>of</strong> the vehicle, which we will use machine<br />

learning algorithms. Machine learning is generally<br />

divided into supervised learning, unsupervised learning<br />

and reinforcement learning [15]. For our scenario,<br />

supervised learning is adopted. In this method <strong>of</strong><br />

learning, a training set is given, and then we attempt to<br />

identify the relationship between input and output<br />

through a learning algorithm and then achieve a function


1756 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

h , called a hypothesis. When a new input x is given, we<br />

can get the predicted output y through the function h .<br />

The process is shown in Fig.2. Supervised learning<br />

consists <strong>of</strong> two important parts namely regression and<br />

classification. The difference between them is whether<br />

the predicted output is continuous or discrete. If the<br />

predicted output is continuous then it is a regression<br />

problem, otherwise a classification problem. As the<br />

output in the paper is continuous, we consider the former<br />

one.<br />

Figure 2. The supervised learning process<br />

In the paper we consider an n-dimensional linear<br />

regression, in which the relationship between the input<br />

features x and predicted output y is linear. As<br />

illustrated in equation (1),<br />

n<br />

∑<br />

T<br />

h( x) = θ x = θ x (1)<br />

i=<br />

0<br />

where θ and x are both vectors, and we set x 0 = 1.<br />

In<br />

order to work out the value <strong>of</strong> θ , we first introduce the<br />

cost function,<br />

m 1<br />

i i 2<br />

J( θ) = ∑( hθ( x ) − y )<br />

2 i=<br />

1<br />

(2)<br />

1<br />

T<br />

= ( Xθ −Y) ( Xθ−Y) 2<br />

it indicates the difference between the predicted output<br />

i i i i T<br />

and the practical output. x = [ x0, x1,..., xn]<br />

is the<br />

i<br />

specific input feature vector, y refers to the<br />

corresponding output, and m defines the number <strong>of</strong> the<br />

1 2 m T<br />

training data. X= [ x , x ,..., x ] is the matrix <strong>of</strong> the<br />

1 2 m T<br />

whole input features, and Y = [ y , y ,..., y ] is the<br />

whole corresponding output.<br />

After defining the cost function, all we need to do is to<br />

choose appropriate θ so as to minimize J ( θ ) . The<br />

intuitive approach is to make derivation for each θ i , as<br />

illustrated in formula (3),<br />

1<br />

T<br />

∇ θJ( θ) =∇θ{ ( Xθ −Y) ( Xθ−Y)} 2<br />

(3)<br />

T T<br />

= XXθ−XY in which ∇θJ ( θ ) means J ( θ ) makes a derivation for<br />

θ in matrix form. Set the derivatives to zero, and we can<br />

have the following standard expression.<br />

© 2011 ACADEMY PUBLISHER<br />

i i<br />

T T<br />

XXθ= XY (4)<br />

Solving the above equation, we get the appropriate θ<br />

to make J ( θ ) minimized. The final expression is<br />

T −1<br />

T<br />

θ = ( XX) XY (5)<br />

IV. SIGNAL MODEL AND ADAPTIVE MODULATION AND<br />

CODING SELECTION<br />

A. Signal model<br />

See [16-19]. A wireless channel including path loss,<br />

shadow fading, small scale fading and additive<br />

background noise is considered. The channel impulse<br />

response for the small scale fading can be modelled as<br />

described with<br />

Lp<br />

−1<br />

∑ α , δ ( τ )<br />

(6)<br />

h () t = t−<br />

ij l ij l<br />

l=<br />

0<br />

where α lij , represents the discrete time-domain channel<br />

coefficient which is independent and identically<br />

distributed (i.i.d) complex Gaussian variable, L p denotes<br />

the number <strong>of</strong> paths in a frequency selective fading<br />

channel and τ l denotes the path delay term.<br />

The transmission parameters <strong>of</strong> OFDM are: total<br />

subcarrier number N , subchannel number K ,<br />

subcarrier spacing B (KHz) and channel spacing W<br />

(MHz). The signal between transmit node i and receive<br />

node j can be represented as<br />

Lp<br />

−1<br />

−α<br />

∑ βα, ( τ )<br />

(7)<br />

r () t = Pd s t− + n () t<br />

ij t ij i lij i l ij<br />

l=<br />

0<br />

In equation (7), P t is the transmission power and d ij<br />

indicates the distance between node i , j . α denotes<br />

pass loss index and β i refers to log-normal shadowing<br />

2<br />

term, i.e., 10log 10 βi ~ N(0,<br />

σ db ) . si() t is the<br />

transmission signal from node i and nij () t denotes<br />

additive white Gaussian noise with zero mean and power<br />

spectral density N 0 .<br />

In the studied OFDM transmission scheme, distinct<br />

Quadrature Amplitude modulation (QAM) schemes are<br />

used according to differing separated spacing between the<br />

two vehicles communicating with each other. The cyclic<br />

prefix is used in OFDM signals as a guard interval whose<br />

length needs to be larger than the maximum excess delay<br />

to mitigate the effect <strong>of</strong> Intersymbol Interference (ISI)<br />

due to the multipath propagation. The cyclic prefix is<br />

added after the IFFT at the transmitter and is removed in<br />

order to get the original signal at the receiver.<br />

The average frequency response <strong>of</strong> subchannel k in<br />

an OFDM receiver is Hij ( k ) . For simplicity reasons, we<br />

ignore the subchannel index and define the gain <strong>of</strong> the


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1757<br />

−α<br />

2<br />

subchannel as G = d β | H | , therefore the signal<br />

to noise ratio (SNR) is<br />

i ij i ij<br />

Pi⋅Gi γ i =<br />

N ⋅ B<br />

0<br />

B. Selection <strong>of</strong> adaptive modulation and coding scheme<br />

The selection principle <strong>of</strong> AMC is to choose the<br />

appropriate scheme that makes the throughput <strong>of</strong> the<br />

vehicle transmission maximum in OFDM transmission.<br />

Considering the M-ary quadrature amplitude modulation<br />

(M-QAM), the modulation level and coding rate <strong>of</strong> node<br />

i are i M and C i , respectively. As seen in [20], the<br />

practical modulation and coding schemes (MCS) will<br />

cause the loss <strong>of</strong> SNR when the bit error rate p b is<br />

considered. Then we consider the rate formula as<br />

bi = log 2(1<br />

+ φγi), φ =− 1.5 / ln(5 pb)<br />

(9)<br />

Assume the length <strong>of</strong> the packet is L , define the<br />

throughput <strong>of</strong> node i is R f , the data rate is R m and<br />

packet error rate (PER) is P e ,then we have:<br />

fR = Rm*1 ( − Pe)<br />

(10)<br />

log2 i M<br />

Rm = N⋅B⋅Ci⋅ (11)<br />

1 (1 ) L<br />

P = − − p<br />

(12)<br />

e b<br />

V. SIMULATION AND RESULTS<br />

The simulation training set for the vehicle is obtained<br />

from the practical measurement. First we fix a vehicle V1<br />

(base station) and make another vehicle V2 drive in a<br />

straight line towards V1. Then we use the camera<br />

installed in V1 to capture the snapshots <strong>of</strong> the V2 and<br />

estimate the distance d between them. Finally in order<br />

to reduce the dimension <strong>of</strong> the unique input feature, we<br />

regard the area <strong>of</strong> the vehicle as the input feature and the<br />

corresponding distance between vehicles as the training<br />

set output. The type <strong>of</strong> the vehicle is Peugeot 307, the<br />

camera employed PAL form, the lens focal length is 12<br />

millimeters and the image resolution is 720*576 pixels.<br />

When the distance between vehicles is too far, the vehicle<br />

size in the image is too small for the camera to capture.<br />

On the other hand when the distance is too close, the<br />

vehicle size in the image is too large and occupies the<br />

whole image. Taking both into consideration, we choose<br />

the distance ranging from 15 meters to 70 meters. The<br />

practical measured data is shown in Table 1. As observed<br />

from Table 1, when the vehicle spacing is close, the<br />

outline <strong>of</strong> V2 becomes larger in size; when the vehicle<br />

spacing gradually increases, the contour dimension<br />

gradually becomes smaller. After getting the training set,<br />

image fitting can be performed by using the mentioned<br />

linear regression method.<br />

For the plane curve fitting, n points on the plane<br />

generally can always be completely fitted by using n-1<br />

order polynomial fitting. However, even though the fitted<br />

curve can pass through the points perfectly, we can not<br />

© 2011 ACADEMY PUBLISHER<br />

(8)<br />

definitely say that the curve is a best prediction. In the<br />

studied process the prediction is the vehicle spacing for<br />

different outline areas <strong>of</strong> the vehicle. Two major issues<br />

for the curve fitting are over-fitting and under-fitting. In<br />

general, the under-fitting shows if the order is lower<br />

compared with the actual model’s and mainly behaves<br />

that most <strong>of</strong> the data are not good fitted as show in Fig.3,<br />

while the over-fitting shows if the order is higher than the<br />

actual model’s and mainly behaves that all the data are<br />

better fitted as show in Fig.4. The selection <strong>of</strong> the order<br />

plays a decisive role in the curve fitting. We employed a<br />

3-rd order fitting and the fitting result for the area is<br />

shown in Fig.5.<br />

TABLE I.<br />

THE TRAINING SET OF LINEAR REGRESSION<br />

Distance (m) 15 16 18 20 25<br />

Area (pixels) 37395 30820 25314 19725 13150<br />

Distance (m) 30 35 40 45 50<br />

Area (pixels) 9780 6903 4931 4068 2958<br />

Distance (m) 55 60 70 - -<br />

Area (pixels) 2588 2301 1725 - -<br />

distance between two vehicles(m)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

x 10 4<br />

0<br />

area <strong>of</strong> the vehicle in the image(pixels)<br />

Figure 3. The under-fitting for the training set with 2ndorder<br />

distance between two vehicles(m)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

x 10 4<br />

0<br />

area <strong>of</strong> the vehicle in the image(pixels)<br />

Figure 4. The over-fitting for the training set with 7th-order


1758 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

System simulation parameters are partially based on<br />

IEEE 802.11a. The number <strong>of</strong> the subcarriers, the<br />

subchannels, subcarrier spacing and channel spacing is<br />

N = 52 , K = 4 , B = 312.5 [KHz] and W = 20<br />

[MHz], respectively. The carrier frequency ranges from<br />

5.850 GHz to 5.925GHz. A quasi-static six-path fading<br />

channel model is considered, whose Rician coefficient is<br />

4 and standard deviation <strong>of</strong> log-normal shadowing is 8dB.<br />

The power gain in each tap is defined as [0.8084, 0.462,<br />

0.253, 0.259, 0.0447, 0.01] and the delay with T=1/W<br />

spaced taps is given as [0, 2, 4, 6, 9, 13]. In the simplified<br />

path-loss model [19], a reference distance, d 0 = 15 [m],<br />

is defined and the corresponding normalized distance is<br />

defined as ( 0 / d d ). In order to get a simple result, five<br />

MCS are considered, i.e., QPSK-1/2, QPSK-3/4,<br />

16QAM-1/2, 16QAM-3/4 and 64QAM-3/4. Assume that<br />

all the subcarriers can obtain equal treatment and all<br />

subcarriers use the single QAM modulation scheme in an<br />

interval <strong>of</strong> fading block. When packet length is 1000<br />

Bytes, we have L = 8000 [bits]. According to Eq.10 we<br />

calculate the objective function and select the MCS that<br />

makes the value <strong>of</strong> the objective function maximum. The<br />

throughput performance comparison under different SNR<br />

values is shown in Fig. 6. One curve represents the near<br />

constant performance with a fixed modulation mode <strong>of</strong><br />

QPSK-1/2, another represents the performance with the<br />

AMC mode taking account <strong>of</strong> the position information.<br />

Obviously, with the proposed AMC the system<br />

throughput can be remarkably improved.<br />

VI. CONCLUSIONS<br />

Position estimation has a significant effect on the<br />

choosing <strong>of</strong> transmission parameters <strong>of</strong> wireless vehicular<br />

communications. However the research work in this field<br />

is less. The work <strong>of</strong> this paper shows such a preliminary<br />

design, namely vehicle-location awareness OFDM<br />

transmission. By using the supervised learning algorithms<br />

<strong>of</strong> machine learning, the base station can first perform<br />

identification and area matching and then predict the<br />

separated spacing between the two vehicles<br />

communicating with each other. The spacing information<br />

can be used to the subsequent selection <strong>of</strong> modulation and<br />

coding scheme. Therefore, the throughput performance <strong>of</strong><br />

the vehicle communication system will be significantly<br />

improved.<br />

ACKNOWLEDGMENT<br />

The authors wish to thank National Mobile<br />

Communications Research Laboratory, Southeast<br />

University. This work was supported by National Science<br />

and Technology <strong>of</strong> major special projects (2010ZX<br />

03003-003-02), the National Natural Science Foundation<br />

<strong>of</strong> China (60972039 and 61001077) and the national postdoctoral<br />

research funding (20090451239).<br />

© 2011 ACADEMY PUBLISHER<br />

distance between two vehicles(m)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

x 10 4<br />

0<br />

area <strong>of</strong> the vehicle in the image(pixels)<br />

Figure 5. Three-order fitting for the training set<br />

Throughput(bit/s)<br />

x 107<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

Adaptive MCS<br />

Fixed MCS<br />

0<br />

5 10 15 20 25<br />

SNR(db)<br />

30 35 40 45<br />

Figure 6. The throughput comparison between fixed MCS<br />

and adaptive MCS<br />

REFERENCES<br />

[1] J. Luo, and J. P. Hubaux, “A Survey <strong>of</strong> Inter-Vehicle<br />

Communication,” Tech. Rep, 2004.<br />

[2] M. Rudack, M. Meincke, K. Jobmann, and M. Lott, “On<br />

traffic dynamical aspects inter vehicle communication<br />

(IVC),” In Proc. <strong>of</strong> the 57th IEEE Semiannual Vehicular<br />

Technology Conference (VTC’03 Spring), 2003.<br />

[3] Ugur Keskin, “In-Vehicle Communication <strong>Networks</strong>: A<br />

Literature Survey,” July 28, 2009.<br />

[4] ASTM International. ASTM E2213-03 Standard<br />

Specification for Telecommunications and Exchange<br />

Between Roadside and Vehicle Systems - 5GHz Band<br />

Dedicated Short Range Communications (DSRC) Medium<br />

Access Control (MAC) and Physical Layer (PHY)<br />

Specifications, 2003..<br />

[5] IEEE 1609 - Family <strong>of</strong> Standards for Wireless Access in<br />

Vehicular Environments (WAVE), U.S. Department <strong>of</strong><br />

Transportation, January 9, 2006.<br />

[6] IEEE Standard 802.11a-1999, Part 11: Wireless LAN<br />

Medium Access Control (MAC) and Physical Layer (PHY)<br />

specifications: High-speed Physical Layer in the 5 GHz<br />

Band.<br />

[7] IEEE Standard IEEE 802.16a, for Local and Metropolitan<br />

Area <strong>Networks</strong> Part 16, Air Interface for Fixed Broadband<br />

Wireless Access Systems:<br />

http://grouper.ieee.org/groups/802/16/.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1759<br />

[8] Rafael C. Gonzalez and Richard E. Woods, “Digital image<br />

processing,” Second Edition, Beijing, Publishing House <strong>of</strong><br />

Electronics Industry, September, 2007.<br />

[9] Ming-Chih Lu, Wei-Yen Wang and Chun-Yen Chu,<br />

“Image-Based Distance and Area Measuring Systems,”<br />

IEEE Sensors <strong>Journal</strong>, Vol. 6, No.2, April 2006, pp495-<br />

503.<br />

[10] Chen-Chien Hsu, Ming-Chih Lu, Wei-Yen Wang and Yin-<br />

Yu Lu, “Distance measurement based on pixel variation <strong>of</strong><br />

CCD images,” ISA Transactions, Vol. 48, No. 4, October<br />

2009, pp389-395.<br />

[11] Tang-Hsien Chang, Chun-hung Lin, Chih-sheng Hsu, and<br />

Yao-jan Wu, “A Vision-Based Vehicle Behavior<br />

Monitoring and Warning System,” In Proc. <strong>of</strong> Intelligent<br />

Transportation Systems, 2003.<br />

[12] Chaohui Lü, Hui Ren, Yibin Zhang, and Yinhua Shen,<br />

“Leaf Area Measurement Based on Image Processing,” In<br />

2010 International Conference on Measuring<br />

Technology and Mechatronics Automation.<br />

[13] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active<br />

contour models,” Internat. J. Comput. Vision 1 (1987)<br />

pp321–331.<br />

[14] Marie-Pierre Dubuisson and Jain. A. K, “Object Contour<br />

Extraction using Color and Motion,” in Computer Vision<br />

and Pattern Recognition, Proceedings CVPR '93, IEEE<br />

Computer Society Conference, 1993.<br />

[15] CS 229: Machine Learning. http://www.stanford.edu/class/<br />

cs229/. Autumn 2010.<br />

[16] R. C. Daniels, C.Caramanis, and R.W.Heath, “A<br />

Supervised Learning Approach to Adaptation in Practical<br />

MIMO-OFDM Wireless Systems,” in Global<br />

Telecommunications Conference, New Orleans, Lo, Nov.<br />

2008, pp1-5.<br />

[17] Qingmin Meng, Xiong Gu, Feng Tian, Baoyu Zheng, “ k-<br />

NN Based MCS Selection in Distributed OFDM Wireless<br />

<strong>Networks</strong>,” In 2011 international conference on<br />

Automation, Communication, Architectonics, and<br />

Materials (ACAM2011), to be published, June 18-19,<br />

Wuhan, China.<br />

[18] T. S. Rappaport, Wireless Communications: Principles and<br />

Practice, 2nd ed. NJ: Prentice-Hall, 2001.<br />

[19] Andrea Goldsmith. Wireless Communications. Cambridge<br />

University Press, 2005.<br />

[20] Koji Yamamoto, “Trade<strong>of</strong>f between Area Spectral<br />

Efficiency and End-to-End Throughput in Rate-Adaptive<br />

Multihop Radio <strong>Networks</strong>,” IEICE Trans. Commu., Vol.<br />

E88-B, No.9, 2005.<br />

© 2011 ACADEMY PUBLISHER<br />

Hao Yang Jiangsu Province, China.<br />

Birthdate: November, 1969. He is Signal<br />

and Information Processing Ph.D.,<br />

graduated from the School <strong>of</strong> Information<br />

Science and Engineering, Southeast<br />

University. And research interests on<br />

image processing.<br />

He is a senior lecturer <strong>of</strong> the School <strong>of</strong><br />

Geography and Biological Information,<br />

Nanjing University <strong>of</strong> Posts and Telecommunications.<br />

Qingmin Meng Jiangsu Province, China.<br />

Birthdate: September, 1965. He received<br />

Ph.D. degree in radio engineering from<br />

Southeast University, Nanjing, China, in<br />

2007. Then he joined the Faculty <strong>of</strong><br />

School <strong>of</strong> Telecommunications and<br />

Information Engineering, Nanjing<br />

University <strong>of</strong> Posts and<br />

Telecommunications. His current research<br />

interests include multihop relaying in next<br />

generation broadband wireless communication, the application<br />

<strong>of</strong> machine learning to resource allocation in cognitive radio<br />

networks and vehicular opportunistic communication.<br />

Xiong Gu Hubei Province, China.<br />

Birthdate: May, 1988. He is working for<br />

master degree in School <strong>of</strong><br />

Telecommunications and Information<br />

Engineering, Nanjing University <strong>of</strong> Posts<br />

and Telecommunications. His current<br />

research interests include machine<br />

learning and its application in resource<br />

allocation in cognitive radio networks<br />

and vehicular opportunistic communication.


1760 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

A Request Distribution Algorithm for Web<br />

Server Cluster<br />

Wei Zhang<br />

School <strong>of</strong> Computer Science and Engineering, Beihang University, Beijing, 100191, China<br />

State Key Laboratory <strong>of</strong> Rail Traffic Control and Safety,Beijing Jiaotong University,Beijing 100044, China<br />

zhangwqh@cse.buaa.edu.cn<br />

Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, Li Ruan<br />

School <strong>of</strong> Computer Science and Engineering, Beihang University, Beijing, 100191, China<br />

{zhumf, xiaolm,ruanli}@buaa.edu.cn<br />

Abstract—With the explosively increasing <strong>of</strong> web-based<br />

applications’ workloads, Web server cluster encounters<br />

challenge in response time for requests. Request distribution<br />

among servers in web server cluster is the key to address<br />

such challenge, especially under heavy workloads. In this<br />

paper, we propose a new request distribution algorithm<br />

named llac (least load active cache) for load balancing<br />

switch in web server cluster. The goal <strong>of</strong> llac is to improve<br />

the cache hit rate and reduce response time. Packets are<br />

parsed in IP level, and back-end servers are notified to<br />

cache hot files using link change technology, neither<br />

changing URL information nor modifying the service<br />

program. This avoids switching overhead between user<br />

mode and kernel mode. The load balancing switch directly<br />

creates connection with the selected server, avoiding<br />

migrating connection overhead. This policy estimates the<br />

current composited load <strong>of</strong> each server and selects the<br />

server with the least load to serve the request. It also<br />

improves the resource utilization <strong>of</strong> web servers.<br />

Experimental results show that llac achieves better<br />

performance for web applications than wrr (weight round<br />

robin) which is a popular request distribution.<br />

Index Terms—Web Cluster, Request Distribution, LLAC<br />

I. INTRODUCTION<br />

The enormous growth <strong>of</strong> the internet industry<br />

introduces web-based application as popular demanding<br />

programs. Users are becoming increasingly reliant on the<br />

web for their daily activities such as electronic commerce,<br />

on-line banking, stock trading, reservations and product<br />

merchandising. Therefore the performance <strong>of</strong> a web server<br />

system plays an important role in success <strong>of</strong> many internet<br />

related companies. Traditionally, a single server machine<br />

can only handle a limited amount <strong>of</strong> requests and can’t<br />

scale up with demand. The better way to cope with<br />

growing processing demands for web servers is by adding<br />

more hardware resources instead <strong>of</strong> completely replacing<br />

one server with a faster one [1]. More and more web sites<br />

use a web cluster, composed <strong>of</strong> a front-end request<br />

dispatching server, also called load balancing switch, and<br />

several back-end servers handing requests. By distributing<br />

requests from clients to separate servers for load balancing<br />

© 2011 ACADEMY PUBLISHER<br />

doi:10.4304/jnw.6.12.1760-1766<br />

or load sharing, web cluster have proved to be a better<br />

solution than using an overloaded single server. Due to<br />

various technical issues regarding the management <strong>of</strong> a<br />

web server cluster, request distribution algorithms (which<br />

are implemented in the load balancing switch) are<br />

particularly important to boost the performance <strong>of</strong> cluster<br />

web servers [2]. The ratio <strong>of</strong> the peak load to light load for<br />

internet applications is usually on the order <strong>of</strong> 300% [19].<br />

J.C.Mongul said [3], web site happened to collapse mostly<br />

because <strong>of</strong> popular and hot event access. A famous<br />

example, the normally well-provisioned Amazon.com site<br />

suffered a forty-minute down time due to an overload<br />

during the popular holiday season in November 2000.<br />

Popular web sites <strong>of</strong>ten face the challenge to deal with<br />

huge amount <strong>of</strong> requests in short time. This paper<br />

addresses the problem <strong>of</strong> request distribution so that web<br />

server cluster can serve its peak workload demand. We<br />

simultaneously use client-side and server-side information<br />

to select server, and avoid switching overhead between<br />

user mode and server mode. We also avoid migrating<br />

connection overhead. We present a new request<br />

distribution algorithm with several contributions in which<br />

as follows: Firstly, design combined load model based on<br />

collection <strong>of</strong> typical load information. This model is used<br />

with online measurements <strong>of</strong> load information to estimate<br />

the processing capacity <strong>of</strong> web servers. This gives us<br />

reliable load descriptors for web servers which are used in<br />

the decision making process <strong>of</strong> the request distribution<br />

algorithm. Secondly, in order to increase the speed <strong>of</strong><br />

accessing the popular or hot files, our approach resorts to<br />

active caching technology. Packets are captured and<br />

analyzed using netfilter mechanism in IP level, avoiding<br />

switching overhead between user mode and server mode<br />

and migrating connection overhead. The active caching<br />

technology does not modify URL or server program,<br />

resorting to link change technology to put hot files in<br />

memory file system. Finally, we propose and implement a<br />

novel request distribution algorithm which works on the<br />

basis <strong>of</strong> the composited load and file access frequency.<br />

We call this novel request distribution algorithm for<br />

llac(least load and active caching) shortly.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1761<br />

The rest <strong>of</strong> the paper is organized as follows. Section II<br />

discusses the related work. In Section III we propose<br />

request distribution architecture and algorithm, and then<br />

we discuss each module separately. Section IV describes<br />

the experimental results <strong>of</strong> a web cluster prototype using<br />

llac. Section V concludes the paper.<br />

II. RELATED WORKS<br />

Numerous dispatching algorithms are proposed for web<br />

server cluster. We can classify dispatching algorithms as<br />

layer-4 and layer-7 algorithms.<br />

A layer-4 algorithm only considers web server-side<br />

information, but doesn’t use client-side information. In<br />

this approach, clients directly create connection with the<br />

selected server. This algorithm is easy to implement, but<br />

cannot make good use <strong>of</strong> server’s resources according to<br />

the customer’s request. It includes random policy [4],<br />

round-robin policy [4], weight round-robin policy [5],<br />

least connection policy [6], fast response time [6] and so<br />

on. Random and round-robin are easy to implement, but<br />

they don’t consider servers capacity. This can easily lead<br />

to unbalance. Wrr associates an evaluated weight with<br />

each server node in a cluster which is proportional to the<br />

server’s capacity. Initial weight is set by the administrator,<br />

disturbing by human factors. Least connection doesn’t<br />

consider that each request may have different response<br />

time and different demand for resources. Fast Response<br />

Time is influenced by the network environment, so can’t<br />

evaluate the performance <strong>of</strong> a web server effectively.<br />

A layer-7 algorithm not only considers server<br />

information, but also can use client’s user level<br />

information, such as session identifiers, type <strong>of</strong> URL,<br />

cookies and so on. However, clients need to create TCP<br />

connection with the load balancing switch in order to<br />

analyze information <strong>of</strong> customer. This involves to two<br />

copies <strong>of</strong> packets between user space and kernel space. As<br />

customers firstly establish connections with the load<br />

balancing switch, so the connections need to be migrated<br />

to the selected server. Migrating connections are very<br />

time-consuming and consume large amounts <strong>of</strong> system<br />

resources. Layer-7 algorithms can consider more<br />

information deciding to select which server to response to<br />

a request and make good use <strong>of</strong> server resources, in<br />

particular cache resource. However, they require<br />

migration <strong>of</strong> connection and copy <strong>of</strong> packets between user<br />

space and kernel space, bringing a certain degree loss <strong>of</strong><br />

performance. Examples <strong>of</strong> the layer-7 include LARD<br />

(locality-aware request distribution) [7], WARD<br />

(workload aware request distribution) [8], CAP (client<br />

aware policy) [9] and so on. LARD is well known<br />

dispatching policy that aims to improve cache hit rate in<br />

web server. In LARD policy, the load balancing policy<br />

dispatches the request <strong>of</strong> the same web object to the same<br />

back-end web server. However, LARD may lead to load<br />

unbalancing due to different popularity <strong>of</strong> web pages.<br />

WARD is static partitioning that assigns dedicated servers<br />

to specific groups <strong>of</strong> requests. Although this policy is<br />

useful from the system management point <strong>of</strong> view and<br />

achieves a higher cache hit rate, it does have poor server<br />

utilization. Degradation in the utilization is due to<br />

© 2011 ACADEMY PUBLISHER<br />

resources that are not utilized and cannot be shared among<br />

all <strong>of</strong> the clients. The main goal <strong>of</strong> CAP is to improve load<br />

sharing in web clusters that provides multiple types <strong>of</strong><br />

services. The load balancing switch classifies requests<br />

from clients into four classes: normal, CPU bound, disk<br />

bound, and CPU and disk bound. However, requests with<br />

the same type might consume different amounts <strong>of</strong><br />

resources.<br />

Considering the shortcomings <strong>of</strong> the above methods,<br />

we propose llac. Using netfiler mechanism to intercept<br />

packets and analyze the URL information in IP level,<br />

Load balancing switch notifies the back-end server cache<br />

hot files, selects least-load server to process requests. We<br />

use both client-side and server-side information, avoid<br />

switching user mode and kernel mode and migrating TCP<br />

connection overhead. In case <strong>of</strong> hot files, use cache to<br />

improve response time.<br />

III. LLAC FOR WEB CLUSTER<br />

In the sites, most <strong>of</strong> the crashes occur during the hot<br />

visit. Therefore, increase accessing speed <strong>of</strong> hot files eases<br />

the pressure on the sites to a large extent. The main<br />

objective in designing such an algorithm is to minimize<br />

the average response time <strong>of</strong> popular or hot requests<br />

(proactive cache hot files, read them from memory file<br />

system, so reducing disk write times), make good use <strong>of</strong><br />

server resources in the cluster and increase the utilization<br />

and throughput <strong>of</strong> cluster web servers. In this regard, we<br />

use a linear model to compute each server’s composite<br />

load, which helps to decide which should be chosen to<br />

serve request. The module <strong>of</strong> llac uses information <strong>of</strong><br />

clients and servers to make request distribution. The load<br />

balancing switch parses packets in IP layer using netfilter<br />

mechanism, records packet access frequency and informs<br />

web servers to cache popular or hot files. Fig. 1 shows<br />

system components we design.<br />

A. Load Collection<br />

The Load Collection is responsible for tracking the<br />

processor, network and memory usage <strong>of</strong> web server. We<br />

gather resource utilization traces by running a set <strong>of</strong><br />

microbechmarks. The full list <strong>of</strong> metrics is shown in<br />

TABLE I. These statistics can all be gathered easily in<br />

Linux with the sysstat monitoring package [10]. We focus<br />

on this set <strong>of</strong> resource measurements since they can easily<br />

be gathered with low overhead and are representative <strong>of</strong><br />

estimating the performance <strong>of</strong> the web server [11]. Since<br />

these traces must also be gathered from the live<br />

application, it is crucial that a lightweight monitoring<br />

system can be used to gather data. To improve<br />

performance, we create a thread for every load indicator in<br />

parallel to execute. The load collection tracks the usage <strong>of</strong><br />

each resource over a measurement interval and reports<br />

these statistics to the load calculation at the end <strong>of</strong> each<br />

interval.<br />

B. Load Calculation<br />

This section describes how to create models which<br />

characterize the relationship between a set <strong>of</strong> resource<br />

utilization metrics gathered from an application running<br />

on the web server and the composited load. The model


1762 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

creation employs a component which is a linear equation.<br />

Using values from the load collection module, we form<br />

an equation which calculates the total load as a linear<br />

combination <strong>of</strong> the different metrics.<br />

T = α 0 + α1<br />

* U1<br />

+ α1<br />

* U 2 + � + α 9 * U 9 (1)<br />

Where<br />

• U i is a value <strong>of</strong> metric collected for a benchmark<br />

executed in the web server;<br />

• The set <strong>of</strong> coefficients α 0 , α1,<br />

�, α n is the<br />

model that describes the relationship between the total<br />

load and Resource Utilization Metrics. Unfortunately,<br />

finding a set <strong>of</strong> good parameters is a rather empirical job,<br />

with very little support from theory. The main objective<br />

Figure1. System Architecture<br />

TABLE I. RESOURCE UTILIZATION METRICS<br />

is to tune the parameters to achieve good system<br />

performance, without asking too many questions about<br />

why it works well. Often, it is just a matter <strong>of</strong> “let’s try<br />

this approach and see what happens”.<br />

• This is the total load <strong>of</strong> the web server.<br />

Load calculation module passes the results to load<br />

management module located in the load balancing switch.<br />

Web Server adopts active push method to report their<br />

composited load. Pushing way than active asking can<br />

further reduce the burden <strong>of</strong> the load balancing switch.<br />

Also, Using UDP unicast data transmission can reduce<br />

the burden on the network bandwidth.<br />

CPU Memory Network Disk<br />

User Space % Memory Used % Rx packets/sec Read req/sec<br />

Kernel Space % Swapper Used % Tx packets/sec Write req/sec<br />

Rx bytes/sec<br />

Read blocks/sec<br />

IO Wait %<br />

Tx bytes/sec<br />

Write blocks/sec<br />

C. Load Management<br />

In the listening state, if it receives the server’s<br />

composited load information, it creates a child thread and<br />

notifies the current load value to llac module located in<br />

the kernel space using IPVSadm management tool.<br />

Frequency statistics module is mounted on the<br />

IP_LOCAL_IN in the Netfilter[12] in order to parsing the<br />

request packets and count the access frequency statistics.<br />

The priority <strong>of</strong> the frequency statistics function must be<br />

higher than the IPVS, otherwise the request will be<br />

forwarded out at this point, leading to not reaching<br />

frequency statistics function. Meanwhile, we use hash lists<br />

in order to raise the speed <strong>of</strong> accessing and searching.<br />

© 2011 ACADEMY PUBLISHER<br />

They link together through the general list head pointer<br />

inside the structure. We sort the file from more to less<br />

according to access frequency, through sorted_list<br />

pointing to the sorted list (Fig. 2) so that we can easily get<br />

hot file information.<br />

D. Cache Replacement<br />

We use memory file system divided from memory<br />

space for hot file cache. Our novelty lies in using link<br />

change technology to modify file location on disk<br />

(changed to symbolic link) pointing to the location on<br />

memory file system caching the file. It brings many<br />

benefits. For example, we do not need to modify the URL<br />

information. Also, service program does not require


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1763<br />

modification, we can access the cached file from the<br />

memory space. Whether the file is on disk or in memory<br />

file is transparent to the user or service procedures.<br />

Due to the size limit <strong>of</strong> RAM resource, when memory<br />

space in memory file is not sufficient to accommodate the<br />

needs <strong>of</strong> caching file, cached files need to be replaced out<br />

from the cache using replacement policy. We use the<br />

IV. EXPERIMENTAL RESULTS<br />

To analyze the proposed dispatching algorithm, it is<br />

implemented on a web server cluster. We implement the<br />

experimental testbed with hardware and s<strong>of</strong>tware<br />

configurations as described below.<br />

B. S<strong>of</strong>tware Setup<br />

following cache replacement strategy. We sort the files<br />

according to the ratio <strong>of</strong> file access frequency and file size.<br />

When need to be replaced, give priority to small ratio.<br />

Follow this way, the access frequency which is low and<br />

the size which is large will be replaced out. To a certain<br />

extent, this improves the cache hit rate, also improves the<br />

cache utilization.<br />

Figure2. File Frequency Statistics<br />

TABLE II. HARDWARE ENVIRONMENT<br />

A. Hardware setup<br />

The web server cluster consists <strong>of</strong> a load balancing<br />

switch node, connected to the web server nodes. All the<br />

nodes are connected through a high speed gigabit LAN<br />

switch. The distributed architecture <strong>of</strong> the cluster is<br />

hidden from the clients via a unique virtual IP address.<br />

The hardware environment is shown in TABLE II.<br />

CPU Memory(GB) HD NIC<br />

Front-end Intel(R) E5345 2.33GHz 8cores DDR2 4 60GB 80003ES2LAN<br />

Back-ends (1-4) Intel(R) E5345 2.33GHz 8cores DDR2 4 60GB 80003ES2LAN<br />

TABLE III. SOFTWARE ENVIRONMENT<br />

OS Kernel IPVS Web server Benchmarks<br />

LB switch Red Hat Linux5.0 2.6.18 1.0.4 ————— —————<br />

Web server Red Hat Linux5.0 2.6.18 —— Apache 2.0.40 —————<br />

Client Red Hat Linux5.0 2.6.18 —— ————— WebBench 5.0<br />

TABLE III shows the experimental s<strong>of</strong>tware<br />

environment. All the machines in the cluster run Linux<br />

kernel 2.6.18 as their operating system, and the load<br />

© 2011 ACADEMY PUBLISHER<br />

balancing switch uses IPVS for request dispatching. We<br />

use Apache 2.0.40 for HTTP service installed as the web<br />

server. HTTP/1.1 connection is applied. In addition, all


1764 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

clients have WebBench[13] installed for request<br />

proposing.<br />

WebBench is a performance testing s<strong>of</strong>tware for web<br />

servers, including both the controller and clients. The<br />

controller is able to control clients for proposing requests,<br />

to record and summarize the experimental data, and then<br />

output the experimental results. In addition, WebBench<br />

can control the mixed ratio <strong>of</strong> request types transmitted<br />

from clients by the programmable workload.<br />

We perform all experiments to analyze the system<br />

performance under different ratios <strong>of</strong> request types (e.g.<br />

different localities <strong>of</strong> hot Web pages). We also create a<br />

workload generator to generate a synthetic workload for<br />

various ratios <strong>of</strong> request types. The performance metrics<br />

we used are the requests per second (req/s) megabits per<br />

second (Mbps) and number <strong>of</strong> successful requests, which<br />

are the experimental results summarized and reported by<br />

WebBench.<br />

C. Experimental evaluation<br />

In this section, we present performance evaluation <strong>of</strong><br />

our proposed llac request distribution algorithm. In this<br />

test, WebBench is used and hot Web pages are built from<br />

the requested web pages <strong>of</strong> the default workload.<br />

Furthermore, we prepare the click through rate (CTR)<br />

with 20%, 40%, 60% and 80%, change the percentage <strong>of</strong><br />

the hot web pages in requested web pages to 10%, 20%,<br />

30% and 40%. We compare the experimental results with<br />

that <strong>of</strong> wrr. We also<br />

compare llac with only using ll or ac. Fig. 3, 4, and 5<br />

shows that our llac outperforms wll, ll, and ac.<br />

Figure5. number <strong>of</strong> successful requests<br />

The reason that llac policy performs better than wrr, ll<br />

is because the llac policy uses frequency-based<br />

mechanism to achieve high cache rates <strong>of</strong> servers. The<br />

reason that llac policy performs better than ac is because<br />

the llac policy considers server’s current load, assessing<br />

the current load and selecting the appropriate node to<br />

response the request.<br />

Experimental results demonstrate that when the web<br />

server cluster is under heavy load, the llac policy can<br />

handle more requests and show better performance.<br />

© 2011 ACADEMY PUBLISHER<br />

Figure 3 number <strong>of</strong> requests per minute<br />

Figure4. data transfer per second<br />

V. SUMMARY<br />

This paper presents a new request distribution<br />

algorithm for web server cluster, called llac. This<br />

research focuses on reducing hot files access time and<br />

uses the resources <strong>of</strong> web servers more efficiently. Our<br />

experimental results show that our proposed llac policy<br />

can get better performance than wrr, ll, and ac under<br />

heavy load condition. This policy reduces response time<br />

especially for hot files, because hot files are retrieved<br />

from memory. The node with least load is selected to<br />

serve the request so that it results in resource utilization<br />

getting better used. In future work we plan to experiment<br />

with more benchmark to further verify effectiveness <strong>of</strong><br />

llac.<br />

ACKNOWLEDGMENT<br />

This study is sponsored by the fund <strong>of</strong> the State Key<br />

Laboratory <strong>of</strong> S<strong>of</strong>tware Development Environment under<br />

Grant No. SKLSDE-2009ZX-01, the National Natural<br />

Science Foundation <strong>of</strong> China under Grant No. 61003015,<br />

the Doctoral Fund <strong>of</strong> Ministry <strong>of</strong> Education <strong>of</strong> China<br />

under Grant No. 20101102110018 and the Fundamental<br />

Research for the Central Universities under Grant No.<br />

YWF-10-02-058.


JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011 1765<br />

REFERENCES<br />

[1] A. Chandra, P. Pradhan, R. Tewari, S. Sahu, P. Shenoy.<br />

"An observation-based approach towards self-managing<br />

web servers", Computer Communications, 2006, pp1174-<br />

1188.<br />

[2] V. Cardellini, E. Casalicchio, M. Colajanni, S. Tucci,<br />

"Mechanisms for quality <strong>of</strong> service in web clusters",<br />

Computer <strong>Networks</strong>, vol.37, No.6, 2001, pp761-771.<br />

[3] M.E. Crovella, A. Bestavros. "Self-Similarity in World<br />

Wide Web Traffic: Evidence and Possible Causes",<br />

IEEE/ACM Transactions on Networking, vol.5, No.6,<br />

1997, pp835-846.<br />

[4] V. Cardellini, E. Casalicchio, M. Colajanni, and P.S. Yu.<br />

"The State <strong>of</strong> the Art in Locally Distributed Web-Server<br />

Systems", ACM Computing Surveys, vol.34, No.2, 2002,<br />

pp 263-311.<br />

[5] M. Andreolini, E. Casalicchio. "A cluster-based web<br />

system providing differentiated and guaranteed services",<br />

Cluster Computing , vol.7, No.1, 2004, pp7-19.<br />

[6] E. Choi. "Performance test and analysis for an adaptive<br />

load balancing mechanism on distributed server cluster<br />

systems", Future Generation Computer Systems, No.20,<br />

2004, pp 237-247.<br />

[7] V.S. Pail, M. Aront, G. Bangat. "Locality-Aware Request<br />

Distribution in Cluster-based Network Servers", ACM<br />

SIGOPS Operating Systems Review, USA:ACM , 1998,<br />

pp205-216.<br />

[8] L. Cherkasova, M. Karlsson. "Scalable Web Server<br />

Cluster Design with Workload-Aware Request<br />

Distribution Strategy WARD", Advanced Issues <strong>of</strong> E-<br />

Commerce and Web-Based Information Systems,<br />

Washington:IEEE Computer Society, 2001, pp212-221.<br />

[9] E. Casalicchio, M. Colajanni. "A client-aware dispatching<br />

algorithm for Web clusters providing multiple services",<br />

The International World Wide Web Conference<br />

Committee (IW3C2), 2001, pp535-544.<br />

[10] Sysstat-7.0.4. http://perso.orange.fr/sebastien.godard/<br />

[11] M. Andreolini , S. Casolari , Michele Colajanni. "Models<br />

and framework for supporting runtime decisions in Webbased<br />

systems", ACM Transactions on the Web (TWEB),<br />

vol.2, No.3, 2008, pp1-43.<br />

[12] CHRISTIAN BENVENUTI. Understanding LINUX<br />

NETWORK INTERNALS. 2006.<br />

[13] http://linux.s<strong>of</strong>tpedia.com/get/System/Benchmarks/Webbench-1378.shtml<br />

[14] M.L. Chiang, Y.C. Lin, L.F. Guo. "Design and<br />

implementation <strong>of</strong> an efficient web cluster with contentbased<br />

request distribution and file caching", <strong>Journal</strong> <strong>of</strong><br />

Systems and S<strong>of</strong>tware, vol.81, No.11, 2008, pp 2044-2058<br />

[15] S. Sharifian, S.A. Motamedi, M.K. Akbari. "A contentbased<br />

load balancing algorithm with admission control for<br />

cluster web servers", Future Generation Computer<br />

Systems , vol.24, No.8, 2008, pp775-787.<br />

[16] M.L. Chiang, C.H. Wu, Y.J. Liao, Y.F. Chen. "New<br />

Content-aware Request Distribution Policies in Web<br />

Clusters Providing Multiple Services", Proceedings <strong>of</strong> the<br />

2009 ACM symposium on Applied Computing,<br />

USA:ACM, 2009, pp79-83.<br />

[17] Z.Y. Xu, J.Z. Han, L. Bhuyan. "Scalable and<br />

Decentralized Content-Aware Dispatching in Web<br />

Clusters", IEEE International Performance, Computing,<br />

and Communications, Washington:IEEE Computer<br />

Society, 2007, pp202-209.<br />

[18] Y.K. Chang. "Fully Pre-Splicing TCP for Web Switches",<br />

Proceedings <strong>of</strong> the First International Conference on<br />

© 2011 ACADEMY PUBLISHER<br />

Innovative Computing, Information and Control,<br />

Washington:IEEE Computer Society , 2006, pp737-740.<br />

[19] S. Chase , D.C. Anderson. "Managing energy and server<br />

resources in hosting centers", In Proc. <strong>of</strong> the eighteenth<br />

ACM symposium on Operating systems principles, 2001,<br />

pp103-116.<br />

[20] Tarek F. Abdelzaher, Kang G. Shin, and Nina Bhatti.<br />

"Performance Guarantees for Web Server End-Systems: A<br />

Control-Theoretical Approach", IEEE Transactions on<br />

Parallel and Distributed Systems, June 2001.<br />

[21] Yasushi Saito, Brian N. Bershad, and Henry M. Levy. "An<br />

approximation-based load-balancing algorithm with<br />

admission control for cluster web servers with dynamic<br />

workloads", <strong>Journal</strong> <strong>of</strong> Supercomputing, vol.53, No.3,<br />

2010, pp 440-463.<br />

Wei Zhang HeBei Province, China.<br />

Birthdate: Dec, 1983. is a PhD<br />

candidate in the Department <strong>of</strong><br />

Computer science and Technology at<br />

Beihang University. She received her<br />

master degree in 2008. Her research<br />

interests include virtualization, load<br />

balancing and cloud computing.<br />

Huan Wang HuNan Province, China.<br />

Birthdate: Oct, 1986. is Computer<br />

Science and Engineering Master,<br />

graduated from Dept. Computer<br />

Science Beihang University. And<br />

research interests on operating system,<br />

load balancing, parallel computing and<br />

massive data processing.<br />

Binbin Yu born in July 1987. Now<br />

study in Computer Science College at<br />

Beihang University for the master<br />

degree. Mainly concentrates on load<br />

balancing and cloud computing.<br />

Wei Xu Fujian Province, China.<br />

Birthdate: Nov, 1986. is Computer<br />

Science and Technology MA, graduated<br />

from Dept. Computer Science and<br />

Engineering BeiHang University. And<br />

research interests on virtualization,<br />

operating system, high performance<br />

computing, cloud computing.


1766 JOURNAL OF NETWORKS, VOL. 6, NO. 12, DECEMBER 2011<br />

© 2011 ACADEMY PUBLISHER<br />

Mingfa Zhu born in 1945. Ph.D,<br />

Pr<strong>of</strong>essor, Senior membership <strong>of</strong> China<br />

Computer Federation. His main research<br />

areas are computer architecture,<br />

computer system s<strong>of</strong>tware, high<br />

performance computing, virtualization<br />

and cloud computing<br />

Limin Xiao born in 1970. Ph.D,<br />

Pr<strong>of</strong>essor, Senior membership <strong>of</strong> China<br />

Computer Federation. His main research<br />

areas are computer architecture,<br />

computer system s<strong>of</strong>tware, high<br />

performance computing, virtualization<br />

and cloud computing.<br />

Li Ruan born in 1978. Ph.D, Lecturer,<br />

Membership <strong>of</strong> China Computer<br />

Federation, Her main research areas are<br />

computer architecture, computer system<br />

s<strong>of</strong>tware, high performance computing,<br />

virtualization and cloud computing.


Aims and Scope.<br />

Call for Papers and Special Issues<br />

<strong>Journal</strong> <strong>of</strong> <strong>Networks</strong> (JNW, ISSN 1796-2056) is a scholarly peer-reviewed international scientific journal published monthly, focusing on theories,<br />

methods, and applications in networks. It provide a high pr<strong>of</strong>ile, leading edge forum for academic researchers, industrial pr<strong>of</strong>essionals, engineers,<br />

consultants, managers, educators and policy makers working in the field to contribute and disseminate innovative new work on networks.<br />

The <strong>Journal</strong> <strong>of</strong> <strong>Networks</strong> reflects the multidisciplinary nature <strong>of</strong> communications networks. It is committed to the timely publication <strong>of</strong> highquality<br />

papers that advance the state-<strong>of</strong>-the-art and practical applications <strong>of</strong> communication networks. Both theoretical research contributions<br />

(presenting new techniques, concepts, or analyses) and applied contributions (reporting on experiences and experiments with actual systems) and<br />

tutorial expositions <strong>of</strong> permanent reference value are published. The topics covered by this journal include, but not limited to, the following topics:<br />

• Network Technologies, Services and Applications, Network Operations and Management, Network Architecture and Design<br />

• Next Generation <strong>Networks</strong>, Next Generation Mobile <strong>Networks</strong><br />

• Communication Protocols and Theory, Signal Processing for Communications, Formal Methods in Communication Protocols<br />

• Multimedia Communications, Communications QoS<br />

• Information, Communications and Network Security, Reliability and Performance Modeling<br />

• Network Access, Error Recovery, Routing, Congestion, and Flow Control<br />

• BAN, PAN, LAN, MAN, WAN, Internet, Network Interconnections, Broadband and Very High Rate <strong>Networks</strong>,<br />

• Wireless Communications & Networking, Bluetooth, IrDA, RFID, WLAN, WMAX, 3G, Wireless Ad Hoc and Sensor <strong>Networks</strong><br />

• Data <strong>Networks</strong> and Telephone <strong>Networks</strong>, Optical Systems and <strong>Networks</strong>, Satellite and Space Communications<br />

Special Issue Guidelines<br />

Special issues feature specifically aimed and targeted topics <strong>of</strong> interest contributed by authors responding to a particular Call for Papers or by<br />

invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are <strong>of</strong> interest to the <strong>Journal</strong>.<br />

Preference will be given to proposals that cover some unique aspect <strong>of</strong> the technology and ones that include subjects that are timely and useful to the<br />

readers <strong>of</strong> the <strong>Journal</strong>. A Special Issue is typically made <strong>of</strong> 10 to 15 papers, with each paper 8 to 12 pages <strong>of</strong> length.<br />

The following information should be included as part <strong>of</strong> the proposal:<br />

• Proposed title for the Special Issue<br />

• Description <strong>of</strong> the topic area to be focused upon and justification<br />

• Review process for the selection and rejection <strong>of</strong> papers.<br />

• Name, contact, position, affiliation, and biography <strong>of</strong> the Guest Editor(s)<br />

• List <strong>of</strong> potential reviewers<br />

• Potential authors to the issue<br />

• Tentative time-table for the call for papers and reviews<br />

If a proposal is accepted, the guest editor will be responsible for:<br />

• Preparing the “Call for Papers” to be included on the <strong>Journal</strong>’s Web site.<br />

• Distribution <strong>of</strong> the Call for Papers broadly to various mailing lists and sites.<br />

• Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be<br />

informed the Instructions for Authors.<br />

• Providing us the completed and approved final versions <strong>of</strong> the papers formatted in the <strong>Journal</strong>’s style, together with all authors’ contact<br />

information.<br />

• Writing a one- or two-page introductory editorial to be published in the Special Issue.<br />

Special Issue for a Conference/Workshop<br />

A special issue for a Conference/Workshop is usually released in association with the committee members <strong>of</strong> the Conference/Workshop like<br />

general chairs and/or program chairs who are appointed as the Guest Editors <strong>of</strong> the Special Issue. Special Issue for a Conference/Workshop is<br />

typically made <strong>of</strong> 10 to 15 papers, with each paper 8 to 12 pages <strong>of</strong> length.<br />

Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop:<br />

• Selecting a Title for the Special Issue, e.g. “Special Issue: Selected Best Papers <strong>of</strong> XYZ Conference”.<br />

• Sending us a formal “Letter <strong>of</strong> Intent” for the Special Issue.<br />

• Creating a “Call for Papers” for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees.<br />

Information about the <strong>Journal</strong> and <strong>Academy</strong> <strong>Publisher</strong> can be included in the Call for Papers.<br />

• Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus<br />

the evaluation from the Session Chairs and the feedback from the Conference attendees.<br />

• Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors.<br />

Authors should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced.<br />

• Providing us the completed and approved final versions <strong>of</strong> the papers formatted in the <strong>Journal</strong>’s style, together with all authors’ contact<br />

information.<br />

• Writing a one- or two-page introductory editorial to be published in the Special Issue.<br />

More information is available on the web site at http://www.academypublisher.com/jnw/.


(Contents Continued from Back Cover)<br />

Covert Flow Graph Approach to Identifying Covert Channels<br />

XiangMei Song and ShiGuang Ju<br />

A Novel HAVE Message <strong>of</strong> Peer-to-peer Protocol in BitTorrent Systems<br />

Jianyong Li, Jianchun Li, Daoying Huang, and Qiang Wei<br />

Image-based Position Estimation and Adaptive Modulation Coding in Vehicular Communication<br />

Hao Yang, Qingmin Meng, Xiong Gu, and Baoyu Zheng<br />

A Request Distribution Algorithm for Web Server Cluster<br />

Wei Zhang, Huan Wang, Binbin Yu, Wei Xu, Mingfa Zhu, Limin Xiao, and Li Ruan<br />

1740<br />

1747<br />

1754<br />

1760

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!