23.07.2013 Views

Online Boosting Based Intrusion Detection in Changing Environments

Online Boosting Based Intrusion Detection in Changing Environments

Online Boosting Based Intrusion Detection in Changing Environments

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

too high to be used <strong>in</strong> practice. Third, the variety of attribute<br />

of network data is also a difficult issue. There are various<br />

types of attributes for network data, <strong>in</strong>clud<strong>in</strong>g both<br />

categorical and cont<strong>in</strong>uous ones. Furthermore, the value<br />

ranges for different attributes differ greatly, from [ 0,<br />

1]<br />

to<br />

7<br />

[ 0,<br />

10 ] . This br<strong>in</strong>gs more difficulties for many detection<br />

methods and limits their performance.<br />

In this paper, an onl<strong>in</strong>e boost<strong>in</strong>g based method is<br />

proposed for <strong>in</strong>trusion detection. The carefully designed<br />

strategies for tra<strong>in</strong><strong>in</strong>g weak classifiers make the learn<strong>in</strong>g<br />

efficient, and the detection rate balance scheme produces<br />

great detection performance for the f<strong>in</strong>al ensemble classifier.<br />

Furthermore, the onl<strong>in</strong>e learn<strong>in</strong>g framework provides the<br />

ability of quick adaption to the chang<strong>in</strong>g environments,<br />

which only needs to make one pass over the tra<strong>in</strong><strong>in</strong>g data of<br />

the new <strong>in</strong>trusion types. The other tra<strong>in</strong><strong>in</strong>g data have not to<br />

be revisited and the detector is efficiently updated onl<strong>in</strong>e.<br />

Experimental results show that the proposed method quickly<br />

adapts to the chang<strong>in</strong>g environments, and can perform realtime<br />

<strong>in</strong>trusion detection with high detection accuracy.<br />

The rest of paper is organized as follows. In Section 2<br />

we <strong>in</strong>troduce the onl<strong>in</strong>e boost<strong>in</strong>g based <strong>in</strong>trusion detection<br />

algorithm, and provide its relation to the batch boost<strong>in</strong>g<br />

based detection scheme. In Section 3 some experimental<br />

results are presented which show the advantage of the<br />

proposed method. Then we draw the conclusions <strong>in</strong> the last<br />

section.<br />

2 <strong>Onl<strong>in</strong>e</strong> boost<strong>in</strong>g based ID<br />

2.1 Problem formulation<br />

In the network-based IDSs, the tra<strong>in</strong><strong>in</strong>g and detection<br />

are performed at network nodes such as switchers and<br />

routers. Three groups of features are extracted from each<br />

network connection:<br />

basic features of <strong>in</strong>dividual TCP connections;<br />

content features with<strong>in</strong> a connection suggested by<br />

doma<strong>in</strong> knowledge;<br />

traffic features computed us<strong>in</strong>g as two-second time<br />

w<strong>in</strong>dow.<br />

The above features are commonly used <strong>in</strong> <strong>in</strong>trusion detection.<br />

The framework for construct<strong>in</strong>g these features can be found<br />

<strong>in</strong> [2].<br />

For each network connection, the feature values<br />

extracted above form a vector x x , x ,..., x ] , where<br />

[ 1 2 d<br />

d is the number of features extracted. The label y <strong>in</strong>dicates<br />

the b<strong>in</strong>ary class of the network connection:<br />

1 normal connection<br />

{ 1 network <strong>in</strong>trusion<br />

The <strong>in</strong>trusion detection algorithm is expected to tra<strong>in</strong> a<br />

classifier H from the labeled tra<strong>in</strong><strong>in</strong>g data set, then use the<br />

classifier H to predict the b<strong>in</strong>ary label ~ y H(<br />

x)<br />

for a<br />

new network connection. Note that there are many types of<br />

<strong>in</strong>trusions, such as guess<strong>in</strong>g password, port scann<strong>in</strong>g,<br />

neptune, satan, and there are constantly new <strong>in</strong>trusion types<br />

emerg<strong>in</strong>g <strong>in</strong> the network. However, they are all expected to<br />

be classified <strong>in</strong>to the network <strong>in</strong>trusion class, no matter<br />

which types of <strong>in</strong>trusion they are.<br />

y<br />

( 1)<br />

y<br />

2.2 Adaoost algorithm<br />

Adaboost [13] is one of the most popular mach<strong>in</strong>e<br />

learn<strong>in</strong>g algorithms developed <strong>in</strong> recent years, which has<br />

been successfully used <strong>in</strong> many applications, such as face<br />

detection [14], and image retrieval [15].<br />

The classical Adaboost algorithm is tra<strong>in</strong>ed <strong>in</strong> batch<br />

mode, which is showed <strong>in</strong> Table 1. Note that N is the<br />

number of tra<strong>in</strong><strong>in</strong>g samples, M is the number of weak<br />

classifiers to generate, and L is the base model learn<strong>in</strong>g<br />

b<br />

algorithm, such as Naive Bayes and decision stumps. A<br />

sequence of weak classifiers is learned based on the<br />

evolv<strong>in</strong>g sampl<strong>in</strong>g distribution of the tra<strong>in</strong><strong>in</strong>g data set. The<br />

f<strong>in</strong>al strong classifier is an ensemble of the weak classifiers,<br />

and the vot<strong>in</strong>g weights are derived from the classification<br />

errors of these weak classifiers.<br />

(m )<br />

The evolv<strong>in</strong>g weight w plays a key role <strong>in</strong><br />

Adaboost. It <strong>in</strong>dicates the importance of the n-th tra<strong>in</strong><strong>in</strong>g<br />

sample while generat<strong>in</strong>g the m-th weak classifier. The<br />

) weight w is updated <strong>in</strong> the follow<strong>in</strong>g way:<br />

w<br />

(m<br />

n<br />

w<br />

( m<br />

1)<br />

( m)<br />

n<br />

n<br />

<br />

{<br />

2(<br />

1<br />

1<br />

2<br />

The weights of samples that are wrongly classified by the<br />

current weak classifier are <strong>in</strong>creased, the others decreased,<br />

so that more attention is paid to the samples that are difficult<br />

to classify while generat<strong>in</strong>g the next weak classifier.<br />

2.3 <strong>Onl<strong>in</strong>e</strong> boost<strong>in</strong>g<br />

1<br />

( m)<br />

( m)<br />

In many applications, learn<strong>in</strong>g process need to be<br />

performed <strong>in</strong> onl<strong>in</strong>e mode. For example, when the tra<strong>in</strong><strong>in</strong>g<br />

data are generated as data streams, or the size of the tra<strong>in</strong><strong>in</strong>g<br />

data set is too huge for memory resources, it is impractical to<br />

make multi-pass over the entire tra<strong>in</strong><strong>in</strong>g data set for the<br />

learn<strong>in</strong>g algorithm. Tra<strong>in</strong><strong>in</strong>g a classifier <strong>in</strong> onl<strong>in</strong>e mode is<br />

necessary <strong>in</strong> these cases. This opens a hot research field<br />

called “onl<strong>in</strong>e learn<strong>in</strong>g” or “<strong>in</strong>cremental learn<strong>in</strong>g”. <strong>Onl<strong>in</strong>e</strong><br />

learn<strong>in</strong>g means we process a tra<strong>in</strong><strong>in</strong>g sample, then discard it<br />

after updat<strong>in</strong>g the classifier. There is likely some difference<br />

between the two classifiers that tra<strong>in</strong>ed <strong>in</strong> batch and onl<strong>in</strong>e<br />

modes, s<strong>in</strong>ce the onl<strong>in</strong>e learn<strong>in</strong>g algorithm can only make<br />

)<br />

n<br />

if<br />

if<br />

( m)<br />

h ( xn<br />

) <br />

( m)<br />

h ( xn<br />

) <br />

y<br />

y<br />

n<br />

n<br />

( 2)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!