Real-time feature extraction from video stream data for stream ...

3. Machine Learning

tree learning algorithms, developed independently between the late 1970s and early

1980s: Iterative Dichotomiser (ID3) by J. Ross Quinlan [Quinlan, 1986] and Classification

and Regression Trees (CART) by L. Breiman, J. Fiedman, R. Olshen, and C.

Stone [Breiman et al., 1984]. The idea is quite simple and straight**for**ward: At each step,

the set of examples is divided into smaller subsets by splitting it by the values (or value

ranges) of one **feature**. The **feature** is hereby selected in a manner that it optimally

discriminates the set according to the classes. Hence, an ideal **feature** would split the example

set into subsets, that are pure (i.e. after the partition all examples in each subset

belong to the same class). As most likely no **feature** will split the example set into pure

subsets, different measures **for** the quality of a split have been proposed.

In**for**mation Gain The in**for**mation gain criterion uses the entropy of a set of examples

to find the optimal **feature** **for** splitting the set into subsets. The entropy of a set

S measures the impurity of S and is defined as

E(S) = −

k∑

P (C j )log 2 (P (C j ))

j=1

where P (C j ), j ∈ {1, ..., k} denotes the probability that an example in S belongs

to class C j . We now assume that we split S into v subsets {S 1 , S 2 , ..., S v } based on

the value of a **feature** A in S. Afterwards we calculate, how helpful this splitting

would be by calculating the accumulated weighted entropy over all resulting subsets

S m , m ∈ {1, ..., v}.

v∑ |S m |

E A (S) = × E(S m )

|S|

m=1

The in**for**mation gain Gain(A) of **feature** A is then defined as

Gain(A) = E(S) − E A (S)

and the **feature** with the highest in**for**mation gain Gain(A) is chosen to split the

example set S. For nominal **feature**s the number v of subsets is hereby usually

the number of possible outcomes of the **feature**. For numerical **feature**s, any v

can be chosen and v value ranges **for** the numerical **feature**s can be defined. The

in**for**mation gain criterion was introduced in ID3.

Gain Ratio By choosing the in**for**mation gain criterion, **feature**s with a large number

of different values are more likely to be chosen than other **feature**s. Especially in a

setting, where one **feature** holds an unique value **for** each example (i.g. an ID), this

**feature** would be selected, as it splits the example set in a way, that each subset is

pure. Hence it makes sense to penalize **feature**s by their number of possible values

v. This is done by the gain ratio criterion. The gain ratio GainRation(A) of a

**feature** A is defined as

GainRatio(A) = Gain(A)

SE A (S)

with

SE A (S) = −

v∑

m=1

|S m |

|S|

× log 2 ( |S m|

|S| )

32