CLASSIFICATION AND PREDICTION - UniversitÃ¤t Wien

Peter Brezany Institut für Softwarewissenschaft, WS 2002 1 

CLASSIFICATION AND PREDICTION 

Slide 1 

Peter Brezany 

Institut für Softwarewissenschaft 

Universität Wien 

Introduction 

Artificial neural systems represent the promising new generation of information processing 

systems. 

They have a proven track record in many data mining and decision support applications. 

Slide 2 

Neural networks (NN) - the ”artificial is usually dropped - are a class of very powerful, 

general-purpose tools readily applied to prediction, classification and clustering. 

Successful application across a broad range of industries: 

predicting financial series 

diagnosing medical conditions 

identifying clusters of valuable customers 

identifying fraudulent credit card transactions 

recognizing numbers written on checks 

predicting the failure rates of engines


Introduction (2) 

People are good at generalizing from experience. 

Computers usually excel at following explicit instructions over and over. 

Slide 3 

NN bridge this gap by modeling on a computer, the neural connections in human brains. 

Their ability to generalize and learn from data mimics our own ability to learn from 

experience. 

This ability is useful for data mining. 

Drawback: The results of training a NN are internal weights distributed throughout the 

network. These weights provide no more insight into why the solution is valid than asking 

many human experts why a particular decision is the right decision. They just know that it 

is. 

A Bit of History 

1940 (neurologist Warren McCulloch and logician Walter Pits) – the original work on how 

neurons work (no digital computers available at that time) 

Slide 4 

1950s - computer scientists implemented models called perceptrons based on the work of 

McCulloch and Pits - some limited successes with perceptrons in the laboratory, but the 

results were disappointing for general problem-solving. One of the reasons: there were no 

powerful computers available 

1970s - the study of NN implementations on computers slowed down drastically. 

1982 - John Hopfield invented backpropagation, a way of training NN – a renaissance in NN 

research. 

1980s - research moved from the labs into the commercial world. 

NN have been applied in virtually every industry field.


Real Estate Appraisal 

Slide 5 

NN are used to replace the appraiser who estimates the market value of a home based on a 

set of features: position (town, country), garage, house style and a lot of other factors that 

figure into her mental calculation. She is not applying some formula, but balancing her 

experience and knowledge of the sales prices of similar homes. And, her knowledge about 

housing prices is not static. She is aware of recent sale prices for homes throughout the 

region and can recognize trends in prices over time – she performs fine-tuning her 

calculation to fit the latest data. 

In 1992, researchers at IBM recognized this as a good problem for NN. The figure on the 

next slide illustrates why. 

A Neural Network as an Opaque Block 

inputs 

living space 

size of garage 

age of house 

etc. etc. etc. 

Neutral Network Model 

output 

appraised value 

Slide 6 

A neural network model calculates the appraised value (the output) from the inputs. 

The calculation is a complex process that we do not need to understand to use the 

appraised values. 

During the 1st phase, we need to train the network using examples of previous sales. An 

example from the training is shown in Tab. 1. 

A Technical Detail: NN work best when all the input and output values are between 0 and 1. 

This requires massaging (changing) all the values both continuous and categorical, to get 

new values between 0 and 1. 

The training set example with massaged values is in Tab. 2.


Tab. 1: Training Set Example 

Feature 

Description 

Value 

Slide 7 

Sales_Price 

Months_Ago 

Num_Apartments 

Year_Built 

Plumbing_Fixtures 

Heating_Type 

Basement_Garage 

Attached_Garage 

Living_Area 

Deck_Area 

Porch_Area 

Recroom_Area 

Basement_Area 

the sales price of the house 

When it was sold? 

Number of dwelling units 

Year built 

Number of plumbing fixtures 

Heating system type 

Basement garage (number of cars) 

Attached frame garage area (in square feet) 

Total living are (squaer feet) 

Deck / open porch area(square feet) 

Enclosed poech area(sqare feet) 

Recreation room area (square feet) 

Finished basement area (square feet) 

$ 171 ,000 

4 

1 

1923 

9 

A 

0 

120 

1,614 

0 

210 

0 

175 

Tab. 2: Massaged Values for Training Set Example 

Feature 

Range of Values 

Original Value 

Massaged Value 

Slide 8 

Sales_Price 

Months_Ago 


Year_Built 


Heating_Type 



Living_Area 

Deck_Area 

Porch_Area 

Recroom_Area 

Basement_Area 

$ 103,000-$ 250,000 

0-23 

1-3 

1850-1986 

5-17 

Coded as A or B 

0-2 

0-228 

714-4185 

0-738 

0-452 

0-672 

0-810 

$ 171,000 

4 

1 

1923 

9 

B 

0 

120 

1,614 

0 

210 

0 

175 

0.4626 

0.1739 

0.0000 

0.5328 

0.3333 

1.0000 

0.0000 

0.5263 

0.2593 

0.0000 

0.4646 

0.0000 

0.2160


How to use a NN? 

1. Training 

The NN is ready to be trained when the training examples have all been massaged. 

Slide 9 

During the training phase, we repeatedly feed the examples in the training set though the 

NN. The NN compares its predicted output value to the actual sales price and adjust all its 

internal weights to improve the prediction. 

By going through all the training examples (sometimes many times), the NN calculates a 

good set of weights. Training is complete when the weights no longer change very much or 

until the network has gone through the training set a maximum number of times. 

2. Test (Evaluating) 

We run the network on a test set that it had never seen before - when the performance is 

satisfactory, then we have a neural network model. The model is ready for use. 

How to use a NN? (2) 

3. Production runs 

The NN model takes descriptive information about a house, suitable massaged, and 

produces an output. 

Slide 10 

One problem: the output is a number between 0 and 1, so we need to unmassage the value to 

turn it back into a sales price. 

If we get a value like 0.75, then we multiply it by the size of the range ($147,000) and then 

add the base number in the range ($103,00) to get an appraisal value of $213,250. 

WARNING 

A neural network is only as good as the training set used to generate it. The model is static 

and must be explicitly updated by adding more recent examples into the training set and 

retraining the network (or training a new network) in order to keep it up-to-date and useful.


What Is a Neural Net? 

NN consists of basic units modeled on the principles of biological neurons. These units are 

connected together as shown in the next figures. 

Slide 11 

input 1 

input 2 

input 3 

input 4 

output 

output 

output 

1 

2 

3 

A neural network can produce 

multiple output values. 

What Is a Neural Net? (2) 

input 1 

input 2 

input 3 

input 4 

output 

A very simple neural network 

takes four inputs and produces an 

output. The result of training this 

network is exactly equivalent to 

the statistical technique called 

logistic regression. 

Slide 12 

input 1 

input 2 

input 3 

input 4 

output 

This network has a middle layer 

called the hidden layer.The 

hidden layer makes the network 

more powerful by enabling it to 

recognize more patterns. 

input 1 

input 2 

input 3 

output 

Increasing the size of the hidden 

layer makes the network more 

powerful but introduces the risk of 

overfitting. Usually, only one 

hidden layer is needed. 

input 4


What Is the Unit of a Neural Network? 

Artificial NNs are composed of basic units to model the behavior of biological neurons (see 

the next figure). The unit combines its inputs into a single output value. This combination is 

called the unit‘s transfer function. 

Slide 13 

The output remains very low until the combined inputs reach a threshold value - then, the 

unit is activated and the output is high. 

Small changes in the inputs can have large effects on the output, and, conversely, large 

changes in the inputs of the unit may have little effect on the output. This property is called 

non-linear behavior. 

The most common combination function is the weighted sum. Other combination functions 

are sometimes useful, e.g., MAX, MIN, logical AND or OR of the weighted values. 

The following diagram (after the figure) compares 3 typical activation functions. By far, the 

most common activation function is the sigmoid function ¡£¢¥¤§¦©¨¢£ 

. 

 

The Unit of an Artificial Neural Network 

output 

The result is exactly one output 

value, usualy between 0 and 1 . 

Slide 14 

Transfer 

Function 

The activation function calculates the 

output value from the result of the 

combination function. 

The combination function combines 

all the inputs into a single value, 

usually as a weighted summation. 

w1 

w2 

w3 

Each input has its own weight. 

inputs


Three Common Activation Functions 

1,5 

1,0 

0,5 

Slide 15 

0,0 

sigmoid 

(logistic) 

-0,5 

linear 

-1,0 

exponential 

(tanh) 

-1,5 

-10 -5 0 5 10 

The real estate training example in a NN 

0,0000 

output 

from unit 

0,5328 

0,21666 

input 

weight 

constant 

input 

0,3333 

0,49728 

0,23057 

Slide 16 


Year_Built 


Heating_Type 



Living_Area 

Deck_Area 

Porch_Area 

Recroom_Area 

Basement_Area 

1 

1923 

9 

8 

0 

120 

1614 

0 

210 

0 

175 

0,0000 

0,5328 

0,3333 

1,0000 

0,0000 

0,5263 

0,2593 

0,0000 

0,4646 

0,0000 

0,2160 

1,0000 

0,0000 

0,5263 

0,2593 

0,0000 

0,4646 

0,48854 

0,24764 

0,26228 

0,53988 

0,53040 

0,53499 

0,35250 

0,52491 

0,86181 

0,47909 

0,73920 

0,35789 

0,04826 

0,24434 

0,73107 

0,22200 0,58282 

0,98888 

0,76719 

0,19472 0,33192 

0,57265 

0.33530 

0,42183 

0.49815 

$176.228 

0,29771 

0,0000 

0,00042 

0,2160


Multiple Output 

Sometimes the output layer has more than one unit. For example, a department store chain 

wants to predict the likelihood that customers will be purchasing products from various 

departments, like women‘s apparel, furniture, and entertainment. The stores want to use this 

information to plan promotions and direct target mailings. 

Slide 17 

last purchase 

age 

propensity to purchase 

women‘s apparel 

gender 


furniture 

avg balance 


entertainment 

How does the NN learn Using Backpropagation? 

Slide 18 

The necessary theoretical background is explained 

in the next part.


Formal Introduction 

Neuronal network is a set of connected input/output units where each connection has a 

weight associated with it. 

Slide 19 

During the learning phase, the network learns by adjusting the weights so as to be able to 

predict the correct class label of the input samples. Neural network learning is also referred 

to as connectionist learning due to the connections between units. 

Neuronal networks involve long training times and are therefore more suitable for 

applications where this is feasible. They require a number of parameters that are typically 

best determined empirically, such as the network topology. 

Advantages of neural networks include their high tolerance to noisy data as well as their 

ability to classify patterns on which they have not been trained. 

The most popular network algorithm is the backpropagation algorithm which performs 

learning on a multilayer feed-forward neural network. 

A Multilayer Feed-Forward Neural Network 

An example of such network is in Fig. 1. 

The inputs correspond to the attributes measured for each training sample. The inputs are 

fed simultaneously into a layer of units making up the input layer. 

Slide 20 

The weighted outputs of the units of the input layer are fed simultaneously to a second layer 

of “neuronlike” units, known as a hidden layer. 

The hidden layer’s weighted outputs can be input to another hidden layer, and so on. The 

number of hidden layers is arbitrary, although in practice, usually only one is used. 

The weighted outputs of the last hidden layer are input to units making up to output layer, 

which emits the networks prediction for given samples. 

The units in the hidden layers and output layer are sometimes referred to as neurodes due to 

their symbolic basis, or as output units. The network shown in Fig. 1 is a two-layer neural 

network. Similarly, a network containing 2 hidden layers is called a three-layer neural 

network, and so on.


A Multilayer Feed-Forward Neural Network (2) 

Slide 21 

The network is feed-forward in that none of the weights cycles back to an input unit or to 

an output unit of a previous layer. 

It is fully connected in that each unit provides input to each unit in the next forward layer. 

A Multilayer Feed-Forward Neural Network (3) 

Input 

layer 

Hidden 

layer 

Output 

layer 

x 1 

x 2 

Slide 22 

. 

. 

. 

. 

. 

. 

. 

. 

. 

x i 

wij 

O j 

wjk 

O k 

A multilayer feed-forward neural network: A training sample, X = (x1 , x2, . . . , xi ), is fed to the 

input layer. Weighted connections exist between each layer , where w ij denotes the weight 

from a unit j in one layer to a unit i in the previous layer. 

Figure 1:

(19) 

?X? 

to the previous layer, 2 


Backpropagation Algorithm 

Algorithm: Backpropagation. Neural network learning for classification, 

using the backpropagation algorithm. 

Input: The training samples, samples; the learning rate, ; 

a multilayer feed-forward network, network. 

Slide 23 

Output: A neural network trained to classify the samples. 

Method: 

(1) Initialize all weights and biases in network; 

(2) while terminating condition is not satisfied 

(3) for each training sample in samples 

(4) // Propagate the inputs forward: 

! 

"$#&%('*),+ ) #.- )/10 

(5) for each hidden or output layer unit 

(6) ; //compute the net input of unit with respect 

-3#4% 5 

(7) 

#>=? 

// compute the output of each unit 

57698;:9< 

Backpropagation Algorithm (2) 

(8) // Backpropagate the errors: 

(9) for each unit in the output layer 

(10) @!ABA#&%(-3#§CEDGFH-I#.JKCMLN#!FO-3#J ; // compute the error 

(11) for each unit in the hidden layers, from the last to the first hidden layer 

Slide 24 

(12) @!ABA#&%(-3#§CEDGFH-I#.J '*P @QABA P + # P ; //compute the error with respect to 

the next higher layer, R 

+ ) # 

) S&+ ) #Q%TCMJE@!ABA#U- 

(13) for each weight in network 

(14) ; // weight increment 

(15) + ) #Q%V+ ) # / S&+ ) # ; 

? 

// weight update 

0 # 

S #&%TCMJW@QABA# 0 

(16) for each bias in network 

(17) ; // bias increment 

(18) 0 #4% 0 # / S 0 # ; 

? 

// bias update

[ 


A hidden or output layer unit j 

x 

0 

Weights 

w 0j 

Bias 

j 

Slide 25 

x 

1 

. 

. 

x n 

w 

1j 

w nj 

f 

Output 

Inputs 

(outputs from 

previous layer) 

Weighted 

sum 

Activation 

funktion 

A hidden or output layer unit j: The inputs to unit j are outputs from the previous layer. 

These are multiplied by their corresponding weights in order to form a weighted sum, 

which is added to the bias associated with unit j. A nonlinear activation function is applied 

to the net input. 

Figure 2: 

Backpropagation Algorithm - Additional Explanation 

The weights are initialized to small random numbers (e.g., ranging from -1.0 to 1.0, or -0.5 

to 0.5). 

Slide 26 

Each unit has a bias associated with it. The bias acts as a threshold in that it serves to vary 

the activity of the unit. 

Error of the output unit Y : OZ is the actual output of Y , and TZ is the true output, based on 

the known class label of the given training sample. 

To compute the error of a hidden layer unit Y , the weighted sum of the errors of the units 

connected to Y in the next layer are considered. 

is the learning rate, a constant typically having a value between 0.0 and 1.0. A rule of 

thumb is to set [ to \B].^ , where ^ is the number of iterations through the training set so far.


S 

Example Calculations for Backpropagation Learning 

The figure below shows a NN. Let the learning rate be 0.9. The initial weight and bias 

values of the network are given in the table on the next slide, along with the first training 

sample, X = (1, 0, 1) whose class label is 1. 

Slide 27 

w 34 

35 

x 1 1 

w 14 

x 3 3 

w 

w 15 

5 

4 

w 24 

x 2 

2 

w 25 

w 46 

w 36 

6 

An 

example of a multilayer feed-forward neural network.

_`acb ed b Kf bQ_ d b4_ f bQa d bQa f b d;g b f;g h.dihfjhg 

 

k \lknmporqsknmptukNm vwkNmx\yqzkNm|{}kNm|o~qkNmpt*qzknmpo*qsknm vVkNm|o}kNmx\ 

\ 

 


Example (2) 

Initial input, weight, and bias values 

Slide 28 

The € sample is fed into the network, and the input and output of each unit are computed. 

These values are shown in the table below. 

UnitZ IZ OZ 

1/(1+e‚pƒ 

Net input, Output, 

4 0.2+0-0.5-0.4=-0.7 )=0.332 

5 -0.3+0+0.2+0.2=0.1 1/(1+e ‚ 

)=0.525 

6 (-0.3)(0.332)-(0.2)(0.525)+0.1=-0.105 1/(1+eU‚ 

f 

)=0.474 

Example (2) 

Slide 29 

The error of each unit is computed and propagated backwards - the error value are shown in 

the table below. 

ErrZ 

UnitZ 

6 (0.474)(1-0.474)(1-0.474)=0.1311 

5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065 

4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087 

The weight and bias updates are shown in the table on the next slide.


Example (3) 

Calculations of weight and bias updating. 

Weight or bias 

New value 

-0.3+(0.9)(0.1311)(0.332)=-0.261 

wd;g 

-0.2+(0.9)(0.1311)(0.525)=-0.138 

wf„g 

Slide 30 

0.2+(0.9)(-0.0087)(1)=0.192 

w$d 

-0.3+(0.9)(-0.0065)(1)=-0.306 

wef 

w _ d 0.4+(0.9)(-0.0087)(0)=0.4 

w _ f 0.1+(0.9)(-0.0065)(0)=0.1 

w d 

w a f 0.2+(0.9)(-0.0065)(1)=0.194 

a 

-0.5+(0.9)(-0.oo87)(1)=-0.508 

g 0.1+(0.9)(0.1311)=0.218 

h 

f 0.2+(0.9)(-0.0065)=0.194 

h 

h.d -0.4+(0.9)(-0.0087)=-0.408 

Estimating Classifier Accuracy 

Slide 31 

Data 

Training 

set 

Derive 

classifier 

Estimate 

accuracy 

Test set 

Figure 3:


Neural Networks for Time Series 

In many data mining problems, the data naturally falls into a time series - e.g., the price of 

IBM stock, the daily value of the Swiss Franc to U.S. Dollar exchange rate. 

Slide 32 

Someone who is able to predict the next value, or even whether the series is heading up or 

down, has a tremendous advantage over other investors. 

NN are easily adapted for time-series analysis. Next figure illustrates how this is done. 

The network is trained on the time-series data, starting at the oldest point in the data. The 

training then moves to the second oldest point and the oldest point goes to the next set of 

units in the input layer, and so on. The network trains like a feed-forward, backpropagation 

network trying to predict the next value in the series in each step. 

Neural Networks for Time Series (2) 

value 1, time t 

time lag 

historical units 

value 1, time t-1 

hidden layer 

Slide 33 


output 

value 2, time t 

value 1, time t+1 


value 2, time t-2



Slide 34 

Notice that the time-series network is not limited to data from just a single time series. It can 

take multiple inputs. For instance, if we were trying to predict the value of the Swiss Franc 

to U.S. Dollar exchange rate, we might include other time-series information, such as the 

U.S. Dollar to Deutsch Mark exchange rate, the closing value of the stock exchange, etc. 

The number of historical units controls the length of the patterns that the network can 

recognize. For instance, keeping 10 historical units on a network predicting the closing price 

of a favorite stock will allow the network to recognize patterns that occur within two-week 

time periods. 


Actually, we can get the same effect of a time-series NN using a regular feed-forward, 

backpropagation network by modifying the input data. 

Say that we have the time-series, shown in the table below with 10 data elements and we are 

interested in two features: the day of the week and the closing price. 

Data Element Day-of-Week Closing Price 

Slide 35 

1 1 

2 

2 

3 

3 

4 

4 

5 

5 

6 

1 

7 

2 

8 

3 

9 

4 

10 

5 

$ 40.25 

$ 41.00 

$ 39.25 

$ 39.75 

$ 40.50 

$ 40.50 

$ 40.75 

$ 41.25 

$ 42.00 

$ 4150



To create a time series with a time lag of three, we just add new features for the previous 

values - see the table below. This data can now be input into a feed-forward, 

backpropagation network without any special support for time series. 

Previous Previous-1 

Data Element Day-of-Week Closing Price Closing Price Closing Price 

Slide 36 

1 

2 

1 

2 

$ 

$ 

40.25 

41.00 

$ 40.25 

3 

3 

$ 

39.25 

$ 41.00 

$ 40.25 

4 

5 

6 

7 

8 

9 

4 

5 

1 

2 

3 

4 

$ 

$ 

$ 

$ 

$ 

$ 

39.75 

40.50 

40.50 

40.75 

41.25 

42.00 

$ 39.25 

$ 39.75 

$ 40.50 

$ 40.50 

$ 40.75 

$ 41.25 

$ 41.00 

$ 39.25 

$ 39.75 

$ 40.50 

$ 40.50 

$ 40.75 

10 

5 

$ 41.50 $ 42.00 $ 41.25 

Heuristics for Using Neural Networks 

Even with sophisticated NN packages, getting the best results from a NN takes some effort. 

1. The number of units in the hidden layer - probably the biggest decision 

Slide 37 

The more units, the more patterns the network can recognize. However, a very large layer 

might end up memorizing the training set instead of generalizing from it. In this case, more 

is not better. 

Fortunately, we can detect when a network is overtrained. If the network performs very well 

on the training set, but does much worse on the test set, then this is an indication that it has 

memorized the test set. 

How large should the hidden layer be? It should never be more than twice as large as the 

input layer. A good place to start is to make the hidden layer the same size as the input layer. 

If the network is overtraining, reduce the size of the layer. If it is not sufficiently accurate, 

increase its size. When using a network for classification, the network should start with one 

hidden unit for each class.

For a network with … input units, † hidden units, and 1 output, there are ‡‰ˆ 

 

 


Heuristics for Using Neural Networks (2) 

2. The size of the training set 

Slide 38 

The training set must be sufficiently large to cover the ranges of inputs available for each 

feature. In addition, we want several training examples for each weight in the network. 

Šz†‹Šr\ 

weights in the network. We want at least 5 to 10 examples in the training set for each weight. 

…GŠ~\ 

3. The learning rate 

Initially, the learning rate should be set high to make large adjustements to the weights. As 

the training proceeds, the learning rate should decrease in order to fine-tune the network. 

Slide 39

CLASSIFICATION AND PREDICTION - UniversitÃ¤t Wien

Create successful ePaper yourself

Delete template?

Save as template?