TDT4171 Artificial Intelligence Methods - Lecture 8 ... - NTNU
TDT4171 Artificial Intelligence Methods - Lecture 8 ... - NTNU
TDT4171 Artificial Intelligence Methods - Lecture 8 ... - NTNU
Transform your PDFs into Flipbooks and boost your revenue!
Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.
<strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong><br />
<strong>Lecture</strong> 8 – Learning from Observations<br />
Norwegian University of Science and Technology<br />
Helge Langseth<br />
IT-VEST 310<br />
helgel@idi.ntnu.no<br />
1 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Outline<br />
1 Summary from last time<br />
2 Chapter 18: Learning from observations<br />
Learning agents<br />
Inductive learning<br />
Decision tree learning<br />
Measuring learning performance<br />
Overfitting<br />
3 Summary<br />
2 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Summary from last time<br />
Summary from last time<br />
Sequential decision problems<br />
Assumptions: Stationarity, Markov assumption, Additive<br />
rewards, infinite horizon with discount<br />
Model classes: Markov decision processes/Partially<br />
Observable Markov Decision Processes<br />
Algorithm: Value iteration / policy iteration<br />
Intuitively, MDPs combine probabilistic models over time<br />
(filtering, prediction) with maximum expected utility<br />
principle.<br />
3 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Chapter 18 – Learning goals<br />
Being familiar with:<br />
Motivation for learning<br />
Decision tree formalism<br />
Decision tree learning<br />
Information Gain for structuring model learning<br />
Overfitting – and what to do to avoid it<br />
4 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Learning<br />
Chapter 18: Learning from observations<br />
Learning agents<br />
This is the second part of the course:<br />
We have learned about representations for uncertain<br />
knowledge<br />
Inference in these representations<br />
Making decisions based on the inferences<br />
Now we will talk about learning the representations:<br />
Supervised learning:<br />
Decision trees<br />
Instance-based learning/Case-based reasoning<br />
<strong>Artificial</strong> Neural Networks<br />
Reinforcement learning<br />
5 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Why do Learning?<br />
Learning agents<br />
Learning modifies the agent’s decision mechanisms to<br />
improve performance<br />
Learning is essential for unknown environments,<br />
i.e., when designer lacks omniscience<br />
Learning is useful as a system construction method,<br />
i.e., expose the agent to reality rather than trying to write it<br />
down<br />
Interesting link on the webpage to the article The End of<br />
Theory (from Wired Magazine). Well worth a read!<br />
6 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Learning agents<br />
Learning agents<br />
Performance standard<br />
Critic<br />
Sensors<br />
feedback<br />
Agent<br />
learning<br />
goals<br />
Learning<br />
element<br />
Problem<br />
generator<br />
changes<br />
knowledge<br />
experiments<br />
Performance<br />
element<br />
Effectors<br />
Environment<br />
7 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Learning element<br />
Learning agents<br />
Design of learning element is dictated by. ..<br />
what type of performance element is used<br />
which functional component is to be learned<br />
how that functional component is represented<br />
what kind of feedback is available<br />
8 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Learning element<br />
Learning agents<br />
Design of learning element is dictated by. ..<br />
what type of performance element is used<br />
which functional component is to be learned<br />
how that functional component is represented<br />
what kind of feedback is available<br />
Example scenarios:<br />
Performance element<br />
Component<br />
Representation<br />
Feedback<br />
Alpha−beta search<br />
Eval. fn.<br />
Weighted linear function<br />
Win/loss<br />
Logical agent<br />
Transition model<br />
Successor−state axioms<br />
Outcome<br />
Utility−based agent<br />
Transition model<br />
Dynamic Bayes net<br />
Outcome<br />
Simple reflex agent<br />
Percept−action fn<br />
Neural net<br />
Correct action<br />
Supervised learning: correct answers for each instance<br />
Reinforcement learning: occasional rewards<br />
8 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning<br />
Inductive learning<br />
Simplest form: Learn a function from examples<br />
f is the target function ⎧<br />
⎨ O O X<br />
An example is a pair {x,f(x)}, e.g., X<br />
⎩<br />
X<br />
⎫<br />
⎬<br />
, +1<br />
⎭<br />
Problem:<br />
Find hypothesis h ∈ H s.t. h ≈ f given a training set of examples<br />
This is a highly simplified model of real learning:<br />
Ignores prior knowledge<br />
Assumes a deterministic, observable “environment”<br />
Assumes examples are given<br />
9 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning method<br />
Inductive learning<br />
Construct/adjust h to agree with f on training set<br />
h is consistent if it agrees with f on all examples<br />
Example – curve fitting:<br />
f(x)<br />
x<br />
10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning method<br />
Inductive learning<br />
Construct/adjust h to agree with f on training set<br />
h is consistent if it agrees with f on all examples<br />
Example – curve fitting:<br />
f(x)<br />
x<br />
10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning method<br />
Inductive learning<br />
Construct/adjust h to agree with f on training set<br />
h is consistent if it agrees with f on all examples<br />
Example – curve fitting:<br />
f(x)<br />
x<br />
10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning method<br />
Inductive learning<br />
Construct/adjust h to agree with f on training set<br />
h is consistent if it agrees with f on all examples<br />
Example – curve fitting:<br />
f(x)<br />
x<br />
Which curve is better? – and WHY??<br />
10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning method<br />
Inductive learning<br />
Construct/adjust h to agree with f on training set<br />
h is consistent if it agrees with f on all examples<br />
Example – curve fitting:<br />
f(x)<br />
x<br />
10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Inductive learning method<br />
Inductive learning<br />
Construct/adjust h to agree with f on training set<br />
h is consistent if it agrees with f on all examples<br />
Example – curve fitting:<br />
f(x)<br />
x<br />
Ockham’s razor: maximize consistency and simplicity!<br />
10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Attribute-based representations<br />
Inductive learning<br />
Examples described by attribute values (Boolean, discrete,<br />
continuous, etc.)<br />
E.g., situations where the authors will/won’t wait for a table<br />
Example<br />
Attributes<br />
Target<br />
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait<br />
X 1 T F F T Some $$$ F T French 0–10 T<br />
X 2 T F F T Full $ F F Thai 30–60 F<br />
X 3 F T F F Some $ F F Burger 0–10 T<br />
X 4 T F T T Full $ T F Thai 10–30 T<br />
X 5 T F T F Full $$$ F T French >60 F<br />
X 6 F T F T Some $$ T T Italian 0–10 T<br />
X 7 F T F F None $ T F Burger 0–10 F<br />
X 8 F F F T Some $$ T T Thai 0–10 T<br />
X 9 F T T F Full $ T F Burger >60 F<br />
X 10 T T T T Full $$$ F T Italian 10–30 F<br />
X 11 F F F F None $ F F Thai 0–10 F<br />
X 12 T T T T Full $ F F Burger 30–60 T<br />
Classification of examples is positive (T) or negative (F)<br />
11 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Decision trees<br />
Chapter 18: Learning from observations Decision tree learning<br />
One possible representation for hypotheses: Decision Trees<br />
Patrons?<br />
None Some Full<br />
F<br />
T<br />
WaitEstimate?<br />
>60 30−60 10−30 0−10<br />
F<br />
Alternate?<br />
No Yes<br />
Hungry?<br />
No Yes<br />
T<br />
Reservation?<br />
No Yes<br />
Fri/Sat?<br />
No Yes<br />
T<br />
Alternate?<br />
No Yes<br />
Bar?<br />
No Yes<br />
T<br />
F<br />
T<br />
T<br />
Raining?<br />
No Yes<br />
F<br />
T<br />
F<br />
T<br />
Example<br />
Attributes<br />
Target<br />
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait<br />
X 3 F T F F Some $ F F Burger 0–10 T<br />
X 12 T T T T Full $ F F Burger 30–60 T<br />
12 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Converting a decision-tree to rules<br />
Decision tree learning<br />
Patrons?<br />
None Some Full<br />
F<br />
T<br />
WaitEstimate?<br />
>60 30−60 10−30 0−10<br />
F<br />
Alternate?<br />
No Yes<br />
Hungry?<br />
No Yes<br />
T<br />
Reservation?<br />
No Yes<br />
Fri/Sat?<br />
No Yes<br />
T<br />
Alternate?<br />
No Yes<br />
Bar?<br />
No Yes<br />
T<br />
F<br />
T<br />
T<br />
Raining?<br />
No Yes<br />
F<br />
T<br />
F<br />
T<br />
IF<br />
THEN<br />
(Patrons?=Full) ∧ (WaitEstimate?=0-10)<br />
Wait? = True<br />
IF (Patrons?=Full) ∧ (WaitEstimate?=30-60) ∧ (Alternate?=Yes) ∧ (Fri/Sat?=No)<br />
THEN Wait? = False<br />
...<br />
13 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
When to consider decision trees<br />
Decision tree learning<br />
Instances describable by attribute–value pairs<br />
Target function is discrete valued<br />
Disjunctive hypothesis may be required<br />
Possibly noisy training data<br />
Examples:<br />
Diagnosis<br />
Credit risk analysis<br />
Classifying email as spam or ham<br />
14 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Decision tree representation<br />
Decision tree learning<br />
Decision tree representation:<br />
Each internal node tests an attribute<br />
Each branch corresponds to attribute value<br />
Each leaf node assigns a classification<br />
How can we represent these functions using a decision-tree?<br />
A∧B, A∨B, A XOR B.<br />
(A∧B)∨(A∧¬B ∧C)<br />
m of n: At least m of A 1 ,A 2 ,...,A n (try n = 3, m = 2).<br />
Discuss with your neighbour for a couple of minutes.<br />
15 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Expressiveness<br />
Chapter 18: Learning from observations<br />
Decision tree learning<br />
A decision tree can represent ANY function of the inputs.<br />
E.g., for Boolean functions, truth table row → path to leaf:<br />
A B A xor B<br />
F F F<br />
F T T<br />
T F T<br />
T T F<br />
F<br />
F<br />
B<br />
A<br />
F T<br />
B<br />
T F T<br />
T T F<br />
16 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Expressiveness<br />
Chapter 18: Learning from observations<br />
Decision tree learning<br />
A decision tree can represent ANY function of the inputs.<br />
E.g., for Boolean functions, truth table row → path to leaf:<br />
A B A xor B<br />
F F F<br />
F T T<br />
T F T<br />
T T F<br />
F<br />
F<br />
B<br />
A<br />
F T<br />
B<br />
T F T<br />
T T F<br />
There is a consistent decision tree for any training set:<br />
Just add one path to leaf for each example (works whenever f is<br />
deterministic in x)<br />
Note!<br />
The tree will probably not generalize to new examples<br />
→ Prefer to find more compact decision trees.<br />
16 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Hypothesis spaces<br />
Decision tree learning<br />
How many distinct decision trees with n Boolean attributes??<br />
17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Hypothesis spaces<br />
Decision tree learning<br />
How many distinct decision trees with n Boolean attributes??<br />
= number of Boolean functions<br />
= number of distinct truth tables with 2 n rows = 2 2n<br />
E.g., with 6 Boolean attributes, there are<br />
18 446 744 073 709 551 616 (2·10 19 ) trees<br />
17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Hypothesis spaces<br />
Decision tree learning<br />
How many distinct decision trees with n Boolean attributes??<br />
= number of Boolean functions<br />
= number of distinct truth tables with 2 n rows = 2 2n<br />
E.g., with 6 Boolean attributes, there are<br />
18 446 744 073 709 551 616 (2·10 19 ) trees<br />
How many purely conjunctive hypotheses (e.g., hungry∧¬rain)??<br />
17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Hypothesis spaces<br />
Decision tree learning<br />
How many distinct decision trees with n Boolean attributes??<br />
= number of Boolean functions<br />
= number of distinct truth tables with 2 n rows = 2 2n<br />
E.g., with 6 Boolean attributes, there are<br />
18 446 744 073 709 551 616 (2·10 19 ) trees<br />
How many purely conjunctive hypotheses (e.g., hungry∧¬rain)??<br />
Each attribute can be in (positive), in (negative), or out<br />
⇒ 3 n distinct conjunctive hypotheses<br />
E.g., with 6 Boolean attributes, there are 729 rules of this kind<br />
17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Hypothesis spaces<br />
Decision tree learning<br />
How many distinct decision trees with n Boolean attributes??<br />
= number of Boolean functions<br />
= number of distinct truth tables with 2 n rows = 2 2n<br />
E.g., with 6 Boolean attributes, there are<br />
18 446 744 073 709 551 616 (2·10 19 ) trees<br />
How many purely conjunctive hypotheses (e.g., hungry∧¬rain)??<br />
Each attribute can be in (positive), in (negative), or out<br />
⇒ 3 n distinct conjunctive hypotheses<br />
E.g., with 6 Boolean attributes, there are 729 rules of this kind<br />
More expressive hypothesis space:<br />
increases chance that target function can be expressed<br />
increases number of hypotheses consistent w/ training set<br />
⇒ may get worse predictions<br />
17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Decision tree learning<br />
Decision tree learning<br />
Aim: find a small tree consistent with the training examples<br />
Idea: (recursively) choose “best” attribute as root of (sub)tree<br />
function DTL(examples,attributes,default) returns a DT<br />
if examples is empty then return default<br />
else if all examples have same class then return class<br />
else if attributes is empty then return Mode(examples)<br />
else<br />
best←Choose-Attribute(attributes,examples)<br />
tree←a new decision tree with root test best<br />
for each value v i of best do<br />
ex i ←{elements of examples with best = v i }<br />
subtree←DTL(ex i ,attributes−best,Mode(examples))<br />
add a branch to tree with label v i and subtree subtree<br />
return tree<br />
18 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Search in the hypothesis space<br />
Decision tree learning<br />
+ – +<br />
A1<br />
+ – + +<br />
A2<br />
+ – + –<br />
...<br />
<br />
...<br />
A2<br />
+ – + –<br />
A3<br />
+<br />
A2<br />
+ – + –<br />
A4<br />
–<br />
... ...<br />
The Big Question: How to choose the “split-attribute”<br />
19 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Search in the hypothesis space<br />
Decision tree learning<br />
+ – +<br />
A1<br />
+ – + +<br />
A2<br />
+ – + –<br />
...<br />
<br />
...<br />
A2<br />
+ – + –<br />
A3<br />
+<br />
A2<br />
+ – + –<br />
A4<br />
–<br />
... ...<br />
DEMO: Random selection<br />
19 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Choosing an attribute<br />
Decision tree learning<br />
Idea: a good attribute splits the examples into subsets that are<br />
(ideally) “all positive” or “all negative”<br />
Patrons?<br />
Type?<br />
None Some Full<br />
French Italian Thai Burger<br />
Patrons? is a better choice – gives information about the<br />
classification<br />
20 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Information<br />
Chapter 18: Learning from observations<br />
Decision tree learning<br />
Information answers questions!<br />
The more clueless we are about the answer initially, the more<br />
information is contained in the answer.<br />
Scale: 1 bit = answer to Boolean question with prior<br />
〈0.5,0.5〉.<br />
Information in an answer when prior is 〈p 1 ,...,p n 〉 is<br />
H(〈p 1 ,...,p n 〉) = −<br />
n∑<br />
p i log 2 p i<br />
i=1<br />
(also called entropy of the prior 〈p 1 ,...,p n 〉)<br />
21 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Information contd.<br />
Decision tree learning<br />
Suppose we have p positive and n negative examples at root<br />
H(〈p/(p+n),n/(p+n)〉) bits needed to classify new example<br />
E.g., for 12 restaurant examples, p=n=6 so we need 1 bit.<br />
22 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Information contd.<br />
Decision tree learning<br />
Suppose we have p positive and n negative examples at root<br />
H(〈p/(p+n),n/(p+n)〉) bits needed to classify new example<br />
E.g., for 12 restaurant examples, p=n=6 so we need 1 bit.<br />
An attribute splits the examples E into subsets E i , we hope<br />
each needs less information to classify...<br />
Let E i have p i positive and n i negative examples:<br />
H(〈p i /(p i +n i ),n i /(p i +n i )〉) bits needed to classify<br />
⇒ expected number of bits per example over all branches is<br />
∑ p i +n i<br />
p+n H(〈p i/(p i +n i ),n i /(p i +n i )〉)<br />
i<br />
For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit<br />
22 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Information contd.<br />
Decision tree learning<br />
Suppose we have p positive and n negative examples at root<br />
H(〈p/(p+n),n/(p+n)〉) bits needed to classify new example<br />
E.g., for 12 restaurant examples, p=n=6 so we need 1 bit.<br />
An attribute splits the examples E into subsets E i , we hope<br />
each needs less information to classify...<br />
Let E i have p i positive and n i negative examples:<br />
H(〈p i /(p i +n i ),n i /(p i +n i )〉) bits needed to classify<br />
⇒ expected number of bits per example over all branches is<br />
∑ p i +n i<br />
p+n H(〈p i/(p i +n i ),n i /(p i +n i )〉)<br />
i<br />
For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit<br />
Heuristic: Choose the attribute that minimizes the remaining information<br />
needed to classify new example<br />
22 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Example contd.<br />
Decision tree learning<br />
Decision tree learned from the 12 examples:<br />
Patrons?<br />
None Some Full<br />
F<br />
T<br />
Hungry?<br />
Yes No<br />
Type?<br />
F<br />
French Italian Thai Burger<br />
T<br />
F<br />
Fri/Sat?<br />
No Yes<br />
T<br />
Substantially simpler than “true” tree – a more complex<br />
hypothesis isn’t justified by small amount of data<br />
DEMO: Gain selection<br />
F<br />
T<br />
23 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Performance measurement<br />
Measuring learning performance<br />
Question: How do we know that h ≈ f?<br />
Answer: Tryhon a new test set of examples (use same distribution<br />
over example space as training set)<br />
Learning curve = % correct on test set as a function of training<br />
set size<br />
1<br />
% correct on test set<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Training set size<br />
24 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Performance measurement contd.<br />
Measuring learning performance<br />
Learning curve depends on. ..<br />
realizable (can express target function) vs. non-realizable<br />
non-realizability can be due to missing attributes or restricted<br />
hypothesis class (e.g., thresholded linear function)<br />
redundant expressiveness (e.g., loads of irrelevant<br />
attributes)<br />
% correct<br />
1<br />
realizable<br />
redundant<br />
nonrealizable<br />
# of examples<br />
25 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Overfitting in decision trees<br />
Overfitting<br />
Consider adding “noisy” training examples X 13 and X 14 :<br />
Example<br />
Attributes<br />
Target<br />
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait<br />
X 13 F T T T Some $$ T T French 0–10 F<br />
X 14 F T T T Some $ T T Thai 0–10 F<br />
Reality check:<br />
What is the effect on the tree we learned earlier?<br />
Patrons?<br />
None Some Full<br />
F<br />
T<br />
Hungry?<br />
Yes No<br />
Type?<br />
F<br />
French Italian Thai Burger<br />
T<br />
F<br />
Fri/Sat?<br />
No Yes<br />
T<br />
F<br />
T<br />
26 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Overfitting<br />
Chapter 18: Learning from observations<br />
Overfitting<br />
Consider error of hypothesis h over<br />
Training data: error t (h)<br />
Entire distribution D of data (often approximated by<br />
measurement on test-set): error D (h)<br />
Overfitting<br />
Hypothesis h ∈ H overfits training data if there is an alternative<br />
hypothesis h ′ ∈ H such that<br />
error t (h) < error t (h ′ ) and error D (h) > error D (h ′ )<br />
27 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Overfitting (cont’d)<br />
Overfitting<br />
0.9<br />
0.85<br />
0.8<br />
Accuracy<br />
0.75<br />
0.7<br />
0.65<br />
0.6<br />
0.55<br />
On training data<br />
On test data<br />
0.5<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Size of tree (number of nodes)<br />
28 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Avoiding Overfitting<br />
Overfitting<br />
How can we avoid overfitting?<br />
stop growing when data split not statistically significant<br />
grow full tree, then post-prune<br />
How to select “best” tree:<br />
Measure performance over training data (statistical tests<br />
needed)<br />
Measure performance over separate validation data set<br />
29 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Reduced-Error Pruning<br />
Overfitting<br />
Split data into training and validation set<br />
Do until further pruning is harmful:<br />
1 Evaluate impact on validation set of pruning each possible<br />
node (plus those below it)<br />
2 Greedily remove the one that most improves validation set<br />
accuracy<br />
⇒ Produces smallest version of most accurate subtree<br />
30 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Chapter 18: Learning from observations<br />
Effect of Reduced-Error Pruning<br />
Overfitting<br />
0.9<br />
0.85<br />
0.8<br />
0.75<br />
Accuracy<br />
0.7<br />
0.65<br />
0.6<br />
0.55<br />
On training data<br />
On test data<br />
On test data (during pruning)<br />
0.5<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Size of tree (number of nodes)<br />
31 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>
Summary<br />
Summary<br />
Learning needed for unknown environments, lazy designers<br />
Learning agent = performance element + learning<br />
element<br />
Learning method depends on type of performance element,<br />
available feedback, type of component to be improved,<br />
and its representation<br />
For supervised learning, the aim is to find a simple<br />
hypothesis approximately consistent with training examples<br />
Decision tree learning using information gain<br />
Learning performance = prediction accuracy measured on<br />
test set<br />
Next week: Agnar Aamodt talks about case-based reasoning.<br />
(His paper is on the webpage, and is part of the curriculum.)<br />
32 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>