21.01.2014 Views

TDT4171 Artificial Intelligence Methods - Lecture 8 ... - NTNU

TDT4171 Artificial Intelligence Methods - Lecture 8 ... - NTNU

TDT4171 Artificial Intelligence Methods - Lecture 8 ... - NTNU

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

<strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong><br />

<strong>Lecture</strong> 8 – Learning from Observations<br />

Norwegian University of Science and Technology<br />

Helge Langseth<br />

IT-VEST 310<br />

helgel@idi.ntnu.no<br />

1 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Outline<br />

1 Summary from last time<br />

2 Chapter 18: Learning from observations<br />

Learning agents<br />

Inductive learning<br />

Decision tree learning<br />

Measuring learning performance<br />

Overfitting<br />

3 Summary<br />

2 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Summary from last time<br />

Summary from last time<br />

Sequential decision problems<br />

Assumptions: Stationarity, Markov assumption, Additive<br />

rewards, infinite horizon with discount<br />

Model classes: Markov decision processes/Partially<br />

Observable Markov Decision Processes<br />

Algorithm: Value iteration / policy iteration<br />

Intuitively, MDPs combine probabilistic models over time<br />

(filtering, prediction) with maximum expected utility<br />

principle.<br />

3 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Chapter 18 – Learning goals<br />

Being familiar with:<br />

Motivation for learning<br />

Decision tree formalism<br />

Decision tree learning<br />

Information Gain for structuring model learning<br />

Overfitting – and what to do to avoid it<br />

4 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Learning<br />

Chapter 18: Learning from observations<br />

Learning agents<br />

This is the second part of the course:<br />

We have learned about representations for uncertain<br />

knowledge<br />

Inference in these representations<br />

Making decisions based on the inferences<br />

Now we will talk about learning the representations:<br />

Supervised learning:<br />

Decision trees<br />

Instance-based learning/Case-based reasoning<br />

<strong>Artificial</strong> Neural Networks<br />

Reinforcement learning<br />

5 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Why do Learning?<br />

Learning agents<br />

Learning modifies the agent’s decision mechanisms to<br />

improve performance<br />

Learning is essential for unknown environments,<br />

i.e., when designer lacks omniscience<br />

Learning is useful as a system construction method,<br />

i.e., expose the agent to reality rather than trying to write it<br />

down<br />

Interesting link on the webpage to the article The End of<br />

Theory (from Wired Magazine). Well worth a read!<br />

6 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Learning agents<br />

Learning agents<br />

Performance standard<br />

Critic<br />

Sensors<br />

feedback<br />

Agent<br />

learning<br />

goals<br />

Learning<br />

element<br />

Problem<br />

generator<br />

changes<br />

knowledge<br />

experiments<br />

Performance<br />

element<br />

Effectors<br />

Environment<br />

7 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Learning element<br />

Learning agents<br />

Design of learning element is dictated by. ..<br />

what type of performance element is used<br />

which functional component is to be learned<br />

how that functional component is represented<br />

what kind of feedback is available<br />

8 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Learning element<br />

Learning agents<br />

Design of learning element is dictated by. ..<br />

what type of performance element is used<br />

which functional component is to be learned<br />

how that functional component is represented<br />

what kind of feedback is available<br />

Example scenarios:<br />

Performance element<br />

Component<br />

Representation<br />

Feedback<br />

Alpha−beta search<br />

Eval. fn.<br />

Weighted linear function<br />

Win/loss<br />

Logical agent<br />

Transition model<br />

Successor−state axioms<br />

Outcome<br />

Utility−based agent<br />

Transition model<br />

Dynamic Bayes net<br />

Outcome<br />

Simple reflex agent<br />

Percept−action fn<br />

Neural net<br />

Correct action<br />

Supervised learning: correct answers for each instance<br />

Reinforcement learning: occasional rewards<br />

8 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning<br />

Inductive learning<br />

Simplest form: Learn a function from examples<br />

f is the target function ⎧<br />

⎨ O O X<br />

An example is a pair {x,f(x)}, e.g., X<br />

⎩<br />

X<br />

⎫<br />

⎬<br />

, +1<br />

⎭<br />

Problem:<br />

Find hypothesis h ∈ H s.t. h ≈ f given a training set of examples<br />

This is a highly simplified model of real learning:<br />

Ignores prior knowledge<br />

Assumes a deterministic, observable “environment”<br />

Assumes examples are given<br />

9 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning method<br />

Inductive learning<br />

Construct/adjust h to agree with f on training set<br />

h is consistent if it agrees with f on all examples<br />

Example – curve fitting:<br />

f(x)<br />

x<br />

10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning method<br />

Inductive learning<br />

Construct/adjust h to agree with f on training set<br />

h is consistent if it agrees with f on all examples<br />

Example – curve fitting:<br />

f(x)<br />

x<br />

10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning method<br />

Inductive learning<br />

Construct/adjust h to agree with f on training set<br />

h is consistent if it agrees with f on all examples<br />

Example – curve fitting:<br />

f(x)<br />

x<br />

10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning method<br />

Inductive learning<br />

Construct/adjust h to agree with f on training set<br />

h is consistent if it agrees with f on all examples<br />

Example – curve fitting:<br />

f(x)<br />

x<br />

Which curve is better? – and WHY??<br />

10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning method<br />

Inductive learning<br />

Construct/adjust h to agree with f on training set<br />

h is consistent if it agrees with f on all examples<br />

Example – curve fitting:<br />

f(x)<br />

x<br />

10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Inductive learning method<br />

Inductive learning<br />

Construct/adjust h to agree with f on training set<br />

h is consistent if it agrees with f on all examples<br />

Example – curve fitting:<br />

f(x)<br />

x<br />

Ockham’s razor: maximize consistency and simplicity!<br />

10 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Attribute-based representations<br />

Inductive learning<br />

Examples described by attribute values (Boolean, discrete,<br />

continuous, etc.)<br />

E.g., situations where the authors will/won’t wait for a table<br />

Example<br />

Attributes<br />

Target<br />

Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait<br />

X 1 T F F T Some $$$ F T French 0–10 T<br />

X 2 T F F T Full $ F F Thai 30–60 F<br />

X 3 F T F F Some $ F F Burger 0–10 T<br />

X 4 T F T T Full $ T F Thai 10–30 T<br />

X 5 T F T F Full $$$ F T French >60 F<br />

X 6 F T F T Some $$ T T Italian 0–10 T<br />

X 7 F T F F None $ T F Burger 0–10 F<br />

X 8 F F F T Some $$ T T Thai 0–10 T<br />

X 9 F T T F Full $ T F Burger >60 F<br />

X 10 T T T T Full $$$ F T Italian 10–30 F<br />

X 11 F F F F None $ F F Thai 0–10 F<br />

X 12 T T T T Full $ F F Burger 30–60 T<br />

Classification of examples is positive (T) or negative (F)<br />

11 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Decision trees<br />

Chapter 18: Learning from observations Decision tree learning<br />

One possible representation for hypotheses: Decision Trees<br />

Patrons?<br />

None Some Full<br />

F<br />

T<br />

WaitEstimate?<br />

>60 30−60 10−30 0−10<br />

F<br />

Alternate?<br />

No Yes<br />

Hungry?<br />

No Yes<br />

T<br />

Reservation?<br />

No Yes<br />

Fri/Sat?<br />

No Yes<br />

T<br />

Alternate?<br />

No Yes<br />

Bar?<br />

No Yes<br />

T<br />

F<br />

T<br />

T<br />

Raining?<br />

No Yes<br />

F<br />

T<br />

F<br />

T<br />

Example<br />

Attributes<br />

Target<br />

Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait<br />

X 3 F T F F Some $ F F Burger 0–10 T<br />

X 12 T T T T Full $ F F Burger 30–60 T<br />

12 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Converting a decision-tree to rules<br />

Decision tree learning<br />

Patrons?<br />

None Some Full<br />

F<br />

T<br />

WaitEstimate?<br />

>60 30−60 10−30 0−10<br />

F<br />

Alternate?<br />

No Yes<br />

Hungry?<br />

No Yes<br />

T<br />

Reservation?<br />

No Yes<br />

Fri/Sat?<br />

No Yes<br />

T<br />

Alternate?<br />

No Yes<br />

Bar?<br />

No Yes<br />

T<br />

F<br />

T<br />

T<br />

Raining?<br />

No Yes<br />

F<br />

T<br />

F<br />

T<br />

IF<br />

THEN<br />

(Patrons?=Full) ∧ (WaitEstimate?=0-10)<br />

Wait? = True<br />

IF (Patrons?=Full) ∧ (WaitEstimate?=30-60) ∧ (Alternate?=Yes) ∧ (Fri/Sat?=No)<br />

THEN Wait? = False<br />

...<br />

13 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

When to consider decision trees<br />

Decision tree learning<br />

Instances describable by attribute–value pairs<br />

Target function is discrete valued<br />

Disjunctive hypothesis may be required<br />

Possibly noisy training data<br />

Examples:<br />

Diagnosis<br />

Credit risk analysis<br />

Classifying email as spam or ham<br />

14 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Decision tree representation<br />

Decision tree learning<br />

Decision tree representation:<br />

Each internal node tests an attribute<br />

Each branch corresponds to attribute value<br />

Each leaf node assigns a classification<br />

How can we represent these functions using a decision-tree?<br />

A∧B, A∨B, A XOR B.<br />

(A∧B)∨(A∧¬B ∧C)<br />

m of n: At least m of A 1 ,A 2 ,...,A n (try n = 3, m = 2).<br />

Discuss with your neighbour for a couple of minutes.<br />

15 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Expressiveness<br />

Chapter 18: Learning from observations<br />

Decision tree learning<br />

A decision tree can represent ANY function of the inputs.<br />

E.g., for Boolean functions, truth table row → path to leaf:<br />

A B A xor B<br />

F F F<br />

F T T<br />

T F T<br />

T T F<br />

F<br />

F<br />

B<br />

A<br />

F T<br />

B<br />

T F T<br />

T T F<br />

16 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Expressiveness<br />

Chapter 18: Learning from observations<br />

Decision tree learning<br />

A decision tree can represent ANY function of the inputs.<br />

E.g., for Boolean functions, truth table row → path to leaf:<br />

A B A xor B<br />

F F F<br />

F T T<br />

T F T<br />

T T F<br />

F<br />

F<br />

B<br />

A<br />

F T<br />

B<br />

T F T<br />

T T F<br />

There is a consistent decision tree for any training set:<br />

Just add one path to leaf for each example (works whenever f is<br />

deterministic in x)<br />

Note!<br />

The tree will probably not generalize to new examples<br />

→ Prefer to find more compact decision trees.<br />

16 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Hypothesis spaces<br />

Decision tree learning<br />

How many distinct decision trees with n Boolean attributes??<br />

17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Hypothesis spaces<br />

Decision tree learning<br />

How many distinct decision trees with n Boolean attributes??<br />

= number of Boolean functions<br />

= number of distinct truth tables with 2 n rows = 2 2n<br />

E.g., with 6 Boolean attributes, there are<br />

18 446 744 073 709 551 616 (2·10 19 ) trees<br />

17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Hypothesis spaces<br />

Decision tree learning<br />

How many distinct decision trees with n Boolean attributes??<br />

= number of Boolean functions<br />

= number of distinct truth tables with 2 n rows = 2 2n<br />

E.g., with 6 Boolean attributes, there are<br />

18 446 744 073 709 551 616 (2·10 19 ) trees<br />

How many purely conjunctive hypotheses (e.g., hungry∧¬rain)??<br />

17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Hypothesis spaces<br />

Decision tree learning<br />

How many distinct decision trees with n Boolean attributes??<br />

= number of Boolean functions<br />

= number of distinct truth tables with 2 n rows = 2 2n<br />

E.g., with 6 Boolean attributes, there are<br />

18 446 744 073 709 551 616 (2·10 19 ) trees<br />

How many purely conjunctive hypotheses (e.g., hungry∧¬rain)??<br />

Each attribute can be in (positive), in (negative), or out<br />

⇒ 3 n distinct conjunctive hypotheses<br />

E.g., with 6 Boolean attributes, there are 729 rules of this kind<br />

17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Hypothesis spaces<br />

Decision tree learning<br />

How many distinct decision trees with n Boolean attributes??<br />

= number of Boolean functions<br />

= number of distinct truth tables with 2 n rows = 2 2n<br />

E.g., with 6 Boolean attributes, there are<br />

18 446 744 073 709 551 616 (2·10 19 ) trees<br />

How many purely conjunctive hypotheses (e.g., hungry∧¬rain)??<br />

Each attribute can be in (positive), in (negative), or out<br />

⇒ 3 n distinct conjunctive hypotheses<br />

E.g., with 6 Boolean attributes, there are 729 rules of this kind<br />

More expressive hypothesis space:<br />

increases chance that target function can be expressed<br />

increases number of hypotheses consistent w/ training set<br />

⇒ may get worse predictions<br />

17 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Decision tree learning<br />

Decision tree learning<br />

Aim: find a small tree consistent with the training examples<br />

Idea: (recursively) choose “best” attribute as root of (sub)tree<br />

function DTL(examples,attributes,default) returns a DT<br />

if examples is empty then return default<br />

else if all examples have same class then return class<br />

else if attributes is empty then return Mode(examples)<br />

else<br />

best←Choose-Attribute(attributes,examples)<br />

tree←a new decision tree with root test best<br />

for each value v i of best do<br />

ex i ←{elements of examples with best = v i }<br />

subtree←DTL(ex i ,attributes−best,Mode(examples))<br />

add a branch to tree with label v i and subtree subtree<br />

return tree<br />

18 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Search in the hypothesis space<br />

Decision tree learning<br />

+ – +<br />

A1<br />

+ – + +<br />

A2<br />

+ – + –<br />

...<br />

<br />

...<br />

A2<br />

+ – + –<br />

A3<br />

+<br />

A2<br />

+ – + –<br />

A4<br />

–<br />

... ...<br />

The Big Question: How to choose the “split-attribute”<br />

19 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Search in the hypothesis space<br />

Decision tree learning<br />

+ – +<br />

A1<br />

+ – + +<br />

A2<br />

+ – + –<br />

...<br />

<br />

...<br />

A2<br />

+ – + –<br />

A3<br />

+<br />

A2<br />

+ – + –<br />

A4<br />

–<br />

... ...<br />

DEMO: Random selection<br />

19 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Choosing an attribute<br />

Decision tree learning<br />

Idea: a good attribute splits the examples into subsets that are<br />

(ideally) “all positive” or “all negative”<br />

Patrons?<br />

Type?<br />

None Some Full<br />

French Italian Thai Burger<br />

Patrons? is a better choice – gives information about the<br />

classification<br />

20 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Information<br />

Chapter 18: Learning from observations<br />

Decision tree learning<br />

Information answers questions!<br />

The more clueless we are about the answer initially, the more<br />

information is contained in the answer.<br />

Scale: 1 bit = answer to Boolean question with prior<br />

〈0.5,0.5〉.<br />

Information in an answer when prior is 〈p 1 ,...,p n 〉 is<br />

H(〈p 1 ,...,p n 〉) = −<br />

n∑<br />

p i log 2 p i<br />

i=1<br />

(also called entropy of the prior 〈p 1 ,...,p n 〉)<br />

21 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Information contd.<br />

Decision tree learning<br />

Suppose we have p positive and n negative examples at root<br />

H(〈p/(p+n),n/(p+n)〉) bits needed to classify new example<br />

E.g., for 12 restaurant examples, p=n=6 so we need 1 bit.<br />

22 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Information contd.<br />

Decision tree learning<br />

Suppose we have p positive and n negative examples at root<br />

H(〈p/(p+n),n/(p+n)〉) bits needed to classify new example<br />

E.g., for 12 restaurant examples, p=n=6 so we need 1 bit.<br />

An attribute splits the examples E into subsets E i , we hope<br />

each needs less information to classify...<br />

Let E i have p i positive and n i negative examples:<br />

H(〈p i /(p i +n i ),n i /(p i +n i )〉) bits needed to classify<br />

⇒ expected number of bits per example over all branches is<br />

∑ p i +n i<br />

p+n H(〈p i/(p i +n i ),n i /(p i +n i )〉)<br />

i<br />

For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit<br />

22 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Information contd.<br />

Decision tree learning<br />

Suppose we have p positive and n negative examples at root<br />

H(〈p/(p+n),n/(p+n)〉) bits needed to classify new example<br />

E.g., for 12 restaurant examples, p=n=6 so we need 1 bit.<br />

An attribute splits the examples E into subsets E i , we hope<br />

each needs less information to classify...<br />

Let E i have p i positive and n i negative examples:<br />

H(〈p i /(p i +n i ),n i /(p i +n i )〉) bits needed to classify<br />

⇒ expected number of bits per example over all branches is<br />

∑ p i +n i<br />

p+n H(〈p i/(p i +n i ),n i /(p i +n i )〉)<br />

i<br />

For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit<br />

Heuristic: Choose the attribute that minimizes the remaining information<br />

needed to classify new example<br />

22 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Example contd.<br />

Decision tree learning<br />

Decision tree learned from the 12 examples:<br />

Patrons?<br />

None Some Full<br />

F<br />

T<br />

Hungry?<br />

Yes No<br />

Type?<br />

F<br />

French Italian Thai Burger<br />

T<br />

F<br />

Fri/Sat?<br />

No Yes<br />

T<br />

Substantially simpler than “true” tree – a more complex<br />

hypothesis isn’t justified by small amount of data<br />

DEMO: Gain selection<br />

F<br />

T<br />

23 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Performance measurement<br />

Measuring learning performance<br />

Question: How do we know that h ≈ f?<br />

Answer: Tryhon a new test set of examples (use same distribution<br />

over example space as training set)<br />

Learning curve = % correct on test set as a function of training<br />

set size<br />

1<br />

% correct on test set<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Training set size<br />

24 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Performance measurement contd.<br />

Measuring learning performance<br />

Learning curve depends on. ..<br />

realizable (can express target function) vs. non-realizable<br />

non-realizability can be due to missing attributes or restricted<br />

hypothesis class (e.g., thresholded linear function)<br />

redundant expressiveness (e.g., loads of irrelevant<br />

attributes)<br />

% correct<br />

1<br />

realizable<br />

redundant<br />

nonrealizable<br />

# of examples<br />

25 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Overfitting in decision trees<br />

Overfitting<br />

Consider adding “noisy” training examples X 13 and X 14 :<br />

Example<br />

Attributes<br />

Target<br />

Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait<br />

X 13 F T T T Some $$ T T French 0–10 F<br />

X 14 F T T T Some $ T T Thai 0–10 F<br />

Reality check:<br />

What is the effect on the tree we learned earlier?<br />

Patrons?<br />

None Some Full<br />

F<br />

T<br />

Hungry?<br />

Yes No<br />

Type?<br />

F<br />

French Italian Thai Burger<br />

T<br />

F<br />

Fri/Sat?<br />

No Yes<br />

T<br />

F<br />

T<br />

26 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Overfitting<br />

Chapter 18: Learning from observations<br />

Overfitting<br />

Consider error of hypothesis h over<br />

Training data: error t (h)<br />

Entire distribution D of data (often approximated by<br />

measurement on test-set): error D (h)<br />

Overfitting<br />

Hypothesis h ∈ H overfits training data if there is an alternative<br />

hypothesis h ′ ∈ H such that<br />

error t (h) < error t (h ′ ) and error D (h) > error D (h ′ )<br />

27 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Overfitting (cont’d)<br />

Overfitting<br />

0.9<br />

0.85<br />

0.8<br />

Accuracy<br />

0.75<br />

0.7<br />

0.65<br />

0.6<br />

0.55<br />

On training data<br />

On test data<br />

0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Size of tree (number of nodes)<br />

28 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Avoiding Overfitting<br />

Overfitting<br />

How can we avoid overfitting?<br />

stop growing when data split not statistically significant<br />

grow full tree, then post-prune<br />

How to select “best” tree:<br />

Measure performance over training data (statistical tests<br />

needed)<br />

Measure performance over separate validation data set<br />

29 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Reduced-Error Pruning<br />

Overfitting<br />

Split data into training and validation set<br />

Do until further pruning is harmful:<br />

1 Evaluate impact on validation set of pruning each possible<br />

node (plus those below it)<br />

2 Greedily remove the one that most improves validation set<br />

accuracy<br />

⇒ Produces smallest version of most accurate subtree<br />

30 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Chapter 18: Learning from observations<br />

Effect of Reduced-Error Pruning<br />

Overfitting<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

Accuracy<br />

0.7<br />

0.65<br />

0.6<br />

0.55<br />

On training data<br />

On test data<br />

On test data (during pruning)<br />

0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Size of tree (number of nodes)<br />

31 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>


Summary<br />

Summary<br />

Learning needed for unknown environments, lazy designers<br />

Learning agent = performance element + learning<br />

element<br />

Learning method depends on type of performance element,<br />

available feedback, type of component to be improved,<br />

and its representation<br />

For supervised learning, the aim is to find a simple<br />

hypothesis approximately consistent with training examples<br />

Decision tree learning using information gain<br />

Learning performance = prediction accuracy measured on<br />

test set<br />

Next week: Agnar Aamodt talks about case-based reasoning.<br />

(His paper is on the webpage, and is part of the curriculum.)<br />

32 <strong>TDT4171</strong> <strong>Artificial</strong> <strong>Intelligence</strong> <strong>Methods</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!