11.04.2014 Views

Advances in Intelligent Systems Research - of Marcus Hutter

Advances in Intelligent Systems Research - of Marcus Hutter

Advances in Intelligent Systems Research - of Marcus Hutter

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

s1<br />

s2<br />

s3<br />

s4<br />

s5<br />

s6<br />

s7<br />

1<br />

0<br />

1<br />

0<br />

1<br />

0<br />

1<br />

0<br />

1<br />

0<br />

1<br />

0<br />

1<br />

0<br />

τ<br />

10<br />

5<br />

18<br />

24<br />

4<br />

1<br />

50<br />

4950<br />

4000<br />

3996000<br />

Figure 1: Several students evaluated <strong>in</strong> a fixed time.<br />

2.3, 2.26, 1.6 and 1 for students s 1 , s 2 , s 3 and s 4 .<br />

The problem is that there is no formula which is<br />

valid <strong>in</strong> general, for different tasks and k<strong>in</strong>ds <strong>of</strong> agents.<br />

Consequently, <strong>in</strong> this sett<strong>in</strong>g, the way <strong>in</strong> which performance<br />

is measured is always task-dependent. But<br />

worse, the compensation between v, n and τ is typically<br />

non-l<strong>in</strong>ear, mak<strong>in</strong>g different choices when units<br />

change, or the measure gives too much weight to speed.<br />

Additionally, when τ → ∞ the measure goes to 0 (or<br />

diverges), aga<strong>in</strong>st the <strong>in</strong>tuition that the larger the time<br />

given the better the evaluation. But the ma<strong>in</strong> problem<br />

<strong>of</strong> us<strong>in</strong>g time is that for every function which is<br />

<strong>in</strong>creas<strong>in</strong>g on speed (n/τ), there is always a very fast<br />

agent with a very small average reward, such that it<br />

gets better and better scores. Consider, for <strong>in</strong>stance, a<br />

student s 5 who does 4,000,000 exercises at random <strong>in</strong><br />

the half an hour, and is able to score 1 <strong>in</strong> 4,000 <strong>of</strong> them<br />

1<br />

and 0 for the rest. The value would be<br />

1000 × 2000<br />

0.5 = 4.<br />

With a very low average performance ( 1<br />

1000<br />

), this student<br />

gets the best result.<br />

To make th<strong>in</strong>gs still worse, compare s 3 with s 6 as<br />

shown <strong>in</strong> Figure 1. The speed <strong>of</strong> s 6 is more than six<br />

times greater than s 3 ’s, but s 3 reaches a state where<br />

results are always 1 <strong>in</strong> about 10 m<strong>in</strong>utes, while s 6 requires<br />

about 17 m<strong>in</strong>utes. But if we consider speed, s 6<br />

has a value v ′ = 16<br />

25 × 5<br />

0.5 = 5.2 (while it was 1.6 for s 3).<br />

But <strong>in</strong> order to realise that this apparently trivial<br />

problem is a challeng<strong>in</strong>g one, consider another case.<br />

Student s 7 acts randomly but she or he modulates time<br />

<strong>in</strong> the follow<strong>in</strong>g way: whenever the result is 1 then she<br />

or he stops do<strong>in</strong>g exercises. If the result is 0 then more<br />

exercises are performed very quickly until a 1 is obta<strong>in</strong>ed.<br />

Note that this strategy scores much better than<br />

random <strong>in</strong> the long term. This means that an opportunistic<br />

use <strong>of</strong> the times could mangle the measurement<br />

and convey wrong results.<br />

The previous example tries to <strong>in</strong>formally illustrate<br />

the goal and the many problems which arise around<br />

agent evaluation <strong>in</strong> a f<strong>in</strong>ite time τ. Simple alternatives<br />

such as us<strong>in</strong>g fixed time slots are not reasonable,<br />

s<strong>in</strong>ce we want to evaluate agents <strong>of</strong> virtually any speed,<br />

without mak<strong>in</strong>g them wait. A similar (and simpler) approach<br />

is to set a maximum <strong>of</strong> cycles n <strong>in</strong>stead <strong>of</strong> a time<br />

13<br />

12<br />

1<br />

4<br />

τ, but this makes test<strong>in</strong>g almost unfeasible if we do not<br />

know the speed <strong>of</strong> the agent <strong>in</strong> advance (the test could<br />

last miliseconds or years).<br />

As apparently there is no trivial solution, <strong>in</strong> this paper<br />

we want to address the general problem <strong>of</strong> measur<strong>in</strong>g<br />

performance <strong>in</strong> a time τ under the follow<strong>in</strong>g sett<strong>in</strong>g:<br />

• The overall allotted evaluation time τ is variable and<br />

<strong>in</strong>dependent <strong>of</strong> the environment and agent.<br />

• Agents can take a variable time to make an action,<br />

which can also be part <strong>of</strong> their policy.<br />

• The environment must react immediately (no delay<br />

time computed on its side).<br />

• The larger the time τ the better the assessment<br />

should be (<strong>in</strong> terms <strong>of</strong> reliability). This would allow<br />

the evaluation to be anytime.<br />

• A constant rate random agent πrand r should have the<br />

same expected valued for every τ and rate r.<br />

• The evaluation must be fair, avoid<strong>in</strong>g opportunistic<br />

agents, which start with low performance to show an<br />

impressive improvement later on, or that stop act<strong>in</strong>g<br />

when they get good results (by chance or not).<br />

The ma<strong>in</strong> contribution <strong>of</strong> this work is that we revisit the<br />

classical reward aggregation (pay<strong>of</strong>f) functions which<br />

are commonly used <strong>in</strong> re<strong>in</strong>forcement learn<strong>in</strong>g and related<br />

areas for our sett<strong>in</strong>g (cont<strong>in</strong>uous time on the<br />

agent, discrete on the environment), we analyse the<br />

problems <strong>of</strong> each <strong>of</strong> them and we propose a new modification<br />

<strong>of</strong> the average reward to get a consistent measurement<br />

for this case, where the agent not only decides<br />

an action to perform but also decides the time the decision<br />

is go<strong>in</strong>g to take.<br />

Sett<strong>in</strong>g Def<strong>in</strong>ition and Notation<br />

An environment is a world where an agent can <strong>in</strong>teract<br />

through actions, rewards and observations. The set <strong>of</strong><br />

<strong>in</strong>teractions between the agent and the environment is a<br />

decision process. Decision processes can be considered<br />

discrete or cont<strong>in</strong>uous, and stochastic or determ<strong>in</strong>istic.<br />

In our case, the sequence <strong>of</strong> events is exactly the same<br />

as discrete-time decision process. Actions are limited by<br />

a f<strong>in</strong>ite set <strong>of</strong> symbols A, (e.g. {left, right, up, down}),<br />

rewards are taken from any subset R <strong>of</strong> rational numbers,<br />

and observations are also limited by a f<strong>in</strong>ite set<br />

O <strong>of</strong> possibilities. We will use a i , r i and o i to (respectively)<br />

denote action, reward and observation at <strong>in</strong>teraction<br />

or cycle (or, more loosely, state) i, with i be<strong>in</strong>g a<br />

positive natural number. The order <strong>of</strong> events is always:<br />

reward, observation and action. A sequence <strong>of</strong> k <strong>in</strong>teractions<br />

is then a str<strong>in</strong>g such as r 1 o 1 a 1 r 2 o 2 a 2 . . . r k o k a k .<br />

We call these sequence histories, and we will use the<br />

notation ˜roa ≤k , ˜roa ′ ≤k, . . ., to refer to any <strong>of</strong> these sequences<br />

<strong>of</strong> k <strong>in</strong>teractions and ˜ro ≤k , ˜ro ′ ≤k, . . ., to refer<br />

to any <strong>of</strong> these sequences just before the action, i.e.<br />

r 1 o 1 a 1 r 2 o 2 a 2 . . . r k o k . Physical time is measured <strong>in</strong> seconds.<br />

We denote by t i the total physical time elapsed<br />

until a i is performed by the agent.<br />

26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!