Advances in Intelligent Systems Research - of Marcus Hutter
Advances in Intelligent Systems Research - of Marcus Hutter
Advances in Intelligent Systems Research - of Marcus Hutter
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
s1<br />
s2<br />
s3<br />
s4<br />
s5<br />
s6<br />
s7<br />
1<br />
0<br />
1<br />
0<br />
1<br />
0<br />
1<br />
0<br />
1<br />
0<br />
1<br />
0<br />
1<br />
0<br />
τ<br />
10<br />
5<br />
18<br />
24<br />
4<br />
1<br />
50<br />
4950<br />
4000<br />
3996000<br />
Figure 1: Several students evaluated <strong>in</strong> a fixed time.<br />
2.3, 2.26, 1.6 and 1 for students s 1 , s 2 , s 3 and s 4 .<br />
The problem is that there is no formula which is<br />
valid <strong>in</strong> general, for different tasks and k<strong>in</strong>ds <strong>of</strong> agents.<br />
Consequently, <strong>in</strong> this sett<strong>in</strong>g, the way <strong>in</strong> which performance<br />
is measured is always task-dependent. But<br />
worse, the compensation between v, n and τ is typically<br />
non-l<strong>in</strong>ear, mak<strong>in</strong>g different choices when units<br />
change, or the measure gives too much weight to speed.<br />
Additionally, when τ → ∞ the measure goes to 0 (or<br />
diverges), aga<strong>in</strong>st the <strong>in</strong>tuition that the larger the time<br />
given the better the evaluation. But the ma<strong>in</strong> problem<br />
<strong>of</strong> us<strong>in</strong>g time is that for every function which is<br />
<strong>in</strong>creas<strong>in</strong>g on speed (n/τ), there is always a very fast<br />
agent with a very small average reward, such that it<br />
gets better and better scores. Consider, for <strong>in</strong>stance, a<br />
student s 5 who does 4,000,000 exercises at random <strong>in</strong><br />
the half an hour, and is able to score 1 <strong>in</strong> 4,000 <strong>of</strong> them<br />
1<br />
and 0 for the rest. The value would be<br />
1000 × 2000<br />
0.5 = 4.<br />
With a very low average performance ( 1<br />
1000<br />
), this student<br />
gets the best result.<br />
To make th<strong>in</strong>gs still worse, compare s 3 with s 6 as<br />
shown <strong>in</strong> Figure 1. The speed <strong>of</strong> s 6 is more than six<br />
times greater than s 3 ’s, but s 3 reaches a state where<br />
results are always 1 <strong>in</strong> about 10 m<strong>in</strong>utes, while s 6 requires<br />
about 17 m<strong>in</strong>utes. But if we consider speed, s 6<br />
has a value v ′ = 16<br />
25 × 5<br />
0.5 = 5.2 (while it was 1.6 for s 3).<br />
But <strong>in</strong> order to realise that this apparently trivial<br />
problem is a challeng<strong>in</strong>g one, consider another case.<br />
Student s 7 acts randomly but she or he modulates time<br />
<strong>in</strong> the follow<strong>in</strong>g way: whenever the result is 1 then she<br />
or he stops do<strong>in</strong>g exercises. If the result is 0 then more<br />
exercises are performed very quickly until a 1 is obta<strong>in</strong>ed.<br />
Note that this strategy scores much better than<br />
random <strong>in</strong> the long term. This means that an opportunistic<br />
use <strong>of</strong> the times could mangle the measurement<br />
and convey wrong results.<br />
The previous example tries to <strong>in</strong>formally illustrate<br />
the goal and the many problems which arise around<br />
agent evaluation <strong>in</strong> a f<strong>in</strong>ite time τ. Simple alternatives<br />
such as us<strong>in</strong>g fixed time slots are not reasonable,<br />
s<strong>in</strong>ce we want to evaluate agents <strong>of</strong> virtually any speed,<br />
without mak<strong>in</strong>g them wait. A similar (and simpler) approach<br />
is to set a maximum <strong>of</strong> cycles n <strong>in</strong>stead <strong>of</strong> a time<br />
13<br />
12<br />
1<br />
4<br />
τ, but this makes test<strong>in</strong>g almost unfeasible if we do not<br />
know the speed <strong>of</strong> the agent <strong>in</strong> advance (the test could<br />
last miliseconds or years).<br />
As apparently there is no trivial solution, <strong>in</strong> this paper<br />
we want to address the general problem <strong>of</strong> measur<strong>in</strong>g<br />
performance <strong>in</strong> a time τ under the follow<strong>in</strong>g sett<strong>in</strong>g:<br />
• The overall allotted evaluation time τ is variable and<br />
<strong>in</strong>dependent <strong>of</strong> the environment and agent.<br />
• Agents can take a variable time to make an action,<br />
which can also be part <strong>of</strong> their policy.<br />
• The environment must react immediately (no delay<br />
time computed on its side).<br />
• The larger the time τ the better the assessment<br />
should be (<strong>in</strong> terms <strong>of</strong> reliability). This would allow<br />
the evaluation to be anytime.<br />
• A constant rate random agent πrand r should have the<br />
same expected valued for every τ and rate r.<br />
• The evaluation must be fair, avoid<strong>in</strong>g opportunistic<br />
agents, which start with low performance to show an<br />
impressive improvement later on, or that stop act<strong>in</strong>g<br />
when they get good results (by chance or not).<br />
The ma<strong>in</strong> contribution <strong>of</strong> this work is that we revisit the<br />
classical reward aggregation (pay<strong>of</strong>f) functions which<br />
are commonly used <strong>in</strong> re<strong>in</strong>forcement learn<strong>in</strong>g and related<br />
areas for our sett<strong>in</strong>g (cont<strong>in</strong>uous time on the<br />
agent, discrete on the environment), we analyse the<br />
problems <strong>of</strong> each <strong>of</strong> them and we propose a new modification<br />
<strong>of</strong> the average reward to get a consistent measurement<br />
for this case, where the agent not only decides<br />
an action to perform but also decides the time the decision<br />
is go<strong>in</strong>g to take.<br />
Sett<strong>in</strong>g Def<strong>in</strong>ition and Notation<br />
An environment is a world where an agent can <strong>in</strong>teract<br />
through actions, rewards and observations. The set <strong>of</strong><br />
<strong>in</strong>teractions between the agent and the environment is a<br />
decision process. Decision processes can be considered<br />
discrete or cont<strong>in</strong>uous, and stochastic or determ<strong>in</strong>istic.<br />
In our case, the sequence <strong>of</strong> events is exactly the same<br />
as discrete-time decision process. Actions are limited by<br />
a f<strong>in</strong>ite set <strong>of</strong> symbols A, (e.g. {left, right, up, down}),<br />
rewards are taken from any subset R <strong>of</strong> rational numbers,<br />
and observations are also limited by a f<strong>in</strong>ite set<br />
O <strong>of</strong> possibilities. We will use a i , r i and o i to (respectively)<br />
denote action, reward and observation at <strong>in</strong>teraction<br />
or cycle (or, more loosely, state) i, with i be<strong>in</strong>g a<br />
positive natural number. The order <strong>of</strong> events is always:<br />
reward, observation and action. A sequence <strong>of</strong> k <strong>in</strong>teractions<br />
is then a str<strong>in</strong>g such as r 1 o 1 a 1 r 2 o 2 a 2 . . . r k o k a k .<br />
We call these sequence histories, and we will use the<br />
notation ˜roa ≤k , ˜roa ′ ≤k, . . ., to refer to any <strong>of</strong> these sequences<br />
<strong>of</strong> k <strong>in</strong>teractions and ˜ro ≤k , ˜ro ′ ≤k, . . ., to refer<br />
to any <strong>of</strong> these sequences just before the action, i.e.<br />
r 1 o 1 a 1 r 2 o 2 a 2 . . . r k o k . Physical time is measured <strong>in</strong> seconds.<br />
We denote by t i the total physical time elapsed<br />
until a i is performed by the agent.<br />
26