Advances in Intelligent Systems Research - of Marcus Hutter

More documents

Recommendations

Info

On Evaluating Agent Performance in a Fixed Period of Time José Hernández-Orallo DSIC, Univ. Politècnica de València, Camí de Vera s/n, 46020 Valencia, Spain. jorallo@dsic.upv.es Abstract The evaluation of several agents over a given task in a finite period of time is a very common problem in experimental design, statistics, computer science, economics and, in general, any experimental science. It is also crucial for intelligence evaluation. In reinforcement learning, the task is formalised as an interactive environment with observations, actions and rewards. Typically, the decision that has to be made by the agent is a choice among a set of actions, cycle after cycle. However, in real evaluation scenarios, the time can be intentionally modulated by the agent. Consequently, agents not only choose an action but they also choose the time when they want to perform an action. This is natural in biological systems but it is also an issue in control. In this paper we revisit the classical reward aggregating functions which are commonly used in reinforcement learning and related areas, we analyse their problems, and we propose a modification of the average reward to get a consistent measurement for continuous time. Introduction Measuring agent intelligence is one of the pending subtasks (or requirements) in the goal of constructing general intelligent artefacts. (LH07) presents a formal definition of intelligence as the evaluated performance in a broad range of contexts or environments. However, time is disregarded in their definition. In (HOD09), an implementation of an anytime intelligence test is endeavoured, where time is considered. The introduction of time in the evaluation has much more implications than it might seem at first glance. We do not only face the issue that fast agents score better than slow agents, but we also need to assess other problems: how can we evaluate fast and slow agents in the same setting? How can we deal with intelligent agents that make a shrewd use of response times to score better? These problems have not been solved in AI areas where agent evaluation is custom. For instance, evaluating decision-making agents in interactive environments where observations, actions and rewards take place has been a well-studied problem in the area of reinforcement learning (SB98). But, in general, time (either discrete or continuous) is understood as a virtual time. Even in real applications, where continuous time appears, any performance evaluation based on rewards typically does not consider the decision-making time of the agents and, to our knowledge, never considers extreme speed differences between the agents. In order to illustrate the problem, imagine that a test (composed of several exercises is passed to several students. All exercises deal about the same (previously unknown) subject, so typically a good student would improve as s/he does more exercises. Each student receives the first exercise, works on it and writes the result and gets an evaluation score or points (e.g. between 0 and 1). Immediately a second exercise is given and the student works on it similarly. The test goes on until a (previously unknown) time limit τ is reached. Consider a test taken in half an hour, where several students have got different results, as shown in Figure 1. Who is best? We can say that s 1 usually scores better than s 2 does but s 2 is faster. Let us make the question a little bit more difficult. What about a third student s 3 only being able to complete five exercises? From the figure, we can say that it has done all of them right from almost the beginning. We can also say it is very slow, but with only two attempts she or he has been able to find the way to solve the rest of exercises. And now a more incisive question: what about a fourth student s 4 , who does exercises very fast, but at random, and, eventually, in a series of 5,000 exercises done in the half an hour is able to score well on 50 of them? In the previous example we can either accumulate the results (so students s 1 , s 2 , s 3 and s 4 would get a total return of 10, 18, 4, 50 respectively) or average the results by the number of exercises (so we would get an average return of 2 3 , 3 7 , 4 5 , 1 100 respectively). We can also consider the physical time (which is equal for all), and average the results by time, getting a scaling of the total returns, i.e., 20, 36, 8, 100 points per hour. An opinion here would be to say that speed and performance are two different things that we should weight into an equation which matches the context of application. In the previous case, if the average by exercises is v, the number of exercises is n and τ is the total time a possible formula might be v ′ = v × √ n/τ, giving values 25
s1 s2 s3 s4 s5 s6 s7 1 0 1 0 1 0 1 0 1 0 1 0 1 0 τ 10 5 18 24 4 1 50 4950 4000 3996000 Figure 1: Several students evaluated in a fixed time. 2.3, 2.26, 1.6 and 1 for students s 1 , s 2 , s 3 and s 4 . The problem is that there is no formula which is valid in general, for different tasks and kinds of agents. Consequently, in this setting, the way in which performance is measured is always task-dependent. But worse, the compensation between v, n and τ is typically non-linear, making different choices when units change, or the measure gives too much weight to speed. Additionally, when τ → ∞ the measure goes to 0 (or diverges), against the intuition that the larger the time given the better the evaluation. But the main problem of using time is that for every function which is increasing on speed (n/τ), there is always a very fast agent with a very small average reward, such that it gets better and better scores. Consider, for instance, a student s 5 who does 4,000,000 exercises at random in the half an hour, and is able to score 1 in 4,000 of them 1 and 0 for the rest. The value would be 1000 × 2000 0.5 = 4. With a very low average performance ( 1 1000 ), this student gets the best result. To make things still worse, compare s 3 with s 6 as shown in Figure 1. The speed of s 6 is more than six times greater than s 3 ’s, but s 3 reaches a state where results are always 1 in about 10 minutes, while s 6 requires about 17 minutes. But if we consider speed, s 6 has a value v ′ = 16 25 × 5 0.5 = 5.2 (while it was 1.6 for s 3). But in order to realise that this apparently trivial problem is a challenging one, consider another case. Student s 7 acts randomly but she or he modulates time in the following way: whenever the result is 1 then she or he stops doing exercises. If the result is 0 then more exercises are performed very quickly until a 1 is obtained. Note that this strategy scores much better than random in the long term. This means that an opportunistic use of the times could mangle the measurement and convey wrong results. The previous example tries to informally illustrate the goal and the many problems which arise around agent evaluation in a finite time τ. Simple alternatives such as using fixed time slots are not reasonable, since we want to evaluate agents of virtually any speed, without making them wait. A similar (and simpler) approach is to set a maximum of cycles n instead of a time 13 12 1 4 τ, but this makes testing almost unfeasible if we do not know the speed of the agent in advance (the test could last miliseconds or years). As apparently there is no trivial solution, in this paper we want to address the general problem of measuring performance in a time τ under the following setting: • The overall allotted evaluation time τ is variable and independent of the environment and agent. • Agents can take a variable time to make an action, which can also be part of their policy. • The environment must react immediately (no delay time computed on its side). • The larger the time τ the better the assessment should be (in terms of reliability). This would allow the evaluation to be anytime. • A constant rate random agent πrand r should have the same expected valued for every τ and rate r. • The evaluation must be fair, avoiding opportunistic agents, which start with low performance to show an impressive improvement later on, or that stop acting when they get good results (by chance or not). The main contribution of this work is that we revisit the classical reward aggregation (payoff) functions which are commonly used in reinforcement learning and related areas for our setting (continuous time on the agent, discrete on the environment), we analyse the problems of each of them and we propose a new modification of the average reward to get a consistent measurement for this case, where the agent not only decides an action to perform but also decides the time the decision is going to take. Setting Definition and Notation An environment is a world where an agent can interact through actions, rewards and observations. The set of interactions between the agent and the environment is a decision process. Decision processes can be considered discrete or continuous, and stochastic or deterministic. In our case, the sequence of events is exactly the same as discrete-time decision process. Actions are limited by a finite set of symbols A, (e.g. {left, right, up, down}), rewards are taken from any subset R of rational numbers, and observations are also limited by a finite set O of possibilities. We will use a i , r i and o i to (respectively) denote action, reward and observation at interaction or cycle (or, more loosely, state) i, with i being a positive natural number. The order of events is always: reward, observation and action. A sequence of k interactions is then a string such as r 1 o 1 a 1 r 2 o 2 a 2 . . . r k o k a k . We call these sequence histories, and we will use the notation ˜roa ≤k , ˜roa ′ ≤k, . . ., to refer to any of these sequences of k interactions and ˜ro ≤k , ˜ro ′ ≤k, . . ., to refer to any of these sequences just before the action, i.e. r 1 o 1 a 1 r 2 o 2 a 2 . . . r k o k . Physical time is measured in seconds. We denote by t i the total physical time elapsed until a i is performed by the agent. 26
Page 2 and 3: Eric Baum, Marcus Hutter, Emanuel K
Page 4 and 5: In Memoriam Ray Solomonoff (1926-20
Page 6: Artificial General Intelligence Vol
Page 10 and 11: Conference Organization Chairs Marc
Page 12 and 13: Table of Contents Full Articles. Ef
Page 14: Uncertain Spatiotemporal Logic for
Page 17 and 18: inference. Constraint graphs compac
Page 19 and 20: s ← the rule system‟s opinion o
Page 21 and 22: Run Time (sec) Run Time (sec) probl
Page 23 and 24: ut also, more importantly, by the c
Page 25 and 26: pattern recognition only, while at
Page 27 and 28: Central would be a two-way interact
Page 29 and 30: set of the OpenCogPrime architectur
Page 31 and 32: mentioned elements to the real elem
Page 33 and 34: Suppose it has previously been show
Page 35 and 36: man reality; we have given a semi-f
Page 37 and 38: t∑ Vµ,g,T π ≡ E( r g (I g,s,i
Page 39: as we have formalized it here is sp
Page 43 and 44: valued for every τ and this value
Page 45 and 46: Environment Type General Bounded Ba
Page 47 and 48: The sliding window is passed over t
Page 49 and 50: data. However, TP alone performs ve
Page 51 and 52: Extension to Non-Symbolic Data Stri
Page 53 and 54: agent’s uncertain reasoning, than
Page 55 and 56: Theorem 2. Suppose that in addition
Page 57 and 58: List lnheritance $E $C Inheritance
Page 59 and 60: The Toy Box Problem As with existin
Page 61 and 62: the conceptual mismatch between the
Page 63 and 64: Initial 2D World State Impact in 2D
Page 65 and 66: R Rewriting Rule: a b a R b a b b
Page 67 and 68: ements connected by binary row and
Page 69 and 70: White uses E Black uses E Gomoku 78
Page 71 and 72: Approach We have used a NARMAX appr
Page 73 and 74: Range [cm] Range [cm] as well as th
Page 75 and 76: system with the computed rotational
Page 77 and 78: (a) (b) Figure 1: DCT network repre
Page 79 and 80: (a) Pole balancing (b) T-maze (c) B
Page 81 and 82: lated weights, i.e. requiring the f
Page 83 and 84: In this paper, we will discuss heur
Page 85 and 86: Consider again a substitution θ as
Page 87 and 88: (Ax S ) ∗∗ (Ax + j S S )∗∗
Page 89 and 90: size of the grid grows. Proposition
Page 91 and 92:
Algorithm 2 Propagate Procedure Pro
Page 93 and 94:
#relations MiniMaxSAT DPLL-S 5 0.9s
Page 95 and 96:
is useful for designing the perform
Page 97 and 98:
its knowledge is limited, and even
Page 99 and 100:
RISC vs. CISC trade-offs in traditi
Page 101 and 102:
Figure 1: Squares: algorithmic comp
Page 103 and 104:
10 5 0 −5 −10 20 40 60 80 100 1
Page 105 and 106:
References [Bas06] A. J. Bastian. L
Page 107 and 108:
Our algorithm incorporates gradient
Page 109 and 110:
is off-policy λ-return and ¯φ t
Page 111 and 112:
we can substitute δ t e t , based
Page 113 and 114:
LP1 Sensing a world state world_sta
Page 115 and 116:
given observed face was considered
Page 117 and 118:
analysis (verification) as has been
Page 119 and 120:
Core Modules Five core regions in t
Page 121 and 122:
agent. These modules receive instru
Page 123 and 124:
The image processing done to extrac
Page 125 and 126:
An agent-environment perception is
Page 127 and 128:
for only one type of the sub-events
Page 129 and 130:
where c ′ is the confidence of th
Page 131 and 132:
sirability of events, i.e. such tha
Page 133 and 134:
where r G , r P and r Q are the rew
Page 135 and 136:
The rest of the argument parallels
Page 137 and 138:
probability distribution Pr are onl
Page 139 and 140:
Figure 1: (a-b) Two causal networks
Page 141 and 142:
2.5 2.5 2 2 d(t) [bits] 1.5 1 d(t)
Page 143 and 144:
eak this clique and then learning i
Page 145 and 146:
The position of the image plane at
Page 147 and 148:
The next experiments are performed
Page 149 and 150:
was poorly aligned to human intelli
Page 151 and 152:
claim that the goals of AGI are out
Page 153 and 154:
feedback connections, pages 95-133.
Page 155 and 156:
A non-universal variant (WS96) is r
Page 157 and 158:
probability density 0.5 0.4 0.3 0.2
Page 159 and 160:
due to the fact that the encoding l
Page 161 and 162:
the “Four Big F’s”: Feeding,
Page 163 and 164:
A runtime-dependent performance mea
Page 165 and 166:
A. N. Kolmogorov. Three approaches
Page 167 and 168:
describable regularity in a batch o
Page 169 and 170:
start out with problems that are in
Page 171 and 172:
This CJS estimate makes it easy to
Page 173 and 174:
Frontier Search Sun Yi, Tobias Glas
Page 175 and 176:
program i execution time τ steps i
Page 177 and 178:
Example 12. Consider the criterion
Page 179 and 180:
The Evaluation of AGI Systems Pei W
Page 181 and 182:
telligence, the evaluation needs to
Page 183 and 184:
Now we see that the empirical appro
Page 185 and 186:
Designing a Safe Motivational Syste
Page 187 and 188:
non-problematic result than explora
Page 189 and 190:
architecture based upon Sloman’s
Page 191 and 192:
Software Design of an AGI System Ba
Page 193 and 194:
A Theoretical Framework to Formaliz
Page 195 and 196:
Uncertain Spatiotemporal Logic for
Page 197 and 198:
A (hopefully) Unbiased Universal En
Page 199 and 200:
Neuroethological Approach to Unders
Page 201 and 202:
Compression Progress, Pseudorandomn
Page 203 and 204:
Relational Local Iterative Compress
Page 205 and 206:
Stochastic Grammar Based Incrementa
Page 207 and 208:
Compression-Driven Progress in Scie
Page 209 and 210:
Concept Formation in the Ouroboros
Page 211 and 212:
On Super-Turing Computing Power and
Page 213 and 214:
A minimum relative entropy principl
Page 215:
Author Index Araujo, Samir . . . .
show all

Advances in Intelligent Systems Research - of Marcus Hutter

Create successful ePaper yourself

Delete template?

Save as template?