chooses an action a ∈ A s that causes the environment to transition to state s ′ inthe next time step t+1 and return a reward with the expected value r ∈ R(s, a).<strong>Here</strong> S is the set <strong>of</strong> all possible states, A s is the set <strong>of</strong> all possible actions ina given state s, and R(s, a) is the function that determines the reward for takingan action a in state s. The probability that the process advances to states ′ is given by the state transition function P (s ′ |s, a). The agent’s behaviour isdescribed by a policy π that determines how the agent chooses an action in anygiven state. The optimal value in this setup can be obtained using dynamic programmingand is given by Bellman’s equation (Equation 1) [19], that relates thevalue function V ∗ (s) in one time step to the value function V ∗ (s ′ ) in the nexttime step. <strong>Here</strong>, γ ∈ [0, 1) is the discount factor that determines the importance<strong>of</strong> future rewards.V ∗ (s) = R(s, a) + maxa∈A sγ ∑ s ′ P (s ′ |s, a)V ∗ (s ′ ). (1)In the reinforcement learning setting both P (s ′ |s, a) and R(s, a) are unknown.So the agent has little choice but to physically act in the environment to observethe immediate reward, and use the samples over time to build estimates <strong>of</strong> theexpected return in each state, in the hope <strong>of</strong> obtaining a good approximation<strong>of</strong> the optimal policy. Typically, the agent tries to maximise some cumulativefunction <strong>of</strong> the immediate rewards, such as the expected discounted return R π (s)(Equation 2) at each time step t. R π (s) captures the infinite-horizon discounted(by γ) sum <strong>of</strong> the rewards that the agent may expect (denoted by E) to receivestarting in state s and following the policy π.R π (s) = E{r t+1 + γr t+2 + γ 2 r t+3 + . . .}. (2)One way to maximise this function is to evaluate all policies by simply followingeach one, sampling the rewards obtained, and then choosing the policy thatgave the best return. The obvious problem with such a brute force method is thatthe number <strong>of</strong> possible policies is <strong>of</strong>ten too large to be practical. Furthermore,if rewards were stochastic, then even more samples will be required in orderto estimate the expected return. A practical solution, based on Bellman’s workon value iteration, is Watkins’ Q-Learning algorithm [20] given by the actionvaluefunction (Equation 3). The Q-function gives the expected discounted returnfor taking action a in state s and following the policy π thereafter. <strong>Here</strong>α is the learning rate that determines to what extent the existing Q-value (i.e.,Q π (s, a)) will be corrected by the new update (i.e., R(s, a) + γ max a ′(Q(s ′ , a ′ ))),and max a ′(Q(s ′ , a ′ )) is the maximum possible reward in the following state, i.e.,it is the reward for taking the optimal action thereafter.[]Q π (s, a) ← Q π (s, a) + α R(s, a) + γ max(Q(s ′ , a ′ )) − Q π (s, a) . (3)a ′In order to learn the Q-values, the agent must try out available actions in eachstate and learn from these experiences over time. Given that acting and learningare interleaved and ongoing performance is important, a key challenge when151
choosing actions is to find a good balance between exploiting current knowledgeto get the best reward known so far, and exploring new actions in the hope <strong>of</strong>finding better rewards. A simple way to achieve this is to select the best knownaction most <strong>of</strong> the time, but every once in a while choose a random action witha small probability, say ɛ. This strategy is well known as ɛ-greedy and is the onewe use in this work. In future work we plan to experiment with more advancedstrategies. For instance, in the so-called Boltzmann selection strategy, instead<strong>of</strong> picking actions randomly, weights are assigned to the available actions basedon their existing action-value estimates, so that actions that perform well havea higher chance <strong>of</strong> being selected in the exploration phase.In this study we use a Q-Learning implementation where the precise actionvaluefunction is maintained in memory. It should be noted here that this implementationdoes not scale well to large state spaces. Of course, we could use anapproximation <strong>of</strong> the action-value function, such as a neural network, to storethis in a compact manner. However, our focus here is not so much to use anefficient reinforcement learning technology as it is to see how such learning canbe integrated into agent programming in a seamless manner. For this reason, inthis version <strong>of</strong> the work, we have kept the basic Q-Learning implementation.3 Related WorkIn most languages for partial reinforcement learning programs, the programmerspecifies a program containing choice points [21]. Because <strong>of</strong> the underspecificationpresent in agent programming languages, there is no need to add such choicepoints as multiple options are generated automatically by the agent program itself.There is little existing work in integrating learning capabilities within agentprogramming languages. In PRS-like cognitive architectures [2, 4, 22, 3] that arebased in the BDI tradition, standard operating knowledge is programmed asabstract recipes or plans, <strong>of</strong>ten in a hierarchical manner. Plans whose preconditionshold in any runtime situation are considered applicable in that situationand may be chosen for execution. While such frameworks do not typically supportlearning, there has been recent work in this area. For instance, in [23] thelearning process that decides when and how learning should proceed, is itself describedwithin plans that can be invoked in the usual manner. Our own previousinvestigations in this area include [24–26] where decision tree learning was usedto improve hierarchical plan selection in the JACK [3] agent programming language.That work bears some resemblance here in that the aim was to improvechoice <strong>of</strong> instantiated plans as we do for bound action options in this study.In [17] we integrated Goal and reinforcement learning as we do in this paper,with the key difference that now (i) a learning primitive has been added to theGoal language to explicitly support adaptive behaviours, and (ii) a much richerstate representation is used, i.e., the mental state <strong>of</strong> the agent.Among other rule-based systems, ACT-R [27, 28] is a cognitive architectureprimarily concerned with modelling human behaviour, where programming consists<strong>of</strong> writing production rules [29] that are condition-action pairs to describepossible responses to various situations. Learning in ACT-R consists <strong>of</strong> forming152
- Page 2 and 3:
Proceedings of the Tenth Internatio
- Page 4 and 5:
OrganisationOrganising CommitteeMeh
- Page 6:
Table of ContentseJason: an impleme
- Page 10 and 11:
in Sect. 3 the translation of the J
- Page 12 and 13:
init_count(0).max_count(2000).(a)(b
- Page 14 and 15:
For instance, a failure in the ERES
- Page 16 and 17:
{plan, fun start_count_trigger/1,fu
- Page 18 and 19:
single parameter, an Erlang record
- Page 20 and 21:
1. Belief annotations. Even though
- Page 22 and 23:
decisions taken during the design a
- Page 24 and 25:
Conceptual Integration of Agents wi
- Page 26 and 27:
Fig. 2. Active component structurep
- Page 28 and 29:
the service provider component. As
- Page 30 and 31:
Fig. 4. Web Service Invocationretri
- Page 32 and 33:
01: public interface IBankingServic
- Page 34 and 35:
tate them in the same way as in the
- Page 36 and 37:
01: public interface IChartService
- Page 38 and 39:
implementations being available for
- Page 41:
deliberative behavior in BDI archit
- Page 44 and 45:
layer modules (i.e. nodes) can be d
- Page 46 and 47:
different methods to choose the cur
- Page 48 and 49:
also a single scheduler module, imp
- Page 50 and 51:
andom choice (OR), conditional choi
- Page 52 and 53:
- Dealing with conflicts based on p
- Page 54 and 55:
5. Brooks, R. A. (1991) Intelligenc
- Page 56 and 57:
An Agent-Based Cognitive Robot Arch
- Page 58 and 59:
It has been argued that building ro
- Page 60 and 61:
EnvironmentHardwareLocal SoftwareC+
- Page 62 and 63:
a cognitive layer can connect as a
- Page 64 and 65:
can reliably be differentiated and
- Page 66 and 67:
4 ExperimentTo evaluate the feasibi
- Page 68 and 69:
learn or gain knowledge from experi
- Page 70 and 71:
A Programming Framework for Multi-A
- Page 72 and 73:
exchange and storage of tuples (key
- Page 74 and 75:
Although some success [13] [14] hav
- Page 76 and 77:
as well as important non-functional
- Page 78 and 79:
component plans have been instantia
- Page 80 and 81:
A in the example) can evaluate all
- Page 83 and 84:
1. robot-1 issues a Localization(ro
- Page 85 and 86:
ACKNOWLEDGMENTThis work has been su
- Page 87 and 88:
The code was analysed both objectiv
- Page 89 and 90:
a conversation is following. Additi
- Page 91 and 92:
the context of a communication-heav
- Page 93 and 94:
Table 1. Core Agent ProtocolsAgent
- Page 95 and 96:
statistically significant using an
- Page 97 and 98:
to the conversation and has a perfo
- Page 99 and 100:
principal reasons. Firstly, it is a
- Page 101 and 102: 2. Muldoon, C., O’Hare, G.M.P., C
- Page 103 and 104: In the following section we will at
- Page 105 and 106: DevelopmentProductionHuman Readabil
- Page 107 and 108: will then evaluate this new format
- Page 109 and 110: encoder, it is first checked if the
- Page 111 and 112: nents themselves. However, since th
- Page 113 and 114: optimized for this format feature s
- Page 115 and 116: Java serialization and Jadex Binary
- Page 117 and 118: 10. P. Hoffman and F. Yergeau, “U
- Page 119 and 120: Caching the results of previous que
- Page 121 and 122: querying an agent’s beliefs and g
- Page 123 and 124: or relative performance of each pla
- Page 125 and 126: were run for 1.5 minutes; 1.5 minut
- Page 127 and 128: Size N K n p c qry U c upd Update c
- Page 129 and 130: epresentation. The cache simply act
- Page 131 and 132: 6 ConclusionWe presented an abstrac
- Page 133 and 134: Typing Multi-Agent Programs in simp
- Page 135 and 136: 1 // agent ag02 iterations (" zero
- Page 137 and 138: 3.1 simpAL OverviewThe main inspira
- Page 139 and 140: 3.2 Typing Agents with Tasks and Ro
- Page 141 and 142: Defining Agent Scripts in simpAL (F
- Page 143 and 144: that sends a message to the receive
- Page 145 and 146: * error: wrong type for the param v
- Page 147 and 148: Given an organization model, it is
- Page 149 and 150: Learning to Improve Agent Behaviour
- Page 151: 2.1 Agent Programming LanguagesAgen
- Page 155 and 156: 1 init module {2 knowledge{3 block(
- Page 157 and 158: of a module. For example, to change
- Page 159 and 160: if bel(on(X,Y), clear(X)), a-goal(c
- Page 161 and 162: mance. Figure 2d shows the same A f
- Page 163 and 164: the current percepts of the agent.
- Page 165: Author IndexAbdel-Naby, S., 69Alelc