23.08.2015 Views

Here - Agents Lab - University of Nottingham

Here - Agents Lab - University of Nottingham

Here - Agents Lab - University of Nottingham

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>of</strong> a module. For example, to change the regular program module in the agent <strong>of</strong>Figure 1 to an adaptive one, we only have to change order=random as follows:main module { program[order=adaptive] {if bel(true) then move(X,Y).}}With this specification, all possible action bindings will be evaluated by theunderlying learning mechanism and action selection will be governed by theQ-values derived over repeated runs, rather than being random as it was before.The benefit is that the agent programmer does not have to explicitly thinkabout learning as being separate from programming, bar adhering to some basicguidelines. The only recommendation we have for the programmer is to notuse the belief base <strong>of</strong> the agent as a long term memory store if learning is tobe used. This means to not keep adding belief facts that only serve to keep ahistory <strong>of</strong> events. For example, a programmer may choose to store the history<strong>of</strong> every choice it made in a maze solving problem. If the programmer thenenables learning such as to optimise the maze exploration strategy, then it willlikely not deliver any useful results quickly due to the very large state spacecreated by the belief base. A similar argument also applies for adding new goalsto the mental state, but it is generally not as much <strong>of</strong> a problem since programsdo not add new goals during execution to the same extent as they do beliefs.We must add here that in some problems this representation is unavoidable. Infuture work, we hope to address such cases by allowing learning to decouple themental state into relevant and not relevant parts using a dependency graph thatis already part <strong>of</strong> the Goal implementation. For instance, in the maze example,if the exploration module code does not depend on the history <strong>of</strong> beliefs beingadded, then it should be possible to automatically isolate them from the staterepresentation for learning purposes.The final decision was on how rewards should be specified for reinforcementlearning within the Goal programming framework. We do this using the existingEnvironment Interface Standard (EIS) that Goal uses to connect to externalenvironments. The addition is a new “reward” query from Goal that, if implementedby the environment, returns the reward for the last executed action.If, however, the plugged environment does not support this query, then the rewardis simply an evaluation <strong>of</strong> the goals <strong>of</strong> the agent: if all the goals have beenachieved, then the reward is 1.0, otherwise it is 0.0. The idea is that learningcan be enabled regardless <strong>of</strong> whether a reward signal is available from the environment,in which case the agent tries to optimise the number <strong>of</strong> steps it takesto achieve its goals. A future extension might be to give partial rewards between0.0 and 1.0 based on how many independent goals have been satisfied. However,it is unclear if rewards based solely on the agent’s goals are always useful inlearning, such as in programs that add or remove goals from the mental state.Under the hood we have implemented a Java-based interface that allows usto plug in a generic reinforcement learning algorithm into Goal. The idea is tobe able to <strong>of</strong>fload the task <strong>of</strong> providing and maintaining the machine learningtechnology to the relevant expert. It will also allow us to easily update the defaultQ-Learning implementation with a more efficient one in the future.156

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!