A Framework for Evaluating Early-Stage Human - of Marcus Hutter

More documents

Recommendations

Info

Claims and Challenges in Evaluating Human-Level Intelligent Systems John E. Laird * , Robert E. Wray III ** , Robert P. Marinier III * , Pat Langley *** * Division of Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48109-2121 ** Soar Technology, Inc., 3600 Green Court, Suite 600, Ann Arbor, MI 48105 *** School of Computing and Information, Arizona State University, Tempe, AZ 85287-8809 Abstract This paper represents a first step in attempting to engage the research community in discussions about evaluation of human-level intelligent systems. First, we discuss the challenges of evaluating human-level intelligent systems. Second, we explore the different types of claims that are made about HLI systems, which are the basis for confirmatory evaluations. Finally, we briefly discuss a range of experimental designs that support the evaluation of claims. Introduction One of the original goals of Artificial Intelligence (AI) was to create systems that had general intelligence, able to approach the breadth and depth of human-level intelligence (HLI). In the last five years, there has been a renewed interest in this pursuit with a significant increase in research in cognitive architectures and general intelligence as indicated by the first conference on Artificial General Intelligence. Although there is significant enthusiasm and activity, to date, evaluation of HLI systems has been weak, with few comparisons or evaluations of specific claims, making it difficult to determine when progress has been made. Moreover, shared evaluation procedures, testbeds, and infrastructure are missing. Establishing these elements could bring together the existing community and attract additional researchers interested in HLI who are currently inhibited by the difficulty of breaking into the field. To confront the issue of evaluation, the first in a series of workshops was held in October 2008 at the University of Michigan, to discuss issues related to evaluation and comparison of human-level intelligent systems. This paper is a summarization of some of the discussions and conclusions of that workshop. The emphasis of the workshop was to explore issues related to the evaluation of HLI, but to stop short of making proposals for specific evaluation methodologies or testbeds. That is our ultimate goal and it will be pursued at future workshops. In this first workshop, we explored the challenges in HLI evaluation, the claims that are typically made about HLI, and how those claims can be evaluated. 1 1 For an in depth and more complete discussion of evaluation of AI systems in general, see Cohen (1995). 91 Challenges in Evaluating HLI Systems Defining the goal for HLI One of the first steps in determining how to evaluate research in a field is to develop a crisp definition its goals, and if possible, what the requirements are for achieving those goals. Legg and Hutter (2007) review a wide variety of informal and formal definitions and tests of intelligence. Unfortunately, none of these definitions provide practical guidance in how to evaluate and compare the current state of the art in HLI systems. Over fifty years ago, Turing (1950) tried to finesse the issue of defining HLI by creating a test that involved comparison to human behavior, the Turing Test. In this test, no analysis of the components of intelligence was necessary; the only question was whether or not a system behaved in a way that was indistinguishable from humans. Although widely known and popular with the press, the Turing Test has failed as a scientific tool because of its many flaws: it is informal, imprecise, and is not designed for easy replication. Moreover, it tests only a subset of characteristics normally associated with intelligence, and it does not have a set of incremental challenges that can pull science forward (Cohen, 2005). As a result, none of the major research projects pursuing HLI use the Turing Test as an evaluation tool, and none of the major competitors in the Loebner Prize (an annual competition based on the Turing Test) appear to be pursuing HLI. One alternative to the Turing Test is the approach taken in cognitive modeling, where researchers attempt to develop computational models that think and learn similar to humans. In cognitive modeling, the goal is not only to build intelligent systems, but also to better understand human intelligence from a computational perspective. For this goal, matching the details of human performance in terms of reaction times, error rates, and similar metrics is an appropriate approach to evaluation. In contrast, the goal of HLI research is to create systems, possibly inspired by humans, but using that as a tactic instead of a necessity. Thus, HLI is not defined in terms of matching human
eaction times, error rates, or exact responses, but instead, the goal is to build computer systems that exhibit the full range of the cognitive capabilities we find in humans. Primacy of Generality One of the defining characteristics of HLI is that there is no single domain or task that defines it. Instead, it involves the ability to pursue tasks across a broad range of domains, in complex physical and social environments. An HLI system needs broad competence. It needs to successfully work on a wide variety of problems, using different types of knowledge and learning in different situations, but it does not need to generate optimal behavior; in fact, the expectation is it rarely will. This will have a significant impact on evaluation, as defining and evaluating broad competency is more difficult than evaluating narrow optimality. Another aspect of generality is that, within the context of a domain, an HLI system can perform a variety of related tasks. For example, a system that has a degree of competence in chess should be able to play chess, teach chess, provide commentary for a chess game, or even develop and play variants of chess (such as Kriegspeil chess). Thus, evaluation should not be limited to a single task within a domain. Integrated Structure of HLI Systems Much of the success of AI has been not only in single tasks, but also in specific cognitive capabilities, such as planning, language understanding, specific types of reasoning, or learning. To achieve HLI, it is widely accepted that a system must integrate many capabilities to create coherent end-to-end behavior, with non-trivial interactions between the capabilities. Not only is this challenging from the standpoint of research and development, but it complicates evaluation because it is often difficult to identify which aspects of a system are responsible for specific aspects of behavior. A further complication is that many HLI systems are developed not by integrating separate implementations of the cognitive capabilities listed earlier, but instead by further decomposing functionality into more primitive structures and process, such as short-term and long-term memories, primitive decision making and learning, representations of knowledge, and interfaces between components, such as shown in Figure 1. In this approach, higher-level cognitive capabilities, such as language processing or planning are implemented in a fixed substrate, differing in knowledge, but not in primitive structures and processes. This is the cognitive architecture approach to HLI development (Langley, Laird, & Rogers, in press), exemplified by Soar (Laird, 2008) and ICARUS (Langley & Choi, 2006). 92 Programmed Content Notional HLI System Learned Content Fixed Processes and Representational Primitives Perception Actions Figure 1: Structure of a Notional HLI System. This issue is clear in research on cognitive architectures because they make the following distinctions: • The architecture: the fixed structure that is shared across all higher-level cognitive capabilities and tasks. • The initial knowledge/content that is encoded in the architecture to achieve capabilities and support the pursuit of a range of tasks. • The knowledge/content that is learned through experience. For any HLI system, it is often difficult to disentangle the contributions of the fixed processes and primitives of the systems to the system’s behavior from any initial, domainspecific content and the learned knowledge, further complicating evaluation. There is a concern that when evaluating an agent’s performance, the quality of behavior can be more a reflection of clever engineering of the content than properties of the HLI systems itself. Thus, an important aspect of evaluation for HLI is to recognize the role of a prior content in task performance and attempt to control for such differences in represented knowledge. Long-term Existence Although not a major problem in today’s implementations, which typically focus on tasks of relatively short duration, HLI inherently involves long-term learning and long-term existence. It is one thing to evaluate behavior that is produced over seconds and minutes – it is possible to run many different scenarios, testing for the effects of variation. When a trial involves cumulative behavior over days, weeks, or months, such evaluation becomes extremely challenging due to the temporal extents of the experiments, and the fact that behavior becomes more and more a function of experience. Claims about HLI We are primarily interested in constructive approaches to HLI; thus, our claims are related to the functionality of our systems, and not their psychological or neurological realism. Achieving such realism is an important scientific goal, but one of the primary claims made by many
Page 2 and 3:
Ben Goertzel, Pascal Hitzler, Marcu
Page 4 and 5:
Artificial General Intelligence Vol
Page 6:
Preface Artificial General Intellig
Page 9 and 10:
Organizing Committee Tsvi Achler U.
Page 11 and 12:
A formal framework for the symbol g
Page 13 and 14:
Importing Space-time Concepts Into
Page 15 and 16:
one structure, everything else adap
Page 17 and 18:
generalize is essential to understa
Page 19 and 20:
First Application The above framewo
Page 21 and 22:
problem solving, use of knowledge,
Page 23 and 24:
Of particular note is the separatio
Page 25 and 26:
Conclusions and Future Work While p
Page 27 and 28:
Conceptual Spaces The conceptual ar
Page 29 and 30:
perceptions and actions. The lingui
Page 31 and 32:
means of the knowledge stored in th
Page 33 and 34:
grams given some (syntactically res
Page 35 and 36:
For example, let reverse([]) = [] r
Page 37 and 38:
Igor2 produced solutions with auxil
Page 39 and 40:
evolve neural net modules as quickl
Page 41 and 42:
ain”, consisting of some dozen or
Page 43 and 44:
Population Chr NListPtr m Chrm is a
Page 45 and 46:
and shaping, we feel that exploring
Page 47 and 48:
Table 2 reviews the key capabilitie
Page 49 and 50:
One way to achieve this goal would
Page 51 and 52:
question. Our approach of staying a
Page 53 and 54: H0 δB(H0) δF(H0) H1 δB(H1) δF(H
Page 55 and 56: Discussion and future work Testing
Page 57 and 58: Types of Reasoning Corresponding Fo
Page 59 and 60: Integrating Different Types of Reas
Page 61 and 62: good overview of analogy models can
Page 63 and 64: ��
Page 65 and 66: � ��
Page 67 and 68: ��
Page 69 and 70: A Unified Framework for IP Conditio
Page 71 and 72: negative evidence when added to a r
Page 73 and 74: ing strategy. Due to GOLEM’s rand
Page 75 and 76: any state index, n∈IN the current
Page 77 and 78: is the best observation summary, wh
Page 79 and 80: to a code for the rewards only, whi
Page 81 and 82: have exponentially many states (2O(
Page 83 and 84: where ˆσ 2 =Loss( ˆw)/n. Given
Page 85 and 86: • The cost function can be improv
Page 87 and 88: decides to introduce inflation or d
Page 89 and 90: € € The diffusion matrix is the
Page 91 and 92: tuning may be done adaptively by te
Page 93 and 94: Of course, Harnad’s argument is n
Page 95 and 96: By exploring variations of Harnad
Page 97 and 98: Discussion The representational sys
Page 99 and 100: the cognitive map is less of a map
Page 101 and 102: models of cognition except in cases
Page 103: 7(b), we can see that the straight
Page 107 and 108: comparing behavior with and without
Page 109 and 110: Doorenbos, R. B. 1994. Combining le
Page 111 and 112: egion, we leave the type of object
Page 113 and 114: extract features in support of reco
Page 115 and 116: with a minimum score of -1.0. The
Page 117 and 118: the NASA Human Error Modeling compa
Page 119 and 120: whether their models are capable of
Page 121 and 122: Incorporating Planning and Reasonin
Page 123 and 124: see an unclaimed ten-dollar bill al
Page 125 and 126: swering user queries) and the state
Page 127 and 128: Program Representation for General
Page 129 and 130: distribution P corresponding to the
Page 131 and 132: with all Ei replaced by the unbound
Page 133 and 134: Consciousness in Human and Machine:
Page 135 and 136: at the sensory receptors, is pre-pr
Page 137 and 138: The second choice is to take a deta
Page 139 and 140: Hebbian Constraint on the Resolutio
Page 141 and 142: The Case of Inputs that Change by T
Page 143 and 144: This route is relevant if the noise
Page 145 and 146: 1. Oculus Info. Inc. Toronto, Ont.
Page 147 and 148: Our implemented Turing judge used a
Page 149 and 150: 2.7; Human vs. JabberWacky t(157.6)
Page 151 and 152: UnsupervisedSegmentationofAudioSpee
Page 153 and 154: FinallyweusedthediscreteFourierTran
Page 155 and 156:
Secondly,thereexistsmorethanonelogi
Page 157 and 158:
Parsing PCFG within a General Proba
Page 159 and 160:
known in advance, although many imp
Page 161 and 162:
Figure 2: Shows the underlying weig
Page 163 and 164:
Self-Programming: Operationalizing
Page 165 and 166:
expressivity. More over, self-progr
Page 167 and 168:
existence of a distance function ov
Page 169 and 170:
Bootstrap Dialog: A Conversational
Page 171 and 172:
elation is typically a verb. In the
Page 173 and 174:
node RDF proposition N1 N2 N3 N4 N5
Page 175 and 176:
Analytical Inductive Programming as
Page 177 and 178:
Problem domain: puttable(x) PRE: cl
Page 179 and 180:
eq Hanoi(0, Src, Aux, Dst, S) = mov
Page 181 and 182:
Human and Machine Understanding Of
Page 183 and 184:
case orderings t correspond to the
Page 185 and 186:
in (1) received a different denotat
Page 187 and 188:
Abstract This paper analyzes the di
Page 189 and 190:
Having grounded meaning: In an inte
Page 191 and 192:
to experience” can be argued to b
Page 193 and 194:
Abstract Case-by-case Problem Solvi
Page 195 and 196:
First, since NARS is designed in th
Page 197 and 198:
typical response is to find such an
Page 199 and 200:
What Is Artificial General Intellig
Page 201 and 202:
Finally, the facts that both seem t
Page 203 and 204:
differ. NARS uses Narsese, the fair
Page 205 and 206:
Integrating Action and Reasoning th
Page 207 and 208:
states, with the condition that the
Page 209 and 210:
action model. The system is able to
Page 211 and 212:
Neuroscience and AI Share the Same
Page 213 and 214:
Relevance Based Planning: Why Its a
Page 215 and 216:
General Intelligence and Hypercompu
Page 217 and 218:
To appear, AGI-09 1 Stimulus proces
Page 219 and 220:
Distribution of Environments in For
Page 221 and 222:
The Importance of Being Neural-Symb
Page 223 and 224:
Improving the Believability of Non-
Page 225 and 226:
Understanding the Brain’s Emergen
Page 227 and 228:
Abstract The challenge of creating
Page 229 and 230:
Importing Space-time Concepts Into
Page 231 and 232:
HELEN: Using Brain Regions and Mech
Page 233 and 234:
Holistic Intelligence: Transversal
Page 235 and 236:
Achieving Artificial General Intell
Page 237:
Achler, Tsvi . . . . . . . . . . .
show all

A Framework for Evaluating Early-Stage Human - of Marcus Hutter

Create successful ePaper yourself

Delete template?

Save as template?