13.07.2015 Views

Investigating the Effects of Visual Saliency on Deictic Gesture ...

Investigating the Effects of Visual Saliency on Deictic Gesture ...

Investigating the Effects of Visual Saliency on Deictic Gesture ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Investigating</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Effects</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Visual</str<strong>on</strong>g> <str<strong>on</strong>g>Saliency</str<strong>on</strong>g> <strong>on</strong> <strong>Deictic</strong> <strong>Gesture</strong>Producti<strong>on</strong> by a Humanoid RobotAar<strong>on</strong> St. Clair, Ross Mead, and Maja J Matarić, Fellow, IEEEAbstract—In many collocated human-robot interacti<strong>on</strong>scenarios, robots are required to accurately andunambiguously indicate an object or point <str<strong>on</strong>g>of</str<strong>on</strong>g> interest in <str<strong>on</strong>g>the</str<strong>on</strong>g>envir<strong>on</strong>ment. Realistic, cluttered envir<strong>on</strong>ments c<strong>on</strong>tainingmany visually salient targets can present a challenge for <str<strong>on</strong>g>the</str<strong>on</strong>g>observer <str<strong>on</strong>g>of</str<strong>on</strong>g> such pointing behavior. In this paper, we describean experiment and results detailing <str<strong>on</strong>g>the</str<strong>on</strong>g> effects <str<strong>on</strong>g>of</str<strong>on</strong>g> visual saliencyand pointing modality <strong>on</strong> human perceptual accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g> arobot’s deictic gestures (head and arm pointing) and compare<str<strong>on</strong>g>the</str<strong>on</strong>g> results to <str<strong>on</strong>g>the</str<strong>on</strong>g> percepti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> human pointing.I. INTRODUCTIONo carry <strong>on</strong> sustained interacti<strong>on</strong>s with people,T aut<strong>on</strong>omous robots must be able to effectively andnaturally communicate in many different interacti<strong>on</strong>c<strong>on</strong>texts. Besides natural language, people employ bothcoverbal modalities (e.g., beat gestures) and n<strong>on</strong>verbalmodalities (facial expressi<strong>on</strong>, proxemics, eye gaze, headorientati<strong>on</strong>, and arm gestures, am<strong>on</strong>g o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs) to signal <str<strong>on</strong>g>the</str<strong>on</strong>g>irintenti<strong>on</strong>s and to attribute intenti<strong>on</strong>s to <str<strong>on</strong>g>the</str<strong>on</strong>g> acti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs.Prior work has dem<strong>on</strong>strated that robots can successfullyemploy <str<strong>on</strong>g>the</str<strong>on</strong>g> same communicati<strong>on</strong> channels [1], [2], [3], [4].Our aim is to develop a general, empirical understanding <str<strong>on</strong>g>of</str<strong>on</strong>g>design factors involved in multimodal communicati<strong>on</strong> with arobot. In this paper, we limit our focus to a study <str<strong>on</strong>g>of</str<strong>on</strong>g> deicticgestures since: (1) <str<strong>on</strong>g>the</str<strong>on</strong>g>ir use in human communicati<strong>on</strong> hasbeen widely studied [6], [7], [8]; (2) <str<strong>on</strong>g>the</str<strong>on</strong>g>y are relativelysimple to map to intenti<strong>on</strong>al c<strong>on</strong>structs in c<strong>on</strong>text [10]; and(3) <str<strong>on</strong>g>the</str<strong>on</strong>g>y are generally useful to robots interacting in sharedenvir<strong>on</strong>ments since <str<strong>on</strong>g>the</str<strong>on</strong>g>y serve to focus attenti<strong>on</strong> and refer toobjects. To achieve robust deixis via gesture in a humanrobotc<strong>on</strong>text, it is necessary to validate <str<strong>on</strong>g>the</str<strong>on</strong>g> perceivedreferent. This paper presents results from an experimentalstudy <str<strong>on</strong>g>of</str<strong>on</strong>g> human percepti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a robot’s deictic gesturesunder a set <str<strong>on</strong>g>of</str<strong>on</strong>g> different envir<strong>on</strong>mental visual saliencyc<strong>on</strong>diti<strong>on</strong>s and pointing modalities using our upper-torsohumanoid robot, Bandit.Manuscript received February 1, 2011. This work was supported in partby Nati<strong>on</strong>al Science Foundati<strong>on</strong> (NSF) grants CNS- 0709296, IIS-0803565,and IIS-0713697 and ONR MURI grant N00014-09-1-1031. R. Mead wassupported by an NSF Graduate Research Fellowship.A. St. Clair (corresp<strong>on</strong>ding author), R. Mead, and M. J. Matarić are with<str<strong>on</strong>g>the</str<strong>on</strong>g> Computer Science Department at <str<strong>on</strong>g>the</str<strong>on</strong>g> University <str<strong>on</strong>g>of</str<strong>on</strong>g> Sou<str<strong>on</strong>g>the</str<strong>on</strong>g>rn California,Los Angeles, CA 90089-0781 USA (e-mail: astclair@usc.edu;rossmead@usc.edu mataric@usc.edu).II. EMBODIED DEICTIC GESTUREMulti-disciplinary research from neuroscience andpsychology has dem<strong>on</strong>strated that human gesture producti<strong>on</strong>is tightly coupled with language processing and producti<strong>on</strong>[11], [12]. There is also evidence that gestures are adaptedby a speaker to account for <str<strong>on</strong>g>the</str<strong>on</strong>g> relative positi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a listenerand can, in some instances, substitute for speech functi<strong>on</strong>s[8], [10]. Bangerter [8] and Louwerse & Bangerter [10]dem<strong>on</strong>strated that deictic speech combined with deicticgesture <str<strong>on</strong>g>of</str<strong>on</strong>g>fered no additi<strong>on</strong>al performance gain compared to<strong>on</strong>e or <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r used separately.These findings have important implicati<strong>on</strong>s for <str<strong>on</strong>g>the</str<strong>on</strong>g> field <str<strong>on</strong>g>of</str<strong>on</strong>g>human-robot interacti<strong>on</strong>. Robots interacting with humans ina shared physical envir<strong>on</strong>ment should be able take advantage<str<strong>on</strong>g>of</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r social channels to both m<strong>on</strong>itor and communicateintent during <str<strong>on</strong>g>the</str<strong>on</strong>g> course <str<strong>on</strong>g>of</str<strong>on</strong>g> an interacti<strong>on</strong>, without completereliance <strong>on</strong> speech. To make this possible, it is necessary togain an empirical understanding <str<strong>on</strong>g>of</str<strong>on</strong>g> how to map well-studiedhuman gestures to robots <str<strong>on</strong>g>of</str<strong>on</strong>g> varying capabilities andembodiments. Specifically, we are interested in identifyingvariables for proper producti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> robot gestures to bestrealize some fixed interpretati<strong>on</strong> by a human observer. Ingeneral, this is difficult, for <str<strong>on</strong>g>the</str<strong>on</strong>g> same reas<strong>on</strong>s that processingnatural language is difficult: many gestures are c<strong>on</strong>textdependentand rely <strong>on</strong> accurately estimating a mental model<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> scope <str<strong>on</strong>g>of</str<strong>on</strong>g> attenti<strong>on</strong> and possible intenti<strong>on</strong>s for people’sacti<strong>on</strong>s given <strong>on</strong>ly low-level perceptual input.<strong>Deictic</strong> gestures, however, are largely c<strong>on</strong>sistent in <str<strong>on</strong>g>the</str<strong>on</strong>g>irmapping to linguistic c<strong>on</strong>structs, such as “that” and “<str<strong>on</strong>g>the</str<strong>on</strong>g>re”,and serve to focus <str<strong>on</strong>g>the</str<strong>on</strong>g> attenti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> observers to a specificobject or locati<strong>on</strong> in <str<strong>on</strong>g>the</str<strong>on</strong>g> envir<strong>on</strong>ment, or perhaps to indicatean intended effect involving such an object (e.g., “I will pickup that <strong>on</strong>e”). These characteristics, while simplifying <str<strong>on</strong>g>the</str<strong>on</strong>g>irinterpretati<strong>on</strong> and producti<strong>on</strong>, also make <str<strong>on</strong>g>the</str<strong>on</strong>g> gestures usefulfor referring to objects and grounding attenti<strong>on</strong>. Intenti<strong>on</strong>alanalysis and timing are still n<strong>on</strong>trivial, except in <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>text<str<strong>on</strong>g>of</str<strong>on</strong>g> performing a specific pre-determined task.Both recogniti<strong>on</strong> [13], [14], [15], [16], [17] andproducti<strong>on</strong> [18], [19], [20], [21] <str<strong>on</strong>g>of</str<strong>on</strong>g> deictic gestures havebeen studied in human-human, human-computer, andhuman-robot interacti<strong>on</strong> (HRI) settings. Our work adds tothis field a step toward obtaining an empirically groundedHRI model <str<strong>on</strong>g>of</str<strong>on</strong>g> deictic gestural accuracy between people androbots, with implicati<strong>on</strong>s for <str<strong>on</strong>g>the</str<strong>on</strong>g> design <str<strong>on</strong>g>of</str<strong>on</strong>g> robotembodiments and c<strong>on</strong>trol systems that perform situateddistal pointing. A study <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> literature <strong>on</strong> human deicticpointing behavior suggests a number <str<strong>on</strong>g>of</str<strong>on</strong>g> possible variablesthat could potentially affect <str<strong>on</strong>g>the</str<strong>on</strong>g> robustness <str<strong>on</strong>g>of</str<strong>on</strong>g> referent


selecti<strong>on</strong> when having a robot employ deictic gesturesincluding physical appearance [25], timing, relative positi<strong>on</strong>,and orientati<strong>on</strong> [6], [22], [23]. Assuming mobility, a robotcould relocate or reorient itself relative to <str<strong>on</strong>g>the</str<strong>on</strong>g> viewer or to<str<strong>on</strong>g>the</str<strong>on</strong>g> referent target to improve <str<strong>on</strong>g>the</str<strong>on</strong>g> viewer’s interpretati<strong>on</strong>accuracy [24]. Most robots, however, point withoutaccounting for distance to <str<strong>on</strong>g>the</str<strong>on</strong>g> target, by using a more or lessc<strong>on</strong>stant arm extensi<strong>on</strong> or head gaze that is reorientedappropriately [18], [5]. Finally, since a gesture is groundedwith respect to a specific referent in <str<strong>on</strong>g>the</str<strong>on</strong>g> envir<strong>on</strong>ment, <str<strong>on</strong>g>the</str<strong>on</strong>g>robot must be able to correctly segment and localize visuallysalient objects in <str<strong>on</strong>g>the</str<strong>on</strong>g> envir<strong>on</strong>ment at a similar granularity to<str<strong>on</strong>g>the</str<strong>on</strong>g> people with whom it is interacting. Studies <str<strong>on</strong>g>of</str<strong>on</strong>g> humanreading <str<strong>on</strong>g>of</str<strong>on</strong>g> robot gaze [31], as well as biologically inspiredmethods to assess and map visually salient features andobjects in an envir<strong>on</strong>ment, exist [26], [27], as do models <str<strong>on</strong>g>of</str<strong>on</strong>g>human visual attenti<strong>on</strong> selecti<strong>on</strong> [28]; however, <str<strong>on</strong>g>the</str<strong>on</strong>g> role <str<strong>on</strong>g>of</str<strong>on</strong>g>visual saliency during deictic reference by a robot is largelyuninvestigated.III. EXPERIMENTAL DESIGNGiven <str<strong>on</strong>g>the</str<strong>on</strong>g> large number <str<strong>on</strong>g>of</str<strong>on</strong>g> possible variables involved inoptimizing a pointing gesture with a particular robotembodiment, we c<strong>on</strong>ducted an initial pilot study with ourupper-torso humanoid robot, Bandit (Figure 1) sitting faceto-facewith participants and gesturing to locati<strong>on</strong>s <strong>on</strong> atransparent screen between <str<strong>on</strong>g>the</str<strong>on</strong>g>m. We varied distance andangle to target, distance and angle to viewer, and pointingmodality, but as no str<strong>on</strong>g correlati<strong>on</strong>s emerged in earlytesting, so we narrowed <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>diti<strong>on</strong>s and hypo<str<strong>on</strong>g>the</str<strong>on</strong>g>ses.A. Hypo<str<strong>on</strong>g>the</str<strong>on</strong>g>sesWe c<strong>on</strong>ducted a factorized experiment over three robotpointing modalities: <str<strong>on</strong>g>the</str<strong>on</strong>g> head with 2 degrees-<str<strong>on</strong>g>of</str<strong>on</strong>g>-freedom(DOF), <str<strong>on</strong>g>the</str<strong>on</strong>g> arm with 7 DOF, and both toge<str<strong>on</strong>g>the</str<strong>on</strong>g>r (i.e.,head+arm) with two saliency c<strong>on</strong>diti<strong>on</strong>s: a blank (or n<strong>on</strong>salient)envir<strong>on</strong>ment and an envir<strong>on</strong>ment with several highlyand equally visually salient targets. Since our results,particularly <str<strong>on</strong>g>the</str<strong>on</strong>g> modality c<strong>on</strong>diti<strong>on</strong>, may be specific toBandit, we also c<strong>on</strong>ducted a similar, but smaller, test with ahuman performing <str<strong>on</strong>g>the</str<strong>on</strong>g> pointing gestures, for comparis<strong>on</strong>.1) ModalityThe c<strong>on</strong>diti<strong>on</strong>s tested include head (Figure 2a), awayfrom-body,straight arm (Figure 2b), cross-body, bent arm(Figure 2c), and combined head and arm (Figure 2d). Wehypo<str<strong>on</strong>g>the</str<strong>on</strong>g>sized that <str<strong>on</strong>g>the</str<strong>on</strong>g> arm modality would lead to moreaccurate percepti<strong>on</strong> since, when fully extended, it is mostexpressive and easily interpreted as a vector from <str<strong>on</strong>g>the</str<strong>on</strong>g> robotto <str<strong>on</strong>g>the</str<strong>on</strong>g> screen. Since Bandit’s head does not have moveableeyes, <str<strong>on</strong>g>the</str<strong>on</strong>g> point <str<strong>on</strong>g>of</str<strong>on</strong>g> reference is somewhat ambiguous andcould lead to pointing error. Our kinematic calculati<strong>on</strong>ssolved for <str<strong>on</strong>g>the</str<strong>on</strong>g> midpoint <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> eyes to align to <str<strong>on</strong>g>the</str<strong>on</strong>g> target; thisinformati<strong>on</strong> was not shared with experiment participants.Additi<strong>on</strong>ally, we expected to see an effect between awayfrom-body,straight arm (Figure 1b) points that occur <strong>on</strong> <strong>on</strong>eside <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen and <str<strong>on</strong>g>the</str<strong>on</strong>g> cross-body bent-arm (Figure 1c)gesture used for <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r side, with <str<strong>on</strong>g>the</str<strong>on</strong>g> bent-arm being moredifficult to interpret, since it is staged in fr<strong>on</strong>t <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> robot’sFigure 1. The experimental setup, with Bandit indicating a crossbodypoint to <str<strong>on</strong>g>the</str<strong>on</strong>g> study participant in <str<strong>on</strong>g>the</str<strong>on</strong>g> foreground.body ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r <str<strong>on</strong>g>the</str<strong>on</strong>g>n laterally [22], [23], [24]. A similar effectwas seen in human pointing [9], showing that people arecapable Bandit <str<strong>on</strong>g>of</str<strong>on</strong>g> estimating pointing with vectors its head, accurately straight-arm, from bent-arm, body pose.Finally, and both. we hypo<str<strong>on</strong>g>the</str<strong>on</strong>g>sized that using both modalities toge<str<strong>on</strong>g>the</str<strong>on</strong>g>rwould reduce error relative to a single modality, sinceparticipants would have two gestures <strong>on</strong> which to base <str<strong>on</strong>g>the</str<strong>on</strong>g>estimate.2) <str<strong>on</strong>g>Saliency</str<strong>on</strong>g>For <str<strong>on</strong>g>the</str<strong>on</strong>g> two saliency c<strong>on</strong>diti<strong>on</strong>s, we hypo<str<strong>on</strong>g>the</str<strong>on</strong>g>sized that <str<strong>on</strong>g>the</str<strong>on</strong>g>salient objects would affect people’s interpretati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>points. Specifically, we anticipated that people would “snapto” <str<strong>on</strong>g>the</str<strong>on</strong>g> salient objects, thus reducing error for points whosetargets were <strong>on</strong> or near markers, whereas in <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>-salientc<strong>on</strong>diti<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>re were no points <str<strong>on</strong>g>of</str<strong>on</strong>g> reference to bias estimates.We did not expect to see any difference in <str<strong>on</strong>g>the</str<strong>on</strong>g> performance<str<strong>on</strong>g>of</str<strong>on</strong>g> each pointing modality when comparing <str<strong>on</strong>g>the</str<strong>on</strong>g> salient andn<strong>on</strong>-salient c<strong>on</strong>diti<strong>on</strong>s.3) Implementati<strong>on</strong>In <str<strong>on</strong>g>the</str<strong>on</strong>g> experiments, <str<strong>on</strong>g>the</str<strong>on</strong>g> participant is seated facing Banditat a distance <str<strong>on</strong>g>of</str<strong>on</strong>g> 6 feet (1.8 meters). The robot and <str<strong>on</strong>g>the</str<strong>on</strong>g>Figure 2. (a) Bandit pointing with its head; (b) straight-arm; (c), bentarm;and (d) and head+arm.


participant are separated by a transparent, acrylic screenmeasuring 12 feet by 8 feet (2.4 by 3.6 meters) (see Figure1). The screen covers a horiz<strong>on</strong>tal field <str<strong>on</strong>g>of</str<strong>on</strong>g> view fromapproximately -60 to 60 degrees and a vertical field <str<strong>on</strong>g>of</str<strong>on</strong>g> view-45 to 60 degrees. The robot performs a series <str<strong>on</strong>g>of</str<strong>on</strong>g> deicticgestures and <str<strong>on</strong>g>the</str<strong>on</strong>g> participant is asked to estimate <str<strong>on</strong>g>the</str<strong>on</strong>g>ir referentlocati<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen. The robot is posed using a closedforminverse kinematic soluti<strong>on</strong>; however, a small firmware“dead-band” in each joint sometimes introduces error inreaching a desired pose. To m<strong>on</strong>itor where <str<strong>on</strong>g>the</str<strong>on</strong>g> robot actuallypointed, we computed forward kinematics using angles fromencoder feedback, which we verified were accurate inseparate c<strong>on</strong>trolled testing. All gestures were static and heldindefinitely until <str<strong>on</strong>g>the</str<strong>on</strong>g> participant estimated a locati<strong>on</strong>, afterwhich <str<strong>on</strong>g>the</str<strong>on</strong>g> robot returned to a home locati<strong>on</strong> (looking straightforward with its hands at its sides) before performing <str<strong>on</strong>g>the</str<strong>on</strong>g>next gesture. Participants were given a laser pointer to mark<str<strong>on</strong>g>the</str<strong>on</strong>g>ir estimated locati<strong>on</strong> for each gesture. These locati<strong>on</strong>swere recorded using a laser rangefinder placed facingupwards at <str<strong>on</strong>g>the</str<strong>on</strong>g> base <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen. For each gesture, anexperimenter placed a fiducial marker over <str<strong>on</strong>g>the</str<strong>on</strong>g> indicatedlocati<strong>on</strong>, which was subsequently localized withinapproximately 1 cm using <str<strong>on</strong>g>the</str<strong>on</strong>g> rangefinder data. The entireexperiment was c<strong>on</strong>trolled via a single Nintendo Wiimote,with which <str<strong>on</strong>g>the</str<strong>on</strong>g> experimenter could record marked locati<strong>on</strong>sand advance <str<strong>on</strong>g>the</str<strong>on</strong>g> robot to point to <str<strong>on</strong>g>the</str<strong>on</strong>g> next referent target.The face-to-face nature <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> experiment was chosenintenti<strong>on</strong>ally although o<str<strong>on</strong>g>the</str<strong>on</strong>g>r work in gesture percepti<strong>on</strong> [8]has tested human deictic pointing accuracy when <str<strong>on</strong>g>the</str<strong>on</strong>g> pointerand <str<strong>on</strong>g>the</str<strong>on</strong>g> observer were situated more or less side-by-side,observing a scene. In our work, and in most HRI settings,<str<strong>on</strong>g>the</str<strong>on</strong>g> robot is facing <str<strong>on</strong>g>the</str<strong>on</strong>g> participant; side-by-side interacti<strong>on</strong>, interms <str<strong>on</strong>g>of</str<strong>on</strong>g> proxemics, is more likely to occur whencoordinated moti<strong>on</strong> and interacti<strong>on</strong> are c<strong>on</strong>current (e.g., <str<strong>on</strong>g>the</str<strong>on</strong>g>robot and human walking toge<str<strong>on</strong>g>the</str<strong>on</strong>g>r while an interacti<strong>on</strong> istaking place [29]). Our design tests <str<strong>on</strong>g>the</str<strong>on</strong>g> face-to-face scenariothat is more applicable to <str<strong>on</strong>g>the</str<strong>on</strong>g> types <str<strong>on</strong>g>of</str<strong>on</strong>g> proxemic HRIc<strong>on</strong>figurati<strong>on</strong>s we have encountered.3) ModalityThe robot gestured to locati<strong>on</strong>s by moving its head, arm,or both toge<str<strong>on</strong>g>the</str<strong>on</strong>g>r. A single arm was used, resulting in crossbody,bent-arm gestures for points <strong>on</strong> <strong>on</strong>e side <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> screenand away-from-body straight-arm gestures for points <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>o<str<strong>on</strong>g>the</str<strong>on</strong>g>r side <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen. The arm was not modified forpointing and, thus, <str<strong>on</strong>g>the</str<strong>on</strong>g> end-effector was simply <str<strong>on</strong>g>the</str<strong>on</strong>g> 1-DOFgripper in a closed positi<strong>on</strong>. We presume that using apointer-like object for an end effector would increaseaccuracy since Bandit’s hand has several sloped surfacesthat make estimati<strong>on</strong> challenging, but our goal was toestablish baseline measures for unmodified hardware. All<str<strong>on</strong>g>the</str<strong>on</strong>g> gestures were static, meaning <str<strong>on</strong>g>the</str<strong>on</strong>g> robot left <str<strong>on</strong>g>the</str<strong>on</strong>g> homepositi<strong>on</strong>, reached <str<strong>on</strong>g>the</str<strong>on</strong>g> gesture positi<strong>on</strong>, and held itindefinitely until <str<strong>on</strong>g>the</str<strong>on</strong>g> participant chose a point, after which itreturned to <str<strong>on</strong>g>the</str<strong>on</strong>g> home positi<strong>on</strong> for <str<strong>on</strong>g>the</str<strong>on</strong>g> next gesture. This wasintended to minimize any possible timing effects by givingparticipants as l<strong>on</strong>g as <str<strong>on</strong>g>the</str<strong>on</strong>g>y needed to estimate a givengesture. Participants were also asked to <strong>on</strong>ly turn <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> laserpointer after <str<strong>on</strong>g>the</str<strong>on</strong>g>y had visually selected an estimated(perceived) target locati<strong>on</strong>, to prevent <str<strong>on</strong>g>the</str<strong>on</strong>g>m from using <str<strong>on</strong>g>the</str<strong>on</strong>g>laser pointer to line up <str<strong>on</strong>g>the</str<strong>on</strong>g> robot arm and head with <str<strong>on</strong>g>the</str<strong>on</strong>g>actual target locati<strong>on</strong>.4) <str<strong>on</strong>g>Saliency</str<strong>on</strong>g>The screen itself was presented with two visual saliencyc<strong>on</strong>diti<strong>on</strong>s: <strong>on</strong>e in which it was completely empty (i.e., n<strong>on</strong>salient)and <strong>on</strong>e in which it was affixed with eight roundmarkers distributed at random (i.e., salient). In <str<strong>on</strong>g>the</str<strong>on</strong>g> salientcase, <str<strong>on</strong>g>the</str<strong>on</strong>g> markers were all 6 inches (15 cm) in diameter andwere identical in shape and color. Experiments werec<strong>on</strong>ducted in two phases, reflecting <str<strong>on</strong>g>the</str<strong>on</strong>g> two saliencyc<strong>on</strong>diti<strong>on</strong>s. In <str<strong>on</strong>g>the</str<strong>on</strong>g> salient c<strong>on</strong>diti<strong>on</strong>, <str<strong>on</strong>g>the</str<strong>on</strong>g> robot’s gesturesincluded 60 points toward salient targets and 60 pointschosen to be <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen, but not necessarily at a salienttarget. In <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>-salient c<strong>on</strong>diti<strong>on</strong>, 74 <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> points within<str<strong>on</strong>g>the</str<strong>on</strong>g> bounds <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen were chosen pseudo-randomly and<str<strong>on</strong>g>the</str<strong>on</strong>g> remaining 36 were chosen to sample a set <str<strong>on</strong>g>of</str<strong>on</strong>g> 4calibrati<strong>on</strong> locati<strong>on</strong>s with each pointing modality.Randomizati<strong>on</strong> was performed such that each participant ina given c<strong>on</strong>diti<strong>on</strong> saw <str<strong>on</strong>g>the</str<strong>on</strong>g> same set <str<strong>on</strong>g>of</str<strong>on</strong>g> points. The calibrati<strong>on</strong>points were used to assess <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>sistency and normality <str<strong>on</strong>g>of</str<strong>on</strong>g><str<strong>on</strong>g>the</str<strong>on</strong>g> error in <str<strong>on</strong>g>the</str<strong>on</strong>g> robot’s actual pointing and <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’spercepti<strong>on</strong> to determine whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r a between-subjectscomparis<strong>on</strong> was possible. All three pointing modalities(head-<strong>on</strong>ly, arm-<strong>on</strong>ly, and head+arm) were used in both <str<strong>on</strong>g>the</str<strong>on</strong>g>salient and n<strong>on</strong>-salient cases.5) Human PointingThe human-human pointing c<strong>on</strong>diti<strong>on</strong> was c<strong>on</strong>ducted byreplacing Bandit with an experimenter with <str<strong>on</strong>g>the</str<strong>on</strong>g> intenti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>anecdotally comparing robot pointing with typical humanpointing in <str<strong>on</strong>g>the</str<strong>on</strong>g> same scenario. Since people point by aligninga chosen end-effector with <str<strong>on</strong>g>the</str<strong>on</strong>g> referent target using <str<strong>on</strong>g>the</str<strong>on</strong>g>irdominant eye [8], c<strong>on</strong>ducting <str<strong>on</strong>g>the</str<strong>on</strong>g> experiment with a humanpointer introduces <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>found in that people cannot pointwith <str<strong>on</strong>g>the</str<strong>on</strong>g> arm-<strong>on</strong>ly modality. We also found <str<strong>on</strong>g>the</str<strong>on</strong>g> head-<strong>on</strong>lymodality difficult to measure accurately and, for this reas<strong>on</strong>,<strong>on</strong>ly <str<strong>on</strong>g>the</str<strong>on</strong>g> head+arm modality was tested, which c<strong>on</strong>sequentlyc<strong>on</strong>veys eye gaze. These experiments were c<strong>on</strong>ducted byreplacing <str<strong>on</strong>g>the</str<strong>on</strong>g> robot with an experimenter who held aNintendo Wiimote in his or her n<strong>on</strong>-pointing hand. Twodifferent vibrati<strong>on</strong> patterns signaled whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r to point to atarget locati<strong>on</strong> or a locati<strong>on</strong> selected arbitrarily. Theexperimenter pointed with a clenched fist grip similar toBandit’s, while holding a laser pointer c<strong>on</strong>cealed in <str<strong>on</strong>g>the</str<strong>on</strong>g> palm<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> hand, and held <str<strong>on</strong>g>the</str<strong>on</strong>g> pose. A sec<strong>on</strong>d experimenter <str<strong>on</strong>g>the</str<strong>on</strong>g>nmarked both <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’s and experimenter’s points, asbefore. As with <str<strong>on</strong>g>the</str<strong>on</strong>g> robot, <str<strong>on</strong>g>the</str<strong>on</strong>g> human pointer <strong>on</strong>ly pointedwith <str<strong>on</strong>g>the</str<strong>on</strong>g> right hand.6) SurveysIn additi<strong>on</strong> to <str<strong>on</strong>g>the</str<strong>on</strong>g> pointing task, we also administered asurvey asking participants to estimate <str<strong>on</strong>g>the</str<strong>on</strong>g>ir average errorwith respect to modality and locati<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen and torate each modality <strong>on</strong> a Likert scale in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> preference.The surveys also collected background informati<strong>on</strong> such ashandedness and level <str<strong>on</strong>g>of</str<strong>on</strong>g> prior experience with robots.


A. Participants and DataIV. RESULTSA total <str<strong>on</strong>g>of</str<strong>on</strong>g> 40 runs <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> experiment were c<strong>on</strong>ducted asdescribed, with 20 (12 female, 8 male) participating in <str<strong>on</strong>g>the</str<strong>on</strong>g>n<strong>on</strong>-salient c<strong>on</strong>diti<strong>on</strong> and 20 (11 female, 9 male) participantsin <str<strong>on</strong>g>the</str<strong>on</strong>g> salient c<strong>on</strong>diti<strong>on</strong>. In total, around 4500 points wereestimated and recorded. The c<strong>on</strong>diti<strong>on</strong>s were close to equallyweighted with <str<strong>on</strong>g>the</str<strong>on</strong>g> excepti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>-salient arm-<strong>on</strong>ly andhead+arm c<strong>on</strong>diti<strong>on</strong>s, which was d<strong>on</strong>e initially to allow forcomparis<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> cross-body versus away-from-body armgestures. The number <str<strong>on</strong>g>of</str<strong>on</strong>g> points collected for each c<strong>on</strong>diti<strong>on</strong>is presented in Table 1. Participants were recruited from <strong>on</strong>campussources and all were undergraduate or graduatestudents at USC from various majors. The participants wereroughly age- and gender-matched with an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 20.The data collected for each run included <str<strong>on</strong>g>the</str<strong>on</strong>g> desired target<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen (i.e., <str<strong>on</strong>g>the</str<strong>on</strong>g> desired locati<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> robot should havepointed to), <str<strong>on</strong>g>the</str<strong>on</strong>g> actual target <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen for each modality(i.e., <str<strong>on</strong>g>the</str<strong>on</strong>g> locati<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> robot actually pointed), and <str<strong>on</strong>g>the</str<strong>on</strong>g>perceived point as indicated by <str<strong>on</strong>g>the</str<strong>on</strong>g> participant and recordedby <str<strong>on</strong>g>the</str<strong>on</strong>g> laser rangefinder. We also captured timing data foreach point and video <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> sessi<strong>on</strong>s taken from a cameramounted behind and to <str<strong>on</strong>g>the</str<strong>on</strong>g> side <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> robot.B. Perceived Error AnalysisWe c<strong>on</strong>ducted a two-way analysis <str<strong>on</strong>g>of</str<strong>on</strong>g> variance (Type III SSANOVA) <str<strong>on</strong>g>of</str<strong>on</strong>g> various angular error measures with modalityand saliency as <str<strong>on</strong>g>the</str<strong>on</strong>g> independent factors. Both were found tohave significant effects <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> angular error betweenperceived and desired target points as well as <strong>on</strong> perceivedand actual target points (Table 2). Additi<strong>on</strong>ally, <str<strong>on</strong>g>the</str<strong>on</strong>g>interacti<strong>on</strong> effects between <str<strong>on</strong>g>the</str<strong>on</strong>g> modality and saliency factorswere found not to be significant. Mean angular errorcomputed from <str<strong>on</strong>g>the</str<strong>on</strong>g> perspective <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> pers<strong>on</strong> and c<strong>on</strong>fidenceintervals are shown in Figures 3-5. We used angular error asa metric to effectively normalize for different distances totarget. For comparis<strong>on</strong> purposes, human perceptual errorwhen estimating human pointing gestures (arm or eye gaze)has been measured to be approximately 2-3 degrees forpeople up to 2.7 meters apart [8], [30]. We c<strong>on</strong>ducted posthocanalysis using <str<strong>on</strong>g>the</str<strong>on</strong>g> Tukey’s h<strong>on</strong>estly significantdifferences (HSD), which revealed that mean error tends tobe <strong>on</strong> average 1 degree larger for arm points with p


directed between <str<strong>on</strong>g>the</str<strong>on</strong>g> center and <str<strong>on</strong>g>the</str<strong>on</strong>g> periphery. Overall, <str<strong>on</strong>g>the</str<strong>on</strong>g>error in estimating human-produced points appears to have asimilar pr<str<strong>on</strong>g>of</str<strong>on</strong>g>ile to that <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> robot-produced points; however,more investigati<strong>on</strong> is necessary.D. Survey Resp<strong>on</strong>sesIn <str<strong>on</strong>g>the</str<strong>on</strong>g> resp<strong>on</strong>ses to <str<strong>on</strong>g>the</str<strong>on</strong>g> survey, participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>salientc<strong>on</strong>diti<strong>on</strong> estimated that <str<strong>on</strong>g>the</str<strong>on</strong>g>ir points were within anaverage <str<strong>on</strong>g>of</str<strong>on</strong>g> 28 centimeters (11 inches); this is very close to<str<strong>on</strong>g>the</str<strong>on</strong>g> mean error <str<strong>on</strong>g>of</str<strong>on</strong>g> 27 centimeters we found in practice. Therewas no significant difference between participants’estimated error when comparing across <str<strong>on</strong>g>the</str<strong>on</strong>g> two c<strong>on</strong>diti<strong>on</strong>s.Pointing with <str<strong>on</strong>g>the</str<strong>on</strong>g> head-<strong>on</strong>ly and with <str<strong>on</strong>g>the</str<strong>on</strong>g> head+arm werepreferred by <str<strong>on</strong>g>the</str<strong>on</strong>g> majority <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> participants, with <strong>on</strong>ly 6 (or15%) stating a preference for <str<strong>on</strong>g>the</str<strong>on</strong>g> arm modality. When askedif <str<strong>on</strong>g>the</str<strong>on</strong>g>re was a noticeable difference in straight-arm pointsversus <str<strong>on</strong>g>the</str<strong>on</strong>g> bent-arm, 68% said <str<strong>on</strong>g>the</str<strong>on</strong>g>re was, with <str<strong>on</strong>g>the</str<strong>on</strong>g> remaindernot seeing a difference. Fourteen out <str<strong>on</strong>g>of</str<strong>on</strong>g> 20 (or 67%) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> salient c<strong>on</strong>diti<strong>on</strong> said that <str<strong>on</strong>g>the</str<strong>on</strong>g> markerswould have an effect <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir estimate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> referent target.V. DISCUSSIONA. <str<strong>on</strong>g>Visual</str<strong>on</strong>g> <str<strong>on</strong>g>Saliency</str<strong>on</strong>g>The mean error, as computed (using <str<strong>on</strong>g>the</str<strong>on</strong>g> perceived pointand <str<strong>on</strong>g>the</str<strong>on</strong>g> desired target points), tells us how close to a desiredtarget (ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r a randomly chosen <strong>on</strong>e in <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>-salient caseor <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> markers in <str<strong>on</strong>g>the</str<strong>on</strong>g> salient case) <str<strong>on</strong>g>the</str<strong>on</strong>g> robot wasactually able to indicate. The performance <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> head-<strong>on</strong>ly,arm-<strong>on</strong>ly and head+arm in <str<strong>on</strong>g>the</str<strong>on</strong>g> salient c<strong>on</strong>diti<strong>on</strong> is improvedby approximately 1 degree. This suggests that <str<strong>on</strong>g>the</str<strong>on</strong>g> snap-totargeteffect that we expected to see when salient objectswere introduced is modest, resulting in a best-caseimprovement <str<strong>on</strong>g>of</str<strong>on</strong>g> approximately 1 degree. This is also seenwhen we c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> mean perceived-actual error, which isslightly lower for nearly every c<strong>on</strong>diti<strong>on</strong>. This suggests thatparticipants estimate <str<strong>on</strong>g>the</str<strong>on</strong>g> referent to be closer to <str<strong>on</strong>g>the</str<strong>on</strong>g> actualpoint <str<strong>on</strong>g>the</str<strong>on</strong>g> robot is physically indicating than <str<strong>on</strong>g>the</str<strong>on</strong>g> nearestsalient object. This could be a useful property because itallows us to c<strong>on</strong>sider pointing without having to assess scenesaliency beforehand. That is, if <str<strong>on</strong>g>the</str<strong>on</strong>g> referential target <str<strong>on</strong>g>of</str<strong>on</strong>g> apoint has not been specified a priori, through some o<str<strong>on</strong>g>the</str<strong>on</strong>g>rmeans such as verbal communicati<strong>on</strong> or previous activity,people tend to evaluate <str<strong>on</strong>g>the</str<strong>on</strong>g> point in an ad hoc manner bytaking a guess. When disambiguating referents, if <str<strong>on</strong>g>the</str<strong>on</strong>g>re areunknown salient objects in <str<strong>on</strong>g>the</str<strong>on</strong>g> envir<strong>on</strong>ment, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir effects <strong>on</strong><str<strong>on</strong>g>the</str<strong>on</strong>g> percepti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a given gesture can be expected to be smallenough in most cases that a precise point to <str<strong>on</strong>g>the</str<strong>on</strong>g> actual targetshould suffice to communicate <str<strong>on</strong>g>the</str<strong>on</strong>g> referent.B. ModalityAs we hypo<str<strong>on</strong>g>the</str<strong>on</strong>g>sized, <str<strong>on</strong>g>the</str<strong>on</strong>g> modalities did result in differentpointing accuracy pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles. When c<strong>on</strong>sidering modalities,pointing with <str<strong>on</strong>g>the</str<strong>on</strong>g> head+arm does appear to performappreciably better than ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r <str<strong>on</strong>g>the</str<strong>on</strong>g> arm-<strong>on</strong>ly or <str<strong>on</strong>g>the</str<strong>on</strong>g> head-<strong>on</strong>ly,in most cases. One possible explanati<strong>on</strong> is that head+armmore closely emulates typical human pointing, in whichError (degrees)Error (degrees)0 2 4 6 8 10 120 2 4 6 8 10 12Acal-Desired ErrorSALIENTNONSALIENTHEAD ARM BOTHPointing modalityFigure 3. Mean angular pointing error.Perceived-Actual ErrorSALIENTNONSALIENTHEAD ARM BOTHPointing modalityFigure 4. Mean angular error between perceived and actual targets.Error (degrees)0 2 4 6 8 10 12Perceived-Desired ErrorSALIENTNONSALIENTHEAD ARM BOTHPointing modalityFigure 5. Mean angular error between perceived and desired targets.


Percepti<strong>on</strong> error (degrees)0 2 4 6 8 10Perceived-Actual Error(Cross- vs. Straight-Arm)CROSSArm <strong>Gesture</strong> TypeSTRAIGHTPercepti<strong>on</strong> error (degrees)0 1 2 3 4Perceived-Actual Error(Human Pointer)RANDOM SALIENT<str<strong>on</strong>g>Saliency</str<strong>on</strong>g> C<strong>on</strong>diti<strong>on</strong>Figure 6. Mean angular error for straight vs. bent-arm and mean error bysaliency c<strong>on</strong>diti<strong>on</strong> with human pointer.people tend to align an end-effector with <str<strong>on</strong>g>the</str<strong>on</strong>g>ir dominant eye[8]; ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r is that multiple modalities provide more diversecues that indicate <str<strong>on</strong>g>the</str<strong>on</strong>g> referential target resulting in betterpriming <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> viewer to interpret <str<strong>on</strong>g>the</str<strong>on</strong>g> gesture. The poorperformance <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> arm in <str<strong>on</strong>g>the</str<strong>on</strong>g> salient c<strong>on</strong>diti<strong>on</strong> wassomewhat unexpected. This might be due to its higher actualerror compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> head. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r source <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> errorcould be <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> cross-body arm gesture, which, whileequally weighted, resulted in nearly twice <str<strong>on</strong>g>the</str<strong>on</strong>g> perceptualerror compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> away-from-body arm. This might be aresult <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> reduced length <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> arm, which forces peopleto estimate <str<strong>on</strong>g>the</str<strong>on</strong>g> vector based <strong>on</strong> <strong>on</strong>ly <str<strong>on</strong>g>the</str<strong>on</strong>g> forearm versus <str<strong>on</strong>g>the</str<strong>on</strong>g>entire arm as in <str<strong>on</strong>g>the</str<strong>on</strong>g> away-from-body case. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>rexplanati<strong>on</strong> is that <str<strong>on</strong>g>the</str<strong>on</strong>g> gesture is staged against <str<strong>on</strong>g>the</str<strong>on</strong>g> body, thatis, with minimal silhouette and is thus more difficult to see.In ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r case, roughly <strong>on</strong>e-third <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> participants did notnotice a difference in <str<strong>on</strong>g>the</str<strong>on</strong>g> arm gestures while <str<strong>on</strong>g>the</str<strong>on</strong>g>irperformance was, in fact, affected. This illustrates <str<strong>on</strong>g>the</str<strong>on</strong>g> impactthat gesture and embodiment design can have <strong>on</strong>interpretati<strong>on</strong>, and underscores <str<strong>on</strong>g>the</str<strong>on</strong>g> need to validate gesturalmeaning with people.When c<strong>on</strong>sidering <str<strong>on</strong>g>the</str<strong>on</strong>g> horiz<strong>on</strong>tal and vertical targetpositi<strong>on</strong> analyses, we see that people are best at estimatingpoints directly between <str<strong>on</strong>g>the</str<strong>on</strong>g>mselves and <str<strong>on</strong>g>the</str<strong>on</strong>g> robot.Performance <str<strong>on</strong>g>the</str<strong>on</strong>g>n drops <str<strong>on</strong>g>of</str<strong>on</strong>g>f when <str<strong>on</strong>g>the</str<strong>on</strong>g> target is locatedlaterally, above, or below. This effect could be due to afield-<str<strong>on</strong>g>of</str<strong>on</strong>g>-view restricti<strong>on</strong>, preventing <str<strong>on</strong>g>the</str<strong>on</strong>g> viewer from seeingboth <str<strong>on</strong>g>the</str<strong>on</strong>g> robot’s gesture and <str<strong>on</strong>g>the</str<strong>on</strong>g> target at <str<strong>on</strong>g>the</str<strong>on</strong>g> same time inhigh acuity, foveal visi<strong>on</strong>. Estimating <str<strong>on</strong>g>the</str<strong>on</strong>g>se points <str<strong>on</strong>g>the</str<strong>on</strong>g>nrequires <str<strong>on</strong>g>the</str<strong>on</strong>g> viewer to saccade <str<strong>on</strong>g>the</str<strong>on</strong>g> head between <str<strong>on</strong>g>the</str<strong>on</strong>g> twopoints <str<strong>on</strong>g>of</str<strong>on</strong>g> interest. We believe <str<strong>on</strong>g>the</str<strong>on</strong>g> slight improvement at <str<strong>on</strong>g>the</str<strong>on</strong>g>far periphery for some <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> modalities is due to <str<strong>on</strong>g>the</str<strong>on</strong>g> factthat we informed participants that <str<strong>on</strong>g>the</str<strong>on</strong>g> points would be <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>screen, thus creating a bound for points near <str<strong>on</strong>g>the</str<strong>on</strong>g> screenedges.C. Human PointingThe results <str<strong>on</strong>g>of</str<strong>on</strong>g> our smaller-scale investigati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> humanpointing did find that <str<strong>on</strong>g>the</str<strong>on</strong>g> salient c<strong>on</strong>diti<strong>on</strong> resulted inapproximately 1.5 degrees less error than in <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>-salientDurati<strong>on</strong> (sec<strong>on</strong>ds)4 5 6 7 8Time Per Point Estimati<strong>on</strong>SALIENTNONSALIENTHEAD ARM BOTHPointing modalityFigure 7. Mean time from start <str<strong>on</strong>g>of</str<strong>on</strong>g> pose to marking <str<strong>on</strong>g>of</str<strong>on</strong>g> estimated point.c<strong>on</strong>diti<strong>on</strong>, which is c<strong>on</strong>sistent with our finding using <str<strong>on</strong>g>the</str<strong>on</strong>g>robot pointer. Also, <str<strong>on</strong>g>the</str<strong>on</strong>g> 2-degree perceptual accuracy that wefound when testing a human pointer seems to agree withprior studies <str<strong>on</strong>g>of</str<strong>on</strong>g> humans c<strong>on</strong>ducted in relevant literature. It isalso worthy <str<strong>on</strong>g>of</str<strong>on</strong>g> note that, although <str<strong>on</strong>g>the</str<strong>on</strong>g> deictic pointingperformance <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> robot is several times worse than whatwe saw in <str<strong>on</strong>g>the</str<strong>on</strong>g> human experiment or would expect from <str<strong>on</strong>g>the</str<strong>on</strong>g>literature, we can use <str<strong>on</strong>g>the</str<strong>on</strong>g> estimate <str<strong>on</strong>g>of</str<strong>on</strong>g> our resolving power(i.e., <str<strong>on</strong>g>the</str<strong>on</strong>g> minimum angle between referents that we couldhope to c<strong>on</strong>vey) to inform c<strong>on</strong>troller design and ensure that<str<strong>on</strong>g>the</str<strong>on</strong>g> robot repositi<strong>on</strong>s itself or gets close enough to prevent<str<strong>on</strong>g>the</str<strong>on</strong>g>se effects. The salient c<strong>on</strong>diti<strong>on</strong> also resulted in a 16%increase (or approximately 1 sec<strong>on</strong>d) in time needed toestimate <str<strong>on</strong>g>the</str<strong>on</strong>g> gesture. This is intuitive, as <str<strong>on</strong>g>the</str<strong>on</strong>g> participantswere presented with more stimuli in <str<strong>on</strong>g>the</str<strong>on</strong>g> form <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> salientobjects and, thus, took some extra time to ground <str<strong>on</strong>g>the</str<strong>on</strong>g> point,possibly checking to see if it is coincident with any objectsfirst. This informati<strong>on</strong> could be useful in developingmethods for effective timing c<strong>on</strong>trol.VI. FUTURE WORKOne obvious next step is to c<strong>on</strong>duct a similar experimentwith a robot <str<strong>on</strong>g>of</str<strong>on</strong>g> a different embodiment to evaluate whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r<str<strong>on</strong>g>the</str<strong>on</strong>g> same general c<strong>on</strong>clusi<strong>on</strong>s hold true or if <str<strong>on</strong>g>the</str<strong>on</strong>g>y are tightlycoupled to <str<strong>on</strong>g>the</str<strong>on</strong>g> specific appearance <str<strong>on</strong>g>of</str<strong>on</strong>g> Bandit. Since <str<strong>on</strong>g>the</str<strong>on</strong>g>project was developed using Willow Garage’s RobotOperating System (ROS), substituting different robots into<str<strong>on</strong>g>the</str<strong>on</strong>g> experiment will involve minimal changes to <str<strong>on</strong>g>the</str<strong>on</strong>g>codebase. We are currently developing a deictic gesture(pointing) package for <str<strong>on</strong>g>the</str<strong>on</strong>g> PR2 robot and o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs that have anURDF specificati<strong>on</strong>, and plan <strong>on</strong> using it to run <str<strong>on</strong>g>the</str<strong>on</strong>g>experiment with <str<strong>on</strong>g>the</str<strong>on</strong>g> PR2. Formal studies <str<strong>on</strong>g>of</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r relevantvariables (such as angle to target and timing), as well ascomparable studies with human pointing (our results were<strong>on</strong>ly anecdotal in nature), are also necessary to develop abetter understanding <str<strong>on</strong>g>of</str<strong>on</strong>g> human percepti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> robot deicticgestures.We also plan to analyze head movements as participantsperform <str<strong>on</strong>g>the</str<strong>on</strong>g> task. We observed a saccade effect in whichpeople transiti<strong>on</strong>ed back and forth between an estimated


point and <str<strong>on</strong>g>the</str<strong>on</strong>g> robot before finally glancing at <str<strong>on</strong>g>the</str<strong>on</strong>g>experimenter to mark <str<strong>on</strong>g>the</str<strong>on</strong>g> point. This is similar in nature to<str<strong>on</strong>g>the</str<strong>on</strong>g> regressive eye movements used to measure gestureclarity in [8].We are seeking ways to automatically measure estimatedpoints, <str<strong>on</strong>g>the</str<strong>on</strong>g>reby allowing us to remove <str<strong>on</strong>g>the</str<strong>on</strong>g> potentiallydistractingexperimenter from <str<strong>on</strong>g>the</str<strong>on</strong>g> room. Finally, we plan <strong>on</strong>using <str<strong>on</strong>g>the</str<strong>on</strong>g> presented data to c<strong>on</strong>struct a parameterized errormodel to allow Bandit to perform effective deixis to objectsin a mapped envir<strong>on</strong>ment.VII. CONCLUSIONIn this paper, we presented <str<strong>on</strong>g>the</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g> a study <str<strong>on</strong>g>of</str<strong>on</strong>g> humanpercepti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> robot gestures intended to test whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r visualsaliency and embodied pointing modality have an effect <strong>on</strong><str<strong>on</strong>g>the</str<strong>on</strong>g> performance <str<strong>on</strong>g>of</str<strong>on</strong>g> human referent resoluti<strong>on</strong>. Our resultssuggest that envir<strong>on</strong>mental saliency, when employing deicticgesture al<strong>on</strong>e to indicate a target, results in <strong>on</strong>ly a modestbias effect. We also dem<strong>on</strong>strated that pointing with twocombined and synchr<strong>on</strong>ized modalities, such as head+arm,c<strong>on</strong>sistently outperforms <strong>on</strong>e or <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r individually.Additi<strong>on</strong>ally, we found that <str<strong>on</strong>g>the</str<strong>on</strong>g> physical instantiati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>gesture (i.e., how it is presented to <str<strong>on</strong>g>the</str<strong>on</strong>g> observer) can havedrastic effects <strong>on</strong> perceptual accuracy, as noted in comparingbent-arm and straight-arm performance.ACKNOWLEDGMENTThe authors would like to thank Mary Ermitanio, HieuMinh Nguyen and Karie Lau for <str<strong>on</strong>g>the</str<strong>on</strong>g>ir help with datacollecti<strong>on</strong>.REFERENCES[1] B. Scassellati, “<str<strong>on</strong>g>Investigating</str<strong>on</strong>g> models <str<strong>on</strong>g>of</str<strong>on</strong>g> social development using ahumanoid robot,” vol. 4, pp. 2704 – 2709 vol.4, Jul. 2003.[2] A. Brooks and C. Breazeal, “Working with robots and objects:Revisiting deictic reference for achieving spatial comm<strong>on</strong> ground,” inProc <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 1st C<strong>on</strong>f <strong>on</strong> Human-robot Interacti<strong>on</strong>, p. 304, ACM, 2006.[3] C. Breazeal, C. Kidd, A. Thormaz, G. H<str<strong>on</strong>g>of</str<strong>on</strong>g>fman, and M. Merlin,“<str<strong>on</strong>g>Effects</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>verbal communicati<strong>on</strong> <strong>on</strong> efficiency and robustness inhuman-robot teamwork, ieee/rsj int,” in IEEE/RSJ Int. C<strong>on</strong>f. <strong>on</strong>Intelligent Robots and Systems (IROS2005), pp. 383–389, 2005.[4] C. Sidner, C. Kidd, C. Lee, and N. Lesh, “Where to look: a study <str<strong>on</strong>g>of</str<strong>on</strong>g>human-robot engagement,” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 9th Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Intelligent User Interfaces, p. 84, ACM, 2004.[5] B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita, “Footing inhuman-robot c<strong>on</strong>versati<strong>on</strong>s: how robots might shape participant rolesusing gaze cues,” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 4th ACM/IEEE Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Human Robot Interacti<strong>on</strong>, (HRI’09), 2009.[6] A. Ozyurek, “Do speakers design <str<strong>on</strong>g>the</str<strong>on</strong>g>ir co-speech gestures for <str<strong>on</strong>g>the</str<strong>on</strong>g>iraddresees? <str<strong>on</strong>g>the</str<strong>on</strong>g> effects <str<strong>on</strong>g>of</str<strong>on</strong>g> addressee locati<strong>on</strong> <strong>on</strong> representati<strong>on</strong>algestures,” Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Memory and Language, vol. 46, no. 4, pp. 688–704, 2002.[7] N. Nishitani, M. Schurmann, K. Amunts, and R. Hari, “Broca’sregi<strong>on</strong>: From acti<strong>on</strong> to language,” Physiology, vol. 20, no. 1, p. 60,2005.[8] A. Bangerter, “Accuracy in detecting referents <str<strong>on</strong>g>of</str<strong>on</strong>g> pointing gesturesunaccompanied by language,” <strong>Gesture</strong>, vol. 6, no. 1, pp. 85–102,2006.[9] A. Bangerter, “Using pointing and describing to achieve joint focus <str<strong>on</strong>g>of</str<strong>on</strong>g>attenti<strong>on</strong> in dialogue,” Psychological Sci., vol. 15, no. 6, p. 415, 2004.[10] M. Louwerse and A. Bangerter, “Focusing attenti<strong>on</strong> with deicticgestures and linguistic expressi<strong>on</strong>s,” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 27thAnnual Meeting <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Cognitive Science Society, 2005.[11] R. Mayberry and J. Jaques, <strong>Gesture</strong> producti<strong>on</strong> during stutteredspeech: Insights into <str<strong>on</strong>g>the</str<strong>on</strong>g> nature <str<strong>on</strong>g>of</str<strong>on</strong>g> gesture-speech integrati<strong>on</strong>, ch. 10,pp. 199–214. Cambridge University Press, 2000.[12] S. Kelly, A. Ozurek, and E. Maris, “Two sides <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> same coin:Speech and gesture mutually interact to enhance comprehensi<strong>on</strong>,”Psychological Science, vol. 21, no. 2, pp. 260–267, 2009.[13] R. Cipolla and N. Hollinghurst, “Human-robot interface by pointingwith uncalibrated stereo visi<strong>on</strong>,” Image and Visi<strong>on</strong> Computing,vol. 14, no. 3, pp. 171–178, 1996.[14] D. Kortenkamp, E. Huber, and R. B<strong>on</strong>asso, “Recognizing andinterpreting gestures <strong>on</strong> a mobile robot,” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>Nati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Artificial Intelligence, pp. 915–921, 1996.[15] K. Nickel and R. Stiefelhagen, “<str<strong>on</strong>g>Visual</str<strong>on</strong>g> recogniti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> pointinggestures for human-robot interacti<strong>on</strong>,” Image and Visi<strong>on</strong> Computing,vol. 25, no. 12, pp. 1875–1884, 2007.[16] P. Pook and D. Ballard, “<strong>Deictic</strong> human/robot interacti<strong>on</strong>,” Roboticsand Aut<strong>on</strong>omous Systems, vol. 18, no. 1-2, pp. 259–269, 1996.[17] N. W<strong>on</strong>g and C. Gutwin, “Where are you pointing? : <str<strong>on</strong>g>the</str<strong>on</strong>g> accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g>deictic pointing in cves,” in Proc. <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 28th Intl. C<strong>on</strong>ference <strong>on</strong>Human Factors in Computing Systems., pp. 1029–1038, ACM, 2010.[18] M. Marjanovic, B. Scassellati, and M. Williams<strong>on</strong>, “Self-taughtvisually guided pointing for a humanoid robot,” in From Animals toAnimats 4: Proc. <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 4th Intl, C<strong>on</strong>f, Simulati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> AdaptiveBehavior, pp. 35–44.[19] J. Traft<strong>on</strong>, N. Cassimatis, M. Bugajska, D. Brock, F. Mintz, andA. Schultz, “Enabling effective human–robot interacti<strong>on</strong> usingperspective-taking in robots,” IEEE Transacti<strong>on</strong>s <strong>on</strong> Systems, Man,and Cybernetics, vol. 35, no. 4, pp. 460–470, 2005.[20] O. Sugiyama, T. Kanda, M. Imai, H. Ishiguro, N. Hagita, andY. Anzai, “Humanlike c<strong>on</strong>versati<strong>on</strong> with gestures and verbal cuesbased <strong>on</strong> a three-layer attenti<strong>on</strong>-drawing model,” C<strong>on</strong>necti<strong>on</strong> science,vol. 18, no. 4, pp. 379–402, 2006.[21] Y. Hato, S. Satake, T. Kanda, M. Imai, and N. Hagita, “Pointing tospace: modeling <str<strong>on</strong>g>of</str<strong>on</strong>g> deictic interacti<strong>on</strong> referring to regi<strong>on</strong>s,” in Proc. <str<strong>on</strong>g>of</str<strong>on</strong>g><str<strong>on</strong>g>the</str<strong>on</strong>g> 5th ACM/IEEE Intl. C<strong>on</strong>f. <strong>on</strong> Human-Robot Interacti<strong>on</strong>, pp. 301–308, ACM, 2010.[22] F. Thomas and O. Johnst<strong>on</strong>, The Illusi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Life: Disney Animati<strong>on</strong>.Hyperi<strong>on</strong>, 1981.[23] J. Lasseter, “Principles <str<strong>on</strong>g>of</str<strong>on</strong>g> traditi<strong>on</strong>al animati<strong>on</strong> applied to 3Dcomputer animati<strong>on</strong>,” in ACM Computer Graphics, vol. 21, no. 4, pp.35-44, July 1987.[24] R. Mead and M.J. Matarić, “Automated caricature <str<strong>on</strong>g>of</str<strong>on</strong>g> robotexpressi<strong>on</strong>s in socially assistive human-robot interacti<strong>on</strong>, “ in The 5thACM/IEEE Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Human-Robot Interacti<strong>on</strong>(HRI2010) Workshop <strong>on</strong> What Do Collaborati<strong>on</strong>s with <str<strong>on</strong>g>the</str<strong>on</strong>g> Arts Haveto Say about HRI?, Osaka, Japan, March 2010.[25] A. Bangerter, “Using pointing and describing to achieve joint focus <str<strong>on</strong>g>of</str<strong>on</strong>g>attenti<strong>on</strong> in dialogue,” Psychological ScI., vol. 15, no. 6, p. 415, 2004.[26] L. Itti, C. Koch, and E. Niebur, “A model <str<strong>on</strong>g>of</str<strong>on</strong>g> saliency-based visualattenti<strong>on</strong> for rapid scene analysis,” IEEE Transacti<strong>on</strong>s <strong>on</strong> PatternAnalysis and Machine Intell., vol. 20, no. 11, pp. 1254–1259, 1998.[27] D. Wal<str<strong>on</strong>g>the</str<strong>on</strong>g>r, L. Itti, M. Riesenhuber, T. Poggio, and C. Koch,“Attenti<strong>on</strong>al selecti<strong>on</strong> for object recogniti<strong>on</strong>: a gentle way,” inBiologically Motivated Comp. Visi<strong>on</strong>, pp. 251–267, Springer, 2010.[28] R. Desim<strong>on</strong>e and J. Duncan, “Neural mechanisms <str<strong>on</strong>g>of</str<strong>on</strong>g> selective visualattenti<strong>on</strong>,” Annual review <str<strong>on</strong>g>of</str<strong>on</strong>g> neuroscience, vol. 18, no. 1, pp. 193–222,1995.[29] G. Butterworth and S. Itakura, “How <str<strong>on</strong>g>the</str<strong>on</strong>g> eyes, head and hand servedefinite reference,” British Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Developmental Psychology,vol. 18, no. 1, pp. 25–50, 2000.[30] R. Mead, “Space: a social fr<strong>on</strong>tier,” poster presented at <str<strong>on</strong>g>the</str<strong>on</strong>g> Workshop<strong>on</strong> Predictive Models <str<strong>on</strong>g>of</str<strong>on</strong>g> Human Communicati<strong>on</strong> Dynamics, LosAngeles, California, August 2010.[31] F. Delaunay, J. de Greeff, and T. Belpaeme, “A study <str<strong>on</strong>g>of</str<strong>on</strong>g> a retroprojectedrobotic face and its effectiveness for gaze reading byhumans,” in Proc. <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 5th ACM/IEEE Intl C<strong>on</strong>f. <strong>on</strong> Human-RobotInteracti<strong>on</strong>, pp. 39–44, ACM, 2010.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!