14.09.2013 Views

Vision Based Hand Gesture Interfaces for Wearable Computing and ...

Vision Based Hand Gesture Interfaces for Wearable Computing and ...

Vision Based Hand Gesture Interfaces for Wearable Computing and ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

UNIVERSITY OF CALIFORNIA<br />

Santa Barbara<br />

<strong>Vision</strong> <strong>Based</strong> <strong>H<strong>and</strong></strong> <strong>Gesture</strong> <strong>Interfaces</strong> <strong>for</strong><br />

<strong>Wearable</strong> <strong>Computing</strong> <strong>and</strong> Virtual Environments<br />

Committee in Charge:<br />

A Dissertation submitted in partial satisfaction<br />

of the requirements <strong>for</strong> the degree of<br />

Professor Matthew Turk, Chair<br />

Ronald T. Azuma, Ph. D.<br />

Professor Andrew C. Beall<br />

Professor Keith C. Clarke<br />

Professor Tobias Höllerer<br />

Professor Yuan-Fang Wang<br />

Doctor of Philosophy<br />

in<br />

Computer Science<br />

by<br />

Mathias Kölsch<br />

September 2004


The Dissertation of<br />

Mathias Kölsch is approved:<br />

Ronald T. Azuma, Ph. D.<br />

Professor Andrew C. Beall<br />

Professor Keith C. Clarke<br />

Professor Tobias Höllerer<br />

Professor Yuan-Fang Wang<br />

Professor Matthew Turk, Committee Chairperson<br />

August 2004


<strong>Vision</strong> <strong>Based</strong> <strong>H<strong>and</strong></strong> <strong>Gesture</strong> <strong>Interfaces</strong> <strong>for</strong><br />

<strong>Wearable</strong> <strong>Computing</strong> <strong>and</strong> Virtual Environments<br />

Copyright c○ 2004<br />

by<br />

Mathias Kölsch<br />

iii


To my parents Ute <strong>and</strong> Otto Kölsch,<br />

as a humble sign of my gratitude <strong>for</strong> their<br />

love, their sacrifices <strong>for</strong> me, <strong>and</strong> <strong>for</strong><br />

supporting my sometimes inscrutable paths.<br />

iv


Acknowledgements<br />

You would not hold this document in your h<strong>and</strong>s had it not been <strong>for</strong> the<br />

direct <strong>and</strong> indirect help of many, many people that shared their love, inspiration,<br />

time, advice, <strong>and</strong> many other essentials with me. It would fill countless more<br />

pages than the technical content of this dissertation to name all of them <strong>and</strong> their<br />

contributions. However, I would like to thank in particular:<br />

My advisor Matthew <strong>for</strong> the freedom he allowed me in my studies <strong>and</strong> the<br />

opportunities he presented me with, <strong>for</strong> his advice in uncountable respects <strong>and</strong><br />

situations; my committee members Andy <strong>for</strong> critical thinking, his humor, <strong>and</strong> the<br />

awesome hot chocolate in a windy night; Keith <strong>for</strong> his excitement <strong>for</strong> my work;<br />

Ron <strong>for</strong> his time, the many tips <strong>for</strong> graduate life <strong>and</strong> <strong>for</strong> sharing his thoughts<br />

on how (not) to write dissertation acknowledgements; Tobias <strong>for</strong> his enthusiasm,<br />

involvement, <strong>and</strong> dedication; <strong>and</strong> Yuan-Fang Wang <strong>for</strong> his insights, especially at<br />

an early stage of my thesis work. Further, I want to thank Professor Tichy, Klaus,<br />

<strong>and</strong> Urs <strong>for</strong> their inspiring teaching; Jerry <strong>for</strong> an awesome internship <strong>and</strong> how it<br />

shaped my career in computer science; Juli <strong>and</strong> the entire Computer Science staff<br />

<strong>for</strong> super-friendly assistance, especially when I was <strong>for</strong>getful or burst into the<br />

office in the last minute; our equally friendly <strong>and</strong> helpful computer support staff<br />

Richard, Andy, Andreas, <strong>and</strong> Jeff; Anurag <strong>for</strong> supporting me through my first<br />

years at UCSB; my lab colleagues Arun, Changbo, Haiying, James, Jason, Jeff,<br />

Lihua, Rogerio, Ryan B., Ryan G., Seb, Steve, Vineet, <strong>and</strong> Ya <strong>for</strong> much help <strong>and</strong><br />

many a good research discussion.<br />

My deep <strong>and</strong> sincere gratitude goes to:<br />

My parents <strong>and</strong> their unconditional love <strong>and</strong> support; Steffen <strong>for</strong> putting up with<br />

his brother <strong>and</strong> still baking him delicious cookies; Schorsch, Markus, <strong>and</strong> Frank<br />

<strong>for</strong> their continued friendship; Kai <strong>for</strong> being the first computer geek I met <strong>and</strong><br />

<strong>for</strong> helping me with my German; Chris <strong>for</strong> all that she taught me; S. Schwenk <strong>for</strong><br />

telling me about remote isl<strong>and</strong>s; Simone <strong>for</strong> the motivation to cross the Atlantic;<br />

Christoph B. <strong>for</strong> a diverse time in Karlsruhe; Corneliu <strong>for</strong> his wit <strong>and</strong> wisdom;<br />

Kris <strong>for</strong> lots of computer help, many trips, delicious Italian dinners, <strong>and</strong> our special<br />

friendship; Alex, Claudia, <strong>and</strong> Gunnar <strong>for</strong> infinite hospitality; Elizabeth <strong>and</strong><br />

Mark <strong>for</strong> all I learned from them; Wine Wednesday <strong>and</strong> its followers <strong>for</strong> weekly<br />

joys since October 1999 <strong>and</strong> Matthew Allen <strong>for</strong> taking the reign after three years;<br />

Radu <strong>and</strong> Karin <strong>for</strong> being my friends; Leonie <strong>for</strong> her inspiration, unconventional-<br />

v


ism, <strong>and</strong> mostly <strong>for</strong> being herself; Mike L. <strong>for</strong> winning the most-tolerant-neighbor<br />

price; Todd <strong>and</strong> Krista <strong>for</strong> the com<strong>for</strong>t of their company; Christa-Lynn <strong>for</strong> happycamping<br />

<strong>and</strong> all those other Canadian qualities; my hiking <strong>and</strong> mountaineering<br />

friends to allow me to experience the “freedom of the hills;” Egle <strong>for</strong> her honesty<br />

<strong>and</strong> musical pleasures; Eric <strong>for</strong> his incredible energy <strong>and</strong> motivation; <strong>and</strong> last but<br />

not least, to the East Beach volleyball players <strong>for</strong> great workouts with such a<br />

welcoming group of people!<br />

Thank you! Mathias.<br />

vi


Education<br />

Curriculum Vitæ<br />

Mathias Kölsch<br />

2003 Master of Science in Computer Science, University of Cali<strong>for</strong>nia,<br />

Santa Barbara.<br />

1997 Bachelor of Science in In<strong>for</strong>matik (Vordiplom <strong>and</strong> one year<br />

of graduate studies), Universität Karlsruhe, Germany.<br />

1993 Abitur, Kurfürst Ruprecht Gymnasium, Neustadt an der<br />

Weinstraße, Germany.<br />

Experience<br />

1999 – 2004 Graduate Research Assistant, University of Cali<strong>for</strong>nia, Santa<br />

Barbara.<br />

1997 – 2001 Teaching Assistant, University of Cali<strong>for</strong>nia, Santa Barbara.<br />

1996 – 1997 Teaching Assistant, Universität Karlsruhe, Germany.<br />

Selected Publications<br />

Mathias Kölsch, Matthew Turk, <strong>and</strong> Tobias Höllerer: “<strong>Vision</strong>-<strong>Based</strong> <strong>Interfaces</strong><br />

<strong>for</strong> Mobility,” In Proc. IEEE Intl. Conference on Mobile <strong>and</strong> Ubiquitous Systems<br />

(Mobiquitous), August 2004.<br />

Mathias Kölsch <strong>and</strong> Matthew Turk: “Fast 2D <strong>H<strong>and</strong></strong> Tracking with Flocks of<br />

Features <strong>and</strong> Multi-Cue Integration,” In Proc. IEEE Workshop on Real-Time<br />

<strong>Vision</strong> <strong>for</strong> Human-Computer Interaction (at CVPR), July 2004.<br />

Mathias Kölsch <strong>and</strong> Matthew Turk: “Robust <strong>H<strong>and</strong></strong> Detection,” In Proc. IEEE<br />

Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition, May 2004.<br />

Mathias Kölsch, Andrew C. Beall, <strong>and</strong> Matthew Turk: “An Objective Measure <strong>for</strong><br />

Postural Com<strong>for</strong>t,” In Human Factors <strong>and</strong> Ergonomics Society’s Annual Meeting<br />

Notes, October 2003.<br />

vii


Abstract<br />

<strong>Vision</strong> <strong>Based</strong> <strong>H<strong>and</strong></strong> <strong>Gesture</strong> <strong>Interfaces</strong> <strong>for</strong><br />

<strong>Wearable</strong> <strong>Computing</strong> <strong>and</strong> Virtual Environments<br />

by<br />

Mathias Kölsch<br />

Current user interfaces are unsuited to harness the full power of computers.<br />

Mobile devices like cell phones <strong>and</strong> technologies such as virtual reality dem<strong>and</strong><br />

a richer set of interaction modalities to overcome situational constraints <strong>and</strong> to<br />

fully leverage human expressiveness. <strong>H<strong>and</strong></strong> gesture recognition lets humans use<br />

their most versatile instrument – their h<strong>and</strong>s – in more natural <strong>and</strong> effective<br />

ways than currently possible. While most gesture recognition gear is cumbersome<br />

<strong>and</strong> expensive, gesture recognition with computer vision is non-invasive <strong>and</strong> more<br />

flexible. Yet, it faces difficulties due to the h<strong>and</strong>’s complexity, lighting conditions,<br />

background artifacts, <strong>and</strong> user differences.<br />

The contributions of this dissertation have helped to make computer vision a<br />

viable technology to implement h<strong>and</strong> gesture recognition <strong>for</strong> user interface pur-<br />

poses. To begin with, we investigated arm postures in front of the human body in<br />

order to avoid anthropometrically unfavorable gestures <strong>and</strong> to establish a “com<strong>for</strong>t<br />

zone” in which humans prefer to operate their h<strong>and</strong>s.<br />

viii


The dissertation’s main contribution is “<strong>H<strong>and</strong></strong>Vu,” a computer vision system<br />

that recognizes h<strong>and</strong> gestures in real-time. To achieve this, it was necessary to<br />

advance the reliability of h<strong>and</strong> detection to allow <strong>for</strong> robust system initialization<br />

in most environments <strong>and</strong> lighting conditions. After initialization, a “Flock of<br />

Features” exploits optical flow <strong>and</strong> color in<strong>for</strong>mation to track the h<strong>and</strong>’s location<br />

despite rapid movements <strong>and</strong> concurrent finger articulations. Lastly, robust ap-<br />

pearance-based recognition of key h<strong>and</strong> configurations completes <strong>H<strong>and</strong></strong>Vu <strong>and</strong><br />

facilitates input of discrete comm<strong>and</strong>s to applications.<br />

We demonstrate the feasibility of computer vision as the sole input modality<br />

to a wearable computer, providing “deviceless” interaction capabilities. We also<br />

present new <strong>and</strong> improved interaction techniques in the context of a multimodal<br />

interface to a mobile augmented reality system. <strong>H<strong>and</strong></strong>Vu allows us to exploit<br />

h<strong>and</strong> gesture capabilities that have previously been untapped, <strong>for</strong> example, in<br />

areas where data gloves are not a viable option.<br />

This dissertation’s goal is to contribute to the mosaic of available interface<br />

modalities <strong>and</strong> to widen the human-computer interface channel. Leveraging more<br />

of our expressiveness <strong>and</strong> our physical abilities offers new <strong>and</strong> advantageous ways<br />

to communicate with machines.<br />

ix


Zusammenfassung<br />

Visuelle <strong>H<strong>and</strong></strong>gesten-Erkennung als Schnittstelle für<br />

Tragbare Computer und Virtuelle Welten<br />

Mathias Kölsch<br />

Mit den heute vorherrschenden Mensch-Maschine Schnittstellen können sich<br />

sowohl die Ausdrucksmöglichkeiten des Menschen als auch das Potential moder-<br />

ner Rechner nicht voll entfalten. Insbesondere tragbare Computer und unkonven-<br />

tionelle Technologien, wie zum Beispiel Mobiltelefone und Virtuelle Realitäten,<br />

verlangen ein vielfältigeres Angebot an Interaktionsmodalitäten um situationsbe-<br />

zogene Einschränkungen zu überwinden und die gesamte B<strong>and</strong>breite des men-<br />

schlichen Ausdrucksvermögens ausschöpfen zu können.<br />

Computergestützte <strong>H<strong>and</strong></strong>gestenerkennung gibt Menschen die Möglichkeit, ihr<br />

flexibelstes Werkzeug – die <strong>H<strong>and</strong></strong> – in natürlicherer und effektiverer Weise be-<br />

nutzen zu können als dies bisher der Fall war. Die meisten Geräte, mit denen<br />

Gestenerkennung realisiert werden kann, sind jedoch umständlich in der Anwen-<br />

dung und kostspielig. Gestenerkennung mit Hilfe von Bildverarbeitung hinge-<br />

gen ist flexibler und nicht invasiv. Dem stehen jedoch Schwierigkeiten der Bild-<br />

verarbeitungsverfahren gegenüber, vor allem aufgrund interpersoneller Varianz,<br />

der Komplexität von <strong>H<strong>and</strong></strong><strong>for</strong>m und <strong>H<strong>and</strong></strong>bewegungen, unterschiedlicher Beleuch-<br />

tungsbedingungen, sowie komplexen Hintergründen.<br />

x


Die Beiträge dieser Dissertation dienen dazu, computergestützte Bildverar-<br />

beitung zu einer praktikablen Implementierungstechnologie für h<strong>and</strong>basierte Be-<br />

nutzerschnittstellen zu machen. Zunächst wurden manipulative Armstellungen<br />

vor dem Körper untersucht, um anthropometrisch ungünstige <strong>H<strong>and</strong></strong>gesten zu ver-<br />

meiden und einen kom<strong>for</strong>tablen Aktionsradius festzulegen.<br />

Als Hauptbeitrag wurde “<strong>H<strong>and</strong></strong>Vu” entwickelt: ein Bildverarbeitungssystem,<br />

welches in der Lage ist, <strong>H<strong>and</strong></strong>gesten in Echtzeit zu erkennen. Um dies zu erre-<br />

ichen war es notwendig, die Zuverlässigkeit der <strong>H<strong>and</strong></strong>erkennung zu verbessern, so<br />

dass eine robuste Initialisierung des Systems in unterschiedlichsten Umgebungen<br />

und Beleuchtungssituationen gewährleistet ist. Der “Schwarm von Merkmalen” –<br />

oder “Flock of Features” – ist eine neuartige Methode zum Verfolgen von Händen.<br />

Nach der Initialisierung werden Farbin<strong>for</strong>mationen und optischer Fluss genutzt,<br />

um die Position der <strong>H<strong>and</strong></strong> trotz schneller globaler Bewegungen und gleichzeiti-<br />

gen Fingerbewegungen zu verfolgen. Robuste, erscheinungsbasierte Erkennung<br />

bestimmter Schlüsselgesten vervollständigt <strong>H<strong>and</strong></strong>Vu und erlaubt somit die Inter-<br />

pretation diskreter <strong>H<strong>and</strong></strong>befehle.<br />

Unter Verwendung oben genannter Techniken werden in dieser Arbeit wei-<br />

terhin Möglichkeiten demonstriert, <strong>H<strong>and</strong></strong>gesten per computergestützter Bildver-<br />

arbeitung als einzige Eingabemodalität für einen tragbaren Computer zu ver-<br />

wenden, und dem Benutzer somit zu einer “gerätelosen” Eingabemöglichkeit zu<br />

xi


verhelfen. In Verbindung mit einer multimodalen Schnittstelle werden zudem<br />

neue und verbesserte Techniken zur Interaktion mit einem System für gemischt<br />

wirklich-virtuelle Realität vorgestellt. So erlaubt <strong>H<strong>and</strong></strong>Vu den Zugriff auf Aspekte<br />

von <strong>H<strong>and</strong></strong>gesten, die bisher unzugänglich waren – wie zum Beispiel in Bereichen<br />

in denen Datenh<strong>and</strong>schuhe ungeeignet sind.<br />

Das Ziel dieser Dissertation ist es, dem für Benutzerschnittstellen verfügbaren<br />

Mosaik an Modalitäten weitere hinzuzufügen und dadurch den Mensch-Maschi-<br />

ne-Kommunikationskanal etwas zu erweitern. Die Aussicht, mehr seiner Aus-<br />

drucksmöglichkeiten und physischen Fähigkeiten einsetzen zu können, birgt für<br />

den Menschen neue und vorteilhafte Wege in der Kommunikation mit Maschinen.<br />

xii


Contents<br />

Acknowledgements v<br />

Curriculum Vitæ vii<br />

Abstract viii<br />

Zusammenfassung x<br />

List of Figures xvii<br />

List of Tables xix<br />

1 Introduction 1<br />

1.1 User interfaces: bottleneck of proliferation . . . . . . . . . . . . . 1<br />

1.2 <strong>H<strong>and</strong></strong> gesture interfaces . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.4 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.5 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2 Literature Review 9<br />

2.1 <strong>H<strong>and</strong></strong> gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.2 Human factors <strong>and</strong> biomechanics . . . . . . . . . . . . . . . . . . 11<br />

2.2.1 Postural com<strong>for</strong>t . . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.3 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.3.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.3.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.3.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.3.4 Skin color . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.3.5 Shapes <strong>and</strong> contours . . . . . . . . . . . . . . . . . . . . . 22<br />

2.3.6 Motion flow . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.3.7 Texture <strong>and</strong> appearance based methods . . . . . . . . . . . 26<br />

xiii


2.3.8 Viola-Jones detection method . . . . . . . . . . . . . . . . 28<br />

2.3.9 Temporal tracking <strong>and</strong> filtering . . . . . . . . . . . . . . . 33<br />

2.3.10 Higher-level models . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.3.11 Temporal gesture recognition . . . . . . . . . . . . . . . . 39<br />

2.4 User interfaces <strong>and</strong> gestures . . . . . . . . . . . . . . . . . . . . . 41<br />

2.4.1 <strong>Gesture</strong>-based user interfaces . . . . . . . . . . . . . . . . 41<br />

2.4.2 <strong>Vision</strong>-based interfaces . . . . . . . . . . . . . . . . . . . . 43<br />

2.5 Virtual environments <strong>and</strong> applications . . . . . . . . . . . . . . . 46<br />

2.5.1 Virtual environments <strong>and</strong> GISs . . . . . . . . . . . . . . . 47<br />

2.5.2 <strong>Vision</strong>-based interfaces <strong>for</strong> virtual environments . . . . . . 52<br />

2.5.3 Mobile interfaces . . . . . . . . . . . . . . . . . . . . . . . 54<br />

3 <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context 58<br />

3.1 Postural com<strong>for</strong>t . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

3.1.1 Operational definition of com<strong>for</strong>t . . . . . . . . . . . . . . 61<br />

3.2 The com<strong>for</strong>t zone <strong>for</strong> reaching gestures . . . . . . . . . . . . . . . 64<br />

3.2.1 Method <strong>and</strong> design . . . . . . . . . . . . . . . . . . . . . . 65<br />

3.2.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

3.2.3 Materials <strong>and</strong> apparatus . . . . . . . . . . . . . . . . . . . 68<br />

3.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

3.2.5 Instructions to participants . . . . . . . . . . . . . . . . . 71<br />

3.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

3.3.1 The meaning of com<strong>for</strong>t . . . . . . . . . . . . . . . . . . . 78<br />

3.3.2 Com<strong>for</strong>t results <strong>and</strong> related work . . . . . . . . . . . . . . 79<br />

3.3.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

3.3.4 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

4 <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong> 85<br />

4.1 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

4.2 <strong>Vision</strong> system overview . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

4.2.1 Core gesture recognition module . . . . . . . . . . . . . . . 89<br />

4.2.2 Area-selective exposure control . . . . . . . . . . . . . . . 94<br />

4.2.3 Speed <strong>and</strong> size scalability . . . . . . . . . . . . . . . . . . 97<br />

4.2.4 Correction <strong>for</strong> camera lens distortion . . . . . . . . . . . . 98<br />

4.2.5 Application programming interface . . . . . . . . . . . . . 99<br />

4.2.6 Verbosity overlays . . . . . . . . . . . . . . . . . . . . . . . 102<br />

4.2.7 <strong>H<strong>and</strong></strong>Vu WinTk: video pipeline <strong>and</strong> toolkit . . . . . . . . . 104<br />

4.2.8 Recognition state distribution . . . . . . . . . . . . . . . . 106<br />

xiv


4.2.9 The vision conductor configuration file . . . . . . . . . . . 109<br />

4.3 <strong>Vision</strong> system per<strong>for</strong>mance . . . . . . . . . . . . . . . . . . . . . . 115<br />

4.4 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

5 <strong>H<strong>and</strong></strong> Detection 119<br />

5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

5.2 Parallel training with MPI . . . . . . . . . . . . . . . . . . . . . . 123<br />

5.3 Classification potential of various postures . . . . . . . . . . . . . 124<br />

5.3.1 Estimation with frequency spectrum analysis . . . . . . . . 125<br />

5.3.2 Predictor accuracy . . . . . . . . . . . . . . . . . . . . . . 128<br />

5.4 Effect of template resolution . . . . . . . . . . . . . . . . . . . . . 130<br />

5.5 Rotational robustness . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

5.5.1 Rotation baseline . . . . . . . . . . . . . . . . . . . . . . . 134<br />

5.5.2 Problem: rotational sensitivity . . . . . . . . . . . . . . . . 134<br />

5.5.3 Rotation bounds <strong>for</strong> undiminished per<strong>for</strong>mance . . . . . . 136<br />

5.5.4 Rotation density of training data . . . . . . . . . . . . . . 137<br />

5.5.5 Rotations of other postures . . . . . . . . . . . . . . . . . 137<br />

5.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

5.6 A new feature type . . . . . . . . . . . . . . . . . . . . . . . . . . 147<br />

5.6.1 Four Box feature instance generation . . . . . . . . . . . . 147<br />

5.6.2 Four Box Same feature type . . . . . . . . . . . . . . . . . 151<br />

5.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153<br />

5.7 Fixed color histogram . . . . . . . . . . . . . . . . . . . . . . . . . 155<br />

5.8 <strong>H<strong>and</strong></strong> pixel probability maps . . . . . . . . . . . . . . . . . . . . . 156<br />

5.9 Learned color distribution . . . . . . . . . . . . . . . . . . . . . . 158<br />

5.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

6 Tracking of Articulated Objects 163<br />

6.1 Preliminary studies . . . . . . . . . . . . . . . . . . . . . . . . . . 164<br />

6.2 Flocks of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

6.2.1 KLT features <strong>and</strong> tracking initialization . . . . . . . . . . 167<br />

6.2.2 Flocking behavior . . . . . . . . . . . . . . . . . . . . . . . 170<br />

6.2.3 Color modality <strong>and</strong> multi-cue integration . . . . . . . . . . 172<br />

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173<br />

6.3.1 Video sequences . . . . . . . . . . . . . . . . . . . . . . . . 175<br />

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175<br />

6.4.1 Comparison to CamShift . . . . . . . . . . . . . . . . . . . 176<br />

6.4.2 Parameter optimizations . . . . . . . . . . . . . . . . . . . 178<br />

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183<br />

6.6 Two-h<strong>and</strong>ed tracking <strong>and</strong> temporal filters . . . . . . . . . . . . . 186<br />

xv


7 Posture Recognition 189<br />

7.1 Fanned detection <strong>for</strong> classification . . . . . . . . . . . . . . . . . . 191<br />

7.2 Data collection <strong>for</strong> evaluation . . . . . . . . . . . . . . . . . . . . 194<br />

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197<br />

7.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 197<br />

7.3.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199<br />

7.3.3 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . 200<br />

7.4 Discussion <strong>and</strong> conclusions . . . . . . . . . . . . . . . . . . . . . . 200<br />

8 <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application 204<br />

8.1 Application overview <strong>and</strong> contributions . . . . . . . . . . . . . . . 205<br />

8.2 The case <strong>for</strong> external interfaces . . . . . . . . . . . . . . . . . . . 206<br />

8.3 <strong>H<strong>and</strong></strong> gesture interaction techniques . . . . . . . . . . . . . . . . . 208<br />

8.4 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />

8.5 Battuta: a wearable GIS . . . . . . . . . . . . . . . . . . . . . . . 214<br />

8.5.1 The gesture interface . . . . . . . . . . . . . . . . . . . . . 215<br />

8.5.2 Benefits of <strong>H<strong>and</strong></strong>Vu <strong>for</strong> Battuta . . . . . . . . . . . . . . . 217<br />

8.6 <strong>Vision</strong>-only interface <strong>for</strong> mobility . . . . . . . . . . . . . . . . . . 218<br />

8.6.1 Functionality of the Maintenance Application . . . . . . . 218<br />

8.6.2 Benefits of <strong>H<strong>and</strong></strong>Vu <strong>for</strong> mobility . . . . . . . . . . . . . . . 223<br />

8.7 A multimodal augmented reality interface . . . . . . . . . . . . . 225<br />

8.7.1 System description . . . . . . . . . . . . . . . . . . . . . . 226<br />

8.7.2 The Tunnel Tool <strong>and</strong> other visualizations . . . . . . . . . . 228<br />

8.7.3 Speech recognition . . . . . . . . . . . . . . . . . . . . . . 229<br />

8.7.4 Interacting with the visualized invisible . . . . . . . . . . . 231<br />

8.7.5 Multimodal integration . . . . . . . . . . . . . . . . . . . . 234<br />

8.7.6 Benefits of <strong>H<strong>and</strong></strong>Vu <strong>for</strong> powerful interfaces . . . . . . . . . 237<br />

8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239<br />

9 The Future in Your <strong>H<strong>and</strong></strong>s 241<br />

9.1 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241<br />

9.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242<br />

9.2.1 Limits of h<strong>and</strong> gesture interfaces . . . . . . . . . . . . . . 242<br />

9.2.2 Limits of vision . . . . . . . . . . . . . . . . . . . . . . . . 243<br />

9.3 Next-generation computer interfaces . . . . . . . . . . . . . . . . . 245<br />

9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249<br />

Bibliography 251<br />

xvi


List of Figures<br />

2.1 The rectangular feature types <strong>for</strong> Viola-Jones detectors. . . . . . . 30<br />

2.2 The structure of the h<strong>and</strong>: . . . . . . . . . . . . . . . . . . . . . . 36<br />

3.1 Plan view of the experiment setup: . . . . . . . . . . . . . . . . . 66<br />

3.2 A participant per<strong>for</strong>ming the skill task: . . . . . . . . . . . . . . . 69<br />

3.3 Mean <strong>and</strong> st<strong>and</strong>ard deviation of body movement: . . . . . . . . . 73<br />

3.4 <strong>H<strong>and</strong></strong>-to-shoulder distance over body movement: . . . . . . . . . . 75<br />

3.5 The com<strong>for</strong>t ratings in front of the human body: . . . . . . . . . . 77<br />

4.1 Our mobile user interface in action: . . . . . . . . . . . . . . . . . 87<br />

4.2 Arrangement of the computer vision methods: . . . . . . . . . . . 91<br />

4.3 A screen capture with verbose output turned on: . . . . . . . . . . 93<br />

4.4 The algorithm <strong>for</strong> area-selective software exposure control. . . . . 96<br />

4.5 The vision module in the application context: . . . . . . . . . . . 104<br />

5.1 Sample areas of the six h<strong>and</strong> postures: . . . . . . . . . . . . . . . 121<br />

5.2 Mean h<strong>and</strong> appearances <strong>and</strong> their Fourier trans<strong>for</strong>ms: . . . . . . . 125<br />

5.3 ROC curves <strong>for</strong> monolithic classifiers: . . . . . . . . . . . . . . . . 129<br />

5.4 ROC curves <strong>for</strong> different template resolutions: . . . . . . . . . . . 131<br />

5.5 ROC curves <strong>for</strong> various training data rotations: . . . . . . . . . . 135<br />

5.6 ROC curves showing the rotational sensitivity: . . . . . . . . . . . 140<br />

5.7 ROC curves <strong>for</strong> detection of r<strong>and</strong>omly rotated images: . . . . . . 141<br />

5.8 ROC curves <strong>for</strong> detection of discrete-rotated images: . . . . . . . . 142<br />

5.9 ROC curves <strong>for</strong> the bounds of training with rotated images: . . . 143<br />

5.10 ROC curves <strong>for</strong> different rotation steps: . . . . . . . . . . . . . . . 144<br />

5.11 The six h<strong>and</strong> postures <strong>and</strong> rotated images: . . . . . . . . . . . . . 145<br />

5.12 Overall gain of training with rotated images: . . . . . . . . . . . . 146<br />

5.13 The Four Box <strong>and</strong> Four Box Same feature types: . . . . . . . . . . 150<br />

xvii


5.14 Example instances of the Four Box Same feature type: . . . . . . 153<br />

5.15 ROC curves <strong>for</strong> detectors with Four Box Same features: . . . . . . 154<br />

5.16 The probability maps <strong>for</strong> six h<strong>and</strong> postures: . . . . . . . . . . . . 157<br />

5.17 The areas <strong>for</strong> learning the skin color model: . . . . . . . . . . . . 159<br />

6.1 Tracking a h<strong>and</strong> with an Active Shape Model: . . . . . . . . . . . 165<br />

6.2 The Flock of Features in action: . . . . . . . . . . . . . . . . . . . 166<br />

6.3 The Flock of Features tracking algorithm: . . . . . . . . . . . . . 168<br />

6.4 Images taken during tracking: . . . . . . . . . . . . . . . . . . . . 171<br />

6.5 Results of tracking with Flocks of Features: . . . . . . . . . . . . . 177<br />

6.6 Contributors towards the Flock of Features’ per<strong>for</strong>mance: . . . . . 179<br />

6.7 Tracking with different numbers of features: . . . . . . . . . . . . 180<br />

6.8 Tracking with different search window sizes: . . . . . . . . . . . . 182<br />

7.1 Fanned arrangement of partial detectors: . . . . . . . . . . . . . . 192<br />

7.2 The data collection <strong>for</strong> evaluating the posture recognition: . . . . 196<br />

7.3 Sample images <strong>for</strong> evaluation of the posture recognition: . . . . . 198<br />

8.1 Interaction area size versus interface device size: . . . . . . . . . . 207<br />

8.2 The map display of the Battuta wearable GIS. . . . . . . . . . . . 216<br />

8.3 Image of pointer-based interaction: . . . . . . . . . . . . . . . . . 220<br />

8.4 Image of two-h<strong>and</strong>ed interaction: . . . . . . . . . . . . . . . . . . 221<br />

8.5 Image of location-independent interaction: . . . . . . . . . . . . . 222<br />

8.6 Selecting from many items with registered manipulation. . . . . . 223<br />

8.7 An overview of the hardware components: . . . . . . . . . . . . . 227<br />

8.8 Schematic view of the Tunnel Tool: . . . . . . . . . . . . . . . . . 230<br />

8.9 A h<strong>and</strong>-worn trackball: . . . . . . . . . . . . . . . . . . . . . . . . 232<br />

xviii


List of Tables<br />

2.1 Measuring various degrees of com<strong>for</strong>t: . . . . . . . . . . . . . . . . 14<br />

5.1 The h<strong>and</strong> image data collection: . . . . . . . . . . . . . . . . . . . 122<br />

6.1 The video sequences <strong>and</strong> their characteristics: . . . . . . . . . . . 176<br />

7.1 Summary of the recognition results: . . . . . . . . . . . . . . . . . 199<br />

8.1 The mapping of input to effect: . . . . . . . . . . . . . . . . . . . 233<br />

xix


Chapter 1<br />

Introduction<br />

First we thought the PC was a calculator. Then we found out how to<br />

turn numbers into letters with ASCII - we thought it was a typewriter.<br />

Then we discovered graphics, <strong>and</strong> we thought it was a television. With<br />

the World Wide Web, we’ve realized it’s a brochure.<br />

Douglas Adams (1952-2001)<br />

What else can computers do <strong>for</strong> us? How can we tap into more of their vast<br />

resources? Which applications are we overlooking because of a limiting view of<br />

computers?<br />

1.1 User interfaces: bottleneck of proliferation<br />

Current user interfaces are unsuited to harness the full power of computers.<br />

Keyboards <strong>and</strong> desktop displays – the most prevalent interfaces – offer only a<br />

narrow bridge across the barrier between brain <strong>and</strong> circuitry. 3D applications <strong>and</strong><br />

wearable computers such as cell phones are in particular need of more natural <strong>and</strong><br />

effective means of interaction. In fact, the limitations of current human-computer<br />

1


Chapter 1. Introduction<br />

interfaces hinder expansion of the computer’s abilities to serve humans in many<br />

aspects of life. The goal of this dissertation is to add to the mosaic of available<br />

interface modalities <strong>and</strong>, thereby, to widen the human-computer interface channel.<br />

Leveraging more of our expressiveness <strong>and</strong> our physical abilities in particular offers<br />

new <strong>and</strong> advantageous ways to communicate with machines.<br />

1.2 <strong>H<strong>and</strong></strong> gesture interfaces<br />

<strong>H<strong>and</strong></strong> gesture interfaces are computer input methods that utilize the 3-dimen-<br />

sional location of a person’s h<strong>and</strong>, its orientation, <strong>and</strong> its posture (the finger<br />

configuration). These perceptual interfaces promise to blend input <strong>and</strong> output<br />

spaces, allowing <strong>for</strong> unmodulated interactions between real <strong>and</strong> virtual objects.<br />

<strong>Gesture</strong> interfaces have many uses, <strong>for</strong> example, <strong>for</strong> live manipulation of objects<br />

in virtual environments, <strong>for</strong> automated video transcription of user studies, as an<br />

aid <strong>for</strong> people with special requirements, <strong>for</strong> contactless interfaces in antiseptic<br />

environments, or <strong>for</strong> character animation through motion capture.<br />

<strong>H<strong>and</strong></strong> gesture interfaces through means of computer vision are particularly ad-<br />

vantageous from a user’s point of view because they are untethered, no gloves need<br />

be worn, <strong>and</strong> they can be deployed anywhere a camera can be taken. Also, <strong>for</strong><br />

example, while the interaction surface of a conventional keyboard or touch screen<br />

2


Chapter 1. Introduction<br />

cannot be larger than the hardware, the input space observable with a camera is<br />

not limited by the camera’s <strong>for</strong>m factor. This property of “device-external inter-<br />

faces” is of increasing importance to ever-shrinking portable electronics. However,<br />

dealing with person-specific variations, cluttered backgrounds, changing lighting<br />

conditions, camera specifics, rapid relative motion, <strong>and</strong> hard real-time require-<br />

ments has so far prevented vision-based interfaces (VBI) from achieving robust-<br />

ness <strong>and</strong> usability in settings other than the lab environment. Recently, Sony’s<br />

EyeToy (a full-body VBI <strong>for</strong> the PlayStation2) <strong>and</strong> Canesta’s virtual keyboard<br />

(type on a keyboard projected onto any flat surface) have proven that consumer-<br />

grade applications have become feasible. Yet, they still are very limited in their<br />

interaction capabilities <strong>and</strong> the keyboard requires customized hardware.<br />

3


Chapter 1. Introduction<br />

Thesis Statement<br />

This dissertation introduces <strong>and</strong> evaluates novel <strong>and</strong> improved computer<br />

vision methods that facilitate robust, user-independent h<strong>and</strong> detection,<br />

h<strong>and</strong> tracking, <strong>and</strong> posture recognition in real-time. It demonstrates<br />

that computer vision is a feasible means to provide h<strong>and</strong> gesture<br />

interfaces. <strong>Wearable</strong> computers <strong>and</strong> non-traditional environments such<br />

as augmented reality benefit from the enriched interaction modalities.<br />

The remainder of this chapter motivates the problem setting further, states our<br />

main contributions to overcome those problems, <strong>and</strong> finally provides an overview<br />

of the dissertation organization.<br />

1.3 Problem statement<br />

<strong>H<strong>and</strong></strong> gestures are very powerful human interface components, yet their ex-<br />

pressiveness has not been leveraged <strong>for</strong> computer input. No currently available<br />

technology is able to recognize these gestures in a convenient <strong>and</strong> easily accessible<br />

manner. Computer <strong>Vision</strong> promises to achieve that, but even its most advanced<br />

methods are either too fragile or deliver too coarse grained output to be of uni-<br />

versal use <strong>for</strong> h<strong>and</strong> gesture recognition. In particular, methods <strong>for</strong> h<strong>and</strong> gesture<br />

interfaces must surpass current per<strong>for</strong>mance in terms of speed <strong>and</strong> robustness to<br />

achieve interactivity <strong>and</strong> usability. No reliable h<strong>and</strong> detection methods exist, <strong>and</strong><br />

recognition of finger configurations fails due to the high degrees of freedom of<br />

4


Chapter 1. Introduction<br />

the h<strong>and</strong>, presenting unsurmountable complexity if approached with traditional<br />

modeling means.<br />

1.4 Key contributions<br />

This dissertation’s first contribution is the establishment of a com<strong>for</strong>table in-<br />

teraction range <strong>for</strong> h<strong>and</strong> gesture interfaces. Since the field of human factors lacked<br />

the means to objectively assess the subtle feeling of postural discom<strong>for</strong>t, a defini-<br />

tion of postural com<strong>for</strong>t had to be introduced that allows precise quantification,<br />

without the need <strong>for</strong> subjective questionnaires. Equipped with this definition, a<br />

user study was conducted to chart the range of com<strong>for</strong>table h<strong>and</strong> positions in<br />

front of a st<strong>and</strong>ing person at about stomach height, an important consideration<br />

<strong>for</strong> free-h<strong>and</strong> gesture interfaces. Chapter 3 covers com<strong>for</strong>t in greater detail <strong>and</strong><br />

explains how these results contribute towards the remainder of this dissertation.<br />

Next, “<strong>H<strong>and</strong></strong>Vu” 1 was built, the first vision-based interface (VBI) that allows<br />

real-time detection of the unmarked h<strong>and</strong>, h<strong>and</strong> tracking, <strong>and</strong> posture recognition<br />

while being mostly invariant to changes in background, lighting, camera, <strong>and</strong><br />

person. The very fast texture-based h<strong>and</strong> detection <strong>and</strong> posture classification<br />

methods operate on unconstrained monocular grey-level images. Together with<br />

1 <strong>H<strong>and</strong></strong>Vu is pronounced “h<strong>and</strong>-view.”<br />

5


Chapter 1. Introduction<br />

color-based verification, the detector has a very low false positive rate of a few<br />

false matches per hour of live processing, indoors <strong>and</strong> outdoors.<br />

The “Flock of Features”, a novel h<strong>and</strong> tracking method, also employs multiple<br />

image cues – globally-constrained optical flow in combination with a dynamic<br />

skin color probability function – to surpass the per<strong>for</strong>mance of single-modality<br />

trackers such as CamShift on de<strong>for</strong>mable objects. The method is robust to most<br />

distraction: it operates indoors <strong>and</strong> outdoors, with different people, <strong>and</strong> despite<br />

dynamic backgrounds <strong>and</strong> camera motion. It requires no object model <strong>and</strong> thus<br />

might be applicable to tracking other very de<strong>for</strong>mable <strong>and</strong> articulated objects<br />

such as human bodies. Its computation time of 2-18ms per 720x480 RGB frame<br />

leaves room <strong>for</strong> the subsequent h<strong>and</strong> posture recognition.<br />

These results <strong>for</strong> detection <strong>and</strong> tracking provide an important step towards<br />

easily deployable <strong>and</strong> robust vision-based h<strong>and</strong> gesture interfaces. Further, a qual-<br />

itative measure is presented that amounts to an a priori estimate of “detectability”<br />

with Viola-Jones-like detectors, alleviating the need <strong>for</strong> compute-intensive training<br />

but obtaining results immediately instead.<br />

<strong>H<strong>and</strong></strong>Vu was successfully demonstrated controlling two wearable computer sys-<br />

tems. The first system features a head-worn camera <strong>and</strong> display, demonstrating<br />

the VBI’s capability to act as the sole input modality. The other is a registered<br />

Augmented Reality system that adds speech recognition <strong>and</strong> a custom control <strong>for</strong><br />

6


Chapter 1. Introduction<br />

multimodal input, permitting efficient operation of its complex functionality. The<br />

VBI was used by different people <strong>and</strong> with a stationary <strong>and</strong> a head-worn cam-<br />

era. <strong>H<strong>and</strong></strong>Vu has evolved into a software toolkit that allows out-of-the-box use of<br />

vision-based h<strong>and</strong> gesture recognition as interface modality.<br />

This vision-based interface offers novel <strong>and</strong> unencumbered data acquisition ca-<br />

pabilities that are important <strong>for</strong> new functionalities, new devices, <strong>and</strong> new ways to<br />

interact with computers. The contributions of this dissertation have shown that<br />

VBIs are ready to be taken out of the lab into real applications. Further progress<br />

on topics that bring together the fields of computer vision, human-computer inter-<br />

action, <strong>and</strong> graphics is expected to harbor many opportunities <strong>for</strong> the computer<br />

of the future.<br />

1.5 Dissertation overview<br />

The remainder of this document is organized as follows. First, the literature<br />

related to the dissertation is discussed in the next chapter. In Chapter 3, we inves-<br />

tigate the human factors <strong>and</strong> biomechanical sides of h<strong>and</strong> gestures, in particular<br />

the physical ranges <strong>for</strong> h<strong>and</strong> gestures.<br />

Computer vision as an enabling technology <strong>for</strong> h<strong>and</strong> gesture recognition is<br />

covered in Chapters 4 through Chapter 7. First, the <strong>H<strong>and</strong></strong>Vu system <strong>for</strong> interface<br />

7


Chapter 1. Introduction<br />

implementation is introduced in Chapter 4. It presents the <strong>H<strong>and</strong></strong>Vu library’s<br />

API <strong>and</strong> how applications can make use of the out-of-the-box gesture recognition<br />

toolkit. The functional division into the three subsequent chapters is motivated in<br />

that chapter as well. Our robust h<strong>and</strong> detection method is discussed in Chapter 5.<br />

The fast “Flock of Features” tracking method is introduced <strong>and</strong> evaluated in<br />

Chapter 6. As the last purely vision-concerned part, Chapter 7 explains our h<strong>and</strong><br />

posture classifier <strong>and</strong> presents evaluation results.<br />

Our experiences with putting <strong>H<strong>and</strong></strong>Vu to use are detailed in Chapter 8. The<br />

vision-based h<strong>and</strong> gesture interface controlled three different applications, two of<br />

them wearable computer systems. The last chapter, Chapter 9, relates our work<br />

to a larger context, provides an outlook to its potentials <strong>and</strong> implications, <strong>and</strong><br />

concludes the dissertation.<br />

8


Chapter 2<br />

Literature Review<br />

This chapter discusses the most relevant research <strong>and</strong> publications pertaining<br />

to this dissertation. The path that this literature review takes follows user in-<br />

terface (UI) construction from theory to practice. First, a definition <strong>and</strong> some<br />

possible classifications of gestures are given. Next, research is described that con-<br />

cerns biomechanical <strong>and</strong> human factors issues of h<strong>and</strong> gestures. The lion’s share<br />

of this chapter deals with computer vision (CV) methods <strong>and</strong> their suitability to<br />

implementing user interfaces. Last but not least, two promising application areas<br />

<strong>for</strong> h<strong>and</strong> gesture-based user interfaces are covered: the non-traditional realms of<br />

augmented environments as well as mobile <strong>and</strong> wearable computing.<br />

2.1 <strong>H<strong>and</strong></strong> gestures<br />

In its most general meaning, a gesture is any physical configuration of the<br />

body, whether the person is aware of it or not, whether per<strong>for</strong>med with the entire<br />

9


Chapter 2. Literature Review<br />

body or just the facial muscles, whether static in nature or involving a movement.<br />

In the computer vision literature, gesture usually refers to a continuous, dynamic<br />

motion, whereas a posture is a static configuration. In this dissertation, the term<br />

gesture pertains to static <strong>and</strong> dynamic, continuous h<strong>and</strong> gestures. Only when<br />

discussing computer vision methods will ‘posture’ be used to explicitly address<br />

the static aspects of h<strong>and</strong> gestures. For example, posture classification refers to<br />

the estimation of finger configurations, that is, the ability to distinguish a fist<br />

from a flat palm <strong>and</strong> so on.<br />

Kendon [83] describes different kinds of gestures from what has become known<br />

as Kendon’s <strong>Gesture</strong> Continuum:<br />

gesticulation → language-like gestures → pantomimes<br />

→ emblems → sign languages<br />

From left to right, “language-like properties” increase <strong>and</strong> the presence of speech<br />

decreases. McNeill [115] proposes a typology of strongly speech-related “gestic-<br />

ulation” gestures based on semiotic properties, that is, properties that pertain<br />

to the symbolics of the gesture. This typology is usually preferred over other<br />

classification schemes owing to its practical value <strong>and</strong> close relation to the ac-<br />

companying speech. Language <strong>and</strong> gestures are also the topic of a comprehen-<br />

sive McNeill-edited publication [116]. A classification from Quek [136] focuses on<br />

10


Chapter 2. Literature Review<br />

non speech-related gestures from Kendon’s <strong>Gesture</strong> Continuum <strong>and</strong> adds uninten-<br />

tional movements <strong>and</strong> manipulative h<strong>and</strong> motions aside from communicative ones.<br />

Cadoz [23] distinguishes further between ergodic 1 gestures – gestural manipulation<br />

– <strong>and</strong> epistemic gestures – tactile exploration.<br />

According to McNeill [115], gestures are composable into three stages, namely<br />

pre-stroke, stroke <strong>and</strong> post-stroke. The pre-stroke prepares the movement of the<br />

h<strong>and</strong>. The h<strong>and</strong> waits in this ready-state until the speech arrives at the point<br />

when the stroke is to be delivered. The stroke is often characterized by a peak<br />

in the h<strong>and</strong>’s velocity <strong>and</strong> distance from the body. The h<strong>and</strong> is retracted during<br />

the post-stroke, but this phase is frequently omitted or strongly influenced by the<br />

following gesture.<br />

2.2 Human factors <strong>and</strong> biomechanics<br />

The field of human factors is the study of the human body’s physical <strong>and</strong><br />

mental properties <strong>and</strong> abilities. Biomechanics is a subfield that is exclusively con-<br />

cerned with the body as a kinematic, mechanical structure. Research in these<br />

areas has established guidelines <strong>for</strong> occupational motions <strong>and</strong> postures that hu-<br />

mans can assume without harm. Good introductions to workspace design can be<br />

found in several comprehensive reference manuals [25, 148, 185].<br />

1 ergodic: mathematical term meaning space-filling<br />

11


Chapter 2. Literature Review<br />

A number of researchers have investigated the range of arm <strong>and</strong> h<strong>and</strong> motions<br />

to great detail: Chaffin [24] devised a method <strong>for</strong> measuring fatigue with phys-<br />

iological indicators, <strong>and</strong> Wiker et al. [183] surveyed arm movement capabilities.<br />

Gr<strong>and</strong>jean [51] states that the best grasping distance is about two thirds of the<br />

maximum reaching distance. The best reaching height with an upright upper<br />

body is around stomach or elbow height (with hanging arms). In addition to<br />

the work considering reachability <strong>and</strong> com<strong>for</strong>t, other topics relevant to evaluating<br />

space as an interaction area include the temporal characteristics of reaching move-<br />

ments, as well as the precision <strong>and</strong> accuracy of such movements. Fitts Law [44],<br />

<strong>and</strong> more recent work such as Fikes [43], show that a good interaction range is<br />

also characterized by the time it takes to reach the various locations in this area.<br />

Specific to computer input of spatial data is a survey by Hinckley et al. [58].<br />

2.2.1 Postural com<strong>for</strong>t<br />

Discom<strong>for</strong>t <strong>and</strong> fatigue are two closely related feelings. Discom<strong>for</strong>t is generally<br />

assumed to be a subjective quantity, whereas fatigue can be measured objectively<br />

(see below). Bhatmager et al. [9], Karwowski et al. [78], <strong>and</strong> Liao et al. [104]<br />

showed that the frequency of non-work related posture shifts is strongly related<br />

to the perceived discom<strong>for</strong>t. They measured discom<strong>for</strong>t exclusively with question-<br />

naires <strong>for</strong> the participants during <strong>and</strong> after the study. The st<strong>and</strong>ard <strong>for</strong>mat <strong>for</strong><br />

12


Chapter 2. Literature Review<br />

whole-body fatigue assessment is Borg’s Rate of Perceived Exertion, or the RPE<br />

scale [12]. Borg found a logarithmic relationship between achieved power on a<br />

treadmill <strong>and</strong> the participant’s subjective exertion. The RPE is obtained with a<br />

questionnaire on which participants mark their rate of exertion on a scale of 6-20<br />

with some numbers having a descriptive name, ranging from 6, “least ef<strong>for</strong>t” over<br />

13 “somewhat hard” to 20 “maximal ef<strong>for</strong>t.” The RPE scale is sometimes used<br />

to assess discom<strong>for</strong>t as well, especially since the boundary between discom<strong>for</strong>t<br />

<strong>and</strong> fatigue is not clearly defined. For assessment of discom<strong>for</strong>t of specific body<br />

parts, the Body Part Discom<strong>for</strong>t (BPD) scale by Corlett et al. [31] is used. With<br />

BPD, participants are asked to indicate the body parts that are experiencing the<br />

highest degree of discom<strong>for</strong>t on a body chart <strong>and</strong> rate the severity of discom<strong>for</strong>t.<br />

These body parts are then covered up on the chart <strong>and</strong> the next most uncom-<br />

<strong>for</strong>table body parts have to be identified <strong>and</strong> so <strong>for</strong>th, until no more body parts<br />

experience discom<strong>for</strong>t. Drury et al. [41] added BPD Frequency <strong>and</strong> BPD Severity<br />

to the evaluation method <strong>for</strong> these questionnaires. BPD Frequency describes the<br />

total number of body parts with some discom<strong>for</strong>t, <strong>and</strong> BPD Severity measures<br />

the mean severity of discom<strong>for</strong>t of only these body parts with discom<strong>for</strong>t.<br />

Fatigue can be objectively observed by degraded per<strong>for</strong>mance (speed <strong>and</strong> ac-<br />

curacy) in task execution. Fatigue might also be measurable directly by means of<br />

superficial sensors <strong>for</strong> voltage potentials. Electromyographic (EMG) amplitudes<br />

13


Chapter 2. Literature Review<br />

<strong>and</strong> frequencies reflect the nervous contraction signals to the muscles. Chaffin [24]<br />

suggests that localized muscular fatigue (LMF), measured by a frequency shift<br />

towards lower frequency b<strong>and</strong>s, is correlated to experienced discom<strong>for</strong>t. Other<br />

studies, however, found little or even inverse correlation between objective EMG<br />

frequency shifts <strong>and</strong> subjective fatigue [122].<br />

Table 2.1: Measuring various degrees of com<strong>for</strong>t:<br />

the table shows which <strong>and</strong> how feelings along the com<strong>for</strong>t dimension can be measured.<br />

BPD is Body Part Discom<strong>for</strong>t, RPE the Rate of Perceived Exertion, <strong>and</strong><br />

EMG are electromyographic signals. Our work is detailed in Section 3.1.<br />

feeling com<strong>for</strong>t discom<strong>for</strong>t fatigue<br />

measure unaware aware<br />

direct no no BPD (RPE) EMG<br />

indirect no our work posture shifts RPE, per<strong>for</strong>mance<br />

objective no our work posture shifts EMG, per<strong>for</strong>mance<br />

subjective absence of discom<strong>for</strong>t BPD (RPE) RPE<br />

The characteristics of the methods to assess feelings along the “com<strong>for</strong>t dimen-<br />

sion” are summarized in Table 2.1. For example, EMG is a direct <strong>and</strong> objective<br />

measure <strong>for</strong> fatigue, <strong>and</strong> BPD is a subjective method <strong>for</strong> direct ascertainment of<br />

aware discom<strong>for</strong>t. “Direct” refers to measurement methods that can detect the<br />

feeling as such, while “indirect” methods have to rely on measuring a secondary<br />

effect caused by the feeling. From this chart it is apparent that no traditional<br />

method can assess very subtle feelings of discom<strong>for</strong>t, particularly not with objec-<br />

tive measurements. Our work, detailed in Chapter 3, tries to fill this void.<br />

14


Chapter 2. Literature Review<br />

Kee [82] built a human-physics model in order to compute an “isocom<strong>for</strong>t<br />

workspace,” surfaces of uni<strong>for</strong>m discom<strong>for</strong>t based on perceived discom<strong>for</strong>t. With<br />

Chung et al. [26], he had earlier collected subjective discom<strong>for</strong>t data which can<br />

now be used <strong>for</strong> purely observational discom<strong>for</strong>t estimation, without the need<br />

<strong>for</strong> additional questionnaires. For this method, the combination of discom<strong>for</strong>t<br />

scores <strong>for</strong> joint angles of various limbs results in a full-body discom<strong>for</strong>t estimation,<br />

hinting at postural workload.<br />

2.3 Computer vision<br />

There is an extensive body of related computer vision research which could<br />

fill many books. Here, we summarize the major works that could fit the bill<br />

<strong>for</strong> real-time user interface operation through h<strong>and</strong> gesture recognition in a fairly<br />

unconstrained environment. To get an independent overview, the reader is referred<br />

to a 1998 paper by Freeman et al. that surveys “computer vision <strong>for</strong> interactive<br />

computer graphics” [47] <strong>and</strong> an assessment of the state of the art by Turk [177].<br />

The next three sections (2.3.1–2.3.3) mention seminal works in their functional<br />

contexts, while the remaining sections (2.3.4–2.3.11) cover specific approaches in<br />

more detail.<br />

15


Chapter 2. Literature Review<br />

Three common tasks <strong>for</strong> computer vision processing are 1) the detection of the<br />

presence of an object in an image, 2) the spatial tracking of a once-acquired object<br />

over time, <strong>and</strong> 3) recognition of one of many object types, that is, classification of<br />

the observation in the image into one of many classes. The vision system described<br />

in Chapter 4 is organized in these three stages <strong>and</strong> this review starts with a brief<br />

overview of them.<br />

2.3.1 Detection<br />

The human visual system has the amazing ability to detect h<strong>and</strong>s in almost any<br />

configuration <strong>and</strong> situation, <strong>and</strong> possibly a single “h<strong>and</strong> neuron” is responsible<br />

<strong>for</strong> recording <strong>and</strong> signaling such an event as Desimone et al. showed in early<br />

neuroscientific experiments [35]. The computer vision research has not quite yet<br />

achieved this goal. However, it is crucial that a h<strong>and</strong> that is supposed to function<br />

as an input mechanism to the computer is robustly <strong>and</strong> reliably detected in front<br />

of arbitrary backgrounds because all further stages <strong>and</strong> functionality depend on it.<br />

Object detection of artificial objects, such as colored or blinking sticks as in Wilson<br />

<strong>and</strong> Shafer [184], can achieve very high detection rates despite low false positive<br />

rates. Yet, the same is not true <strong>for</strong> faces <strong>and</strong> even less <strong>for</strong> h<strong>and</strong>s. Face detection<br />

has attracted a great amount of interest <strong>and</strong> many methods relying on shape,<br />

texture, <strong>and</strong>/or temporal in<strong>for</strong>mation have been thoroughly investigated over the<br />

16


Chapter 2. Literature Review<br />

years. The interested reader is referred to two surveys by Yang, Kriegman, <strong>and</strong><br />

Ahuja [190] <strong>and</strong> Hjelm˚as <strong>and</strong> Low [60].<br />

Little work has been done on finding h<strong>and</strong>s in grey-level images based on their<br />

appearance <strong>and</strong> texture. Wu <strong>and</strong> Huang [188] surveyed a number of methods <strong>for</strong><br />

their suitability to h<strong>and</strong> detection. Very recently, boosted classifiers have achieved<br />

compelling results <strong>for</strong> view- <strong>and</strong> posture-independent h<strong>and</strong> detection as Ong <strong>and</strong><br />

Bowden demonstrated [124]. However, most of the h<strong>and</strong> detection methods resort<br />

to less object-specific approaches <strong>and</strong> instead employ color in<strong>for</strong>mation (see, <strong>for</strong><br />

example, Zhu et al. [196]), sometimes in combination with location priors (<strong>for</strong> ex-<br />

ample, Kurata et al. [100]), motion flow (see Cutler <strong>and</strong> Turk [34]) or background<br />

differencing (<strong>for</strong> example, Segen <strong>and</strong> Kumar [152]). The focus of the methods<br />

developed <strong>for</strong> this dissertation is on reliable detection without placing severe con-<br />

straints on the scenery. Those must use the maximum amount of in<strong>for</strong>mation<br />

available in the image, combining texture <strong>and</strong> color cues <strong>for</strong> reliability.<br />

2.3.2 Tracking<br />

If the detection method is fast enough to operate at image acquisition frame<br />

rate, it can be used <strong>for</strong> tracking as well. However, tracking h<strong>and</strong>s is notoriously<br />

difficult since they can move very fast <strong>and</strong> their appearance can change vastly<br />

within a few frames. On the other h<strong>and</strong>, mostly rigid objects can be tracked with<br />

17


Chapter 2. Literature Review<br />

very limited shape modeling ef<strong>for</strong>ts. Some of the most effective head trackers,<br />

<strong>for</strong> example, use a fixed elliptical shape model which is fast <strong>and</strong> sufficient <strong>for</strong><br />

the rigid head structure, as in Birchfield [10]. Similarly, more or less rigid h<strong>and</strong><br />

models work well <strong>for</strong> a few select h<strong>and</strong> configurations <strong>and</strong> relatively static lighting<br />

conditions (see Isard <strong>and</strong> Blake [66]).<br />

Since tracking with a rigid appearance model is not possible <strong>for</strong> h<strong>and</strong>s in<br />

general, most approaches resort to shape-free color in<strong>for</strong>mation or background<br />

differencing as in the mentioned works by Cutler <strong>and</strong> Turk [34], Kurata et al. [100],<br />

<strong>and</strong> Segen <strong>and</strong> Kumar [152]. These methods are vulnerable to unimodal failure<br />

modes, caused, <strong>for</strong> example, by a non-stationary camera or other skin-colored<br />

objects in the neighborhood. In Chapter 6, we describe the Flock of Features,<br />

our multimodal technique that can overcome these vulnerabilities. Other multi-<br />

cue methods integrate, <strong>for</strong> example, texture <strong>and</strong> color in<strong>for</strong>mation <strong>and</strong> can then<br />

recognize <strong>and</strong> track a small number of fixed shapes despite arbitrary backgrounds<br />

(<strong>for</strong> example, Bretzner et al. [19]). Shan et al. [155] also integrate color in<strong>for</strong>mation<br />

with motion data. In their work, a particle filtering method is optimized <strong>for</strong> speed<br />

with deterministic mean shift, <strong>and</strong> dynamic weights determine the blend of color<br />

with motion data: the faster the object moves, the more weight is given to the<br />

motion data, <strong>and</strong> slower object movements result in the color cue being weighted<br />

higher. Some of their per<strong>for</strong>mance is surely due to a simple yet usually effective<br />

18


Chapter 2. Literature Review<br />

dynamical model (of the object velocity), which we could add to our method as<br />

well. Depth in<strong>for</strong>mation combined with color as in Grange et al. [52] also yields<br />

a robust h<strong>and</strong> tracker, yet stereo cameras are more expensive <strong>and</strong> cumbersome<br />

than the single imaging device required <strong>for</strong> our monocular approach.<br />

Object segmentation based on optical flow (<strong>for</strong> example, normalized graph<br />

cuts as proposed by Shi <strong>and</strong> Malik [157]) can produce good results <strong>for</strong> tracking<br />

objects that exhibit a limited amount of de<strong>for</strong>mations during global motions <strong>and</strong><br />

thus have a fairly uni<strong>for</strong>m flow [137]. The Flock of Features method relaxes this<br />

constraint <strong>and</strong> can track despite concurrent articulation <strong>and</strong> location changes.<br />

2.3.3 Recognition<br />

Recognizing or distinguishing different h<strong>and</strong> configurations is a very difficult<br />

<strong>and</strong> largely unsolved problem in its generality. First attempts have recently been<br />

made by Ong <strong>and</strong> Bowden [124] with fairly good results. However, to achieve the<br />

more stringent requirements of user interface quality, robustness <strong>and</strong> a low false<br />

positive rate are more important than recognition of the complete h<strong>and</strong> configu-<br />

ration space from the entire view sphere. The posture recognition task becomes<br />

more tractable <strong>for</strong> a few select postures from fixed views. Segen <strong>and</strong> Kumar’s<br />

Shadow <strong>Gesture</strong>s [153] demonstrated how heavy constraints on the scenery make<br />

computer vision a viable user interface implementation modality: they require a<br />

19


Chapter 2. Literature Review<br />

calibrated point light source, <strong>and</strong> the h<strong>and</strong> has to be a certain distance from the<br />

camera <strong>and</strong> from a bright, even background. Even then, only three gestures are<br />

recognized <strong>and</strong> distinguished reliably. Other, traditional methods based on edge<br />

detection can achieve fairly good results as Athitsos <strong>and</strong> Sclaroff [1] demonstrate.<br />

Wu <strong>and</strong> Huang [188] show good classification per<strong>for</strong>mance but require extensive<br />

processing time.<br />

We use a method (described in Chapter 7) similar to Ong <strong>and</strong> Bowden’s but<br />

are able to achieve the real-time per<strong>for</strong>mance that is necessary <strong>for</strong> user interfaces.<br />

Additionally, the result of our classification is a known posture, not a similarity to<br />

a cluster that was learned without supervision <strong>and</strong> can thus contain syntactically<br />

similar, but semantically different h<strong>and</strong> postures.<br />

2.3.4 Skin color<br />

That compelling results can be achieved merely by skin color properties was<br />

shown early on, <strong>for</strong> example, by Schiele <strong>and</strong> Waibel who used it in combination<br />

with a neural network to estimate gaze direction [151]. Kjeldsen <strong>and</strong> Kender [85]<br />

demonstrate interface-quality h<strong>and</strong> gesture recognition solely with color segmen-<br />

tation means. Their method uses an HSV-like color space, which is debatably<br />

beneficial to skin color identification. 2<br />

2 HSV st<strong>and</strong>s <strong>for</strong> Hue, Saturation, <strong>and</strong> Value. This color space separates the chrominance<br />

(hue <strong>and</strong> saturation) of light from its brightness (value).<br />

20


Chapter 2. Literature Review<br />

The appearance of skin color varies mostly in intensity while the chrominance<br />

remains fairly consistent (see Saxe <strong>and</strong> Foulds [150]). There<strong>for</strong>e, <strong>and</strong> according<br />

to Zarit et al. [193], color spaces that separate intensity from chrominance are<br />

better suited to skin segmentation when simple threshold-based segmentation is<br />

used. However, their results are based on a few images only, while a paper from<br />

Jones <strong>and</strong> Rehg [73] examined a huge number of images <strong>and</strong> found an excellent<br />

classification per<strong>for</strong>mance with a histogram-based method in RGB color space. 3<br />

It seems that very simple threshold methods or other linear filters achieve better<br />

results in HSV space, while more complex methods, particularly learning-based,<br />

nonlinear models excel in any color space.<br />

Jones <strong>and</strong> Rehg also state that Gaussian mixture models are inferior to his-<br />

togram-based approaches. This is true as long as a large enough training set<br />

is available. Otherwise, Gaussians can fill in <strong>for</strong> insufficient training data <strong>and</strong><br />

achieve better classification results. Bradski [14] <strong>and</strong> Comaniciu et al. [28] showed<br />

that object tracking based on color in<strong>for</strong>mation is possible with a method called<br />

CamShift which is based on the mean shift algorithm. These methods dynamically<br />

slide a “color window” along the color probability distribution to dynamically<br />

parameterize a thresholding segmentation. A certain amount of lighting changes<br />

can be dealt with. Patches or blobs of uni<strong>for</strong>m color have also been used, especially<br />

3 RGB st<strong>and</strong>s <strong>for</strong> Red, Green, <strong>and</strong> Blue. Most digital cameras natively provide images in this<br />

<strong>for</strong>mat.<br />

21


Chapter 2. Literature Review<br />

in fairly controlled scenes (<strong>for</strong> example, by Wren <strong>and</strong> Pentl<strong>and</strong> [186]). Zhu et<br />

al. [195] achieve excellent segmentation with dynamic adaptation of the skin color<br />

model based on the observed image.<br />

2.3.5 Shapes <strong>and</strong> contours<br />

In theory, the contour or silhouette of an object reveals a lot about its shape<br />

<strong>and</strong> orientation. If perfect segmentation is possible, similarity based on curve<br />

matching is a viable approach to object classification, <strong>for</strong> example, based on polar<br />

coordinates as in Hamada et al. [53]. One can benefit even more from curve<br />

descriptors that are invariant to scale variation <strong>and</strong> rigid trans<strong>for</strong>mations such<br />

as those by Gdalyahu <strong>and</strong> Weinshall [50] <strong>and</strong> Shape Context descriptors [8]. For<br />

less-than-perfect conditions however, more robust 2D methods must be employed.<br />

Those usually bank on finding enough local clues in the image to place a shape<br />

model close to where the most likely incarnation of this shape can be found in the<br />

image data. Iterative methods frequently try to minimize an energy defined as<br />

image misalignments (far from an edge) <strong>and</strong> a second energy component defined<br />

as shape de<strong>for</strong>mations. For example, snakes are “rubber b<strong>and</strong>s” that are attracted<br />

to strong gradients in gray-level images (see Kass et al. [79]). Active Shape Models<br />

(ASM, introduced by Cootes <strong>and</strong> Taylor [30]) are rubber b<strong>and</strong>s that default to<br />

other than round shapes as snakes would without image constraints. Those shapes<br />

22


Chapter 2. Literature Review<br />

are learned from training data. This approach is based on PCA 4 which extracts<br />

“modes of variation” from the data. For a h<strong>and</strong> in top view, these modes could<br />

conceptually be the flexing of each finger. ASM’s iterative matching method<br />

then tries to find the lowest energy de<strong>for</strong>mation of the shape while achieving the<br />

lowest image mismatch energy. The limitations of ASMs are partially due to the<br />

PCA which does not prevent creation of shapes that are implausible according<br />

to the training data, especially <strong>for</strong> silhouettes of highly articulate objects. All<br />

pure shape models also require very good initialization in close proximity to the<br />

object in question or background artifacts will make good registration difficult.<br />

Statistical models of an object’s 3D shape, often called “point clouds,” can also<br />

be built (as did, <strong>for</strong> example, Heap <strong>and</strong> Hogg [56]), but they shall not be further<br />

surveyed since they exhibit limited speed per<strong>for</strong>mance.<br />

Conventional contour or silhouette-based methods st<strong>and</strong> <strong>and</strong> fall with the qual-<br />

ity of segmentation or edge finding methods. They are sensitive to background<br />

variation <strong>and</strong> – when used <strong>for</strong> tracking – they are generally unable to recover from<br />

tracking loss. Particle filtering methods (see Section 2.3.9) have built-in tolerance<br />

<strong>for</strong> temporary false positives <strong>and</strong> they recover from tracking errors more easily.<br />

This comes at the cost of a higher computational requirement. Also, temporal<br />

4 For an explanation of principal component analysis (PCA) see Section 2.3.7.<br />

23


Chapter 2. Literature Review<br />

data is usually part of the modeled state whereas conventional shape models have<br />

no notion of time.<br />

Athitsos <strong>and</strong> Sclaroff [1] took an increasingly popular approach <strong>and</strong> had their<br />

recognition method learn from rendered h<strong>and</strong> images instead of from actual pho-<br />

tographs. During testing, edge data between the observation <strong>and</strong> the learned<br />

database are compared <strong>and</strong> 3D h<strong>and</strong> configurations can be estimated from 2D<br />

grey-level edges. According to their paper, matching takes less than a second <strong>for</strong><br />

an approximate result, but too long <strong>for</strong> interactive frame rates. Thayananthan et<br />

al. [170] detect h<strong>and</strong>s in distinctive postures despite cluttered backgrounds with<br />

chamfer matching. The chamfer distance between two curves or contours is the<br />

mean of the distances between each point on one curve <strong>and</strong> its closest point on<br />

the other curve.<br />

2.3.6 Motion flow<br />

The motion field of a frame specifies <strong>for</strong> every sample point the direction<br />

<strong>and</strong> distance with which it moved in image coordinates from one frame to the<br />

next. Optical flow <strong>and</strong> motion fields (see Barron et al. [6]) can serve two main<br />

purposes in h<strong>and</strong> gesture interfaces. First, if the motion field <strong>for</strong> the entire image<br />

is computed, regions of uni<strong>for</strong>m flow (in direction <strong>and</strong> speed) can aid segmentation<br />

<strong>and</strong> attract other methods to areas of interest. This is particularly effective <strong>for</strong><br />

24


Chapter 2. Literature Review<br />

setups with static camera positions, where motion blobs direct other processing<br />

stages to attention areas, regions in the image of likely locations <strong>for</strong> the desired<br />

object (Cutler <strong>and</strong> Turk [34], Cui <strong>and</strong> Weng [33], Freeman et al. [47]). Background<br />

differencing assumes scene composition of a fairly static background <strong>and</strong> a more<br />

vivid <strong>for</strong>eground. This can be seen as a specialized motion flow model where<br />

additional in<strong>for</strong>mation is used about some of the objects in the scene. While<br />

these methods can achieve good results <strong>for</strong> stationary cameras, they are unsuited<br />

to moving cameras <strong>and</strong> large depth differences between objects that make up the<br />

background.<br />

Second, motion flow can be used to track a few select image features, usu-<br />

ally corners or other areas with high image gradients. KLT features – named<br />

after their inventors Kanade, Lucas, <strong>and</strong> Tomasi – can be matched efficiently to<br />

similar regions in the following frame (see “good features to track” in Shi <strong>and</strong><br />

Tomasi [158] <strong>and</strong> pyramid-based matching in Lucas <strong>and</strong> Kanade [110]). For ex-<br />

ample, Quek [137] used this technique in his often-cited gesture recognition paper.<br />

An improvement on the feature selection method was recently proposed by Toews<br />

<strong>and</strong> Arbel [173], but this has not been evaluated yet <strong>for</strong> its effect on practical<br />

tracking per<strong>for</strong>mance. Feature tracking of course has its limitations due to the<br />

constancy assumption (no change in appearance from frame to frame), match<br />

window size (aperture problem), <strong>and</strong> search window size (speed of moving object,<br />

25


Chapter 2. Literature Review<br />

computational ef<strong>for</strong>t). Yet our Flock of Features method introduced in Chap-<br />

ter 6 is able to capitalize on the benefits <strong>and</strong> makes up <strong>for</strong> some deficiencies by<br />

integration of color, a second image cue.<br />

2.3.7 Texture <strong>and</strong> appearance based methods<br />

In<strong>for</strong>mation in the image domain must play an important role in every object<br />

recognition or tracking method. This in<strong>for</strong>mation is extracted to <strong>for</strong>m image<br />

features: higher-level descriptions of the observations. The features’ degree of<br />

abstraction <strong>and</strong> the scale of what they describe (small, local image artifacts or<br />

large, global impressions) have a strong influence on the method’s characteristics.<br />

Features built from local, small-scale image in<strong>for</strong>mation such as steep gray-level<br />

gradients are more sensitive to noise, they need a good spatial initialization, <strong>and</strong><br />

if all but the simplest objects are to be detected, a frequently large collection of<br />

many features is m<strong>and</strong>atory (see also Section 2.3.5, Shapes <strong>and</strong> contours). Once<br />

the features are extracted from the image, they need to be brought into context<br />

with each other, often times involving an iterative <strong>and</strong> computationally expensive<br />

method.<br />

If instead the features are composed of many more pixels, cover a larger region<br />

in the image, <strong>and</strong> abstract to more complex visuals, the dependent methods are<br />

usually better able to deal with clutter <strong>and</strong> might flatten the hierarchy of process-<br />

26


Chapter 2. Literature Review<br />

ing levels since they already contain much more in<strong>for</strong>mation than smaller-scale<br />

features. These methods focus on an object’s appearance, which is a description<br />

of its color <strong>and</strong> brightness properties at every point as it appears to the observer.<br />

Since these attributes are view-dependent, it only makes sense to talk about ap-<br />

pearance from a given view. Appearance is further caused by texture, surface<br />

structure, <strong>and</strong> lighting. The appearance of de<strong>for</strong>mable <strong>and</strong> articulated objects<br />

naturally is also caused by the object’s current <strong>for</strong>m.<br />

One of the most influential procedures uses a set of training images <strong>and</strong> the<br />

Karhunen-Loève trans<strong>for</strong>m [77, 108]. This trans<strong>for</strong>mation is an orthogonal ba-<br />

sis trans<strong>for</strong>mation (or “rotation”) of the training space that maximizes sample<br />

variance along the new basis vectors <strong>and</strong> is frequently known in the computer vi-<br />

sion literature as principal component analysis (PCA, see Jolliffe [71]) or singular<br />

value decomposition (SVD). In their seminal work, Turk <strong>and</strong> Pentl<strong>and</strong> applied this<br />

method to per<strong>for</strong>m face detection <strong>and</strong> recognition <strong>and</strong> called it Eigenfaces [178],<br />

building on work by Kirby <strong>and</strong> Sirovich [84] on image representation of faces.<br />

Matching an observation to a PCA-based appearance model is computationally<br />

expensive. Active Appearance Models (AAM, see Cootes et al. [29]) encode shape<br />

<strong>and</strong> appearance in<strong>for</strong>mation in one model, built in a two-step process with PCAs.<br />

Matching to an AAM can be done with a relatively fast, iterative approximation<br />

method, allowing <strong>for</strong> some real-time applications in face recognition, <strong>for</strong> example.<br />

27


Chapter 2. Literature Review<br />

Direct Appearance Models, proposed by Hou et al. [64], exploit an observation<br />

that is true at least <strong>for</strong> face images: that the shape is entirely determined by the<br />

appearance. Thus, their method avoids some of the complexity of AAMs.<br />

Further ways to learn <strong>and</strong> then test <strong>for</strong> common appearances of objects make<br />

use of neural networks. However, their per<strong>for</strong>mance limits (in terms of speed <strong>and</strong><br />

accuracy) seem to have been surpassed by various other approaches. The following<br />

section describes a particular method that overcomes most of the speed problems<br />

associated with spatially large features. A more complete review of appearance-<br />

based methods <strong>for</strong> detection <strong>and</strong> recognition of patterns (faces in fact) can be<br />

found in Yang et al.’s survey on face detection [190].<br />

2.3.8 Viola-Jones detection method<br />

In this section, the object detection method is described that is the basis <strong>for</strong> our<br />

robust h<strong>and</strong> detector <strong>and</strong> the posture classification method. Proposed by Viola<br />

<strong>and</strong> Jones [180], this extremely fast <strong>and</strong> almost arbitrarily accurate approach has<br />

taken the vision community by storm <strong>and</strong> a number of extension <strong>and</strong> application<br />

papers have shown its advantages <strong>for</strong> detection <strong>and</strong> recognition tasks.<br />

Very simple features based on intensity comparisons between rectangular im-<br />

age areas serve as weak classifiers. A weak classifier may have just better-than-<br />

guessing abilities to separate the two classes. AdaBoost [48] then combines many<br />

28


Chapter 2. Literature Review<br />

weak classifiers into a strong classifier: Given a set of positive <strong>and</strong> negative train-<br />

ing samples, it iteratively selects the best weak classifier from a usually large pool<br />

of possible weak classifiers. The best classifier achieves the smallest number of<br />

misclassifications – a small training error. The samples are then weighted accord-<br />

ing to the classification result, such that an incorrectly classified sample gets a<br />

larger weight <strong>and</strong> correct ones obtain smaller weights. In subsequent iterations,<br />

the quality of weak classifiers is evaluated on the weighted samples. Next, the<br />

now best weak classifier is picked, the samples re-weighted, <strong>and</strong> so on. A strong<br />

classifier is a linear combination of weak classifiers with scalar coefficients that<br />

correspond to each classifier’s training error.<br />

Freund <strong>and</strong> Schapire show that asymptotically perfect classification is possible<br />

if enough training samples are available. The number of weak classifiers that<br />

<strong>for</strong>m one good strong classifier is usually in the hundreds. In their application to<br />

pattern recognition, Viola <strong>and</strong> Jones arrange a number of these strong classifiers<br />

in sequence to <strong>for</strong>m a “cascade.” Lazy cascade evaluation during detection allows<br />

early elimination of negative image areas, resulting in high speed per<strong>for</strong>mance. A<br />

typical number of cascade stages is in the order of tens. The detector cascades<br />

achieve excellent detection rates on face images with a low false positive rate.<br />

The second major contribution of Viola <strong>and</strong> Jones’ paper [180] is the clever<br />

technique <strong>for</strong> constant-time computation of the feature values: a single-pass pre-<br />

29


Chapter 2. Literature Review<br />

computation step creates an “integral image,” better known as data cubes in the<br />

database community. With those, the pixel values in arbitrarily large rectangular<br />

areas can be summed up with three additions. This is crucial as other techniques<br />

usually require time proportional to the area of the extent. Some of these “rectan-<br />

gle features” are depicted in Figure 2.1. Due to the exhaustive-search component<br />

to find the best weak classifier at each AdaBoost iteration, training a cascade<br />

takes on the order of 24 hours on a 30+ node PC cluster.<br />

1 2 3<br />

Figure 2.1: The rectangular feature types <strong>for</strong> Viola-Jones detectors.<br />

At detection time, the entire image is scanned at multiple scales. For example,<br />

a template of size 25x25 pixels, swept across a 640x480 image pixel by pixel,<br />

then enlarged in size by 25%, swept again, enlarged, swept, etc. yields 355614<br />

classifications. Every stage of the cascade has to classify the area positive <strong>for</strong><br />

an overall positive match. This lazy successive cascade evaluation, together with<br />

the rectangular features’ constant-time property, allows the detector to run fast<br />

enough <strong>for</strong> the low latency requirements of real-time object detection.<br />

Overall, the method’s accuracy <strong>and</strong> speed per<strong>for</strong>mance, as well as its sole<br />

reliance on grey-level images, make it very attractive <strong>for</strong> h<strong>and</strong> detection.<br />

30


Chapter 2. Literature Review<br />

Boosting <strong>and</strong> particularly Viola <strong>and</strong> Jones’ method has become one of the<br />

most actively researched areas in computer vision. For example, Pavlovic <strong>and</strong><br />

Garg [129] showed applications to skin color segmentation <strong>and</strong> face detection<br />

based on other features. Jones <strong>and</strong> Viola extended their work to multi-view face<br />

detection [72]. The three feature types shown in Figure 2.1 were not sufficient to<br />

accurately separate the classes, so they introduced a feature type based on four<br />

rectangular areas. For h<strong>and</strong> detection, we conceived a similar feature type that is,<br />

however, even more expressive because it is not restricted to exhibit only adjacent<br />

rectangular areas.<br />

Viola <strong>and</strong> Jones [179] also addressed the issue of the asymmetrical training<br />

data: usually, there are many more negative examples than positive ones. Yet<br />

the cost <strong>for</strong> misclassification of either of them is equal in AdaBoost, which does<br />

not reflect the importance of the few positive examples. In AsymBoost, they<br />

propose a way to dynamically <strong>and</strong> regularly adjust the weights such that positive<br />

examples have an increased importance. Their results show a reduction in false<br />

positive rates by up to a factor of two <strong>for</strong> a given detection rate.<br />

Zhang et al. [194] also tackled the multi-view face detection. They improve<br />

the core AdaBoost with a method they call FloatBoost. During training, not only<br />

are weak classifiers added, but they are also removed from a strong classifier if<br />

they do not contribute progressively to the per<strong>for</strong>mance anymore. This reduces<br />

31


Chapter 2. Literature Review<br />

the number of required feature evaluations during detection <strong>and</strong>/or improves the<br />

accuracy of the detector.<br />

Lienhart <strong>and</strong> Maydt [105] show that not only grid-aligned rectangular features<br />

can be computed in constant time, but that diagonal, diamond-shaped arrange-<br />

ments are equally possible. 45-degree rotated rectangular areas can be summed<br />

up in constant time, however, the precomputation step requires two passes over<br />

the image versus one <strong>for</strong> the original integration templates, not counting creation<br />

of the squared integral image that is required <strong>for</strong> constant-time normalization.<br />

A template is a knowledge-based description of the typical components of an<br />

object. Components can be locations of prominent intensity levels <strong>and</strong> their spa-<br />

tial relationship, image symmetries, certain frequency ranges in Fourier-trans-<br />

<strong>for</strong>med images etc. Machine vision <strong>and</strong> image processing often use image moments<br />

to describe an object template. Schweitzer et al. [7] extended the idea of integral<br />

images to compute any order moments in constant time, again with one or several<br />

precomputation steps.<br />

Last but certainly not least, Ong <strong>and</strong> Bowden’s posture- <strong>and</strong> view-indepen-<br />

dent h<strong>and</strong> (shape) detector [124] must be mentioned as it is the closest to our<br />

h<strong>and</strong> posture recognition method. With the aid of unsupervised training, they<br />

can automatically group thous<strong>and</strong>s of training images into 300 clusters based on<br />

their shape appearance (contour) using k-medoid clustering. In a first stage during<br />

32


Chapter 2. Literature Review<br />

detection, a generic h<strong>and</strong> detector finds h<strong>and</strong>s in any pose <strong>and</strong> posture. A second<br />

stage per<strong>for</strong>ms the classification into one of the 300 shape clusters. Different from<br />

our recognition method, their second stage has no influence on whether an area<br />

is considered a h<strong>and</strong> or not. This is likely to involve redundant classifier work,<br />

causing a speed per<strong>for</strong>mance penalty that might be prohibitive <strong>for</strong> user interface<br />

requirements (no computation times are stated in their paper). Also, their clusters<br />

have no immediate relation to a h<strong>and</strong> posture as they are solely contour-based.<br />

Our classes directly correspond to one known posture each.<br />

2.3.9 Temporal tracking <strong>and</strong> filtering<br />

This section concerns methods that go beyond motion flow, that is, those that<br />

track on the object level, not at the pixel- or pattern- level. While our vision<br />

system currently does not include a dynamic model of h<strong>and</strong> motion, this is likely<br />

to improve its per<strong>for</strong>mance significantly. Such a dynamic filter can be easily<br />

implemented at the application layer.<br />

Kalman filtering [75] takes advantage of smooth movements of the object of<br />

interest. At every frame the filter predicts the object’s location based on the<br />

previous motion history, collapsed into the state at the prior frame <strong>and</strong> the error<br />

covariances. The image matching is initialized with this prediction <strong>and</strong> once the<br />

33


Chapter 2. Literature Review<br />

object is found, the Kalman parameters are adjusted according to the prediction<br />

error.<br />

One of the limitations of Kalman filtering is the underlying assumption of a<br />

single Gaussian probability. If this is not given, <strong>and</strong> the probability function is es-<br />

sentially multi-modal as is the case <strong>for</strong> scenes with cluttered backgrounds, Kalman<br />

filtering cannot cope with the non-Gaussian observations. The particle filtering or<br />

factored sampling method, often called Condensation (conditional density propa-<br />

gation) tracking, makes no implicit assumptions of a certain probability function<br />

but rather represents it with a set of sample points into this function. Thus, irreg-<br />

ular functions with multiple “peaks” – corresponding to multiple hypotheses <strong>for</strong><br />

object states – can be h<strong>and</strong>led without violating the method’s assumptions. Fac-<br />

tored sampling, introduced to the vision community by Isard <strong>and</strong> Blake [67], has<br />

been applied with great success to tracking various fast-moving, fixed-shape ob-<br />

jects in very complex scenes (see Laptev <strong>and</strong> Lindeberg [101], Deutscher et al. [36],<br />

<strong>and</strong> Bretzner et al. [19]). How various models, one <strong>for</strong> each typical motion pat-<br />

tern, can improve tracking is shown in another paper by Isard <strong>and</strong> Blake [66].<br />

Partitioned sampling is a technique that reduces the complexity of particle filters,<br />

introduced by MacCormick <strong>and</strong> Isard [111]. The modeled domain is usually a<br />

feature vector, combined from shape-describing elements such as the coefficients<br />

of B-splines <strong>and</strong> temporal elements such as the object velocities.<br />

34


Chapter 2. Literature Review<br />

2.3.10 Higher-level models<br />

While our method makes no attempt at estimating the articulation of the<br />

h<strong>and</strong>’s phalanges, this topic is still covered here due to its close relation to our<br />

work. Also, logical extensions of our work certainly include systems <strong>for</strong> full state<br />

estimation of complex h<strong>and</strong> models.<br />

Higher-level models describe properties of the tracked objects that are not im-<br />

mediately visible in the image domain. For h<strong>and</strong>s, there are structural, anatomical<br />

models <strong>and</strong> kinematic models. Both are 3D models that have explicit knowledge<br />

about the h<strong>and</strong>’s physique. For example, the knowledge about limbs <strong>and</strong> that<br />

they can occlude each other can avoid problematic situations <strong>for</strong> the vision algo-<br />

rithms as Rehg <strong>and</strong> Kanade showed in [141, 142]. A kinematic model furthermore<br />

describes interactions between the limbs; see, <strong>for</strong> example, Lee <strong>and</strong> Kunii [102].<br />

Anatomically, the h<strong>and</strong> is a connection of 18 elements: the 5 fingers with 3<br />

elements each, the thumb-proximal part of the palm, <strong>and</strong> the two parts of the<br />

palm that extend from the pinky <strong>and</strong> ring fingers to the wrist (see Figure 2.2).<br />

The 17 joints that connect the elements have one, two, or three degrees of freedom<br />

(DOF). There are a total of 23 DOF, but <strong>for</strong> computer vision (CV) purposes the<br />

joints inside the palm are frequently ignored as well as the rotational DOF of the<br />

trapeziometacarpal joint. In addition to these 20 DOF, the h<strong>and</strong> reference frame<br />

has 6 DOF (location <strong>and</strong> orientation). See Braf<strong>for</strong>t et al. [15] <strong>for</strong> an exemplary an-<br />

35


Chapter 2. Literature Review<br />

thumb IP<br />

thumb MP<br />

TMC<br />

DIP<br />

PIP<br />

MCP<br />

1<br />

MCC<br />

1<br />

3<br />

1<br />

1<br />

2<br />

1<br />

1<br />

distal phalanges<br />

middle phalanges<br />

proximal phalanges<br />

metacarpals<br />

carpals<br />

radius<br />

ulna<br />

Figure 2.2: The structure of the h<strong>and</strong>:<br />

the joints <strong>and</strong> their degrees of freedom are: distal interphalangeal joints (DIP, 1<br />

DOF), proximal interphalangeal joints (PIP, 1 DOF), metacarpophalangeal joints<br />

(MCP, 2 DOF), metacarpocarpal joints (MCC, 1 DOF <strong>for</strong> pinky <strong>and</strong> ring fingers),<br />

thumb’s interphalangeal joint (IP, 1 DOF), thumb’s metacarpophalangeal joint<br />

(MP, 1 DOF), <strong>and</strong> thumb’s trapeziometacarpal joint (TMC, 3 DOF).<br />

thropometric h<strong>and</strong> model. A h<strong>and</strong> configuration is a point in this 20-dimensional<br />

configuration space.<br />

The difficulties of searching such a high-dimensional space are obvious. Lin,<br />

Wu, <strong>and</strong> Huang [106] suggest limiting the search to the interesting subspace of<br />

natural h<strong>and</strong> configurations <strong>and</strong> motions. Type I constraints limit the extent of<br />

the space by considering only anatomically possible joint angles <strong>for</strong> each joint (see<br />

36


Chapter 2. Literature Review<br />

also earlier work by Lee <strong>and</strong> Kunii [102]). Type II constraints reduce the dimen-<br />

sionality by assuming direct correlation between DIP <strong>and</strong> PIP flexion. Type III<br />

constraints limit the extent of the space again by eliminating generally impossible<br />

configurations <strong>and</strong> unlikely transitions between configurations. Altogether <strong>and</strong><br />

after a PCA, a dimensionality reduction to seven dimensions was achieved while<br />

retaining 95% of all configurations observed in their experiments. Wu <strong>and</strong> Huang<br />

also published a good high-level survey of the state of the art of h<strong>and</strong> modeling,<br />

analysis, <strong>and</strong> recognition [189].<br />

Bray et al. [18] introduce a special gradient descent method called “Stochastic<br />

Meta Descent.” It has adaptive step sizes so that it does not likely get stuck in<br />

a local minimum. Together with an anthropometric model <strong>and</strong> stereo video they<br />

achieve good results in 3D h<strong>and</strong> tracking. The processing time is not short enough<br />

<strong>for</strong> real-time deployment (4.7 seconds/frame on 1GHz Sunfire). Another paper<br />

by the same authors [17] combines the Stochastic Meta Descent with a particle<br />

filtering method, thus making it easier <strong>for</strong> the deterministic component to march<br />

out of local minima. This improvement is bought with increased computation<br />

time.<br />

Analysis by synthesis describes estimation of model parameters by back-pro-<br />

jection of the model into the image domain, <strong>and</strong> then adjusting it iteratively<br />

to the closest match between back-projection <strong>and</strong> observation (see, <strong>for</strong> example,<br />

37


Chapter 2. Literature Review<br />

Kameda et al. [76], Shimada [159], <strong>and</strong> Ström et al. [166]). These methods lack the<br />

capability to deal with singularities that arise from ambiguous views (see Morris<br />

<strong>and</strong> Rehg [118]). When using more complex h<strong>and</strong> models that allow methods from<br />

projective geometry to be used to generate the synthesis, self-occlusion is modeled<br />

again <strong>and</strong> thus it can be dealt with. Stenger et al. [165], <strong>for</strong> example, use quadrics<br />

to model the h<strong>and</strong> geometry <strong>and</strong> show that good tracking of a few h<strong>and</strong> postures<br />

with a fairly unambiguous contour is possible. They use an “unscented Kalman<br />

filter” (UKF) to align the model with the observations. The UKF (introduced<br />

by Julier et al. [74]) improves on the widely used Extended Kalman filter (EKF)<br />

<strong>for</strong> nonlinear systems by not assuming an underlying Gaussian distribution of<br />

the parameters, by being faster to implement <strong>and</strong> to apply, <strong>and</strong> by achieving<br />

superior per<strong>for</strong>mance due to a more accurate estimation of the covariance matrix.<br />

It samples the distribution with a few, well selected samples which are propagated<br />

through every estimation step. One of the drawbacks of all Kalman-based filters,<br />

however, <strong>and</strong> the reason why the proposed h<strong>and</strong> posture tracking method will<br />

most likely not generalize to more complex postures is that these filters assume<br />

a unimodal distribution. This assumption is most likely violated by complex,<br />

articulated objects such as the h<strong>and</strong>. It remains to be noted that the UKF is faster<br />

than particle filtering, but that the combination with the quadrics representation<br />

slows the approach down below real-time per<strong>for</strong>mance.<br />

38


Chapter 2. Literature Review<br />

Lu et al. [109] also employ an articulated h<strong>and</strong> model, parameterized with the<br />

image edges, optical flow, <strong>and</strong> shading in<strong>for</strong>mation to estimate a h<strong>and</strong> posture in<br />

3D. In an earlier system, Nolker <strong>and</strong> Ritter [120] start from the fingertip locations<br />

(found with prior work) <strong>and</strong> deduce the possible 3D h<strong>and</strong> postures. Under the<br />

hood, a neural network parameterizes an articulated h<strong>and</strong> model.<br />

2.3.11 Temporal gesture recognition<br />

When it comes to modeling <strong>and</strong> extracting the temporal movements of the<br />

h<strong>and</strong>s, general physics-based motion models are called <strong>for</strong>. Kalman filtering in<br />

combination with a skeletal model can easily deal with resolving simple occlusion<br />

ambiguities as Wren <strong>and</strong> Pentl<strong>and</strong> [186] demonstrated. Dynamic gesture recog-<br />

nition, that is, recognition of the continuity aspects of gestures, <strong>and</strong> especially<br />

their semantic interpretation is tangential to this dissertation. Readily available<br />

mathematical tools can easily <strong>and</strong> independently be applied to the data produced<br />

from the proposed methods, should the need arise. There<strong>for</strong>e, the related work<br />

description will be kept very brief.<br />

Many methods in use are borrowed from the more evolved field of speech recog-<br />

nition due to the similarity of the domains: temporal <strong>and</strong> noisy. Hidden Markov<br />

Models (HMMs, see Bunke <strong>and</strong> Caelli [21] <strong>and</strong> Young [192]) are frequently em-<br />

ployed to dissect <strong>and</strong> recognize gestures due to their suitability to processing<br />

39


Chapter 2. Literature Review<br />

temporal data. More discrete approaches also per<strong>for</strong>m well at detecting spatio-<br />

temporal patterns; see work by Hong et al. [63]. The learning methods of tradi-<br />

tional HMMs cannot model some structural properties of moving limbs very well. 5<br />

Br<strong>and</strong> [16] uses another learning procedure to train HMMs that overcomes these<br />

problems: it allows <strong>for</strong> estimation of 3D model configurations from a series of 2D<br />

silhouettes <strong>and</strong> achieves excellent results. The advantage of this approach is that<br />

no knowledge has to be hard-coded but instead everything is learned from training<br />

data. This of course has its drawbacks when it comes to recognizing previously<br />

unseen motion.<br />

With much more modeling ef<strong>for</strong>t, strong enough priors can be placed on the<br />

observations <strong>and</strong> simple motions can be recognized even though they are not<br />

close to training data. DiFranco et al. [38] showed that – when combined with<br />

manually annotated key frames – even very complex movements can be extracted<br />

successfully. They model a body as a kinematic chain <strong>and</strong> add a model <strong>for</strong> link<br />

dynamics plus angular restrictions on the joints.<br />

5 Br<strong>and</strong> says that the traditional learning methods are not well suited to model state transitions<br />

since they do not improve much on the initial guess about connectivity. Estimating the<br />

structure of a manifold with these methods thus is extremely suboptimal in its results. See [16]<br />

<strong>for</strong> details.<br />

40


Chapter 2. Literature Review<br />

2.4 User interfaces <strong>and</strong> gestures<br />

This section covers literature related to issues of gesture-based user interfaces<br />

(UI). It sheds light on the suitability of gestures as an input modality. Approaches<br />

of matching gestures to application needs differ qualitatively. This is discussed<br />

first, followed by an introductory look at vision-based gesture recognition <strong>for</strong><br />

human-computer interaction. Another detailed review of gesture recognition tech-<br />

nology can be found in a book chapter by Turk [176].<br />

2.4.1 <strong>Gesture</strong>-based user interfaces<br />

<strong>Gesture</strong> interfaces can offer natural <strong>and</strong> easy ways to communicate with a<br />

computer <strong>and</strong> programs. Hummels <strong>and</strong> Stappers [65] describe two experiments<br />

that show how well gestures are suited to convey geometric design in<strong>for</strong>mation<br />

without requiring pre-defined semantics. For anything but geometric constructs<br />

of course the semantics have to be defined. Hauptmann showed early on that<br />

multi-modal interfaces combining gestures <strong>and</strong> speech are highly intuitive <strong>and</strong> the<br />

preferred mode of interaction (given speech-only <strong>and</strong> gestures-only as the other<br />

choices). In a Wizard-of-Oz study [55] he had participants manipulate graphical<br />

objects on a screen. Other recommendations based on his experience are not to try<br />

to substitute gestures <strong>for</strong> mouse input since people want to use their h<strong>and</strong>s in more<br />

41


Chapter 2. Literature Review<br />

than just one rigid configuration. He also stresses the importance of immediate<br />

feedback to allow <strong>for</strong> compensation of alignment errors when pointing. A study<br />

by Krum et al. [97] of applications of the <strong>Gesture</strong> Pendant (see below <strong>and</strong> [161])<br />

confirmed these recommendations: their system maximizes gesture recognition<br />

robustness by allowing only four static h<strong>and</strong> configurations, followed by relative<br />

movements from the starting positions. This severely restricts the expressiveness<br />

<strong>and</strong> intuitiveness of the interface – which resulted in limited per<strong>for</strong>mance <strong>and</strong><br />

users disliking the gesture interface.<br />

UI issues of h<strong>and</strong> gesture interfaces are usually h<strong>and</strong>led in an ad-hoc manner<br />

without a systematic approach. That is to say, current research often limits the<br />

choice of gestures to those that are easily recognizable, or it employs habitual<br />

gestures out of convenience. Few projects take a systematic approach; exceptions<br />

can be found in Pausch et al. [127], Buxton et al. [22], <strong>and</strong> Leganchuk et al. [103],<br />

the latter explaining some of the benefits that two-h<strong>and</strong>ed interfaces have over<br />

single-h<strong>and</strong>ed interfaces. A high-level design method <strong>for</strong> gesture interfaces was<br />

developed by Sturman et al. [167], complete with an evaluation guide <strong>and</strong> im-<br />

plementation recommendations. They distinguish gestures by whether they are<br />

continuous or discrete. <strong>Gesture</strong> interpretations are classified according to whether<br />

the actions in the application domain have a direct relationship to the h<strong>and</strong> ges-<br />

42


Chapter 2. Literature Review<br />

tures, the gestures are mapped to arbitrary comm<strong>and</strong>s, or gestures are interpreted<br />

in a more context sensitive, symbolic way.<br />

The XW<strong>and</strong> [184] is a UI device in the shape of a w<strong>and</strong> or stick. It enables<br />

natural interaction with consumer devices through gestures <strong>and</strong> speech while over-<br />

coming many of the problems of h<strong>and</strong> gesture recognition. It closely matches<br />

“perfect” interaction with the environment: point at <strong>and</strong> gesture or speak com-<br />

m<strong>and</strong>s to objects. While none of the techniques <strong>for</strong> realization of the UI are new,<br />

the combination into such a package is novel. Earlier, Kohtake et al. [86] showed a<br />

similar w<strong>and</strong>-like device that enabled data transfer between consumer appliances<br />

such as a digital camera, a computer, <strong>and</strong> a printer by pointing the “InfoPoint”<br />

at it. A built-in camera is used <strong>for</strong> fiducial recognition.<br />

Since the focus of this dissertation is on recognition of h<strong>and</strong> configurations from<br />

single-frame appearances (not h<strong>and</strong> motions as in Section 2.3.11), sign language<br />

recognition will not be covered <strong>for</strong> its strong dynamic character in this literature<br />

review. The interested reader is referred to a review of some recognition methods<br />

<strong>for</strong> dynamic gestures, written in 1999 by Wu <strong>and</strong> Huang [187].<br />

2.4.2 <strong>Vision</strong>-based interfaces<br />

An early survey of methods <strong>and</strong> applications of vision-based h<strong>and</strong> gesture<br />

interfaces was written by Pavlovic, Sharma <strong>and</strong> Huang [128]. One of the first<br />

43


Chapter 2. Literature Review<br />

papers that used finger tracking as real-time input was by Fukumoto et al. [49].<br />

There, a cursor can be moved around on a projection screen by making a pointing<br />

h<strong>and</strong> posture <strong>and</strong> moving the h<strong>and</strong> within a space observed from two cameras.<br />

Two different postures can be distinguished (thumb up <strong>and</strong> down) with various<br />

interpretations to control a VCR <strong>and</strong> to draw. The paper also deals with the<br />

problem of estimating where a person points. Crowley et al. [32] showed another<br />

vision-based interface that provides finger tracking input to an augmented reality<br />

(in its wider sense) painting application. While the CV method presented in<br />

that paper is not very robust <strong>and</strong> does not distinguish between different postures,<br />

they nicely showed correlations between tracking parameters (template area <strong>and</strong><br />

search window sizes) <strong>and</strong> effects <strong>for</strong> the UI (maximal speeds). A paper by Sato<br />

et al. [149] uses infrared images <strong>and</strong> similar methods to achieve simple interaction<br />

functionality such as browsing the web with finger gestures.<br />

A background-robust system that uses shape in<strong>for</strong>mation <strong>and</strong> color probabil-<br />

ities was demonstrated at various public venues by Bretzner et al. [19]. Their<br />

person-independent methods can distinguish five view-dependent h<strong>and</strong> configu-<br />

rations <strong>and</strong> control consumer electronics with the recognition data. In order to<br />

make the system real-time despite the employed compute-intensive particle filters<br />

they use hierarchical sampling <strong>and</strong> a dual-processor machine.<br />

44


Chapter 2. Literature Review<br />

With a camera positioned above the keyboard, Mysliwiec et al.’s FingerMouse<br />

project [119, 138] is able to detect the h<strong>and</strong> in one posture. Once it is detected, the<br />

extended index finger is found <strong>and</strong> its tip acts as a mouse pointer. A probability-<br />

based color-segmented image is the input to a finite state machine (FSM) which<br />

per<strong>for</strong>ms h<strong>and</strong> detection. The fingertip is found with a simple heuristic that looks<br />

<strong>for</strong> the north-most h<strong>and</strong> part.<br />

Twelve gestures are distinguished in a paper by Zhu et al. [197]. The ges-<br />

tures are a combination of motions <strong>and</strong> configurations, recognized with a trained<br />

model of a combination of affine trans<strong>for</strong>mation <strong>and</strong> ellipse fitting. A variation<br />

of dynamic time warping h<strong>and</strong>les temporal variation. With a camera mounted<br />

opposing the user, a map browser interface is demonstrated to work fairly reliably.<br />

They observed that their measure of shape is less well recognized (just above sev-<br />

enty percent of the time) than their measure of motion (almost ninety percent of<br />

the time), which is not surprising given their choice of gestures <strong>and</strong> CV methods.<br />

The <strong>Gesture</strong> Pendant [161] is a clever combination of hardware <strong>and</strong> software<br />

to implement a robust gesture recognition interface. An array of infrared LEDs<br />

illuminates the area be<strong>for</strong>e the device, which is worn like a necklace <strong>and</strong> ends up in<br />

front of the sternum. <strong>H<strong>and</strong></strong> gestures per<strong>for</strong>med in front of it are easily segmented<br />

from the background with the aid of an infrared filter on a camera, co-located<br />

with the LEDs. The system can recognize four postures <strong>and</strong> their movements<br />

45


Chapter 2. Literature Review<br />

very reliably in indoor settings. The authors note that the limited expressiveness<br />

<strong>and</strong> outdoor per<strong>for</strong>mance might be remedied with a laser beam that produces<br />

structured (grid-patterned) light instead of the constant illumination of the LEDs.<br />

Computer vision can provide services to many different applications. Its flex-<br />

ibility due to the absence of physical devices (aside from the camera of course)<br />

<strong>and</strong> due to most of its implementation being software-based rather than in hard-<br />

ware, reconfiguration <strong>and</strong> in particular adaptation to the individual are a matter<br />

of software adjustments, as opposed to cumbersome or impossible device changes.<br />

The interested reader is referred to a fairly recent survey by Porta [135]. Also,<br />

Turk [177] recently evaluated the state of computer vision as interface modality<br />

<strong>and</strong> gives advice as to which challenges must be overcome be<strong>for</strong>e widespread vision<br />

interfaces can become possible.<br />

2.5 Virtual environments <strong>and</strong> applications<br />

This section discusses some general issues of Virtual Environments (VEs) <strong>and</strong><br />

Geographic In<strong>for</strong>mation Systems <strong>and</strong> then continues the previous section by look-<br />

ing at ways to per<strong>for</strong>m vision-based human-computer interaction with VEs.<br />

46


Chapter 2. Literature Review<br />

2.5.1 Virtual environments <strong>and</strong> GISs<br />

Virtual Environments (see Isdale [68]) have many advantages over conventional<br />

desktop or other 2D displays. The two main capabilities that make them appeal<br />

to many other disciplines are their inherent 3D characteristics <strong>and</strong> the increased<br />

spatial extent of the display volume. Even though the actual display unit might<br />

not have a large field of view, viewpoint navigation can turn the overall area<br />

into infinite spaces. Thus, a natural application area <strong>for</strong> VEs are Geographic<br />

In<strong>for</strong>mation Systems (GISs).<br />

The Virtual GIS in Koller et al. [87] is arguably the first <strong>and</strong> most exten-<br />

sive combination of state-of-the-art GIS, visualization, <strong>and</strong> interaction techniques.<br />

System design <strong>for</strong> these complex applications poses many challenges (see Lind-<br />

strom et al. [107] <strong>for</strong> a project update). Interaction with VEs is also a very<br />

challenging problem <strong>and</strong> many researchers are working on solving it. Bowman’s<br />

thesis [13] is a comprehensive starting point to learn about methods <strong>and</strong> devices<br />

<strong>for</strong> interaction techniques <strong>for</strong> VEs. Shneiderman [160] summarizes that direct ma-<br />

nipulation – which our vision-based interface inherently provides – is well suited<br />

<strong>for</strong> interaction with VEs due to at least the following reasons: “control-display<br />

compatibility, less syntax reduces error rates, errors are more preventable, faster<br />

learning <strong>and</strong> higher retention, encourages exploration.”<br />

47


Chapter 2. Literature Review<br />

One frequently taken path leads to multimodal interfaces, usually involving<br />

speech <strong>and</strong> gesture recognition. These interfaces mimic natural spatial interac-<br />

tion as demonstrated in the seminal “Put that there” paper by Bolt [11]. Krum<br />

et al. [98] show a fairly simple yet robust multimodal interface to navigate a vir-<br />

tual earth model. They use the <strong>Gesture</strong> Pendant wearable gesture recognition<br />

device [161] <strong>and</strong> thus limit the gesture types to a body-relative reference frame.<br />

In contrast, Rauschert et al. [139] cannot leverage the users’ proprioceptive sense.<br />

On the other h<strong>and</strong>, their system allows <strong>for</strong> direct, absolute pointing to objects<br />

in the virtual world. The mentioned paper also briefly discusses “step in <strong>and</strong><br />

use” operation, which refers to distinguishing multiple people without requiring<br />

person-specific training.<br />

On a related note, Foxlin <strong>and</strong> Harrington [45] say about Mine et al. [117]:<br />

“Mine et al. have discussed very convincingly the benefits of designing virtual<br />

environment interaction techniques that exploit our proprioceptive sense of the<br />

relative pose of our head, h<strong>and</strong>s <strong>and</strong> body. A variety of ‘preferred’ techniques<br />

were presented, such as: 1. Direct manipulation of objects within arms reach 2.<br />

Scaled-world grab 3. Hiding tools <strong>and</strong> menus on the user’s body 4. Body-relative<br />

gestures.” This is very encouraging <strong>for</strong> the style of interfaces that we propose<br />

since the proprioceptive sense is conveniently <strong>and</strong> automatically exploited with<br />

vision-based h<strong>and</strong> tracking.<br />

48


Chapter 2. Literature Review<br />

Most of the VE-enhanced GIS systems are designed <strong>for</strong> indoor use, however,<br />

while a project at UCSB attempts to take the entire GIS <strong>and</strong> some of the novel<br />

visualization techniques outdoors. Clarke et al. [27] describe the UI design issues<br />

of such an approach. Nusser et al. [121] discuss current technological advances in<br />

various research areas <strong>and</strong> how they could affect the future of data sampling in<br />

the field. A good collection of wearable technology is maintained by Sutherl<strong>and</strong><br />

on his <strong>Wearable</strong>s web site. 6 The dem<strong>and</strong>s on a wearable GIS are even higher than<br />

they are on traditional VEs: full immersion of the wearer into the virtual world<br />

is usually not desired, but instead a mixed reality that places virtual objects into<br />

the real world.<br />

Augmented Reality (AR) is by many considered the new exciting frontier in VE<br />

research. 7 It is a fairly young field that still has many opportunities <strong>for</strong> solving<br />

technology problems. For an in-depth overview of AR see Azuma’s survey [4]<br />

<strong>and</strong> a more recent update [2]. One of the earliest attempts to take AR to the<br />

outdoors was by Starner [162]. This paper also briefly deals with user input to<br />

their wearable AR system, <strong>for</strong> which they employed a chording keyboard. These<br />

devices have a slow learning curve <strong>and</strong> the familiarity penetration in the general<br />

6 http://home.earthlink.net/∼wearable/<br />

7 Dieter Schmalstieg (TU Wien) suggested on a panel discussion at the VR 2003 conference<br />

to focus on AR research <strong>for</strong> now, because purely virtual reality (VR) research seems to have<br />

stalled a little <strong>and</strong> could definitely benefit from most AR advances as well. He considers AR to<br />

be harder, so there would be more room <strong>for</strong> good research. As one of the steps that could bring<br />

the field on the right track he proposed to integrate camera <strong>and</strong> video support into existing VR<br />

toolkits, almost instantly yielding AR toolkits.<br />

49


Chapter 2. Literature Review<br />

population is extremely low. Another early, more elaborate outdoor AR system<br />

is the Touring Machine from Columbia University (see Feiner et al. [42]).<br />

A paper by Höllerer et al. [62] mostly concerns design of interaction elements,<br />

but it also describes the implementation of various ways of user input. They use,<br />

<strong>for</strong> example, selection by gaze (head) direction <strong>and</strong> following menus either on the<br />

HMD or a h<strong>and</strong>-held computer. Indoors, wireless InterSense position sensors <strong>and</strong><br />

wireless trackballs take advantage of more available infrastructure. In another<br />

paper [61], the application of strategies <strong>for</strong> designing interface techniques is advo-<br />

cated. When displaying in<strong>for</strong>mation, it should not only be relevant to the user at<br />

the respective time, it should also come in a package appropriate <strong>for</strong> the current<br />

environment <strong>and</strong> be visually placed in a sensible spot.<br />

Head-worn cameras provoke many novel applications. The Remembrance<br />

Agent by Rhodes [146], <strong>for</strong> example, associates computer vision-recognized faces<br />

with learned names <strong>and</strong> can thus augment the wearer’s memory. Rekimoto <strong>and</strong><br />

Nagao proposed similar augmentations with their NaviCam system [144]. A built-<br />

in video recorder that operates on the wearer’s comm<strong>and</strong>s such as in Jebara et<br />

al.’s “DyPERS” system [69] is conceivable as well as using the camera to track his<br />

or her head orientation or even 6DOF location in the environment. The latter can<br />

be in the unprepared outdoors (see Azuma et al. [3]) or in infrastructured settings<br />

(see, <strong>for</strong> example, Park et al.’s take on vision-based pose estimation [126], Welch<br />

50


Chapter 2. Literature Review<br />

et al.’s paper describing the HiBall Tracker [182], Rekimoto’s Matrix [143], <strong>and</strong><br />

Kato <strong>and</strong> Billinghurst’s AR Toolkit [80]). Purely vision-based ego-pose tracking<br />

without any infrastructure is very hard, but hybrid tracking can yield better accu-<br />

racy <strong>and</strong> precision than each of the involved methods in isolation. This has been<br />

demonstrated <strong>for</strong> a combination of magnetic trackers <strong>and</strong> vision-based tracking<br />

of artificial fiducials by State et al. [164], with a database of known images by<br />

Kourogi <strong>and</strong> Kurata [96], as well as <strong>for</strong> a combination of compass <strong>and</strong> inertial<br />

trackers with CV-based methods that dynamically identify <strong>and</strong> track features in<br />

the environment, see You et al. [191]. The current trend <strong>for</strong> commercial trackers<br />

certainly seems headed that way since InterSense is developing a hybrid tracker<br />

with exciting per<strong>for</strong>mance characteristics (see Foxlin <strong>and</strong> Naimark [46]). A related<br />

project from InterSense [45] had previously looked at ultrasound-based h<strong>and</strong> track-<br />

ing from a head-worn reference unit. This would surpass the accuracy of current<br />

computer vision approaches that also attempt h<strong>and</strong> tracking in a head-centered<br />

reference frame, but a h<strong>and</strong>-worn unit is required <strong>and</strong> no finger configuration<br />

in<strong>for</strong>mation can be acquired.<br />

Another well-known outdoor AR project, ARQuake by Piekarski et al. [130,<br />

171], accomplishes registration mainly with huge fiducials in the environment,<br />

tracked with the AR Toolkit. Whenever the vision-based tracking delivers un-<br />

reliable results, a differential GPS <strong>and</strong> compass tracker take over to supply the<br />

51


Chapter 2. Literature Review<br />

6DOF spatial parameters to the application. They use these fiducials also <strong>for</strong><br />

h<strong>and</strong> tracking [132], which achieves per<strong>for</strong>mance somewhere in between an ultra-<br />

sound tracker <strong>and</strong> purely vision-based tracking. Again, no finger configurations<br />

can be distinguished. Piekarski <strong>and</strong> Thomas [133, 131] also strongly advocate<br />

development of st<strong>and</strong>ard interfaces to both virtual <strong>and</strong> augmented realities, <strong>and</strong><br />

tools to make application content creation a process that can be executed within<br />

the augmented environment to support immediate visualization <strong>and</strong> interaction<br />

capabilities.<br />

2.5.2 <strong>Vision</strong>-based interfaces <strong>for</strong> virtual environments<br />

Broll et al. [20] state that the “absence of tether-free solutions” to tracking<br />

is one of the big problems of interfacing in a convenient manner with VRs. The<br />

Studierstube [168] <strong>and</strong> a precursor project of the AR Toolkit by Kato et al. [81]<br />

use a plain object with fiducials on it to visually track one tool that can virtually<br />

assume many functionalities. In another application of the same technology, Hed-<br />

ley et al. [57] investigate how AR can be used in simple geography applications<br />

on the desktop. In addition, a simple h<strong>and</strong> gesture recognition system allows<br />

<strong>for</strong> free-h<strong>and</strong> input which is interpreted similar to a mouse pointer. Due to its<br />

strong relation to the above projects, one non vision-based application shall be<br />

mentioned: the RV-Border Guards game [123] also uses one device (an electro-<br />

52


Chapter 2. Literature Review<br />

magnetically tracked glove) <strong>for</strong> various purposes, depending on the per<strong>for</strong>med<br />

gestures: shooting <strong>and</strong> shielding.<br />

Dorfmüller-Ulhaas <strong>and</strong> Schmalstieg use extensive amounts of special equip-<br />

ment [40]: users must wear gloves with infrared-reflecting markers, the scene is<br />

infrared-illuminated, <strong>and</strong> a pair of cameras is used <strong>for</strong> stereo vision. However, the<br />

system’s accuracy <strong>and</strong> robustness is quite high even with cluttered backgrounds.<br />

It is capable of delivering the accuracy necessary to grab <strong>and</strong> move virtual checker-<br />

board figures. The tradeoffs between special equipment <strong>and</strong> constraints on the<br />

environment is illustrated by comparing that paper with one presented by Segen<br />

<strong>and</strong> Kumar [154]: their techniques also operate with two cameras, but they do not<br />

require the user to wear gloves. As a direct result, the background has to be very<br />

uni<strong>for</strong>m <strong>and</strong> easy to segment from the h<strong>and</strong>. The recognized gestures constitute<br />

input to a simple flight simulator or a robot arm.<br />

Hardenberg <strong>and</strong> Bérard [181] show three applications: painting, presentation<br />

control, <strong>and</strong> moving of virtual objects. They focus on usability <strong>and</strong> the paper<br />

discusses the requirements <strong>for</strong> a usable interface. An implementation is shown<br />

that is based on a frame differencing algorithm with decaying average. Assuming<br />

good segmentation, they first extract the fingertip locations. From those the h<strong>and</strong><br />

configuration is estimated.<br />

53


Chapter 2. Literature Review<br />

Chris <strong>H<strong>and</strong></strong> [54] took on the tremendously important task of investigating in-<br />

teraction techniques <strong>for</strong> 3D environments without limiting himself to a specific im-<br />

plementation technology. He classifies the uses of gestures in virtual environments<br />

into viewpoint control, selection, manipulation, <strong>and</strong> system control. Only by es-<br />

tablishing device-independent metrics <strong>for</strong> which gestures are suited to what task,<br />

can the development of applications <strong>and</strong> implementation technologies progress<br />

without slowing each other down.<br />

2.5.3 Mobile interfaces<br />

This section reviews VBIs <strong>for</strong> mobile computers <strong>and</strong> applications.<br />

Starner et al. pioneered mobile VBIs <strong>for</strong> American Sign Language recogni-<br />

tion [163]. A cap-mounted camera tracked skin-colored blobs whose spatial pro-<br />

gression was analyzed over time with Hidden Markov Models. Their system<br />

worked with non-instrumented h<strong>and</strong>s just as ours. However, our system inte-<br />

grates multiple image cues (skin color <strong>and</strong> texture in<strong>for</strong>mation) to overcome the<br />

robustness limitations associated with relying on the accuracy of single-cue im-<br />

age segmentation. Recognizing a set of communicative gestures (that frequently<br />

exhibit distinct spatial trajectories) requires more semantic post-processing, but<br />

manipulative <strong>and</strong> discrete postures as recognized by our methods are more de-<br />

m<strong>and</strong>ing on the CV methods. Another color-based VBI was shown by Dominguez<br />

54


Chapter 2. Literature Review<br />

et al. [39] who implemented a compelling wearable VBI that enabled the user to<br />

encircle objects in view with a pointing gesture.<br />

In a later project, Krum, Starner, et al. built the <strong>Gesture</strong> Pendant, a previ-<br />

ously mentioned mobile system <strong>for</strong> recognizing gestures <strong>and</strong> speech [98] (see also<br />

Section 2.5.1). It employed specialized imaging hardware with active infrared il-<br />

lumination <strong>and</strong> provided a small interactive area at sternum height in front of the<br />

wearer’s body. Our vision hardware is entirely passive; that is, it does not include<br />

light sources. A related user study [97] found that the relatively static h<strong>and</strong> po-<br />

sition <strong>for</strong> extended periods of time caused fatigue. We hope to avoid fatigue <strong>and</strong><br />

discom<strong>for</strong>t symptoms even <strong>for</strong> long-term interaction through use of a much larger<br />

interaction area <strong>and</strong> less rigid h<strong>and</strong> postures. The reader is referred to Section 2.2<br />

<strong>and</strong> Chapter 3 <strong>for</strong> more on fatigue <strong>and</strong> discom<strong>for</strong>t.<br />

Kurata et al.’s <strong>H<strong>and</strong></strong>Mouse [100] is a VBI <strong>for</strong> mobile users wearing an HMD <strong>and</strong><br />

camera very similar to ours, allowing <strong>for</strong> the registered manipulation technique<br />

(see Section 8.3). It differs in that the h<strong>and</strong> has to be the visually prominent<br />

object in the camera image <strong>and</strong> that it relies solely on skin color. The robustness<br />

gained with our multi-modal approach makes it possible <strong>for</strong> the image of the<br />

h<strong>and</strong> to be much smaller. Via wireless networking, they employ a stationary<br />

cluster <strong>for</strong> processing [99] whereas our vision methods run on a single laptop<br />

without sacrificing real-time per<strong>for</strong>mance. Going beyond the interaction methods<br />

55


Chapter 2. Literature Review<br />

they demonstrated, we characterize additional techniques <strong>and</strong> their suitability <strong>for</strong><br />

mobility <strong>and</strong> the outdoors. Our system then shows how this improves the mobile<br />

user interface’s usability <strong>and</strong> effectiveness.<br />

A few recent research projects use the ARtoolkit [80] software to obtain the<br />

h<strong>and</strong>’s 6 degree-of-freedom (DOF) position purely by means of grey-level image<br />

processing. For example, <strong>for</strong> the previously mentioned Outdoor Tinmith Backpack<br />

Computer [172], Thomas <strong>and</strong> Piekarski attached a fiducial to the back of the h<strong>and</strong>.<br />

Our system requires no markers, tracks without restrictions on rotation, <strong>and</strong> can<br />

obtain posture in<strong>for</strong>mation in addition to 2D location. Their system is an excellent<br />

example of a high-fidelity wearable computer, but also of the amount of equipment<br />

required to facilitate this functionality. We designed our system to minimize<br />

extraneous hardware requirements <strong>and</strong> instead make the computer disappear as<br />

much as possible. Only the head-worn devices are exposed, everything else is<br />

carried in a small backpack.<br />

<strong>Wearable</strong> Augmented Reality systems such as described in Feiner et al. [42]<br />

are related in that they are a prime recipient <strong>for</strong> our interaction methods, as they<br />

lack sufficient interface capabilities <strong>and</strong> are thus still limited in their application.<br />

System usability evaluation is becoming more <strong>and</strong> more important as witnessed<br />

by a few projects. Dias et al. [37] found that participants “positively adhered to<br />

the concept of Tangible Interaction in” augmented reality environments. They<br />

56


Chapter 2. Literature Review<br />

advocate the development of st<strong>and</strong>ardized tests <strong>for</strong> usability as well as st<strong>and</strong>ard<br />

metrics to that end. A first step in that direction is taken by Paelke et al. [125]<br />

who have started creating an easy-to-use testbed <strong>for</strong> interaction techniques in<br />

augmented <strong>and</strong> virtual environments. Koskela et al. [94, 95] show in their concept<br />

papers that wearable augmented reality technology has created interest. They<br />

have approached user evaluation on proof-of-concept systems as a next step.<br />

57


Chapter 3<br />

<strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human<br />

Context<br />

<strong>H<strong>and</strong></strong>s are our most versatile tool to accomplish every day’s tasks, from prepar-<br />

ing a cup of tea or coffee in the morning to switching off the light at the end of<br />

the day. We use our h<strong>and</strong>s <strong>and</strong> arms without thinking about them, but instead<br />

focus on the task they are to do: when writing a letter, we concentrate on the<br />

words <strong>and</strong> sentences <strong>and</strong> their meanings, not on how our fingers guide the pencil<br />

on the paper. This is the holy grail of user interfaces: controls <strong>and</strong> interaction<br />

metaphors should not occupy <strong>for</strong>eground brain processing resources but recede<br />

to unawareness in order to make space <strong>for</strong> task-related thinking. Three rules of<br />

thumb help to achieve this goal <strong>for</strong> gesture interfaces.<br />

• First, the gesture should be as natural as possible to control the task at<br />

h<strong>and</strong>. If the task resembles an action that we per<strong>for</strong>m in other situations,<br />

the gesture should also be similar.<br />

58


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

• Second, the gesture should not involve uncom<strong>for</strong>table or even painful pos-<br />

tures or motions which would attract one’s attention <strong>and</strong> distract from the<br />

task.<br />

• Third, the tradeoff between learning curve <strong>and</strong> efficacy of the gesture must<br />

be considered carefully. More required learning such as <strong>for</strong> keyboard short-<br />

cuts only pays off versus a generic graphical user interface <strong>for</strong> the frequent<br />

<strong>and</strong> experienced user.<br />

In this chapter, we address issues from the second bullet: com<strong>for</strong>t of h<strong>and</strong><br />

postures <strong>and</strong> motions. We investigated the interaction range in front of a st<strong>and</strong>ing<br />

human in the transverse plane at about stomach height. This is of particular<br />

interest <strong>for</strong> our envisioned scenario in which a head-worn camera <strong>and</strong> display<br />

provide input <strong>and</strong> output capabilities to a wearable computer, <strong>for</strong> example, while<br />

walking or st<strong>and</strong>ing. The questions that directed our research were the following:<br />

- What range is the h<strong>and</strong> likely to operate in?<br />

- What percentage of people appreciate a more expansive interaction area?<br />

- Does this range change with longer interaction times?<br />

59


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

- When assuming direct, registered manipulation (versus the unregistered<br />

mouse- <strong>and</strong> pointer- style interaction), where shall items be placed to al-<br />

low <strong>for</strong> convenient access with h<strong>and</strong> gestures?<br />

- Where in the image should the computer vision methods search <strong>for</strong> the<br />

h<strong>and</strong>?<br />

These issues are important at two levels of building a vision-based interface.<br />

At the implementation level, the computer vision methods can be optimized in a<br />

number of ways, <strong>for</strong> example, by restricting the search region <strong>for</strong> the h<strong>and</strong>. The<br />

h<strong>and</strong> detection stage of our <strong>H<strong>and</strong></strong>Vu vision interface takes these optimizations<br />

into account. At the user level, general acceptance depends to a large degree<br />

upon the ease of use <strong>and</strong> sustained positive experience with the interface. This<br />

satisfaction in turn depends on h<strong>and</strong> postures <strong>and</strong> motions that are convenient<br />

<strong>and</strong> com<strong>for</strong>table to per<strong>for</strong>m <strong>for</strong> a majority of the users. Our vision-based gesture<br />

interfaces described in Chapter 8 were designed with these recommendations in<br />

mind.<br />

To find answers to these questions, we conducted a user study that employed a<br />

novel method to measure com<strong>for</strong>t. That method does not require the participants<br />

to fill in questionnaires but enables fine-grained objective com<strong>for</strong>t assessment in-<br />

stead. The main idea is that study participants will naturally <strong>and</strong> intuitively<br />

60


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

select com<strong>for</strong>table postures if they are given the chance to do so. This chapter<br />

summarizes our results, which are also available in two publications that were<br />

presented at the Human Factors Society’s annual meeting in 2003 [88, 89].<br />

3.1 Postural com<strong>for</strong>t<br />

This section describes the steps that lead to a quantifiable indication of pos-<br />

tural com<strong>for</strong>t. The main concept is to allow <strong>for</strong> compensation <strong>for</strong> uncom<strong>for</strong>table<br />

postures through alternative body motions or postures <strong>and</strong> to measure under what<br />

conditions the compensation takes over. The goal of this operational definition<br />

is to lay out a generic method which can detect a sub range corresponding to<br />

com<strong>for</strong>t within the full base range of the physically possible. Observing the range<br />

of postures that participants intuitively assume will give an indication of their<br />

com<strong>for</strong>t zone.<br />

3.1.1 Operational definition of com<strong>for</strong>t<br />

Base: Given is an object under investigation (“OI,” a body part or joint) <strong>for</strong><br />

which a com<strong>for</strong>table sub range of positions is to be determined during exe-<br />

cution of a certain task. Select a base range of the OI’s possible positions,<br />

which can be either the anthropometrically feasible range (that is, the phys-<br />

61


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

ical reach limit <strong>for</strong> example), or it can be an ergonomically sensible range<br />

(<strong>for</strong> example, the maximum non-fatiguing reach) <strong>for</strong> this body part or joint.<br />

Compensation: A way to adjust the OI must be identified such that the OI<br />

can assume any location within its full base range without hampering task<br />

execution. This compensation can either be another body part or joint<br />

that can substitute <strong>for</strong> the movement of the OI, or it can be a way to<br />

adjust the experiment settings. For example, assume the OI is sideways head<br />

rotation (around the longitudinal axis). Then, compensation by another<br />

body part could be full-body rotation around the same axis. Or, assume<br />

users are per<strong>for</strong>ming a visual display terminal task (VDT task, reading on<br />

a monitor). A monitor that can be rotated around the body’s longitudinal<br />

axis is a device that can compensate <strong>for</strong> the head’s rotation by adjusting<br />

the experiment settings. It provides an alternative <strong>for</strong> head rotation in the<br />

longitudinal axis <strong>and</strong>, thus, is a compensation <strong>for</strong> the OI. Care must be taken<br />

to choose a compensation that does not require considerable ef<strong>for</strong>t from the<br />

participant. That is to say, it is important that these alternatives are not<br />

anthropometrically more “expensive” than the adoption of an uncom<strong>for</strong>table<br />

posture.<br />

62


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

Com<strong>for</strong>t Zone: In an experiment where the participants are free to use both the<br />

OI <strong>and</strong> the compensation to their desire, the sub-range which corresponds<br />

to the com<strong>for</strong>t zone will be determined. Design a set of tasks such that, if<br />

compensation was not permitted, areas in the base range of OI’s positions<br />

have to be assumed at sufficient sample density. The tasks must not evoke<br />

the participant’s desire to compromise posture <strong>for</strong> task per<strong>for</strong>mance. Now<br />

allow <strong>for</strong> compensation <strong>and</strong> observe which sub-range of the OI’s positions is<br />

still assumed, <strong>and</strong> which complementary region is compensated <strong>for</strong> by the<br />

alternative motion or posture. This boundary delineates the com<strong>for</strong>t zone<br />

which is the main result of the procedure. Frequently, a second range can<br />

be observed, the positions assumed by the OI after compensating motion.<br />

Strictly speaking, this range is independent of the com<strong>for</strong>t zone. However,<br />

we generally expect it to fall within the limits of the com<strong>for</strong>t zone. This in<br />

fact validates <strong>and</strong> rein<strong>for</strong>ces the result of the com<strong>for</strong>t zone <strong>and</strong> was observed<br />

in the experiment described in the following section.<br />

The concept of devising experiments to determine com<strong>for</strong>t zones is illustrated<br />

below with two potential experiments.<br />

Example 1: A potential experiment could include displaying text at locations<br />

around the head <strong>and</strong> having the participants read it aloud. For some locations,<br />

the participants will be more likely to turn their heads towards the text without<br />

63


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

moving their bodies, <strong>and</strong> <strong>for</strong> other locations they will compensate <strong>for</strong> extreme<br />

head rotations by rotating their entire bodies. The range of assumed rotational<br />

head-to-body offsets corresponds to the com<strong>for</strong>t zone <strong>for</strong> this OI.<br />

Example 2: A com<strong>for</strong>table arm <strong>and</strong> h<strong>and</strong> motion is one that requires little<br />

or no trunk motion, provided that the trunk is free to move. Presenting a set<br />

of targets at various locations will elicit trunk motions <strong>for</strong> some locations, while<br />

others will not. The com<strong>for</strong>t zone is the range of targets which elicits little or no<br />

trunk motion.<br />

3.2 The com<strong>for</strong>t zone <strong>for</strong> reaching gestures<br />

We determined the physical range in which humans prefer to per<strong>for</strong>m fine<br />

motor h<strong>and</strong> motion through a user study that is described in the following. The<br />

important distinctions between this study <strong>and</strong> those in the previous human factors<br />

research are the following. First, the participants had to sustain only the weight<br />

of their arms <strong>and</strong> h<strong>and</strong>s as opposed to carrying additional weight. Second, as our<br />

participants were naïve to the study’s purpose, we were able to expect that dem<strong>and</strong><br />

characteristics of the study itself did not play a significant role in influencing<br />

our results. In fact, we measured an objective quantity as opposed to acquiring<br />

subjective, questionnaire-based data as was done traditionally.<br />

64


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

3.2.1 Method <strong>and</strong> design<br />

According to our definition of com<strong>for</strong>t, a com<strong>for</strong>table arm <strong>and</strong> h<strong>and</strong> motion<br />

is one that requires little or no trunk motion, provided that the trunk is free to<br />

move. Presenting a set of targets at various locations will elicit trunk motions <strong>for</strong><br />

some locations, while others will be reached solely by arm <strong>and</strong> h<strong>and</strong> gestures. The<br />

com<strong>for</strong>t zone is indicated indirectly as the range of targets which elicits little or<br />

no trunk motion. We hypothesized that there is such a range that satisfies most<br />

users; that is, users would prefer to operate (are com<strong>for</strong>table) therein.<br />

We used a 13 x 2 within-participants design that studied the effect of 13<br />

target locations <strong>and</strong> 2 trial durations. The locations were chosen to optimally<br />

sample the frontal transverse plane at stomach height across angle <strong>and</strong> distance<br />

with respect to the right shoulder joint. These locations are depicted as little<br />

circles in Figure 3.1 which shows a sketch of the experiment area from a bird’s<br />

eye view. In every trial, a trackball mounted on a tripod was placed at one of<br />

the 13 locations. The participants had to per<strong>for</strong>m a task that required their h<strong>and</strong><br />

to remain suspended in this position <strong>for</strong> five or twenty seconds. In Figure 3.2 a<br />

participant can be seen per<strong>for</strong>ming the skill task on the trackball.<br />

The dependent variables were the locations of the participants’ shoulder joint<br />

<strong>and</strong> of the h<strong>and</strong>’s palm top. Both trajectories were measured in 3D space with<br />

electromagnetic trackers throughout each trial. Of particular interest were two<br />

65


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

50<br />

40<br />

30<br />

20<br />

10<br />

-10<br />

-20<br />

-30<br />

-40<br />

0<br />

cm<br />

70°<br />

-55°<br />

45°<br />

-50<br />

-10 0 10 20 30 40 50 60<br />

-5°<br />

error<br />

counter<br />

-30°<br />

20°<br />

80cm<br />

00<br />

countdown<br />

timer<br />

250<br />

wall<br />

Figure 3.1: Plan view of the experiment setup:<br />

shown are the 13 target locations (open circles at line intersections) <strong>and</strong> the size of<br />

the displayed object <strong>for</strong> the skill task on the right. Rays indicate the six azimuthal<br />

directions (-55 ◦ , -30 ◦ , -5 ◦ , 20 ◦ , 45 ◦ , <strong>and</strong> 70 ◦ ); concentric partial circles depict the<br />

four radial distances (20, 33, 46, <strong>and</strong> 60 cm). Also shown are two measured<br />

values: the initial shoulder locations (dot cluster at coordinate system origin) <strong>and</strong><br />

the h<strong>and</strong> locations during the trials (dot clusters left of targets).<br />

66<br />

cm


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

derived measures from these variables, the shoulder motion from the initial start-<br />

ing position to the location at the end of each trial, <strong>and</strong> the distance between<br />

h<strong>and</strong> <strong>and</strong> shoulder.<br />

3.2.2 Participants<br />

Three females <strong>and</strong> four males from the campus community ranging in age<br />

between 21 <strong>and</strong> 30 years participated in this study. All were naïve to the purpose<br />

<strong>and</strong> predictions of the study. Prior computer exposure varied from ”email <strong>and</strong><br />

web” to computer science majors. All participants reported being right-h<strong>and</strong><br />

dominant, <strong>and</strong> we recorded their body, shoulder, <strong>and</strong> elbow heights from the<br />

floor. The participants were compensated <strong>for</strong> their ef<strong>for</strong>ts with material goods<br />

valued at approximately USD 5.<br />

Woodson [185] delineates reaching distances <strong>for</strong> arms in seated positions. Af-<br />

ter converting his head-centric coordinate system to our shoulder-centric one, we<br />

obtain the following percentiles: the 5 th percentile of reaching distance is at 61cm,<br />

the 50 th at 66cm, <strong>and</strong> the 95 th at 70cm. Taking into account about 7cm <strong>for</strong> the<br />

distance between the target reached with the curled fingertips <strong>and</strong> the tracker<br />

location on top of the palm, our participants’ physiques spread from the 10 th to<br />

the 90 th percentile, thus being a representative sample of the population.<br />

67


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

3.2.3 Materials <strong>and</strong> apparatus<br />

The participants had one Ascension electromagnetic six degrees of freedom<br />

tracker attached to the right palm top <strong>and</strong> one above the right shoulder joint. For<br />

each participant, a starting foot position was established <strong>and</strong> marked on the floor<br />

such that the shoulder joint was at approximately the same horizontal location<br />

<strong>for</strong> all participants. This is the origin of the coordinate system <strong>for</strong> all further<br />

discussion. The target object was the ball of a conventional computer trackball.<br />

The choice of this device is irrelevant to the study because its only objectives were<br />

a) to require the participants to keep their h<strong>and</strong>s suspended in the air <strong>and</strong> b) to<br />

distract the participants from this fact. Furthermore, various input devices fare<br />

very similar with regard to perceived discom<strong>for</strong>t of use (see Kee [82]).<br />

The skill task required participants to turn the trackball in both directions<br />

around the horizontal axis, that is, <strong>for</strong>ward <strong>and</strong> backward from the participant’s<br />

point of view (roll). The goal was to keep a virtual object from rotating beneath<br />

a virtual ground level. The object exhibited a r<strong>and</strong>omly changing velocity <strong>and</strong><br />

acceleration of its rotational speed around an anchor point on the ground. This<br />

rotation had to be counteracted by the participant. The nature of the skilled task<br />

itself is not relevant to the study, its purpose was merely to keep the participants<br />

occupied <strong>and</strong> distracted from the study objective. The object was projected with a<br />

video projector on the wall about 2.5 meters from the participant’s initial position.<br />

68


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

Figure 3.2: A participant per<strong>for</strong>ming the skill task:<br />

note one tracker attached to the participant’s shoulder <strong>and</strong> one on his h<strong>and</strong>.<br />

Projected in front of him is the virtual object, a spiky sphere, which had to be<br />

balanced with the trackball.<br />

Its size was 80 centimeters in diameter. It was rendered in real-time with OpenGL<br />

at 60Hz. A countdown timer <strong>for</strong> the time remaining in each trial was shown to<br />

the right <strong>and</strong> an error counter to the left of it. A few pictures from the setup can<br />

be seen in Figure 3.2.<br />

To prevent participants from pre-planning their movements, they were re-<br />

quired to close their eyes during setup between all trials. This was m<strong>and</strong>ated<br />

since pre-planning can change the way motions are per<strong>for</strong>med, see Rosenbaum<br />

et al. [147]. To also eliminate sound cues to target placement, participants wore<br />

mono headphones that rendered spatial sound cues ineffective.<br />

69


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

Two constants were defined. First, body sway <strong>and</strong> body rotation cause shoul-<br />

der movements. Participants moved their shoulders no more than 9cm without<br />

taking a step <strong>for</strong>ward or backward. We defined this as the threshold <strong>for</strong> signifi-<br />

cant compensating movement, tnc = 9cm. Second, the offset between the tracker<br />

location on the h<strong>and</strong> <strong>and</strong> the fingertips (which had to reach the target) was also<br />

constant: we placed the tracker at a distance of dtf = 7cm. This amount was also<br />

observed in the data.<br />

3.2.4 Procedure<br />

A trial consisted of one rotate task executed <strong>for</strong> 5 or 20 seconds at one of the 13<br />

trackball locations. After concluding a trial the participants were told to momen-<br />

tarily walk to a location about one meter from the starting location, <strong>and</strong> thereafter<br />

go back to the initial position <strong>and</strong> close their eyes. This was implemented after pi-<br />

lot data showed a tendency <strong>for</strong> participants to move less <strong>and</strong> less over the course<br />

of subsequent trials. We speculated that promoting a general level of walking<br />

would help reduce this ostensible impedance. The trackball was repositioned to<br />

the next location <strong>and</strong> the countdown timer reset to either 5 or 20 seconds. The<br />

start of a new trial was indicated by vocal announcement. The participants would<br />

then open their eyes, move towards the target (trackball), <strong>and</strong> execute one rotate<br />

task. The target locations <strong>and</strong> motion duration was r<strong>and</strong>omized.<br />

70


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

During each trial after opening the eyes, participants were free to move around.<br />

In particular, they were allowed to move their bodies towards or away from the<br />

target. To rein<strong>for</strong>ce this possibility <strong>for</strong> compensating motion, we positioned the<br />

trackball well out of arm reach <strong>for</strong> a few test trials be<strong>for</strong>e the beginning of data<br />

collection. The entire experiment lasted about 1.5 hours per participant <strong>for</strong> three<br />

repetitions, including a 5-10 minute break after 50 minutes.<br />

3.2.5 Instructions to participants<br />

Participants were told that the study tested motor skill levels at various loca-<br />

tions, that they should focus on the rotate task <strong>and</strong> produce as accurate per<strong>for</strong>-<br />

mance as possible. In the pilot study, participants had slid their palms back <strong>and</strong><br />

<strong>for</strong>th on the trackball, causing arm weight offload onto the trackball mounting<br />

structure. There<strong>for</strong>e, they were also instructed to use a “finger-walking” motion<br />

on the trackball to minimize the detrimental effect of weight offload. It was made<br />

explicit that the time between the instructor’s start comm<strong>and</strong> <strong>and</strong> the partici-<br />

pants starting to rotate the trackball did not matter. The intention was to avoid<br />

participants compromising com<strong>for</strong>t <strong>for</strong> speed. It was in fact observed that partici-<br />

pants took different amounts of time to “settle” be<strong>for</strong>e they touched the trackball.<br />

The dialog <strong>for</strong> each trial was:<br />

(after having completed the previous trial)<br />

71


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

instructor: “please move out of the tracker range, then go to the starting<br />

position <strong>and</strong> close your eyes”<br />

participant: (does so) “eyes closed”<br />

instructor: (repositions trackball) “st<strong>and</strong> still please”<br />

participant: (st<strong>and</strong>s motionless)<br />

instructor: (starts recording) “go!”<br />

participant: (opens eyes <strong>and</strong> starts the trial)<br />

3.2.6 Results<br />

Fairly consistently the participants’ body positions remained constant shortly<br />

(500ms) after their h<strong>and</strong>s had reached the target. Thereafter, very little motion<br />

of h<strong>and</strong> <strong>and</strong> body was observed until the end of the trial. We now define a few<br />

symbols <strong>for</strong> the sake of a thorough discussion. Let tE be the time at the end of<br />

each trial. Let db be the amount of body movement, that is, the distance between<br />

the initial position of the shoulder <strong>and</strong> the shoulder’s position at time tE. In fact,<br />

db is defined only from the movement component along the vector from origin to<br />

target location. Negative numbers indicate a movement backwards, away from<br />

the target.<br />

Figure 3.3 shows contour lines (isolines or “isocom<strong>for</strong>t”) <strong>for</strong> medians of |db| over<br />

the entire study area – interpolated with Matlab’s v4 algorithm, which produces<br />

72


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

Figure 3.3: Mean <strong>and</strong> st<strong>and</strong>ard deviation of body movement:<br />

absolute body movement |db| in cm is considered. The rays <strong>and</strong> coordinate system<br />

are identical to Figure 3.1.<br />

smoother results than cubic or spline interpolation. The central region about 35-45<br />

cm from the shoulder joint is clearly visible as the region of least body movement<br />

(1cm isoline). The participants engaged in compensational motion of less than<br />

1cm <strong>for</strong> targets positioned in this area. Thus it is the most com<strong>for</strong>table region <strong>for</strong><br />

gesture interaction in the horizontal plane at about elbow or stomach height. The<br />

entire area at this radial distance around the shoulder is also highly preferred:<br />

the median body movement was less than 2cm (2cm isoline). The variation in<br />

compensational motion was not significant across different angles (p=0.46). The<br />

st<strong>and</strong>ard deviation of absolute body movement shown in Figure 3.3 confirms a<br />

73


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

high consistency <strong>for</strong> all participants in this area. Results were consistent (p > 0.5)<br />

throughout the repetitions <strong>and</strong> naturally showed a high significance in the target<br />

distance parameter (p < 0.001).<br />

One could expect that the absolute size of the com<strong>for</strong>t zone <strong>for</strong> a particular<br />

person is only dependent on the person’s physical reach, that is, the person’s<br />

actual arm length (as measured with a tape). However, our experiments could<br />

not confirm this. Similarly, the observed greatest or median reaching distances <strong>for</strong><br />

a particular participant did not play a significant role either. In the contrary, the<br />

com<strong>for</strong>t zones that we measured seem to be a function of personal preference or<br />

habit.<br />

Let dhs be the distance between h<strong>and</strong> <strong>and</strong> shoulder at time tE. Note that dhs<br />

is measured from the shoulder to the tracker attached to the palm top, not to<br />

the location of the fingertips. Figure 3.4 shows <strong>for</strong> all trials at -30 ◦ azimuth the<br />

distance dhs on the y-axis, plotted over the distance db on the x-axis.<br />

The linear relationship between body movement <strong>and</strong> h<strong>and</strong>-to-shoulder distance<br />

is clearly visible in the diagonal arrangement of the data <strong>for</strong> each target distance<br />

dt: either the h<strong>and</strong> reaches further towards the target (larger dhs) <strong>and</strong> the com-<br />

pensating body movement db is smaller, or vice versa. In every case however, the<br />

sum db + dhs roughly equals the target distance (minus the tracker-to-fingertip<br />

offset dtf). Figure 3.4 also shows that some participants chose to use the com-<br />

74


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

Figure 3.4: <strong>H<strong>and</strong></strong>-to-shoulder distance over body movement:<br />

<strong>for</strong> four target distances at -30 ◦ azimuth (to the right). The com<strong>for</strong>t zone is defined<br />

only <strong>for</strong> dhs, as cz− ≤ dhs ≤ cz+.<br />

75


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

pensating motion, that is, to take a step <strong>for</strong>ward or backward <strong>for</strong> target distances<br />

dt ∈ {20, 46, 60}. The respective clusters of data points spread to the left <strong>and</strong> right<br />

outside of the range of body movements which are possible without compensating<br />

stepping, |db| < tnc.<br />

There are two important observations to make. First, these participants take a<br />

step if the target is outside their com<strong>for</strong>t zone <strong>for</strong> h<strong>and</strong>-shoulder distances, cz− ≤<br />

dhs ≤ cz+. This range marks the limits of the participant’s com<strong>for</strong>table reaching<br />

distance that, when exceeded, is compensated <strong>for</strong> by body movement. Second, the<br />

h<strong>and</strong>-to-shoulder distance dhs that these participants assume thereafter is again<br />

within a tight range. This can be observed in data points to the left <strong>and</strong> right<br />

of the tnc threshold lines: they are confined to within the two dashed horizontal<br />

lines. It is critical to note that, strictly speaking, these two ranges are independent<br />

from each other, yet they seem to correspond strongly. This indeed indicates the<br />

existence of a “com<strong>for</strong>table interaction range.” In other words, if a participant<br />

takes steps <strong>for</strong> dt /∈ K <strong>for</strong> some K ⊂ ℜ, then dhs ∈ K is assumed thereafter. For<br />

reaching straight <strong>for</strong>ward (-5 ◦ azimuth) we found cz− = 23cm <strong>and</strong> cz+ = 38cm.<br />

The two interaction durations turned out to be too similar: no significant<br />

difference in the participants’ behavior was found between 5 <strong>and</strong> 20 second trials<br />

(p=0.60). However, our guess is that with interactions longer than a few minutes<br />

participants will start changing their postures slightly during the trials.<br />

76


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

Figure 3.5: The com<strong>for</strong>t ratings in front of the human body:<br />

the isolines depict the percentiles of all study participants’ trials in which the<br />

absolute body movement was less than the no-step threshold tnc = 9cm.<br />

The main results of this user study can be explained with Figure 3.5. It<br />

shows, again in birds-eye view, the 13 target locations as points denoted by little<br />

circles along the six rays emanating from the coordinate system’s origin. The<br />

origin of the coordinate system is the shoulder location at the beginning of each<br />

trial. The participants are facing to the right. The isolines depict the percentage<br />

of all participants’ trials that did not evoke compensating body motion, that is,<br />

participants did not step from their starting position (thus, |db| < tnc. Accord-<br />

77


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

ing to our operational definition, this behavior corresponds to the participants’<br />

com<strong>for</strong>t zone <strong>for</strong> h<strong>and</strong> <strong>and</strong> arm postures. For example, to reach a target located<br />

straight <strong>for</strong>ward from the right shoulder joint at a distance of about 45 cm, in<br />

about 85 percent of the time the participants did not take a step to alleviate an<br />

uncom<strong>for</strong>table reaching gesture <strong>and</strong> are thus within their com<strong>for</strong>t zone.<br />

3.3 Discussion<br />

3.3.1 The meaning of com<strong>for</strong>t<br />

While our definition of com<strong>for</strong>t clearly describes how to measure a quantity,<br />

the relationship between this quantity <strong>and</strong> (dis)com<strong>for</strong>t is not inherent. In other<br />

words, the com<strong>for</strong>t zone, defined as the area of the most com<strong>for</strong>table motions<br />

or postures <strong>for</strong> a given task, does not predicate an absolute measure of well-<br />

being. Further experiments have to be designed to study this relationship. For<br />

the purpose of evaluating biomechanical postures with respect to their relative<br />

sustainability, however, our definition delivers the desired results: users within<br />

their com<strong>for</strong>t zone are unlikely to change into other postures.<br />

Similarly, com<strong>for</strong>t does not effectuate a risk-free posture. Anthropometric<br />

soundness of a posture or motion has to be established with complementary means.<br />

Also, personal differences might cause a generally com<strong>for</strong>table <strong>and</strong> safe posture<br />

78


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

to be uncom<strong>for</strong>table or even risk-afflicted <strong>for</strong> select individuals. This is inherent<br />

in results that can only be stated as population percentiles. Furthermore, a gen-<br />

eral impairment in a person’s mobility might negatively affect the quality of our<br />

com<strong>for</strong>t measure.<br />

3.3.2 Com<strong>for</strong>t results <strong>and</strong> related work<br />

Here, we will put our results in the context with two previous studies on pos-<br />

tural workload <strong>and</strong> discom<strong>for</strong>t. Note, however, that our method does not usually<br />

evoke an aware experience of discom<strong>for</strong>t, as is essential <strong>for</strong> all other methods.<br />

Please refer to the related work Section 2.2 <strong>for</strong> a more general embedding of our<br />

research into the body of literature.<br />

After adding the tracker-to-fingertip offset dtf = 7cm to cz− = 23cm <strong>and</strong><br />

cz+ = 38cm (at -5 ◦ azimuth, almost straight <strong>for</strong>ward), our findings correspond to<br />

recommendations on workspace design <strong>and</strong> tool or materials positioning: Gr<strong>and</strong>-<br />

jean [51] suggests an optimal placement within 35-45 cm radius from the lowered<br />

elbow. Our study goes further as it not only allows recommendation of an opti-<br />

mal reaching distance, but also quantification of how well this <strong>and</strong> other areas are<br />

likely to be experienced by the human. Figure 3.5 details this result of our study.<br />

To build com<strong>for</strong>table gesture interfaces, designers should stay within the 95 th or<br />

at least the 90 th percentile to accommodate most users.<br />

79


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

If the participants were not allowed the compensating motion, arm motions<br />

necessary to reach all target locations in our study would provoke arm-joint dis-<br />

com<strong>for</strong>t scores in the range of 1 to 8 according to Chung et al. [26], where a higher<br />

number reflects greater discom<strong>for</strong>t. After allowing <strong>for</strong> compensating motion, no<br />

participant assumed postures with a score higher than 1. This shows that our<br />

findings are in line with these previous results as well. Our study delivered more<br />

fine grained results, however, as those scores are discrete values that stem from<br />

discrete intervals of joint angles: 0-45 degrees produce a score of one, <strong>and</strong> 45-90<br />

degrees a score of three, <strong>and</strong> so on. Thus, the com<strong>for</strong>t zone is a true subspace of<br />

the area of no experienced discom<strong>for</strong>t.<br />

Our definition does not rely on physiological data about muscle fatigue but<br />

instead considers postural com<strong>for</strong>t, which – to the best of our knowledge – is not<br />

physiologically manifested. Also, most human factors work is targeted towards<br />

decreasing the risk <strong>for</strong> musculoskeletal injuries resulting directly from the actual,<br />

assumed position during the task per<strong>for</strong>mance. Our work, however, aims at finding<br />

postures that do not motivate the desire to change posture, thus eliminating the<br />

risk of postures that the task designer has not anticipated.<br />

80


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

3.3.3 Miscellaneous<br />

We did not collect subjective, questionnaire-based user discom<strong>for</strong>t data <strong>for</strong><br />

three reasons. First, we did not expect them to deviate from the previous studies’<br />

results. Second, the in<strong>for</strong>mation gathered with conventional methods would have<br />

been too coarse <strong>for</strong> a meaningful comparison. Third, it is essential to the com<strong>for</strong>t<br />

definition that participants are oblivious to the study objective. Intermittent data<br />

collection is there<strong>for</strong>e prohibitive. Post-trial data collection either includes recall<br />

of 26 experiment configurations – an unlikely feat – or re-execution of them with<br />

interlaid evaluation, which again would interfere with the participants’ naïvety<br />

towards the purpose of the compensational motion.<br />

User interfaces that utilize both h<strong>and</strong>s provide many benefits over single-<br />

h<strong>and</strong>ed interaction [59, 134]. The com<strong>for</strong>t zone <strong>for</strong> the non-dominant h<strong>and</strong> is<br />

expected to be very similar to the mirrored image of the com<strong>for</strong>t zone that we<br />

observed <strong>for</strong> the dominant h<strong>and</strong>. However, when both h<strong>and</strong>s are to be used con-<br />

currently in a user interface, special considerations might be necessary.<br />

3.3.4 Open issues<br />

Our definition of com<strong>for</strong>t represents a novel evaluation tool <strong>for</strong> the detailed<br />

investigation of human postures <strong>and</strong> motions. The following are directions <strong>for</strong><br />

further research that are, however, out of the scope of this dissertation.<br />

81


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

• The relation between the ef<strong>for</strong>t <strong>for</strong> the compensating motion/posture <strong>and</strong><br />

the ef<strong>for</strong>t <strong>for</strong> the primary motion/posture might influence the extent of the<br />

com<strong>for</strong>t zone. For example, if the compensation is physically too costly, the<br />

study participant might choose to put up with the uncom<strong>for</strong>table, primary<br />

motion or posture. A related question is how the ef<strong>for</strong>t of assuming a cer-<br />

tain posture compares to the ef<strong>for</strong>t to execute a certain motion, <strong>and</strong> what<br />

implications this has <strong>for</strong> compensating <strong>for</strong> one with the other.<br />

• Postural shifts during long-term stationary postures such as sitting indicate<br />

some degree of discom<strong>for</strong>t, see in Liao <strong>and</strong> Drury [104]. More frequent<br />

shifts indicate an increase in discom<strong>for</strong>t. If shifts occurred more frequently<br />

<strong>for</strong> postures outside the com<strong>for</strong>t zone than they occur <strong>for</strong> “com<strong>for</strong>table”<br />

postures, that would be an independent indication of the validity of our<br />

com<strong>for</strong>t measure.<br />

• Temporal evolution of com<strong>for</strong>t <strong>and</strong> discom<strong>for</strong>t is largely unknown. It is<br />

unclear <strong>for</strong> how long people can be com<strong>for</strong>table within a certain motion<br />

range or posture, as well as what preferable remedies are <strong>for</strong> this temporally<br />

acquired discom<strong>for</strong>t. The size of the com<strong>for</strong>t zone <strong>for</strong> very long term postures<br />

<strong>and</strong> motions could remain unchanged or it could shrink. In any case, the<br />

validity of the method to determine the com<strong>for</strong>t zone would be assured<br />

82


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

if the a<strong>for</strong>ementioned increase in postural shifts happened with different<br />

magnitudes <strong>for</strong> areas inside the com<strong>for</strong>t zone as <strong>for</strong> outside.<br />

• The link between our definition of com<strong>for</strong>t <strong>and</strong> observed postures that pose<br />

a health risk is not yet established. An experiment designed to show this<br />

link <strong>for</strong> certain postures would make the proposed theory an even more<br />

important contribution. The objective is to determine whether participants<br />

that per<strong>for</strong>m a task that is outside their com<strong>for</strong>t range will assume postures<br />

that are known to compromise their health. On the other h<strong>and</strong>, <strong>for</strong> a positive<br />

link to be established, they have to per<strong>for</strong>m a task that is within their<br />

com<strong>for</strong>t range without assuming those postures.<br />

3.4 Conclusions<br />

The com<strong>for</strong>t assessment method as described in Section 3.1.1 provides a prin-<br />

cipled approach <strong>for</strong> identifying the com<strong>for</strong>t zone of bodily motions <strong>and</strong> postures.<br />

The particular study example of h<strong>and</strong> reach (Section 3.2) was chosen because it<br />

surveyed an important aspect of the space that is likely to be chosen <strong>for</strong> a h<strong>and</strong><br />

gesture interface. The study results define a fine-grained, two-dimensional com-<br />

<strong>for</strong>t function over the area in front of the body: the optimal distance <strong>for</strong> h<strong>and</strong><br />

placement is within a half moon-shaped area about 35 to 45 centimeters from<br />

83


Chapter 3. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in the Human Context<br />

the shoulder joint, at an angular range from 70 degrees adduction to 50 degrees<br />

abduction (away from the body center). Researchers <strong>and</strong> designers of gesture<br />

interfaces should primarily make use of this com<strong>for</strong>t zone. In general, novel inter-<br />

faces should be evaluated <strong>for</strong> com<strong>for</strong>t or they could expose users to risk-fraught,<br />

unanticipated use patterns.<br />

The results of this study were used in the remainder of this dissertation work in<br />

the following ways. 1) The camera’s field of view includes the entire com<strong>for</strong>t zone<br />

in front of the human body. 2) <strong>H<strong>and</strong></strong> detection occurs in the most com<strong>for</strong>table<br />

interaction area. 3) The pointer-based interaction style is preferred whenever<br />

possible, allowing h<strong>and</strong> movements in a dynamically determined area due to the<br />

input-to-output coordinate translation. 4) Also <strong>for</strong> the pointer-based interaction<br />

style, the input range is scaled to a larger output range, thus allowing smaller<br />

h<strong>and</strong> movements that do not exit the com<strong>for</strong>t zone <strong>and</strong> still reach all interaction<br />

elements with the pointer.<br />

84


Chapter 4<br />

<strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong><br />

System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

<strong>H<strong>and</strong></strong> gestures can be recognized with various means <strong>and</strong> varying fidelity. Data<br />

gloves, <strong>for</strong> example, are gloves equipped with bending-sensing elements that can<br />

accurately report the intrinsic parameters of the h<strong>and</strong> – flexion/extension <strong>and</strong><br />

abduction/adduction of the various joints. A position- <strong>and</strong> orientation-sensing<br />

device, mounted on the wrist of the glove, can track the h<strong>and</strong>’s extrinsic pa-<br />

rameters with six degrees of freedom (DOF). These devices set the high bar <strong>for</strong><br />

body posture estimation. However, they require gear to be worn on the h<strong>and</strong> <strong>and</strong><br />

usually some fixed-mounted infrastructure.<br />

Computer vision-based approaches hold the promise to avoid these disadvan-<br />

tages. While the presence of a camera is inevitable, be it mounted on the ceiling<br />

or strapped to the wrist, the observed body part is unhindered by cloth or worn<br />

85


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

gear. In addition, cameras worn on the body allow <strong>for</strong> mobility of the recognition<br />

system <strong>and</strong> thus <strong>for</strong> use of the data as input to wearable devices.<br />

This chapter presents the structure <strong>and</strong> main characteristics of the computer<br />

vision system that we built. This software system is capable of detecting the<br />

human h<strong>and</strong> in monocular video, tracking its location over time, <strong>and</strong> recognizing<br />

a set of finger configurations (postures). It operates in real time on commodity<br />

hardware <strong>and</strong> its output can thus function as a user interface. We will first describe<br />

the physical setup, then the software organization, <strong>and</strong> finally characteristics of<br />

the system as a whole.<br />

4.1 Hardware setup<br />

A camera is assumed to be worn on the <strong>for</strong>ehead, facing <strong>for</strong>ward <strong>and</strong> downward<br />

to cover the lower quarter-sphere that h<strong>and</strong>s can operate in, in front of the body.<br />

This configuration is advantageous because gestures per<strong>for</strong>med in the observed<br />

space include the most convenient h<strong>and</strong>/arm postures as detailed in Chapter 3.<br />

Furthermore, that location allows camera mounting atop a head-worn display as it<br />

is frequently employed <strong>for</strong> virtual <strong>and</strong> augmented (mixed) reality applications (see<br />

Chapter 8). The co-location can in turn be exploited to realize video see-through<br />

86


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

capabilities by feeding the recorded video to the display in real-time. This is a<br />

popular way of facilitating augmented reality.<br />

Figure 4.1: Our mobile user interface in action:<br />

all hardware components aside from display <strong>and</strong> camera are stowed in the backpack.<br />

Output is realized through a head-worn display (HMD, in our case Sony<br />

Glasstron LDI-A55 glasses), atop which we mounted a small digital camera (Fire-<br />

Fly, Point Grey Research); see Figure 4.1. The camera has a horizontal field of<br />

view (FOV) of 70 degrees. The live video stream, augmented with the application<br />

overlay described in Chapter 8, is fed into the display to achieve video see-through<br />

mixed reality. This alleviates problems with the HMD’s small 30 degree FOV be-<br />

cause it makes 70 degrees FOV available to the wearer. The resulting spatial<br />

compression takes users a few minutes to get used to, but seems quite natural<br />

after that time. Use of this fish-eye-style lens reduced the tunnel effect that most<br />

optical see-through mixed reality displays exhibit. The high FOV is also impor-<br />

87


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

tant <strong>for</strong> interface functionality because both the h<strong>and</strong>s <strong>and</strong> a more <strong>for</strong>ward-facing<br />

view direction are within the FOV, which allows direct feedback as well as a regis-<br />

tered interaction style (see Section 8.3). Furthermore, the FOV encompasses the<br />

entire com<strong>for</strong>t zone as discussed in the previous chapter, leaving the possibility<br />

to leverage the full area of convenient h<strong>and</strong> motions <strong>for</strong> the interface designer to<br />

explore.<br />

Note that no other input device such as a Twiddler keyboard or 3D mouse<br />

has to be used. Instead, the input- <strong>and</strong> output-interface is combined into a single<br />

head-worn unit. The other logical component of the system, a laptop plus a few<br />

adapters <strong>and</strong> batteries, is stored away in a conventional backpack. Overall, this<br />

makes <strong>for</strong> a fairly easy to assemble <strong>and</strong> relatively inexpensive mobile computer.<br />

4.2 <strong>Vision</strong> system overview<br />

The software system that realizes the vision-based h<strong>and</strong> gesture recognition<br />

<strong>and</strong> allows <strong>for</strong> its utilization as a user interface consists of a number of software<br />

components that will be described in the following. <strong>H<strong>and</strong></strong>Vu – pronounced “h<strong>and</strong>-<br />

view” – is a library <strong>and</strong> the core gesture recognition module that implements all<br />

of the computer vision methods <strong>for</strong> detection, tracking, <strong>and</strong> recognition of h<strong>and</strong><br />

gestures. This module receives the video feed from a DirectShow pipeline <strong>and</strong> sup-<br />

88


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

plies the gesture results to an MFC application. This application, called <strong>H<strong>and</strong></strong>Vu<br />

WinTk, h<strong>and</strong>les pipeline initialization <strong>and</strong> implements convenience functions. In<br />

addition to these runtime components, there is also an offline module that im-<br />

plements AdaBoost training <strong>for</strong> the detection <strong>and</strong> recognition components. The<br />

training module is described in Chapter 5.<br />

4.2.1 Core gesture recognition module<br />

The core module is a combination of recently developed methods with novel<br />

algorithms to achieve real-time per<strong>for</strong>mance <strong>and</strong> robustness. A careful orches-<br />

tration <strong>and</strong> automatic parameterization is largely responsible <strong>for</strong> the high speed<br />

per<strong>for</strong>mance while multi-modal image cue integration guarantees robustness.<br />

There are three stages: the first stage detects the presence of the h<strong>and</strong> in one<br />

particular posture. (It is undesirable to have the vision interface always active<br />

since coincidental gestures may be interpreted as comm<strong>and</strong>s. Also, processing is<br />

faster <strong>and</strong> more robust if only one gesture is to be detected.) After this gesture-<br />

based activation, the second stage serves as an initialization to the third stage,<br />

the main tracking- <strong>and</strong> posture recognition stage.<br />

This multi-stage approach makes it possible to take advantage of less general<br />

situations at each stage. Exploiting spatial <strong>and</strong> other constraints that limit the<br />

dimensionality <strong>and</strong>/or extent of the search space achieves better quality <strong>and</strong> faster<br />

89


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

processing speed. We use this at a number of places: the generic skin color model<br />

is adapted to the specifics of the observed user (see Section 5.9 in Chapter 5),<br />

<strong>and</strong> the search window <strong>for</strong> posture recognition is positioned with fast model-<br />

free tracking (see Chapter 6). However, staged systems are more prone to error<br />

propagation <strong>and</strong> failures at each stage. To avoid these pitfalls, every stage makes<br />

conservative estimations <strong>and</strong> uses multiple image cues (grey-level texture <strong>and</strong> local<br />

color in<strong>for</strong>mation) to increase confidence in the results.<br />

<strong>H<strong>and</strong></strong>Vu, the core vision component, is entirely plat<strong>for</strong>m independent <strong>and</strong> its<br />

only necessary dependency is the OpenCV library. 1 The Maintenance Support<br />

application (see Chapter 8) that is built into the <strong>H<strong>and</strong></strong>Vu WinTk MFC component<br />

requires the Magick image library <strong>for</strong> loading the icon overlays. It is started on<br />

dem<strong>and</strong> only. Image operations are kept scalable to different frame sizes as much<br />

as possible.<br />

<strong>H<strong>and</strong></strong>Vu serves as a library <strong>for</strong> gesture recognition that can be built into any<br />

application that dem<strong>and</strong>s a h<strong>and</strong> gesture user interface. However, it does not<br />

h<strong>and</strong>le any plat<strong>for</strong>m-specific operations such as image acquisition or display, <strong>and</strong><br />

thus requires some programming be<strong>for</strong>e it can be used. Section 4.2.7 describes<br />

1 However, using Intel’s Integrated Per<strong>for</strong>mance Primitives (in particular the Image <strong>and</strong><br />

Video Processing part, <strong>for</strong>merly IPL) as OpenCV subsystem increases the per<strong>for</strong>mance on Intel<br />

plat<strong>for</strong>ms.<br />

90


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

method detection<br />

tracking<br />

tracking <strong>and</strong> image cue<br />

initialization<br />

recognition<br />

extended Viola-Jones<br />

detection/recognition<br />

Flock of Features<br />

histogram<br />

lookup<br />

h<strong>and</strong> detection<br />

color verification<br />

(fixed histogram)<br />

-<br />

+<br />

initialize<br />

feature locations<br />

learn <strong>for</strong>e-&background<br />

color<br />

+<br />

-<br />

posture<br />

classification<br />

feature tracking<br />

learned color<br />

model<br />

grey-level<br />

grey-level<br />

Figure 4.2: Arrangement of the computer vision methods:<br />

only on successful h<strong>and</strong> detection will the tracking method start operating. Posture<br />

recognition is attempted after each tracking step. If successful, features <strong>and</strong><br />

color are re-initialized.<br />

the WinTK application that embeds the library to provide a set of versatile <strong>and</strong><br />

easy-to-access interfaces to the gesture recognition results.<br />

The final output of the vision system indicates <strong>for</strong> every frame the 2D location<br />

of the h<strong>and</strong> if is tracked, or that it has not been detected yet. Chapter 6 defines<br />

what exactly is meant by the “h<strong>and</strong> location.” The location of a second h<strong>and</strong><br />

within view can also be determined in certain cases. If the dominant h<strong>and</strong>’s<br />

posture is recognized, it is described with a string identifier as a classification<br />

into a set of predefined, recognizable h<strong>and</strong> configurations. <strong>H<strong>and</strong></strong>Vu’s API <strong>and</strong> the<br />

various ways of obtaining output from the system are described in Section 4.2.5.<br />

The diagram in Figure 4.2 <strong>and</strong> the following paragraphs briefly introduce the<br />

components of our vision system <strong>and</strong> their interactions. More detail can then be<br />

found in the following chapters.<br />

91<br />

color


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

<strong>H<strong>and</strong></strong> Detection<br />

The initial stage of the vision system attempts detection of the h<strong>and</strong> in a<br />

particular posture. Since h<strong>and</strong>s are frequently over-exposed in comparison to<br />

the background, the vision system per<strong>for</strong>ms automatic exposure correction <strong>for</strong> a<br />

smaller area than the entire image (see Section 4.2.2). To facilitate this, the h<strong>and</strong><br />

is only detected in a rectangular region that can be specified by the interface<br />

designer. For our applications, we chose the central part of the com<strong>for</strong>t zone as<br />

discussed in Chapter 3 to function as the h<strong>and</strong> detection area. If the detection<br />

was not successful, vision processing <strong>for</strong> the current frame ends here. If the h<strong>and</strong><br />

was detected, the location of a 2D bounding box is sent to the tracking initializa-<br />

tion stage. More in<strong>for</strong>mation on how the detection is facilitated can be found in<br />

Chapter 5.<br />

Tracking initialization<br />

The observed h<strong>and</strong> color is then learned in a color histogram, given the bound-<br />

ing box location <strong>and</strong> a probability map that specifies the likelihood of pixels within<br />

that area to belong to the h<strong>and</strong>. This color is contrasted to a reference area of<br />

background image areas. Next, a “Flock of Features” is placed on what is believed<br />

to be the h<strong>and</strong>. (See Chapter 6 <strong>for</strong> tracking details.) No further processing is done<br />

92


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

on the current video frame, <strong>and</strong> the following frames are sent to the third stage,<br />

introduced in the next paragraph.<br />

Figure 4.3: A screen capture with verbose output turned on:<br />

the image is partially color-segmented, illustrating how skin color by itself is not<br />

a reliable modality. In color prints, the green KLT features are also visible. This<br />

shot was taken while walking.<br />

Tracking <strong>and</strong> recognition<br />

The Flock of Features follows small grey-level image artifacts. A weak global<br />

constraint on the features’ locations is en<strong>for</strong>ced, keeping the features tightly to-<br />

gether. Features that are not likely to still be on an area of the h<strong>and</strong> appearance<br />

are relocated to close proximity of the remaining features <strong>and</strong> on an area with high<br />

skin color probability. This technique integrates grey-level texture <strong>and</strong> dimension-<br />

less color cues, resulting in more robustness towards tracking disturbances cause<br />

93


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

by background artifacts. From the feature locations a small area is determined<br />

that is scanned <strong>for</strong> the key postures that recognition is attempted <strong>for</strong>. Posture<br />

recognition is described in detail in Chapter 7, while all tracking aspects are cov-<br />

ered in Chapter 6. If the posture recognition succeeds, the feature locations <strong>and</strong><br />

the color lookup table are re-initialized as described in the previous paragraph.<br />

4.2.2 Area-selective exposure control<br />

The automatic exposure control that most digital cameras per<strong>for</strong>m does not<br />

suit the purposes of a vision-based h<strong>and</strong> gesture interface. Ideally, the h<strong>and</strong><br />

would always be perfectly exposed. Yet the cameras optimize exposure <strong>for</strong> the<br />

entire image area, not just where the h<strong>and</strong> is located. We there<strong>for</strong>e designed <strong>and</strong><br />

implemented a software-based exposure correction function that only considers a<br />

sub-area of the frame. For it to work, the camera (or another component in the<br />

video pipeline) must provide a simple interface to read <strong>and</strong> set its exposure level.<br />

class CameraController {<br />

public:<br />

virtual double GetCurrentExposure() = 0; // [0..1]<br />

// true if change has an effect, false if step is too small<br />

virtual bool SetExposure(double exposure) = 0; // [0..1]<br />

virtual bool SetCameraAutoExposure(bool enable=true) = 0;<br />

virtual bool CanAdjustExposure() = 0;<br />

};<br />

For example, the DirectShow interface IAMCameraControl that many camera<br />

source filters expose, can fulfill these dem<strong>and</strong>s with only minute changes. The<br />

94


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

correction algorithm runs periodically, currently every 500 or 1000 milliseconds.<br />

It counts the number of highly-exposed pixels within the current scan area, where<br />

a highly-exposed pixel has 80% or more of the maximum brightness scale. If that<br />

number is greater than max frac = 30% of the total number of pixels within the<br />

scan area, the image area is considered over-exposed. In that case, the exposure<br />

time per frame is corrected by a factor of max frac over the current percentage,<br />

bounded to a factor greater than 0.75 to avoid abrupt changes.<br />

Under-exposure is corrected <strong>for</strong> if less than min frac = 10% of the pixels are<br />

within the upper 20 percent of the brightness scale. Exposure time is extended by<br />

a factor of 1.1 over the current percentage of pixels in the 20 th percentile, bounded<br />

to a maximum correction factor of 1.25. See Figure 4.4 <strong>for</strong> a pseudo-code notation<br />

of the algorithm.<br />

Setting a new exposure level might or might not result in an actual adjustment<br />

to the camera because of the camera’s exposure level resolution. To obtain the<br />

exact effect, the new exposure setting has to be checked after a call to SetExposure.<br />

However, as a coarse measure, the function returns true if the CameraController<br />

changed its exposure.<br />

While no <strong>for</strong>mal experiment has been conducted, subjective image results show<br />

a large reduction of strongly over-exposed pixels within the area of adjustment. As<br />

a direct result from our area-selective software adjustment, the detection method<br />

95


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

input:<br />

bbox : scan area<br />

img : grey-level image, values [0..1]<br />

exposure: last computed exposure level<br />

constants:<br />

bright_exp = 0.8<br />

max_frac = 0.3<br />

min_frac = 0.1<br />

algorithm:<br />

bright_pixels = num pixels in bbox with brightness>=bright_exp<br />

bright_pixels = bright_pixels / area(bbox)<br />

correction_factor = 1.0<br />

if (bright_pixels>max_frac)<br />

correction_factor = max(0.75, max_frac/bright_pixels)<br />

else if (bright_pixels


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

consistently (over time <strong>and</strong> with different cameras) succeeds in lighting conditions<br />

in which it consistently fails when relying on the cameras’ built-in exposure control<br />

mechanisms.<br />

4.2.3 Speed <strong>and</strong> size scalability<br />

The vision module adapts to the available processing power of the hardware it<br />

is running on. This is important <strong>for</strong> the responsiveness of the user interface, <strong>and</strong><br />

necessary because some computation can take longer than the time between two<br />

successive frames.<br />

Incoming frames are tagged based on their latency, which is the time that<br />

has passed between frame capture <strong>and</strong> the frame’s arrival at <strong>H<strong>and</strong></strong>Vu’s processing<br />

module. If this time is less than a threshold t max normal latency, it is tagged to<br />

be fully processed. For the most part, this turns on the detection <strong>and</strong> recognition<br />

subroutines. If the frame’s latency is greater than that but less than a second<br />

threshold t max abnormal latency, the frame is tagged to be “skipped.” This<br />

means that partial processing is done <strong>and</strong> the <strong>H<strong>and</strong></strong>Vu-wrapping application is<br />

recommended to also only per<strong>for</strong>m minimal processing steps be<strong>for</strong>e displaying the<br />

video frame. If <strong>H<strong>and</strong></strong>Vu is tracking the h<strong>and</strong>, it will per<strong>for</strong>m an update step on the<br />

Flock of Features. If the h<strong>and</strong> has not been detected, <strong>H<strong>and</strong></strong>Vu will not do any pro-<br />

cessing. Lastly, if the frame arrives with more than t max abnormal latency de-<br />

97


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

lay, only Flock of Features tracking is per<strong>for</strong>med <strong>and</strong> the frame should be dropped<br />

by the wrapping application. This behavior is transparent to applications that<br />

are connected through the event server.<br />

The described method results in smooth scaling down to machines with about<br />

1GHz CPU speed but becomes increasingly choppy <strong>and</strong> uneven <strong>for</strong> slower ma-<br />

chines.<br />

A minimum resolution of 320x240 pixels is highly recommended. Beyond that,<br />

all functionality is independent of the video scale. However, larger video frames<br />

on slower machines require longer processing times per frame.<br />

4.2.4 Correction <strong>for</strong> camera lens distortion<br />

Most camera lenses introduce spatial distortions into the video frames. The<br />

<strong>H<strong>and</strong></strong>Vu system optionally corrects <strong>for</strong> those artifacts with the help of a function<br />

from the OpenCV library, turning every frame into correct perspective projection.<br />

This is important <strong>for</strong> many applications, particularly <strong>for</strong> augmented reality ap-<br />

plications that need to draw aligned, perspectively correct geometry over the real<br />

world as seen through the camera. All frames get undistorted aside from those<br />

that are dropped. Since this operation takes a considerable amount of time (on<br />

the order of the Flock of Features tracking), it can be turned on <strong>and</strong> off without<br />

affecting other settings.<br />

98


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

Features are tracked in every frame, even in frames with such a high latency<br />

that they are to be dropped by the wrapping application <strong>and</strong> no undistortion is<br />

per<strong>for</strong>med on them. Thus, features are tracked in the original, distorted image.<br />

To be compatible with the image output, the locations reported by the event<br />

server are converted to their respective coordinates in the undistorted frame as a<br />

last step of processing.<br />

4.2.5 Application programming interface<br />

<strong>H<strong>and</strong></strong>Vu is primarily a library <strong>and</strong> this section describes its application pro-<br />

gramming interface (API). It also introduces ways to connect to the <strong>H<strong>and</strong></strong>Vu<br />

WinTk, described in Section 4.2.7. To keep the core as plat<strong>for</strong>m-independent as<br />

possible, <strong>H<strong>and</strong></strong>Vu requires an application that uses the library to implement one<br />

or two functionalities. The camera controller that interfaces to specific camera<br />

functions was already introduced in Section 4.2.2 <strong>and</strong> is in fact optional. The<br />

other, m<strong>and</strong>atory functionality to be provided by the wrapping application is a<br />

clock that can report both the time when an image frame was captured <strong>and</strong> the<br />

current time in micro-seconds:<br />

typedef long long RefTime;<br />

class RefClock {<br />

public:<br />

virtual RefTime GetSampleTimeUsec() const = 0;<br />

virtual RefTime GetCurrentTimeUsec() const = 0;<br />

};<br />

99


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

The state of an “object” – currently the right h<strong>and</strong>, object ID 0 – is accessed<br />

via the state data structure HVState, see Section 4.2.8 below. Given these two def-<br />

initions, we can introduce the main <strong>H<strong>and</strong></strong>Vu class. The most important functions<br />

of <strong>H<strong>and</strong></strong>Vu’s API are shown below.<br />

class <strong>H<strong>and</strong></strong>Vu {<br />

public:<br />

enum HVAction { // specify recommendations to application:<br />

HV_PROCESS_FRAME, // fully process <strong>and</strong> display the frame<br />

HV_SKIP_FRAME, // display but do not further process<br />

HV_DROP_FRAME // do not display the frame<br />

};<br />

void Initialize(int width, int height, RefClock* pClock,<br />

CameraController* pCamCon);<br />

void LoadConductor(const string& filename);<br />

void StartRecognition(int obj_id=0);<br />

void StopRecognition(int obj_id=0);<br />

HVAction ProcessFrame(IplImage* inOutImage);<br />

void GetState(int obj_id, HVState& state) const;<br />

void SetOverlayLevel(int level);<br />

void CorrectDistortion(bool enable=true);<br />

void SetAdjustExposure(bool enable=true);<br />

};<br />

The operation is fairly straight-<strong>for</strong>ward: first, <strong>H<strong>and</strong></strong>Vu is initialized with the<br />

width <strong>and</strong> height of the video stream, <strong>and</strong> the RefClock <strong>and</strong> camera controller<br />

are supplied. After a conductor configuration file (see Section 4.2.9 below) has<br />

been loaded with LoadConductor, recognition can be started <strong>and</strong> stopped at will.<br />

Every video frame needs to be passed to <strong>H<strong>and</strong></strong>Vu via the ProcessFrame function.<br />

It returns a recommendation on what to do with that frame. The main result,<br />

100


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

the state of the h<strong>and</strong> gesture recognition, is available via the GetState function<br />

call. For CorrectDistortion to work, the conductor configuration file has to<br />

specify a valid camera calibration file. For the exposure adjustment to be possible<br />

(turned on via SetAdjustExposure), <strong>H<strong>and</strong></strong>Vu must have been initialized with a<br />

CameraController!=NULL.<br />

User applications (such as those described in Chapter 8) can also connect to the<br />

“<strong>H<strong>and</strong></strong>Vu WinTk” toolkit, a st<strong>and</strong>-alone Windows MFC application that is built<br />

on top of the library using DirectShow (see Section 4.2.7). Applications are either<br />

embedded in the pipeline as an extension to the recognition module’s DirectShow<br />

(DX) Filter, they are inserted into the pipeline by the MFC application as a<br />

separate DX Filter, or they connect through one of the two networked interfaces,<br />

the <strong>Gesture</strong> Server or the OSC protocol.<br />

Independent of the connection channel that allows access to the recognition<br />

state, a separate channel can be opened to transfer the actual video data from<br />

<strong>H<strong>and</strong></strong>Vu’s WinTk to applications that do not have direct access to the pipeline’s<br />

video stream. This is essential in case the application requires access to the video<br />

<strong>for</strong> input or <strong>for</strong> its own output capabilities. It can request that the frames are not<br />

displayed through DX but instead saved to a shared memory location that resides<br />

in a DLL called FrameDataLib. 2 Its API is shown below.<br />

2 Many thanks to Ryan Bane who implemented the DLL.<br />

101


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

void FDL_WaitForInitialization();<br />

void FDL_GetDimensions(int* width, int* height, int* channels);<br />

void FDL_GetImage(unsigned char** img);<br />

The application calls FDL_WaitForInitialization, a blocking call that re-<br />

turns when <strong>H<strong>and</strong></strong>Vu’s WinTk has initialized the buffer. FDL_GetDimensions re-<br />

turns the frame size <strong>and</strong> the number of color channels. Finally, FDL_GetImage<br />

obtains a pointer to the library-internal shared memory data structure that con-<br />

tains the latest frame that <strong>H<strong>and</strong></strong>Vu has processed.<br />

4.2.6 Verbosity overlays<br />

One of the most important aspects of user interfaces is the immediate feedback<br />

to the user once a comm<strong>and</strong> or even a slight change in the input vector was<br />

recognized. A lack thereof decreases usability, in particular the speed with which<br />

the interface can be used.<br />

<strong>H<strong>and</strong></strong>Vu can give feedback about its gesture recognition state by overlaying<br />

in<strong>for</strong>mation on the processed video frame. The amount <strong>and</strong> verbosity of the over-<br />

lay can be selected based on the application programmer’s <strong>and</strong> the application<br />

user’s needs. <strong>H<strong>and</strong></strong>Vu provides timely <strong>and</strong> direct feedback about the most impor-<br />

tant vision-level in<strong>for</strong>mation – whether detection <strong>and</strong> tracking of the h<strong>and</strong> was<br />

successful <strong>and</strong> whether one of the key postures was recognized. The following is a<br />

102


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

detailed explanation of <strong>H<strong>and</strong></strong>Vu’s different verbosity levels. Higher levels display<br />

all lower levels’ in<strong>for</strong>mation in addition to the mentioned in<strong>for</strong>mation.<br />

Level 0: No overlay, only the (possibly distortion-corrected) video stream is<br />

rendered.<br />

Level 1: A textual display in the right upper corner shows the frames per second<br />

that <strong>H<strong>and</strong></strong>Vu achieved, <strong>and</strong> in parentheses the frames per second in which<br />

posture recognition was attempted. The fastest <strong>and</strong> slowest processing time<br />

within the last second is stated in milliseconds. A single dot on the tracked<br />

h<strong>and</strong> shows the Flock of Features’ mean location.<br />

Level 2: The frames’ incoming latency is shown, along with a white rectangle<br />

around the current scan area. A large dot identifies the h<strong>and</strong>’s mean loca-<br />

tion, <strong>and</strong> little dots mark each of the features in the flock. Recognized h<strong>and</strong><br />

postures have a green box drawn around them.<br />

Level 3: During tracking, the scan area is color-segmented with a 0.5 probability<br />

threshold. This back-projection turns non-skin pixels black <strong>and</strong> leaves skin<br />

pixels unchanged. The number of individual detected areas is shown.<br />

In addition to <strong>H<strong>and</strong></strong>Vu’s feedback mechanism, applications can implement<br />

their own ways to signal event recognition to the user. Section 8.4 on page 212<br />

explains the implications on some examples.<br />

103


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

4.2.7 <strong>H<strong>and</strong></strong>Vu WinTk: video pipeline <strong>and</strong> toolkit<br />

For convenient access to the h<strong>and</strong> tracking <strong>and</strong> recognition results, we built a<br />

prototypical application <strong>for</strong> the Microsoft Windows plat<strong>for</strong>m that embeds the core<br />

vision components. It provides true out-of-the-box utilization of h<strong>and</strong> gestures<br />

as a user interface. It is a st<strong>and</strong>-alone application that leverages Microsoft’s<br />

DirectShow API to support almost any video source, be it a camera or file. The<br />

recognition results are made available in two network protocol <strong>for</strong>mats so that<br />

client applications can run on the same or another machine. Figure 4.5 is a<br />

schematic diagram of the data flow <strong>and</strong> interfaces.<br />

DX filter<br />

frame<br />

action<br />

detection<br />

h<strong>and</strong> detection<br />

-<br />

color verification<br />

(fixed histogram)<br />

+<br />

tracking<br />

initialization<br />

initialize<br />

feature locations<br />

learn <strong>for</strong>e-&background<br />

color<br />

clock exposure<br />

toolkit controller<br />

last frame<br />

+<br />

-<br />

state<br />

tracking <strong>and</strong><br />

recognition<br />

posture<br />

classification<br />

feature tracking<br />

learned color<br />

model<br />

state<br />

server<br />

application A application B<br />

Figure 4.5: The vision module in the application context:<br />

embedding of the vision module into the video pipeline <strong>and</strong> st<strong>and</strong>-alone application.<br />

104


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

The DirectShow Filter serves three purposes. First, it prepares each frame <strong>for</strong><br />

processing by the vision module. This mostly amounts to creating the appropri-<br />

ate image structure (an IplImage) <strong>and</strong> possibly flipping the image upside down<br />

to accommodate <strong>for</strong> different video source properties. Second, it takes the vision<br />

system’s recommendation on subsequent frame processing into account <strong>and</strong> <strong>for</strong>-<br />

wards or drops the frame (see Section 4.2.3). Third, it is a thin interface wrapper<br />

to allow <strong>for</strong> COM access to the vision module’s functionality. This is necessary<br />

because the main application lives in a different process space than the DX filter.<br />

The Maintenance Support application (see Chapter 8) <strong>and</strong> two user studies are<br />

built into the filter <strong>and</strong> controlled partially through input to the MFC applica-<br />

tion. If active, buttons <strong>and</strong> other interactive visualizations are overlaid over the<br />

rendered video.<br />

The application has a number of responsibilities. First, it has to build the<br />

DirectShow graph including a video source, the DX filter that wraps the vision<br />

module, <strong>and</strong> a rendering filter. Second, it implements the RefClock <strong>and</strong> Cam-<br />

eraController interfaces with the help of a few DirectShow COM interfaces, <strong>and</strong><br />

announces their availability to the DX filter. Third, it spawns the gesture event<br />

server’s thread <strong>and</strong> initiates the sending of events after every video frame. Fourth,<br />

if desired, it also makes the entire image frame available to the FrameDataLib<br />

DLL (see page 101) so other applications can display the video in a custom <strong>for</strong>-<br />

105


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

mat. Lastly, some keyboard <strong>and</strong> mouse input is interpreted as comm<strong>and</strong>s to the<br />

toolkit or <strong>H<strong>and</strong></strong>Vu library, <strong>and</strong> other traditional input is <strong>for</strong>warded unprocessed<br />

to the DX filter.<br />

4.2.8 Recognition state distribution<br />

There are three main ways to obtain the current state of <strong>H<strong>and</strong></strong>Vu. The first is<br />

through a library call to the GetState function, the second is through a TCP/IP<br />

client-server connection, <strong>and</strong> the third uses a UDP packet <strong>for</strong>mat frequently used<br />

in the music <strong>and</strong> arts community. They are described below.<br />

GetState<br />

The common way <strong>for</strong> applications to obtain the result of processing a frame<br />

with <strong>H<strong>and</strong></strong>Vu is to call the following function:<br />

void <strong>H<strong>and</strong></strong>Vu::GetState(int obj_id, HVState& state) const;<br />

where<br />

class HVState {<br />

public:<br />

int m_obj_id;<br />

bool m_tracked;<br />

bool m_recognized;<br />

double m_center_xpos, m_center_ypos;<br />

string m_posture;<br />

};<br />

106


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

The obj_id is currently fixed to a value of zero, identifying the right h<strong>and</strong>. The<br />

two boolean member variables indicate whether the object is successfully tracked<br />

<strong>and</strong> whether one of the key postures was recognized, respectively. The location<br />

of the tracked object is reported in relative image coordinates, the image origin is<br />

in the left upper corner of the image. If one of the key postures was recognized,<br />

the posture string contains the identifier string of the detecting cascade (see also<br />

Section 4.2.9).<br />

<strong>Gesture</strong> events<br />

The client-server architecture sends events from the computer vision module to<br />

any connected gesture event listener, somewhat similar to the VRPN server [169].<br />

VRPN is a VR periphery system that makes device differences between conven-<br />

tional UIs <strong>and</strong> trackers transparent to the clients.<br />

The gesture event server component is currently implemented within the plat-<br />

<strong>for</strong>m-specific WinTk, but plans are <strong>for</strong> its inclusion into the library in a plat<strong>for</strong>m-<br />

independent manner. The server opens a TCP/IP port (default port is 7045)<br />

<strong>and</strong> runs the accept-loop in its own thread, accepting an arbitrary maximum of<br />

five concurrent clients. The blocking send comm<strong>and</strong>s are invoked from the main<br />

application thread. Thus, the client applications should read events promptly<br />

from their sockets.<br />

107


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

The protocol is a unidirectional stream of events. Each event is a string of<br />

ASCII characters, delimited by a carriage return <strong>and</strong> a line feed. The current<br />

protocol version is 1.2, which has the following <strong>for</strong>mat.<br />

Where<br />

1.2 tstamp id: t, r, "posture" (x, y) [s, a]\r\n<br />

• 1.2 is the protocol version number,<br />

• tstamp is a long integer timestamp of the respective image capture time, in<br />

milliseconds starting with the first seen frame,<br />

• id is an identifier <strong>for</strong> which object this event belongs to, currently fixed to<br />

0,<br />

• t is 1 if the object is being tracked, 0 otherwise,<br />

• r is 1 if one of the key postures was recognized, 0 otherwise,<br />

• posture is a string identifier of one of the six recognized postures, or the<br />

empty string “”,<br />

• x, y are the tracked location in relative image coordinates, the image origin<br />

is in the top left,<br />

108


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

• s, a are currently unused but will eventually contain a scale identifier <strong>and</strong><br />

a rotation angle.<br />

Open Sound Controller interface<br />

The <strong>for</strong>mat of the Open Sound Controller (OSC) packets is very similar to the<br />

custom packet <strong>for</strong>mat described above:<br />

gesture_event, siiiisffff, tstamp, id, t, r, posture, x, y, s, a<br />

Note that the OSC identifier is gesture_event <strong>and</strong> that the cryptic siiiisffff<br />

encodes the type in<strong>for</strong>mation <strong>for</strong> all arguments: the following four arguments<br />

are integers, posture is a string argument, <strong>and</strong> the last four arguments are float<br />

numbers.<br />

4.2.9 The vision conductor configuration file<br />

For the <strong>H<strong>and</strong></strong>Vu application programmer <strong>and</strong> user who desires more control<br />

over the interface operation, the vision modules’ main settings are stored in <strong>and</strong><br />

read from a configuration file. This can be conveniently modified to fit the specific<br />

needs. Due to the orchestrating nature of the settings, we termed it a “vision<br />

conductor” file. We will briefly describe its <strong>for</strong>mat <strong>and</strong> refer to the respective<br />

places in the dissertation that cover the details. The following is a typical example<br />

of a conductor configuration file.<br />

109


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

<strong>H<strong>and</strong></strong>Vu <strong>Vision</strong>Conductor file, version 1.5<br />

camera calibration: -<br />

#camera calibration: config/FireFly4mm_calib.txt<br />

camera exposure: software<br />

#camera exposure: camera<br />

detection params: coverage 0.3, duration 0, radius 10.0<br />

tracking params: num_f 30, min_f 10, win_w 7, win_h 7, \<br />

min_dist 3.0, max_err 400<br />

tracking style: OPTICAL_FLOW_COLORFLOCK<br />

#tracking style: CAMSHIFT_HSV<br />

#tracking style: CAMSHIFT_LEARNED<br />

recognition params: max_scan_width 0.4, max_scan_height 0.6<br />

1 detection cascades<br />

config/closed_30x20.cascade<br />

area: left 0.6, top .2, right 0.94, bottom .84<br />

params scaling: start 1.0, stop 8.0, inc_factor 1.2<br />

params misc: translation_inc_x 2, translation_inc_y 3, \<br />

post_process 1<br />

0 tracking cascades<br />

1 recognition cascades<br />

config/all_h<strong>and</strong>s_combined.cascade<br />

area: left 0.47, top .2, right 0.94, bottom .84<br />

params scaling: start 1.0, stop 8.0, inc_factor 1.2<br />

params misc: translation_inc_x 2, translation_inc_y 3, \<br />

post_process 0<br />

7 masks<br />

config/Lpalm.mask<br />

config/Lback.mask<br />

config/sidepoint.mask<br />

config/closed.mask<br />

110


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

config/open.mask<br />

config/victory.mask<br />

config/closed_30x20.mask<br />

The backslash \ at line endings in the above printout indicates that there must<br />

not be a line break in the actual configuration file. All configuration settings must<br />

be present in the order shown above. Blank lines <strong>and</strong> comment lines, prefixed with<br />

a pound #, are ignored.<br />

• camera calibration specifies whether a correction <strong>for</strong> lens distortion is<br />

to be per<strong>for</strong>med <strong>and</strong> what file holds the calibration in<strong>for</strong>mation. See Sec-<br />

tion 4.2.4 <strong>for</strong> details. A dash - indicates that no calibration is desired.<br />

• camera exposure can be either camera or software <strong>and</strong> specifies whether<br />

the camera’s automatic exposure control is to be used or the software-based,<br />

area-selective exposure control as introduced in Section 4.2.2.<br />

• detection params are three general settings pertaining to h<strong>and</strong> detection:<br />

the coverage specifies the relative amount of masked h<strong>and</strong> area that has to<br />

have skin color as determined with the fixed color histogram method from<br />

Section 5.7. The duration gives an amount in milliseconds that a h<strong>and</strong> must<br />

be detected in every successive frame <strong>for</strong> it to be considered a match <strong>and</strong> a<br />

valid system initialization. A value of 0 prompts acceptance with only one<br />

frame. The radius parameter is only used <strong>for</strong> durations greater than 0 <strong>and</strong><br />

111


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

delimits the radius in pixels in which subsequent h<strong>and</strong> detections must lie<br />

from the first one to be considered a match. The discussion in Section 5.10<br />

explains when these settings might be helpful.<br />

• tracking params are used exclusively <strong>for</strong> the Flock of Features tracking<br />

style <strong>and</strong> specify: num_f the target number of features that is maintained,<br />

min_f the minimum number of features that has to be successfully tracked<br />

from one frame to the next or tracking is considered lost, win_w the width<br />

of the search window <strong>for</strong> KLT features, win_h the window height, min_dist<br />

the minimum-distance flocking constraint, <strong>and</strong> max_err the maximum area<br />

mismatch be<strong>for</strong>e a KLT feature is considered lost. All units but the last are<br />

in pixels. More details about the meaning of these parameters can be found<br />

in Chapter 6.<br />

• tracking style determines the method to be used <strong>for</strong> tracking a once-<br />

detected h<strong>and</strong>:<br />

OPTICAL_FLOW_COLORFLOCK causes tracking with a Flock of Features,<br />

CAMSHIFT_HSV with CamShift based on a fixed HSV skin color distribution,<br />

CAMSHIFT_LEARNED with CamShift based on a color distribution learned at<br />

detection time.<br />

Again, please see Chapter 6 <strong>for</strong> more.<br />

112


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

• recognition params limit the maximum size of the area that is scanned<br />

<strong>for</strong> h<strong>and</strong> postures during tracking to the width <strong>and</strong> height specified trough<br />

max_scan_width <strong>and</strong> max_scan_height, relative to the video size.<br />

• n detection cascades is a list of length n of detector cascades <strong>and</strong> their<br />

detection parameters. The first line of a list entry points to a file that de-<br />

scribes a detector cascade. In addition to all weak classifiers, each cascade<br />

file contains a textual identifier (a fanned detector contains multiple iden-<br />

tifiers). This name is used <strong>for</strong> associating the correct masks (probability<br />

maps) <strong>and</strong> giving detected appearances a name, <strong>for</strong> example, <strong>for</strong> reporting<br />

detected postures (see Section 4.2.8). Specifics about the detection method<br />

<strong>and</strong> cascades can be found in Chapter 5. The remaining three lines in a list<br />

entry are described in the following.<br />

• area defines a rectangular region that is to be scanned with the respective<br />

cascade, in relative coordinates.<br />

• params scaling specifies the scales at which the respective cascade is to be<br />

scanned across the area. For example, a start scale of 1.0 is the minimum<br />

template resolution, a stop scale of 8.0 says to increase the scale incremen-<br />

tally while it is smaller than eight times the template resolution, <strong>and</strong> an<br />

113


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

inc_factor of 1.2 asks <strong>for</strong> scale increase steps of 20% over the previous<br />

size.<br />

• params misc specify the translation of the cascade during scanning in pixel-<br />

sized increments, both in the horizontal <strong>and</strong> the vertical dimension. The<br />

increments are <strong>for</strong> the smallest scale <strong>and</strong> scaled with the cascade size there-<br />

after. post_process can be 0 or 1, where 1 means that all intersecting<br />

matches found in a single frame are to be combined into a single rectangular<br />

area as suggested by Viola <strong>and</strong> Jones in [180], <strong>and</strong> 0 causes all individ-<br />

ual matches to be reported. See Section 2.3.8 <strong>for</strong> more details on detector<br />

scanning.<br />

• n tracking cascades is currently not used <strong>and</strong> n must be 0.<br />

• n recognition cascades are the cascades used <strong>for</strong> recognizing different<br />

postures as described in Chapter 7. The list of cascades has the same<br />

<strong>for</strong>mat as <strong>for</strong> the detection cascades, but the area line is ignored, only<br />

params scaling <strong>and</strong> params misc are used.<br />

• n masks are the names of n files that contain the h<strong>and</strong> pixel probability maps<br />

as described in Section 5.8. Each of these files contains a textual posture<br />

identifier that is used to match a map to its cascade <strong>and</strong> a template-sized<br />

matrix of probabilities <strong>for</strong> the respective pixel to belong to the h<strong>and</strong> area.<br />

114


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

A file that follows these specifications can be read with the LoadConductor<br />

API call. Upon successful parsing, the changes are assumed immediately.<br />

4.3 <strong>Vision</strong> system per<strong>for</strong>mance<br />

The quality <strong>and</strong> usability of any vision-based interface is determined by four<br />

main aspects of the computer vision methods: speed, accuracy, precision, <strong>and</strong><br />

robustness. In addition, usability of the application interface is of course an<br />

important factor, but this shall not be considered here. While the main results of<br />

user studies <strong>and</strong> runtime data are reported in the following chapters, this section<br />

summarizes the per<strong>for</strong>mance as it pertains to the entire vision system.<br />

The tracking component requires 2-18ms processing time on a 3GHz Xeon<br />

CPU. The combination of tracking, recognition, <strong>and</strong> color re-learning takes be-<br />

tween 50ms <strong>and</strong> 90ms total time, with O2-compiled C++ code. On a 1.13GHz<br />

laptop, the respective times are 18-33 <strong>and</strong> 50-140ms. The latency from frame cap-<br />

ture to render completion time as reported by DirectShow is a few milliseconds<br />

higher.<br />

Sheridan <strong>and</strong> Ferrell found a maximum latency between event occurrence <strong>and</strong><br />

system response of 45ms to be experienced as “no delay” [156]. While <strong>H<strong>and</strong></strong>Vu<br />

does not achieve that end-to-end latency <strong>for</strong> all methods in combination, together<br />

115


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

they are well below the threshold of 300ms <strong>for</strong> when interfaces start to feel sluggish,<br />

might provoke oscillations, <strong>and</strong> cause the “move <strong>and</strong> wait” symptom [156]. With<br />

the cameras that we used, the tracking always runs at capture rate (up to 15Hz),<br />

while recognition is interlaced with 6-10Hz, time permitting. In comparison to<br />

other mobile VBIs, our method is significantly more responsive than the <strong>H<strong>and</strong></strong><br />

Mouse [100], judging from a video available on the authors’ web site. The following<br />

chapters will detail the processing time <strong>for</strong> each of the vision module’s components.<br />

The combination of vision methods is generally robust to different environmen-<br />

tal conditions, including different lighting, different users, cluttered backgrounds,<br />

<strong>and</strong> non-trivial motion such as walking. They are largely camera-independent <strong>and</strong><br />

can cope with the automatic image quality adjustments of digital cameras. No<br />

per<strong>for</strong>mance degradation was observed even <strong>for</strong> severely distorting camera lenses.<br />

Two conditions will still violate the system’s assumptions <strong>and</strong> might impact recog-<br />

nition <strong>and</strong> tracking negatively. First, an extremely over- or under-exposed h<strong>and</strong><br />

appearance does not contain a sufficient amount of skin-colored pixels <strong>for</strong> suc-<br />

cessful detection. Second, if the color changes dramatically in between two con-<br />

secutive successful posture classifications, the tracking degenerates into single-cue<br />

grey-level KLT tracking with flocking constraints. Since the system updates its<br />

color model periodically, it is able to cope with slowly changing lighting conditions,<br />

however.<br />

116


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

The detection <strong>and</strong> posture recognition classifiers were trained with images<br />

taken with different still picture cameras, while the system was successfully tested<br />

with three different digital video cameras. In addition, none of the training images<br />

was shot with as short a focal length lens as our mobile camera has. These facts<br />

suggest that the entire system will run with almost any color camera available.<br />

No user calibration is necessary; the methods are largely person independent.<br />

4.4 Delimitation<br />

Motions of the head-mounted camera are not treated explicitly. Only the<br />

AR application has independent means to obtain the extrinsic camera parameters<br />

(location <strong>and</strong> orientation). Given these parameters, it is also possible to trans<strong>for</strong>m<br />

the h<strong>and</strong> location from image coordinates into an absolute world reference frame,<br />

save the distance from the camera which is not available from <strong>H<strong>and</strong></strong>Vu.<br />

The gesture recognition as facilitated with the described system is in fact a<br />

more challenging achievement than recognition of events that have a temporal<br />

extent, such as h<strong>and</strong> waving. Frame-based methods can easily be extended to<br />

recognition of dynamic, continuous motions with an independent module that<br />

analyzes sequences of single-frame results. It is up to the application designer<br />

117


Chapter 4. <strong>H<strong>and</strong></strong>Vu: A Computer <strong>Vision</strong> System <strong>for</strong> <strong>H<strong>and</strong></strong> <strong>Interfaces</strong><br />

to include Hidden Markov Models or similar techniques to recognize dynamic<br />

gestures.<br />

No 3D model of the h<strong>and</strong> was built due to time constraints of the fastest<br />

currently known parameterization methods. However, such a model might prove<br />

helpful in order to en<strong>for</strong>ce posture consistency over time <strong>and</strong> of course to facilitate<br />

more fine-grained gestural comm<strong>and</strong>s. There<strong>for</strong>e it is a good c<strong>and</strong>idate <strong>for</strong> a<br />

possible extension of this dissertation work.<br />

None of the system’s functionality explicitly detects or models h<strong>and</strong> occlusions.<br />

Yet, brief occlusions of the tracked h<strong>and</strong> with <strong>for</strong>eign objects or the other h<strong>and</strong><br />

do generally not cause all KLT features to be lost <strong>and</strong> tracking might continue.<br />

118


Chapter 5<br />

<strong>H<strong>and</strong></strong> Detection<br />

<strong>H<strong>and</strong></strong> detection <strong>for</strong> user interfaces must favor reliability over expressiveness:<br />

false positives are less tolerable than false negatives. Since detecting h<strong>and</strong>s in<br />

arbitrary configurations is a largely unsolved problem in computer vision, the<br />

detector <strong>for</strong> <strong>H<strong>and</strong></strong>Vu allows reliable <strong>and</strong> fast detection of the h<strong>and</strong> in one particular<br />

pose from a particular view direction. Starting the interaction from this initiation<br />

pose is particularly important <strong>for</strong> a h<strong>and</strong> gesture interface that serves as the<br />

sole input modality as it functions as a switch to turn on the interface: without<br />

this <strong>and</strong> instead with an always-on interface, any gesture might inadvertently be<br />

interpreted as a comm<strong>and</strong>. The output of the detection stage amounts to the<br />

extent of the detected h<strong>and</strong> area in image coordinates.<br />

This chapter describes the methods used <strong>for</strong> <strong>H<strong>and</strong></strong>Vu’s robust h<strong>and</strong> detector:<br />

a combination of an adapted Viola-Jones detection method <strong>and</strong> skin color veri-<br />

fication. The particularities of h<strong>and</strong>s <strong>for</strong> the purpose of reliable detection were<br />

119


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

researched. For training, a large set of h<strong>and</strong> images was collected in various con-<br />

figurations <strong>and</strong> views. Detector training was per<strong>for</strong>med with an MPI-parallelized<br />

training program on Linux clusters. Different h<strong>and</strong> configurations <strong>and</strong> views were<br />

compared <strong>for</strong> their suitability to be detected be<strong>for</strong>e arbitrary backgrounds. The<br />

best one was chosen as the initialization posture <strong>for</strong> the vision system described in<br />

the previous chapter. The detection parameters were then optimized <strong>for</strong> this <strong>and</strong><br />

other postures, in particular the amount of in-plane rotation was tuned during<br />

training to allow <strong>for</strong> detection of the widest range of in-plane rotations. Another<br />

training parameter modification reduced the training time, yet another increased<br />

detection per<strong>for</strong>mance. Lastly, a new rectangular feature type that allows com-<br />

parison of non-adjacent areas was conceived <strong>and</strong> its superior per<strong>for</strong>mance on h<strong>and</strong><br />

appearances demonstrated.<br />

5.1 Data collection<br />

We collected over 2300 images of h<strong>and</strong>s of ten male <strong>and</strong> female students’<br />

right h<strong>and</strong>s with two different digital still cameras. The pictures were taken<br />

indoors <strong>and</strong> outdoors with widely varying backgrounds <strong>and</strong> lighting conditions,<br />

but without direct sunlight on the h<strong>and</strong>s. The rectangular bounding boxes of the<br />

areas containing h<strong>and</strong> posture appearances were manually marked <strong>and</strong> rotated to<br />

120


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

a st<strong>and</strong>ard orientation. Figure 5.1 shows five examples <strong>for</strong> each of the six postures<br />

<strong>for</strong> which we trained detectors.<br />

closed<br />

sidepoint<br />

victory<br />

open<br />

Lpalm<br />

Lback<br />

Figure 5.1: Sample areas of the six h<strong>and</strong> postures:<br />

they are shown in the smallest resolution necessary <strong>for</strong> detection (25x25 pixels).<br />

The posture closed is a flat palm with all fingers extended <strong>and</strong> touching each<br />

other, open is the same but with the fingers spread apart. Sidepoint is a pointing<br />

posture with only the index finger extended, seen from the thumb side. The victory<br />

or peace posture has index <strong>and</strong> middle finger extended. The “L” posture involves<br />

an abducted thumb <strong>and</strong> extended index finger <strong>and</strong> can be seen from the Lpalm<br />

side <strong>and</strong> the Lback side of the h<strong>and</strong>. Two additional gestures were investigated<br />

but no detectors were trained <strong>for</strong> them: the grab gesture is suited to picking up<br />

121


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

coffee mugs, seen from the top, <strong>and</strong> the fist posture is viewed from the back of the<br />

h<strong>and</strong>.<br />

Table 5.1: The h<strong>and</strong> image data collection:<br />

the number of training images <strong>and</strong> the bounding box ratios (width over height)<br />

<strong>for</strong> each posture. Template size <strong>and</strong> bounding box ratio determine the template<br />

resolution along the vertical <strong>and</strong> horizontal dimensions.<br />

closed sidepoint victory open Lpalm Lback<br />

389 331 341 455 382 433<br />

0.6785 0.5 0.5 1.0 0.9 0.9<br />

The rectangular areas had different but fixed aspect ratios <strong>for</strong> each of the<br />

postures (Table 5.1). Since we wanted uni<strong>for</strong>m template sizes <strong>for</strong> all postures<br />

<strong>for</strong> better comparability, this resulted in varying resolutions <strong>for</strong> the interpolation<br />

step. For example, the posture sidepoint with a template of size 25 by 25 pixels<br />

has twice the sample density along the horizontal dimension than its resolution<br />

in the vertical dimension. Similarly, during matching of each detector, different<br />

scale factors have to be applied. The effect of this is covered in Section 5.4 in this<br />

chapter.<br />

The non-cascaded detectors were trained with more than 23000 negative ex-<br />

amples, r<strong>and</strong>omly selected areas from the pictures containing the h<strong>and</strong> images,<br />

but not intersecting the h<strong>and</strong> areas. To avoid over-training, AdaBoost was per-<br />

<strong>for</strong>med on one half of the h<strong>and</strong> images, error rate-validation on the other half. For<br />

the cascaded detectors, 180 r<strong>and</strong>om images not containing h<strong>and</strong>s were scanned to<br />

122


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

periodically increase the negative training set as explained in Section 2.3.8 <strong>and</strong><br />

in [180]. Again, half of them were added to the training set, the other half was<br />

used <strong>for</strong> validation.<br />

5.2 Parallel training with MPI<br />

Our implementation of the heavily compute- <strong>and</strong> memory-intensive AdaBoost<br />

<strong>for</strong> a Viola-Jones detection method is a processor-scalable parallel program. It<br />

uses MPI (Message Passing Interface) <strong>for</strong> remote process instantiation <strong>and</strong> com-<br />

munication. It was run on two Linux clusters, one with 16 nodes, one with 32<br />

nodes <strong>and</strong> two CPUs per node. The workload was split such that each CPU eval-<br />

uated some instances of a single feature type on all examples. The disadvantage<br />

against splitting across the examples 1 is that every CPU needs all examples <strong>for</strong><br />

processing, which can require a large amount of memory. For most of the exper-<br />

iments however, the image areas of question fit into each CPU’s 2GB memory,<br />

almost obliterating any per<strong>for</strong>mance penalties. On the other h<strong>and</strong>, the advantage<br />

is that much less in<strong>for</strong>mation needs to be communicated during synchronization<br />

(which occurs with the same frequency in both cases, once <strong>for</strong> each weak classifier).<br />

Each process can determine the feature instance with the smallest cumulative er-<br />

1 Michael Jones has implemented his improved training method in that manner [private<br />

conversation].<br />

123


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

ror over all images <strong>and</strong> needs only to send this in<strong>for</strong>mation back to a root process.<br />

If split across image examples, every feature instance’s partial error sum must be<br />

communicated, <strong>and</strong> there are hundreds of thous<strong>and</strong>s of feature instances to be<br />

considered.<br />

Depending on the feature types utilized <strong>and</strong> also depending particularly on<br />

the desired ratio of negative to positive example images, training <strong>for</strong> one detector<br />

took between a few hours to two days.<br />

5.3 Classification potential of various postures<br />

<strong>H<strong>and</strong></strong> appearances – the combinations of postures <strong>and</strong> the directions from<br />

which they are viewed – differ in their potential <strong>for</strong> classification from background<br />

<strong>and</strong> other objects, their “detectability.” In order to pick the appearance with<br />

the best separability from background (that is, the one that allows detectors to<br />

achieve the best per<strong>for</strong>mance), one could train a detector <strong>for</strong> each combination<br />

<strong>and</strong> a posteriori analyze their per<strong>for</strong>mance. As previously mentioned, the training<br />

<strong>for</strong> Viola-Jones detection method takes far too long to explore all possible h<strong>and</strong><br />

posture <strong>and</strong> view combinations <strong>for</strong> their suitability to detection.<br />

This section presents a frequency analysis-based method <strong>for</strong> instantaneous<br />

estimation of class separability, without the need <strong>for</strong> any training but based on<br />

124


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

only a few training images <strong>for</strong> each posture. The receiver operating characteristics<br />

<strong>for</strong> detectors <strong>for</strong> our postures confirm the estimates. This estimator contributes to<br />

a systematic approach to building an extremely robust h<strong>and</strong> appearance detector,<br />

providing an important step towards easily deployable <strong>and</strong> reliable vision-based<br />

h<strong>and</strong> gesture interfaces. This research was also published in [92].<br />

5.3.1 Estimation with frequency spectrum analysis<br />

We investigated eight postures from fixed views, which were selected based on<br />

their different appearances <strong>and</strong> because they can be per<strong>for</strong>med easily. A proto-<br />

typical example <strong>for</strong> each posture is shown in Figure 5.2.<br />

closed sidepoint victory open Lpalm Lback grab fist<br />

0.435339 0.38612 0.323325 0.391111 0.335228 0.315761 0.263778 0.202895<br />

Figure 5.2: Mean h<strong>and</strong> appearances <strong>and</strong> their Fourier trans<strong>for</strong>ms:<br />

larger s-values (see Equation 5.4) indicate more high-amplitude frequency components<br />

being present, suggesting better suitability <strong>for</strong> classification from background.<br />

The bottom row of images shows the “artifact-free” Fourier trans<strong>for</strong>ms.<br />

The separability of two classes depends on many factors, including feature<br />

dimensionality <strong>and</strong> method of classification. In particular <strong>for</strong> AdaBoost classifiers,<br />

125


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

it is desirable to a priori predict the potential <strong>for</strong> successful classification of h<strong>and</strong><br />

appearances from background due to the detector’s computationally expensive<br />

training phase. The estimator presented here is the first method to approximate<br />

the per<strong>for</strong>mance of the Viola-Jones detection method. It is based on the intuition<br />

that appearances with a prominent pattern can be detected more reliably than<br />

very uni<strong>for</strong>mly shaded appearances. The advantage of the estimator is that it<br />

only requires a single prototypical example of the positive class. There is no need<br />

<strong>for</strong> explicit or <strong>for</strong>mal representation of the negative class, “everything else.”<br />

We collected up to ten training images of each of eight h<strong>and</strong> postures from<br />

similar views <strong>and</strong> computed their mean image (top row in Figure 5.2). Due to<br />

limited training data <strong>for</strong> the fist posture we took only one image <strong>and</strong> manually<br />

set non-skin pixels to a neutral grey. The areas of interest were resized <strong>and</strong><br />

rescaled to 25x25 pixels, see table 5.1. The higher-frequency components of a<br />

Fourier trans<strong>for</strong>m describe the amount of grey-level variation present in an image<br />

– exactly what we are looking <strong>for</strong>. However, the trans<strong>for</strong>mation F (Equation 5.1)<br />

introduces strong artificial frequencies, caused by the image’s finite <strong>and</strong> discrete<br />

nature. Choosing a power-of-two image side length would avoid some artifacts;<br />

however, we did not want to deviate from the template size that the detectors<br />

would be built <strong>for</strong>.<br />

126


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

F (u, v) =<br />

1<br />

25 ∗ 25<br />

24<br />

24<br />

m=0 n=0<br />

I(m, n)e<br />

mu nv<br />

−i2π( + 25 25 )<br />

(5.1)<br />

The Fourier trans<strong>for</strong>m P of a neutrally colored 25x25-sized image patch is<br />

there<strong>for</strong>e subtracted from F . This ensures that frequencies resulting from image<br />

cropping are eliminated, yielding an artifact-free difference-trans<strong>for</strong>m D.<br />

where<br />

D(u, v) = log |F (u, v) − P (u, v)| , (5.2)<br />

P (u, v) =<br />

1<br />

25 ∗ 25<br />

24<br />

24<br />

m=0 n=0<br />

1<br />

2<br />

mu nv<br />

+<br />

e−i2π( 25 25 )<br />

(5.3)<br />

In the last step (Equation 5.4), the sum of all frequency amplitudes is com-<br />

puted, normalized by the Fourier trans<strong>for</strong>m’s resolution. This sum is the sought-<br />

<strong>for</strong> estimator, giving an indication of the amount of appearance variation present<br />

in the image:<br />

s = e 1<br />

k ∗<br />

u,v D(u,v) . (5.4)<br />

The bottom row in Figure 5.2 presents the postures’ artifact-free Fourier trans-<br />

<strong>for</strong>ms D, annotated with s, the sums of their log amplitudes over the entire fre-<br />

quency spectrum. The sums’ absolute values have limited meaning, they are to<br />

be regarded in relation to each other. As expected after visual inspection, the<br />

127


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

closed h<strong>and</strong> appearance has the most amount of grey-level variation, reflected in a<br />

high amplitude sum. The fist, being mostly a uni<strong>for</strong>mly grey patch, has the least<br />

amount of appearance variation, thus also a low s-value.<br />

In the following section, a comparison of the estimates with actual detectors’<br />

per<strong>for</strong>mances will confirm the hypothesis – that appearances with larger s-values<br />

can be detected more reliably. <strong>Computing</strong> s-values there<strong>for</strong>e alleviates the need<br />

<strong>for</strong> the compute-intensive training of many detectors in order to gauge their per-<br />

<strong>for</strong>mance potentials.<br />

5.3.2 Predictor accuracy<br />

To evaluate predictor accuracy, we built detectors with unmodified AdaBoost,<br />

which produces a single set of weak classifiers <strong>for</strong> each detector. Cascaded detec-<br />

tors – composed of multiple, staged sets of weak classifiers – will be covered in<br />

the following sections, their per<strong>for</strong>mance also following the estimator’s prediction.<br />

Here, the three traditional feature types (see Figure 2.1 on page 30) were used.<br />

The detectors were evaluated <strong>for</strong> their false positive rates by scanning a test<br />

set of 200 images of varying sizes not containing h<strong>and</strong>s, some obtained from a<br />

web crawl <strong>and</strong> some taken in <strong>and</strong> around our lab. Note that the false positive<br />

rate is relative to all detector evaluations, <strong>and</strong> that there are 355,614 evaluations<br />

required to scan a VGA-sized image (see Section 2.3.8).<br />

128


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

10 −5<br />

0.75<br />

10 −4<br />

false positive rate<br />

10 −3<br />

closed<br />

sidepoint<br />

victory<br />

open<br />

Lpalm<br />

Lback<br />

Figure 5.3: ROC curves <strong>for</strong> monolithic classifiers:<br />

the curves <strong>for</strong> all h<strong>and</strong> postures are shown, trained on integral images with 25x25<br />

pixel resolution. Each of the six detectors consists of 100 weak classifiers. The<br />

x-axis is in log scale.<br />

129<br />

10 −2


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

Results: The receiver operating characteristic (ROC, see Trees [175]) curves<br />

in Figure 5.3 show the results of evaluating the six detectors. The posture closed<br />

fares much better than its competitor h<strong>and</strong> postures, in that it achieves a higher<br />

detection rate <strong>for</strong> a given false positive rate. This is in line with the prediction of<br />

the spectrum-analysis estimator. The sidepoint posture does second-best <strong>for</strong> high<br />

detection rates, but then deviates from the prediction. We will later see, however,<br />

that it comparatively does much better again <strong>for</strong> very low false positive rates with<br />

the cascaded detector. Another prediction failure can be observed <strong>for</strong> the Lback<br />

<strong>and</strong> Lpalm curves: the more structured Lpalm appearance should achieve better<br />

class separability. Again, the more expressive features in the cascaded detector<br />

actually do bring out this advantage <strong>and</strong> are in line with the prediction.<br />

5.4 Effect of template resolution<br />

Be<strong>for</strong>e training detectors with more expressive, but also more expensive feature<br />

types (on the order of two magnitudes more computational ef<strong>for</strong>t during training),<br />

we wanted to make sure the integration templates did not contain any redundant<br />

in<strong>for</strong>mation. There<strong>for</strong>e, we varied the size of the template area <strong>for</strong> the best-<br />

faring appearance, hoping <strong>for</strong> resolution reduction without sacrificing accuracy.<br />

130


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

The impact of different integral image resolutions can be seen in Figure 5.4 on a<br />

monolithic detector. These results are also published in [92].<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

10 −5<br />

0.75<br />

10 −4<br />

false positive rate<br />

25x25<br />

35x35<br />

20x30<br />

30x20<br />

Figure 5.4: ROC curves <strong>for</strong> different template resolutions:<br />

ROC curves <strong>for</strong> the closed posture detector with 100 weak classifiers. The detectors<br />

with higher resolution in the horizontal (35x35 <strong>and</strong> 30x20) outper<strong>for</strong>m the<br />

other two.<br />

Unsurprisingly, the finest resolution integral (35x35 pixels) achieves the best<br />

per<strong>for</strong>mance. Remember that the observed image area is constant, only the sam-<br />

pling resolution differs. But higher resolution in the vertical dimension contributes<br />

131<br />

10 −3<br />

10 −2


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

little to this improvement, as witnessed by the lower detection rates of the 20x30<br />

curve. On the other h<strong>and</strong>, the 30x20 curve has high resolution along the di-<br />

mension that the estimator frequency analysis showed more high amplitudes <strong>for</strong><br />

– see the bright horizontal extent in the frequency image <strong>for</strong> the closed posture<br />

in Figure 5.2. This seems to enable the detector to capitalize much more on<br />

appearance peculiarities <strong>and</strong> rewards us with detection rates comparable to the<br />

highest-resolution detector.<br />

It is interesting to note that the detector with 30x20 templates per<strong>for</strong>ms bet-<br />

ter <strong>for</strong> low false positive rates, while a 20x30 resolution per<strong>for</strong>ms better <strong>for</strong> higher<br />

false positive rates. We speculate that the stretch in the vertical produces large,<br />

uni<strong>for</strong>m areas that allow <strong>for</strong> easy distinction between h<strong>and</strong>s <strong>and</strong> many other ap-<br />

pearances. However, the lack in horizontal resolution compresses away the fine<br />

finger structures that are required <strong>for</strong> separation from most other appearances.<br />

5.5 Rotational robustness<br />

The research described in this section, published in [90], analyzes the in-plane<br />

rotational robustness of the Viola-Jones detection method when used <strong>for</strong> h<strong>and</strong><br />

appearance detection. This is necessary because the object detection method is<br />

not inherently invariant to in-plane object rotations. When trained with only<br />

132


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

strictly aligned data <strong>and</strong> then used <strong>for</strong> gesture interfaces, it would require the<br />

users to per<strong>for</strong>m very precise gestures – a daunting task with a head-worn camera.<br />

Viola&Jones’ face detectors h<strong>and</strong>le about 30 degrees of in-plane rotation of frontal<br />

<strong>and</strong> profile views, 15 degrees in either direction [72]. However, we found detectors<br />

<strong>for</strong> h<strong>and</strong>s to be much more sensitive to in-plane rotations. This prompted the<br />

research presented here.<br />

Viola <strong>and</strong> Jones recently extended their method to detect objects exhibiting<br />

arbitrary in-plane rotations <strong>and</strong> side views of faces [72]. This extension requires<br />

additional ef<strong>for</strong>t both algorithmically, during training, <strong>and</strong> during detection: in a<br />

first stage of classification, implemented with a decision tree, one of twelve detec-<br />

tors is selected. Each of these h<strong>and</strong>les detection of faces within about 30 degrees<br />

of in-plane rotations. While this approach is still very fast, it adds training time<br />

<strong>and</strong> about doubles detection time.<br />

Similarly, we investigated detection of in-plane rotations of various h<strong>and</strong> pos-<br />

tures. However, our focus was not on covering the entire 360 degrees range of<br />

rotations. Instead, we wanted to increase each detector’s range of detected rota-<br />

tions without adding any computational overhead <strong>and</strong> without negatively affecting<br />

the false positive rate, that is, without incurring a per<strong>for</strong>mance penalty. Objects<br />

other than faces have different appearance characteristics that warrant specific<br />

treatment. This is motivated in the following section.<br />

133


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

5.5.1 Rotation baseline<br />

First, a baseline was established to which the subsequent results could be<br />

compared. For the closed posture, both the training <strong>and</strong> validation sets were<br />

rotated by various amounts around the image area’s center. Then, one detec-<br />

tor was trained <strong>for</strong> each angle. Consistent parameters <strong>for</strong> the training caused<br />

equally-complex cascade stages throughout all experiments in this section. The<br />

evaluation (Figure 5.5) shows that there are no large differences in the accuracy<br />

of the detectors, especially <strong>for</strong> low false positive rates. Establishing this baseline<br />

is important because some rotations could be intrinsically harder to detect than<br />

others – these experiments dismiss this possibility.<br />

5.5.2 Problem: rotational sensitivity<br />

To demonstrate the sensitivity of the detection method when used <strong>for</strong> h<strong>and</strong><br />

appearances, a detector that had been trained on well-aligned examples was tested<br />

<strong>for</strong> its accuracy. In contrast to the detector’s application to face detection, <strong>for</strong><br />

h<strong>and</strong>s it achieved poor accuracy with rotated test images <strong>for</strong> as little as 4 de-<br />

grees (Figure 5.6). The per<strong>for</strong>mance decrease is roughly symmetric <strong>for</strong> clockwise<br />

(negative angles) <strong>and</strong> counter-clockwise (positive angles) rotations.<br />

A second set of experiments shows that this is not caused by peculiarities of the<br />

unrotated appearance of the particular h<strong>and</strong> posture. Eight detectors were built,<br />

134


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

10 −8<br />

10 −7<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

0 (166)<br />

3 (254)<br />

5 (146)<br />

6 (247)<br />

9 (201)<br />

10 (194)<br />

12 (151)<br />

15 (123)<br />

Figure 5.5: ROC curves <strong>for</strong> various training data rotations:<br />

these curves constitute the baseline <strong>for</strong> our experiments <strong>and</strong> are from detectors<br />

trained <strong>and</strong> evaluated on the same rotation angle.<br />

each trained on examples rotated by a certain, fixed amount. They were then<br />

tested with examples rotated r<strong>and</strong>omly between 0 <strong>and</strong> 15 degrees. The results in<br />

Figure 5.7 demonstrate their high rotational sensitivity in contrast to a detector<br />

(top curve) that was trained on examples that also exhibited a varying degree of<br />

rotations. The difference is even larger <strong>for</strong> false positive rates below 10 −4 .<br />

135<br />

10 −3<br />

10 −2<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

The data in Figure 5.8 stems from a very similar experiment, only different in<br />

that the detectors were evaluated on a test set whose examples exhibited rotations<br />

by discrete amounts, not r<strong>and</strong>om on a continuous scale. The graph shows that<br />

smaller deviations in rotation from the training data achieve better detection<br />

rates: detectors trained <strong>for</strong> angles “in the middle” of the rotation spectrum, 6 <strong>and</strong><br />

9 degrees in particular, fare better than those trained on angles 0 <strong>and</strong> 15.<br />

5.5.3 Rotation bounds <strong>for</strong> undiminished per<strong>for</strong>mance<br />

The objective of this set of experiments was to determine the angles that the<br />

training examples could be rotated <strong>and</strong> still achieve good detection per<strong>for</strong>mance<br />

on the equally-rotated test set. Four repetitions of the original training set <strong>for</strong> the<br />

closed posture were rotated by 0, 15, 30, <strong>and</strong> 45 degrees, respectively, <strong>and</strong> joined<br />

into one large training set. The Viola-Jones detection method over time keeps the<br />

positive examples that are reliably detectable, while it successively ignores those<br />

that would require an unacceptably high false positive rate. The experiment’s as-<br />

sumption is that well-detectable examples will be retained <strong>and</strong> all others sacrificed<br />

in order to achieve a low false positive rate. The evaluation in Figure 5.9 shows<br />

this effect. It suggests that the examples with 0 degrees <strong>and</strong> 15 degrees of rotation<br />

are more consistently recognizable than those with 30 degrees <strong>and</strong> 45 degrees, the<br />

136


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

latter ones being sacrificed to achieve a low false positive rate. There<strong>for</strong>e, the<br />

bounds <strong>for</strong> rotating the training examples were set to within 15 degrees.<br />

5.5.4 Rotation density of training data<br />

Next, we were interested in the influence that different rotation angle densities<br />

have on training <strong>and</strong> detection per<strong>for</strong>mance. Three detectors were trained; their<br />

training <strong>and</strong> validation sets contained examples rotated in varying steps: A = {0,<br />

5, 10, 15}, B = {0, 3, 6, 9, 12, 15}, <strong>and</strong> C = {0..15} with r<strong>and</strong>om angles. They<br />

consisted of 198, 190, <strong>and</strong> 239 weak classifiers, respectively. The detectors were<br />

evaluated on examples r<strong>and</strong>omly rotated between 0 <strong>and</strong> 15 degrees.<br />

No significant accuracy variation can be observed in Figure 5.10, leading to<br />

the conclusion that detector accuracy is not affected by rotation angle density<br />

<strong>for</strong> angles of 5 degrees or less. This is an important result because wider steps<br />

allow <strong>for</strong> fewer training examples, reducing both the data collection ef<strong>for</strong>t <strong>and</strong> the<br />

computational training cost.<br />

5.5.5 Rotations of other postures<br />

Finally, we confirmed the applicability of the main results that we had obtained<br />

<strong>for</strong> the closed posture to the other five postures, shown in Figure 5.11. Plotted<br />

in Figure 5.12 are the detection rates of detectors built with rotated training sets<br />

137


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

(0-15 degrees r<strong>and</strong>om) divided by that of detectors built with unrotated training<br />

sets. Both were evaluated with a test set with all examples rotated by 15 degrees.<br />

The detectors trained on rotated examples achieve at least equal per<strong>for</strong>mance,<br />

<strong>and</strong> <strong>for</strong> low false positive rates they outper<strong>for</strong>m the detectors trained on fixed<br />

examples by about one order of magnitude. They also have a lower minimum<br />

false positive rate while still detecting some h<strong>and</strong> appearances.<br />

5.5.6 Discussion<br />

The number of weak classifiers required <strong>for</strong> a certain accuracy did not differ<br />

significantly between detectors trained on rotated <strong>and</strong> unrotated training images.<br />

Since consistent training parameters (number of weak classifiers per cascade stage<br />

<strong>and</strong> their accuracy) had been used <strong>for</strong> all detectors, the resulting detection speed<br />

per<strong>for</strong>mance of detectors <strong>for</strong> 0-15-degree-rotated images is about equal to that of<br />

detectors that detect unrotated images only.<br />

The results presented in this section pertaining to rotational robustness are<br />

likely to generalize to other objects because the surveyed h<strong>and</strong> appearances ex-<br />

hibit very different characteristics, such as their convexity (open versus closed),<br />

their texture variation (Lback versus Lpalm), <strong>and</strong> the background to <strong>for</strong>eground<br />

ratio (closed versus sidepoint). As detailed in Section 5.5.2, presenting training<br />

138


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

images that are rotated within these bounds is crucial to good accuracy <strong>for</strong> object<br />

appearances other than faces.<br />

In summary, the result is that only about 15 degrees of rotations can be ef-<br />

ficiently detected with one detector, different from the method’s per<strong>for</strong>mance on<br />

faces (30 degrees total). The difference with faces stems from the h<strong>and</strong>’s smaller<br />

features (fingers) being more sensitive to correct alignment during training, as<br />

well as from less inter-person appearance variation of a certain posture <strong>and</strong> view.<br />

Most importantly, the training data must contain rotated example images within<br />

these rotation limits. Detection rates on rotated appearances improve by about<br />

one order of magnitude without algorithmic modifications. This also has no nega-<br />

tive impact on detection speed. These results are consistent <strong>for</strong> a number of h<strong>and</strong><br />

postures <strong>and</strong> appearances. The implications of the results effect both savings in<br />

training costs as well as increased naturalness <strong>and</strong> com<strong>for</strong>t of vision-based h<strong>and</strong><br />

gesture interfaces.<br />

We employed the improved detectors in our mobile vision interface <strong>and</strong> can<br />

report better <strong>and</strong> faster initialization due to more natural <strong>and</strong> less rigid h<strong>and</strong><br />

postures required <strong>for</strong> detection.<br />

139


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

0.7<br />

0.65<br />

0.6<br />

0.55<br />

0.5<br />

10 −8<br />

10 −7<br />

25x25 templates, 0 evaluated on each of −14..0..14<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

Figure 5.6: ROC curves showing the rotational sensitivity:<br />

the classifiers in this figure were trained on unrotated training images <strong>and</strong> evaluated<br />

on test images rotated by various angles. There is a sharp decrease in<br />

detection accuracy <strong>for</strong> in-plane rotations of 4 degrees or more. Note the symmetry<br />

<strong>for</strong> rotations to the left <strong>and</strong> right. Also note the scale of the y-axis; unlike in<br />

the other graphs it starts at 0.5.<br />

140<br />

10 −3<br />

10 −2<br />

10 −1<br />

0<br />

2<br />

4<br />

6<br />

8<br />

10<br />

12<br />

14<br />

−2<br />

−4<br />

−6<br />

−8<br />

−10<br />

−12<br />

−14<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

10 −8<br />

rotated all−extended−closed, evaluated on r<strong>and</strong>om 0−15 rotation<br />

10 −7<br />

0<br />

3<br />

5<br />

6<br />

9<br />

12<br />

15<br />

0,5,10,15<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

Figure 5.7: ROC curves <strong>for</strong> detection of r<strong>and</strong>omly rotated images:<br />

the classifiers were trained <strong>for</strong> the stated angle <strong>and</strong> evaluated on a r<strong>and</strong>omly<br />

rotated test set. None of the fixed-angle detectors achieves accuracy close to that<br />

of the detector trained <strong>for</strong> various angles.<br />

141<br />

10 −3<br />

10 −2<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

10 −8<br />

25x25 templates, 0−3−6−9−12−15 evaluated on each of 0, 3, 6, 9, 12, 15<br />

10 −7<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

Figure 5.8: ROC curves <strong>for</strong> detection of discrete-rotated images:<br />

these classifiers were trained <strong>for</strong> the stated angle <strong>and</strong> evaluated on rotated examples<br />

with various angles. The detector favors angles “in the middle.”<br />

142<br />

10 −3<br />

10 −2<br />

0<br />

3<br />

6<br />

9<br />

12<br />

15<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

10 −8<br />

25x25 templates, 0−15−30−45 evaluated on each of 0, 15, 30, 45<br />

10 −7<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

Figure 5.9: ROC curves <strong>for</strong> the bounds of training with rotated images:<br />

the curves show that a detector created on a training set with multiple rotations<br />

does not treat all angles equally. Instead, examples rotated by 30 degrees<br />

<strong>and</strong> 45 degrees are more likely to be dropped in favor of examples with smaller<br />

rotations.<br />

143<br />

10 −3<br />

0<br />

15<br />

30<br />

45<br />

10 −2<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

10 −8<br />

30x20 templates, various training sets evaluated on 0−15 r<strong>and</strong>om<br />

10 −7<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

10 −3<br />

0,5,10,15<br />

0,3,6,9,12,15<br />

0−15<br />

Figure 5.10: ROC curves <strong>for</strong> different rotation steps:<br />

the classifiers were trained <strong>for</strong> various rotational densities <strong>and</strong> evaluated on 0-15<br />

r<strong>and</strong>om.<br />

144<br />

10 −2<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

Figure 5.11: The six h<strong>and</strong> postures <strong>and</strong> rotated images:<br />

shown are typical images of the postures in 25x25 pixels resolution, the bottom<br />

row is rotated by 15 degrees. From left to right: closed, open, sidepoint, victory,<br />

Lpalm, <strong>and</strong> Lback.<br />

145


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

10 2<br />

10 1<br />

10 0<br />

10 −8<br />

pairwise comparison of unrotated vs. rotated trained, eval on 15, ratio<br />

10 −7<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

10 −3<br />

10 −2<br />

closed<br />

open<br />

sidepoint<br />

victory<br />

Lpalm<br />

Lback<br />

Figure 5.12: Overall gain of training with rotated images:<br />

shown is the ratio of detection rates <strong>for</strong> “trained on rotated” over “trained on<br />

unrotated,” evaluated on 15 degrees rotated areas. There are no data points if<br />

the unrotated detectors have a detection rate of zero.<br />

146<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

5.6 A new feature type<br />

In this section, we will show that the particular choice of feature types influ-<br />

ences the relative detectability of h<strong>and</strong> appearances. For each posture, a cascaded<br />

detector was trained that could select its weak classifiers from a set of four fea-<br />

ture types – instead of from only the three types that were used by Viola <strong>and</strong><br />

Jones [180] <strong>and</strong> are shown in Section 2.3.8. The novel feature type, called “Four<br />

Box” feature, is a comparison of four rectangular areas. This type is similar to<br />

the “diagonal” filters proposed by an extension of their work in [72]; however, our<br />

filters can compare non-adjacent rectangular areas. During training, the areas can<br />

move about relative to each other with few strings attached, even partially over-<br />

lapping each other, just their sizes are restricted. These more powerful features<br />

allow the detector to achieve better accuracy, demonstrated in Figure 5.15.<br />

5.6.1 Four Box feature instance generation<br />

It is important to note the difference between a feature type <strong>and</strong> a feature<br />

instance. The feature type describes general properties, while a feature instance<br />

is one particular constellation of rectangular areas that has these properties. In-<br />

stances can differ in the size of their rectangular areas, the location of the areas in<br />

the sample images, or both. An example type would be described by “two rectan-<br />

147


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

gular areas of identical size that share one vertical edge.” An example instance of<br />

this type would be the area that contains the pixels (x=5, y=8) <strong>and</strong> (5, 9) <strong>and</strong> its<br />

adjacent area with pixels (6, 8) <strong>and</strong> (6, 9). During AdaBoost training, all possible<br />

weak classifiers must be tested on all sample images. This requires that all pos-<br />

sible feature instances <strong>for</strong> all feature types are evaluated on every image. Since<br />

this pool of feature instances is large <strong>and</strong> instance generation is cheap, we do not<br />

cache instances but instead generate them anew at every iteration of AdaBoost.<br />

The following is an algorithmic description of how all instances of the Four Box<br />

feature type are generated sequentially.<br />

First a brief definition: a rectangular area is described by four coordinates,<br />

(left, top, right, bottom). It is defined to contain pixels (x, y) with left < x ≤<br />

right <strong>and</strong> top < y ≤ bottom. Thus, the area with coordinates (-1, -1, 0, 0) contains<br />

exactly one pixel, the one at (0, 0). This is the actual leftmost <strong>and</strong> topmost pixel<br />

in a sample image.<br />

Given is the first, initial instance of the Four Box feature type <strong>and</strong> the width<br />

<strong>and</strong> height of all sample images (also called the template resolution). The initial<br />

instance is shown on the left in Figure 5.13). The leftmost <strong>and</strong> topmost edges<br />

(numbers 5 <strong>and</strong> 11) are at the pixel coordinates -1 in x <strong>and</strong> -1 in y; edges 4 <strong>and</strong><br />

10 are at coordinates 0, edges 3 <strong>and</strong> 9 at coordinates 1, <strong>and</strong> so <strong>for</strong>th.<br />

148


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

In every iteration of the sequential generation of all instances, one or more<br />

edges are moved. To avoid repeated construction of the same instance, an order<br />

is defined on the edges, represented by their numbers in Figure 5.13. Edges<br />

with smaller numbers are moved first. If an edge would move beyond the size<br />

of the sample images, the next-higher numbered edge is moved. After an edge<br />

has been moved, all its dependent edges are reset to a coordinate that is in a<br />

fixed relation to its ancestor. Edge dependency is indicated through an arrow:<br />

dependent edges connected through a double-lined arrow are placed to the same<br />

coordinate as their ancestor, <strong>and</strong> those connected through a solid arrow are placed<br />

one coordinate further than the ancestor edge. For example, after placing the<br />

leftmost <strong>and</strong> topmost edges, all other edges are updated because – directly or<br />

independently – they all depend upon those two edge’s locations. This results in<br />

the initial instance; all four rectangular areas are 2 pixels wide <strong>and</strong> 2 pixels high.<br />

All further feature instances are created successively. To obtain the next in-<br />

stance from a certain instance, the following procedure is applied.<br />

• The edge numbered 1 is moved one pixel to the right. If this is a valid coordinate,<br />

this is the next instance.<br />

• If the coordinate is not valid <strong>and</strong> instead exceeds the template area’s dimensions,<br />

the right edge of the topmost box (numbered 2) is moved one pixel to the right.<br />

149


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

5<br />

10<br />

4<br />

4<br />

8<br />

11<br />

3<br />

6<br />

9<br />

2<br />

2<br />

7<br />

+1<br />

==<br />

1<br />

Figure 5.13: The Four Box <strong>and</strong> Four Box Same feature types:<br />

the Four Box feature type is shown on the left <strong>and</strong> the Four Box Same feature<br />

type is shown on the right. Note that the two less-than conditions must always<br />

be met, while the “+1” <strong>and</strong> equality dependencies are only en<strong>for</strong>ced when the<br />

respective ancestor edge is moved. In the right feature type, the width <strong>and</strong> height<br />

of all boxes is the same.<br />

• All edges that depend on edges numbered 2 are updated; those are the right edges<br />

of the bottom box (which is set to the same pixel) <strong>and</strong> the right edge of the rightmost<br />

box (which is set one pixel further to the right).<br />

• If the rightmost box’ right edge has a valid pixel coordinate, this is the next instance.<br />

• If the coordinate is not valid, edge 3 is moved one pixel to the right.<br />

• All edges depending on edge 3 are updated; those are edges numbered 2 <strong>and</strong> 1.<br />

150<br />

2<br />

0<br />

4<br />

-1<br />

4<br />

0<br />

0<br />

1<br />

1<br />

<<br />

-1<br />

1<br />

<<br />

3<br />

3<br />

2<br />

2<br />

4<br />

4<br />

1<br />

1<br />

1<br />

3<br />

3<br />

1<br />

2


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

• ... <strong>and</strong> so on <strong>for</strong> edges 4 <strong>and</strong> 5.<br />

• If (after moving edge 5 <strong>and</strong> updating all dependent edges) the edge numbered 1 is<br />

on a valid pixel, this is the next instance.<br />

• If edge 1 is not on a valid pixel, edge number 6 is moved downwards one pixel <strong>and</strong><br />

edge number 5 is reset to coordinate -1. All dependent edges are updated.<br />

• If edge 6 is on a valid pixel, this is the next instance.<br />

• If edge 6 is not on a valid pixel, edge 7 is moved down one pixel <strong>and</strong> all dependent<br />

edges (number 6) are updated.<br />

• ... <strong>and</strong> so on <strong>for</strong> edges 8, 9, 10, <strong>and</strong> 11.<br />

• Eventually, after moving down edge 11 <strong>and</strong> updating all dependent edges, edge 6<br />

will not be on a valid pixel coordinate. This indicates that all instances have been<br />

produced <strong>and</strong> no new ones can be created.<br />

5.6.2 Four Box Same feature type<br />

The number of instances that the Four Box feature type can produce is im-<br />

mense (15, 144, 529, 400 ≈ 1.5 ∗ 10 10 instances in a 25x25 template), <strong>and</strong> Ad-<br />

aBoost’s exhaustive search component takes many hours to create a single weak<br />

classifier, even on a cluster with over 60 processors. A simplified version of this<br />

feature type was there<strong>for</strong>e conceived, called “Four Box Same.” It produces only<br />

151


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

8, 233, 632 ≈ 8.2 ∗ 10 6 instances in a 25x25 template <strong>and</strong> cuts computation time<br />

by more than three orders of magnitude. Three additional constraints are en-<br />

<strong>for</strong>ced: first, all rectangles have the same size in any particular instance. Second,<br />

rectangles move in pairs from one instance to the next. The circled numbers in<br />

the right drawing of Figure 5.13 indicate which edges move concurrently. Third,<br />

the “less-than” condition is en<strong>for</strong>ced <strong>for</strong> every instance. If it is violated after an<br />

edge labeled n has been updated, the instance is considered invalid <strong>and</strong> the edges<br />

numbered n + 1 are moved (prompting their dependent edges to be updated in<br />

turn).<br />

The procedure to get the next instance is slightly different <strong>for</strong> the Four Box<br />

Same feature type. As be<strong>for</strong>e, edges are dependent upon edges with higher num-<br />

bers. However, updating dependent edges happens in a different manner: the<br />

new edge location is not a function of the ancestor edge’s location but instead is<br />

always set to a fixed initial value. This value is indicated by the numbers on the<br />

arrows. Together, these conditions allow <strong>for</strong> instances as they are shown in Fig-<br />

ure 5.14. Note that by adding <strong>and</strong> subsequent subtracting of partial rectangles,<br />

fairly irregular areas can be compared. Also note that the rectangular areas need<br />

not be adjacent to each other as <strong>for</strong> Jones’ <strong>and</strong> Viola’s feature type in [72].<br />

152


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

Figure 5.14: Example instances of the Four Box Same feature type:<br />

the framed areas are subtracted from solid-black areas.<br />

5.6.3 Results<br />

The relative per<strong>for</strong>mance of detectors <strong>for</strong> different postures stays roughly the<br />

same, even though the curves are not as smooth as with non-cascaded detectors<br />

due to the staged cascading <strong>and</strong> the resulting evaluation method (details in Sec-<br />

tion 2.3.8 <strong>and</strong> [180]). Of particular interest are the left parts of the curves since<br />

a fail-safe h<strong>and</strong> detection <strong>for</strong> vision-based interfaces must be on the conservative<br />

side with very few false positives. There, the cascaded detectors show ROCs along<br />

the lines of the per<strong>for</strong>mance predicted in Section 5.3.1: closed outper<strong>for</strong>ms all oth-<br />

ers, sidepoint is second-best, <strong>and</strong> the more structured appearance Lpalm now does<br />

better than the more uni<strong>for</strong>m Lback.<br />

Extrapolating from the results of this study, we suggest that mostly convex<br />

appearances with internal grey-level variation are better suited to the purpose of<br />

detection with the Viola-Jones detection method. The open posture, <strong>for</strong> example,<br />

already has a lower Fourier structure “s” value, hinting that background noise<br />

153


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

detection rate<br />

1<br />

0.95<br />

0.9<br />

0.85<br />

0.8<br />

0.75<br />

10 −8<br />

10 −7<br />

10 −6<br />

10 −5<br />

10 −4<br />

false positive rate<br />

10 −3<br />

closed (30x20)<br />

sidepoint (20x25)<br />

victory (25x25)<br />

open (25x25)<br />

Lpalm (25x25)<br />

Lback (25x25)<br />

Figure 5.15: ROC curves <strong>for</strong> detectors with Four Box Same features:<br />

this feature type is less constrained <strong>and</strong> the areas need not be adjacent. Note that<br />

the scale on the y axis is different from previous figures.<br />

hinders extraction of consistent patterns. The detector’s accuracy confirms the<br />

difficulty to distinguish h<strong>and</strong>s from other appearances.<br />

The final h<strong>and</strong> detector that we chose <strong>for</strong> our application detects the closed<br />

posture. For scenarios where we desire fast detection, we picked the parame-<br />

terization that achieved a detection rate of 92.23% with a false positive rate of<br />

1.01 ∗ 10 −8 in the test set, or one false hit in 279 VGA-sized frames. For most<br />

154<br />

10 −2<br />

10 −1<br />

10 0


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

scenarios it is sufficient, however, to pick a parameterization that had a detec-<br />

tion rate of 65.80%, but not one false positive in the test set. The high frame<br />

rate of the algorithm almost guarantees that the posture is detected within a few<br />

consecutive frames.<br />

5.7 Fixed color histogram<br />

Upon detection of a h<strong>and</strong> area, it is tested <strong>for</strong> the amount of skin-colored pixels<br />

that it contains. To this end, we built a histogram-based statistical model in<br />

HSV space from a large collection of h<strong>and</strong>-segmented pictures from many imaging<br />

sources, similar to Jones <strong>and</strong> Rehg’s approach [73]. We used a histogram-based<br />

method because they achieve better results in general, user-independent cases. If<br />

a sufficient amount of area pixels are classified as skin pixels, the h<strong>and</strong> detection<br />

is considered successful <strong>and</strong> control is passed to the second stage. This coverage<br />

can be set in the vision conductor configuration file, see Section 4.2.9. For good<br />

per<strong>for</strong>mance in this step, the h<strong>and</strong> must not be vastly over- or under-exposed.<br />

The software exposure control can correctly expose a selective area in the video<br />

(see Section 4.2.2).<br />

The “area of the h<strong>and</strong>” <strong>for</strong> this <strong>and</strong> other postures is defined with the help<br />

of a probability map. These maps have the same scale <strong>and</strong> resolution as the<br />

155


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

corresponding detectors <strong>and</strong> state <strong>for</strong> every pixel the probability that it belonged<br />

to the h<strong>and</strong> versus to background in the training data.<br />

No grey-level fiducials or colored markers are employed <strong>and</strong> still a good de-<br />

tection accuracy is achieved. This is possible due to the multi-cue integration<br />

of texture <strong>and</strong> skin color. Other than in the tracking method explained in the<br />

following chapter, both modalities must report a positive match <strong>for</strong> the detection<br />

to be successful. This makes sense as a false positive is potentially more harmful<br />

than a false negative, assuming that subsequent frames will eventually correctly<br />

detect the initialization posture.<br />

5.8 <strong>H<strong>and</strong></strong> pixel probability maps<br />

The grey-level appearance-based h<strong>and</strong> detector finds rectangular areas that<br />

contain h<strong>and</strong>s. Not every pixel within those areas is likely to belong to a h<strong>and</strong>,<br />

however, <strong>and</strong> some pixels will belong to the background. This spatial probability<br />

distribution is defined <strong>for</strong> each posture <strong>and</strong> estimated from training images. Fig-<br />

ure 5.16 shows the probability maps <strong>for</strong> six postures. A brighter pixel indicates<br />

a higher probability that the respective pixel in the detected area is going to be<br />

belong to the h<strong>and</strong>, that is, to be of skin color.<br />

156


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

The maps were constructed by averaging a number of grey-level training im-<br />

ages. Since the h<strong>and</strong> is usually brighter than the background area, pixels belonging<br />

to the h<strong>and</strong> showed up in the mean image as brighter pixels. This mean image<br />

was normalized to have values between zero <strong>and</strong> one. Areas that have skin color<br />

but are darker, <strong>for</strong> example, those between two adjacent fingers, were manually<br />

set to high probability values. Similarly, high-value pixels that were known to be<br />

from the background were set to low values.<br />

Figure 5.16: The probability maps <strong>for</strong> six h<strong>and</strong> postures:<br />

the maps are shown in 25x25 pixel resolution. A brighter pixel indicates a higher<br />

probability <strong>for</strong> a pixel observed at that location to belong to the h<strong>and</strong> appearance<br />

<strong>and</strong> thus to be of skin color.<br />

The probability maps are used in two places in <strong>H<strong>and</strong></strong>Vu: first, the color of<br />

the very h<strong>and</strong> appearance is learned upon detection based on pixels with high<br />

probability values in the map. To this end, the map is scaled to the actual<br />

detection area’s size. Second, the “Flock of Features” (see Chapter 6) favors high<br />

map values when initially placing the features.<br />

157


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

5.9 Learned color distribution<br />

At h<strong>and</strong> detection time, the observed h<strong>and</strong> color is learned in a normalized-<br />

RGB histogram <strong>and</strong> contrasted to the background color as observed in a horse-<br />

shoe-shaped area in the image around the h<strong>and</strong>, see Figure 5.17. This assumes that<br />

no other exposed skin body parts of the same person who’s h<strong>and</strong> is to be tracked<br />

is within that background reference area. Since our applications mostly assume a<br />

<strong>for</strong>ward- <strong>and</strong> downward-facing head-worn camera, this assumption is reasonable.<br />

We ensured that it was met <strong>for</strong> our test videos, which also included other camera<br />

locations. The segmentation quality that this dynamic learning achieves is very<br />

good <strong>for</strong> as long as the h<strong>and</strong>’s lighting conditions do not change dramatically<br />

<strong>and</strong> the reference background is representative <strong>for</strong> the actual background. For<br />

example, wooden objects that are not within the reference background area during<br />

learning will frequently be classified wrongly as <strong>for</strong>eground color.<br />

5.10 Discussion<br />

In this section we will discuss practical considerations with the detection mod-<br />

ule. To achieve the detection rates that are reported in this chapter, the following<br />

conditions must be met. First, there should be no intense, direct sunlight on the<br />

h<strong>and</strong>, particularly no hard, cast shadows that cover parts of the h<strong>and</strong>. Second,<br />

158


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

Figure 5.17: The areas <strong>for</strong> learning the skin color model:<br />

After the h<strong>and</strong> was detected, the color in the h<strong>and</strong>-masked area (white) is learned<br />

in a histogram. The pixelized look stems from scaling the 30x20 sized maps to the<br />

detected h<strong>and</strong>’s size. A second histogram is learned from the horseshoe-shaped<br />

area around the h<strong>and</strong> (black); it is presumed to contain only background.<br />

the exposure of the h<strong>and</strong> area should be approximately correct. This can be cor-<br />

rected <strong>for</strong> automatically with the method proposed in Section 4.2.2. Third, the<br />

h<strong>and</strong> area in the image must be at least as big as the properly scaled recognition<br />

template. For example, the 30x20 resolution template with the 0.6785 aspect ra-<br />

tio (the best detector) has a minimum size of 30x37 pixels in width <strong>and</strong> height.<br />

Fourth, camera <strong>and</strong> h<strong>and</strong> must not both be held static at the same time since<br />

this produces unchanging video frames. This is problematic as discussed in the<br />

following.<br />

The detection probabilities in two consecutive frames are not independent. In<br />

particular, this means that the chances <strong>for</strong> h<strong>and</strong> detection are reduced if it did not<br />

159


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

succeed in the previous frame. This is an inherent property of video processing<br />

<strong>and</strong> not a shortcoming of any particular computer vision method. The results<br />

of this chapter were obtained with still-image cameras where this is much less of<br />

an issue. While there is no technical solution to this problem beyond improving<br />

the per-frame per<strong>for</strong>mance, a user of a vision-based interface is expected to adapt<br />

to these characteristics over time. In <strong>H<strong>and</strong></strong>Vu’s case this means that after a few<br />

unsuccessful detections a user could move the h<strong>and</strong>, the camera, or both in order<br />

to present slightly different images to the vision methods. This discussion also<br />

applies to the posture recognition method that is covered in Chapter 7.<br />

Some parameters of the detection module can easily be changed in the vision<br />

conductor configuration file, see Section 4.2.9. In particular, the relative amount of<br />

masked h<strong>and</strong> area that must be of skin color (determined with the fixed histogram)<br />

in order to regard a detection as a match can be used to adapt VBI per<strong>for</strong>mance to<br />

different environments: if the h<strong>and</strong> area is expected to be always well exposed <strong>and</strong><br />

the lighting is such that it does not result in many specularities, the parameter<br />

can be turned up to about 80%. This is because the skin color is going to be much<br />

more reliable to segment correctly <strong>and</strong> can reduce the number of false positives,<br />

<strong>for</strong> example, in well-lit indoor environments. On the other h<strong>and</strong>, in the outdoors<br />

the parameter should not be higher than 30% since the apparent skin color can<br />

vary a lot more <strong>and</strong> more weight should be given to the grey-level in<strong>for</strong>mation.<br />

160


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

If the application designer wishes to make h<strong>and</strong> detection a more distinct event<br />

<strong>and</strong> thus distinguish it stronger from undesired activations, the duration parame-<br />

ter in the configuration file can help to avoid inadvertent VBI initializations: the<br />

value of this parameters specifies in milliseconds the time that a posture has to be<br />

recognized continuously be<strong>for</strong>e a match is reported. Similarly, initialization can<br />

be restricted to postures per<strong>for</strong>med within a certain pixel radius from the first<br />

detection. This allows distinction of moving versus fixed-pose h<strong>and</strong>s that are in<br />

the same posture.<br />

Also, we give the <strong>H<strong>and</strong></strong>Vu user the opportunity to detect different postures<br />

<strong>for</strong> initialization by specifying more than one entry in the “detection cascades”<br />

list. The additional cascades can be scanned over the same detection area as the<br />

cascade <strong>for</strong> the first posture or over a different one. For example, this can be<br />

conveniently exploited <strong>for</strong> an extension to <strong>H<strong>and</strong></strong>Vu that can be initialized with<br />

both the left h<strong>and</strong> <strong>and</strong> the right h<strong>and</strong> in separate locations, with applications <strong>for</strong><br />

left-h<strong>and</strong>ed users.<br />

Lastly, the processing time of the detection module by itself shall be men-<br />

tioned, using the most accurate detector <strong>for</strong> the closed h<strong>and</strong> posture as detailed<br />

in Section 5.6.3. To scan an entire VGA-sized video frame (640x480 pixels), with<br />

initial translation increments of two pixels in the horizontal <strong>and</strong> three pixels in<br />

the vertical, between the scales 1.0 <strong>and</strong> the maximum that the frame size allows<br />

161


Chapter 5. <strong>H<strong>and</strong></strong> Detection<br />

with a scale increment factor of 1.2, it takes between 114ms <strong>and</strong> 211ms with a<br />

mean of 118.625ms <strong>and</strong> a median of 118ms on a 3GHz Xeon running Windows<br />

XP. For the 218x308-sized area that we most frequently used (<strong>for</strong> example, in the<br />

Maintenance Application described in Section 8.6), the detector is scaled from its<br />

minimum size (scale factor 1.0) to scale factor 8.0, with increment factor of 1.2,<br />

the processing time is between 21ms <strong>and</strong> 27ms with a mean of 22.373ms <strong>and</strong> a<br />

median of 22ms.<br />

162


Chapter 6<br />

Tracking of Articulated Objects<br />

The objective of the vision component described in this chapter is to follow<br />

the h<strong>and</strong> as robustly as possible after it has been detected. Preliminary studies<br />

showed that shape-based methods are unsuited <strong>for</strong> non-rigid objects because the<br />

variety of contours would have to be dealt with in an explicit <strong>and</strong> thus high-<br />

dimensional manner. Color-based methods work well only as long as the h<strong>and</strong><br />

is the predominant skin-colored object in view. To overcome these problems,<br />

this chapter introduces the “Flock of Features,” a fast tracking method <strong>for</strong> non-<br />

rigid <strong>and</strong> highly articulated objects such as h<strong>and</strong>s. It combines optical flow <strong>and</strong><br />

a learned color probability distribution to facilitate 2D position tracking of the<br />

object as a whole (not each articulation) from a monocular view. The tracker’s<br />

benefits include its speed <strong>and</strong> its ability to track rapid h<strong>and</strong> movements despite<br />

arbitrary finger configuration changes (postures). It can deal with arbitrary <strong>and</strong><br />

dynamic backgrounds, significant camera motion, <strong>and</strong> some lighting changes. It<br />

163


Chapter 6. Tracking of Articulated Objects<br />

does not require a shape-based h<strong>and</strong> model, thus it is in principle applicable<br />

to tracking any de<strong>for</strong>mable or articulated object. A more distinct <strong>and</strong> uni<strong>for</strong>m<br />

object color increases per<strong>for</strong>mance but is not essential. Tracker per<strong>for</strong>mance is<br />

evaluated on h<strong>and</strong> tracking with a non-stationary camera in unconstrained indoor<br />

<strong>and</strong> outdoor environments. The main results of this work were published in [91].<br />

6.1 Preliminary studies<br />

Shape or contour-based tracking is not very robust. Even <strong>for</strong> very rigid <strong>and</strong><br />

optimally distinct <strong>and</strong> optimally concave/convex objects as in Figure 6.1, track-<br />

ing is very sensitive to background noise. The shown example was implemented<br />

with an Active Shape Model on image pyramids. It requires good initialization<br />

<strong>and</strong> small frame-to-frame differences <strong>for</strong> tracking. As can be seen, the tracking is<br />

disturbed by intensity variations on the h<strong>and</strong> (third, top right image). Eventu-<br />

ally the shape is attracted to high gradients not caused by the h<strong>and</strong> but by the<br />

keyboard <strong>and</strong> the display border (bottom right image).<br />

A combination of ASMs with predictive filters such as Kalman or particle<br />

filters improves the tracking per<strong>for</strong>mance, but does not eliminate the sensitivity<br />

to noise. Some texture- or appearance-based methods on the other h<strong>and</strong> track<br />

more robustly <strong>and</strong> do not require a model learned during training.<br />

164


Chapter 6. Tracking of Articulated Objects<br />

Figure 6.1: Tracking a h<strong>and</strong> with an Active Shape Model:<br />

these images of tracking with the ASM were taken every 2 seconds. The thin,<br />

static line is only used <strong>for</strong> initialization purposes. The thick line represents the<br />

ASM-estimated shape.<br />

6.2 Flocks of Features<br />

The tracker’s core idea is motivated by the seemingly chaotic flight behavior<br />

of a flock of birds such as pigeons. While no single bird has any global control,<br />

the entire flock still stays tightly together, a large “cloud.” This decentralized<br />

organization has been found to mostly hinge upon two simple constraints that<br />

can be evaluated on a local basis: birds like to maintain a minimum safe flying<br />

distance to the other birds, but desire not to be separated from the flock by more<br />

than another threshold distance; see, <strong>for</strong> example, in Reynolds [145].<br />

165


Chapter 6. Tracking of Articulated Objects<br />

Figure 6.2: The Flock of Features in action:<br />

tracking despite a non-stationary camera, h<strong>and</strong> articulations, <strong>and</strong> changing lighting<br />

conditions. The images are selected frames from sequence #5.<br />

The h<strong>and</strong> tracker consists of a set of small image areas, or features, moving<br />

from frame to frame in a way similar to a flock of birds. Their “flight paths”<br />

are determined by optical flow, <strong>and</strong> then constrained by observing a minimum<br />

distance from all other features <strong>and</strong> by not exceeding a maximum distance from<br />

the feature median. If these conditions are violated, the feature is repositioned to a<br />

location that has a high skin color probability. This fall-back on a second modality<br />

counters the drift of features onto nearby background artifacts that exhibit strong<br />

grey-level gradients.<br />

The speed of pyramid-based KLT feature tracking (see Section 6.2.1 below)<br />

allows our method to overcome the computational limitations of model-based<br />

approaches to tracking, easily achieving the real-time per<strong>for</strong>mance required <strong>for</strong><br />

166


Chapter 6. Tracking of Articulated Objects<br />

vision-based interfaces. 1 It delivers excellent results <strong>for</strong> tracking quickly moving<br />

rigid objects. The flocking feature behavior was introduced to allow <strong>for</strong> tracking<br />

of objects whose appearance changes over time, that is, to make up <strong>for</strong> features<br />

that are “lost” from one frame to another because the image mark they were<br />

tracking disappeared. Since mere feature re-introduction within proximity of the<br />

flock cannot provide any guarantees on whether it will be located on the object<br />

of interest or some background artifact, color as the second modality is consulted<br />

to aid in the choice of location. An overview of the entire algorithm is given in<br />

Figure 6.3.<br />

6.2.1 KLT features <strong>and</strong> tracking initialization<br />

KLT features are named after Kanade, Lucas, <strong>and</strong> Tomasi 2 who found that<br />

a steep brightness gradient along at least two directions makes <strong>for</strong> a promising<br />

feature c<strong>and</strong>idate to be tracked over time (“good features to track,” see [158]).<br />

In combination with image pyramids (a series of progressively smaller-resolution<br />

interpolations of the original image [110]), a feature’s image area can be matched<br />

efficiently to the most similar area within a search window in the following video<br />

frame. The feature size determines the amount of context knowledge that is used<br />

1The color distribution can be seen as a model, yet it is not known a priori but learned on<br />

the fly.<br />

2KLT trackers are not to be confused with the Karhunen-Loeve Trans<strong>for</strong>m, often abbreviated<br />

KLT as well.<br />

167


Chapter 6. Tracking of Articulated Objects<br />

input:<br />

h_size - rectangular area containing h<strong>and</strong><br />

mindist - minimum pixel distance between features<br />

n - number of features to track<br />

winsize - size of feature search windows<br />

initialization:<br />

learn color histogram<br />

find n*k good-features-to-track with mindist<br />

rank them based on color <strong>and</strong> fixed h<strong>and</strong> probability maps<br />

pick the n highest-ranked features<br />

tracking:<br />

update KLT feature locations with image pyramids<br />

compute median feature<br />

<strong>for</strong> each feature<br />

if less than mindist from any other feature<br />

or outside h_size, centered at median<br />

or low match correlation<br />

then relocate feature onto good color spot<br />

that meets the flocking conditions<br />

output:<br />

the average feature location<br />

Figure 6.3: The Flock of Features tracking algorithm:<br />

k is an empirical value, chosen so that enough features end up on good colors; we<br />

use k = 3. The fixed h<strong>and</strong> probability map is a known spatial distribution <strong>for</strong><br />

pixels belonging to some part of the h<strong>and</strong> in the initialization posture.<br />

168


Chapter 6. Tracking of Articulated Objects<br />

<strong>for</strong> matching. If the feature match correlation between two consecutive frames is<br />

below a threshold, the feature is considered “lost.”<br />

Recently, Toews <strong>and</strong> Arbel [173] proposed a method <strong>for</strong> finding good can-<br />

didates <strong>for</strong> tracking <strong>and</strong> claim better per<strong>for</strong>mance than KLT features achieve.<br />

Picking our features based on their criterion might in fact improve our Flock of<br />

Features tracking even further.<br />

The h<strong>and</strong> detection component, described in the previous chapter, supplies<br />

both a rectangular bounding box <strong>and</strong> a probability distribution to initialize track-<br />

ing. This probability “map” is particular to the recognized gesture <strong>and</strong> was learned<br />

offline. It states <strong>for</strong> every pixel in the bounding box the likelihood that it belongs<br />

to the h<strong>and</strong> <strong>and</strong> is described in Section 5.8. A set of approximately 100 features<br />

is chosen according to the goodness criterion <strong>and</strong> observing a pairwise minimum<br />

distance. A subset of the features is then selected based on the map <strong>and</strong> color<br />

probability. The subset’s cardinality is the target number of features which will<br />

be maintained throughout tracking by replacing lost features with new ones.<br />

Each feature is tracked individually from frame to frame. That is, its new<br />

location becomes the area with the highest match correlation between the two<br />

frame’s areas. The features will not move in a uni<strong>for</strong>m direction; some might be<br />

lost <strong>and</strong> others will venture far from the flock.<br />

169


Chapter 6. Tracking of Articulated Objects<br />

6.2.2 Flocking behavior<br />

The flocking behavior is a way of en<strong>for</strong>cing a loose global constraint on the<br />

feature locations that keeps them spatially confined. During tracking, the feature<br />

locations are first updated just like regular KLT features as described in the<br />

previous section <strong>and</strong> their median is computed. Then, the two flocking conditions<br />

are en<strong>for</strong>ced at every frame: no two features must be closer to each other than<br />

a threshold distance, <strong>and</strong> no feature must be further from the feature median<br />

than a second threshold distance. Unlike birds that will gradually change their<br />

flight paths if the flocking conditions are not met, the tracking method abruptly<br />

relocates affected features to a new location that fulfills the conditions. The flock<br />

of features can be seen in Figure 6.4 as clouds of little dots.<br />

The effect of this method is that individual features can latch on to arbitrary<br />

artifacts of the object being tracked, such as the fingers of a h<strong>and</strong>. They can<br />

then move independently along with the artifact, without disturbing most other<br />

features <strong>and</strong> without requiring the explicit updates of model-based approaches,<br />

resulting in flexibility <strong>and</strong> speed. Too dense concentrations of features that would<br />

ignore other object parts are avoided because of the minimum-distance constraint.<br />

Stray features that are likely to be too far from the object of interest are brought<br />

back into the flock due to the maximum-distance constraint.<br />

170


Chapter 6. Tracking of Articulated Objects<br />

Figure 6.4: Images taken during tracking:<br />

these images are individual frames from sequence #3 with highly articulated h<strong>and</strong><br />

motions. 200x230 pixel areas were cropped from the 720x480-sized frames. The<br />

cloud of little dots represents the flock of features, the big dot is their mean. Note<br />

the change in size of the h<strong>and</strong> appearance between the first <strong>and</strong> fifth image <strong>and</strong><br />

its effect on the feature cloud.<br />

The median was chosen over the mean location to en<strong>for</strong>ce the maximum-<br />

distance constraint because of its robustness towards spatial outliers. In fact,<br />

the furthest 15% of features are also removed from the median computation to<br />

achieve temporally more stable results. However, the location of the tracked object<br />

as a whole is considered to be the mean of all features since this measure changes<br />

more smoothly over time than the median. The gained precision is important <strong>for</strong><br />

the vision-based interface’s usability.<br />

171


Chapter 6. Tracking of Articulated Objects<br />

6.2.3 Color modality <strong>and</strong> multi-cue integration<br />

A histogram-based probability <strong>for</strong> a pixel’s color to be of h<strong>and</strong> color is obtained<br />

as described in the previous chapter, Section 5.9. The color in<strong>for</strong>mation is used<br />

as a probability map (of a pixel’s color belonging to the h<strong>and</strong>) in three places.<br />

First, the CamShift method, which the tracker was compared to, solely operates<br />

on this modality. Second, at tracker initialization time, the KLT features are<br />

placed preferably onto locations with a high skin color probability. This is true<br />

even <strong>for</strong> the two tracking styles that did not use color in<strong>for</strong>mation in subsequent<br />

tracking steps, see Section 6.3.<br />

Third, the new location of a relocated feature (due to low match correlation<br />

or violation of the flocking conditions) is chosen to have a high color probabil-<br />

ity. If this is not possible without repeated violation of the flocking conditions,<br />

it is chosen r<strong>and</strong>omly. The goodness-to-track is not taken into account at this<br />

point anymore, but probably this would not improve tracking because the fea-<br />

tures quickly move to those locations nevertheless.<br />

This method leads to a very natural multi-modal integration, combining cues<br />

from feature movement based on grey-level image texture with cues from texture-<br />

less skin color probability. The relative contribution of the modalities can be<br />

controlled by changing the threshold of when a KLT feature is considered lost<br />

between frames. If this threshold is low, features are relocated more frequently,<br />

172


Chapter 6. Tracking of Articulated Objects<br />

raising the importance of the color modality, <strong>and</strong> vice versa. The threshold can<br />

be set in the vision conductor configuration, see Section 4.2.9.<br />

6.3 Experiments<br />

The main objective of the experiments was to assess the tracker’s per<strong>for</strong>mance<br />

in comparison to a frequently used, state of the art tracker. The CamShift tracking<br />

method (see Bradski [14]) was chosen because it is widely available <strong>and</strong> because it<br />

is representative of single-cue approaches. The contribution of both the flocking<br />

behavior <strong>and</strong> of the multi-cue integration was also of interest. Five tracking styles<br />

were there<strong>for</strong>e compared:<br />

• CamShift: The OpenCV implementation of CamShift [14] was supplied<br />

with the learned color distribution. A pilot study using a fixed HSV his-<br />

togram yielded inferior results.<br />

• KLT features only: The KLT features were initialized on the detected<br />

h<strong>and</strong> <strong>and</strong> subject to no restrictions during subsequent frames. If their match<br />

quality from one to the next frame was below a threshold, they were reini-<br />

tialized r<strong>and</strong>omly within proximity of the feature median.<br />

173


Chapter 6. Tracking of Articulated Objects<br />

• KLT features with flocking behavior: As above, but the constraints on<br />

minimum pairwise feature distance <strong>and</strong> maximum distance from the median<br />

were en<strong>for</strong>ced at every frame (see Section 6.2.2).<br />

• KLT features with color: As plain KLT features, but resurrected features<br />

were placed onto pixels with high skin-color probabilities (see Section 6.2.3).<br />

• Combined flocking <strong>and</strong> color cue: This tracking style combines the<br />

above two methods into the actual Flock of Features tracker.<br />

All styles used color in<strong>for</strong>mation that was obtained in identical ways. All KLT-<br />

based styles used the same feature initialization technique, based on a combination<br />

of known h<strong>and</strong> area locations <strong>and</strong> learned h<strong>and</strong> color. This guarantees equal<br />

starting conditions to all styles.<br />

Feature tracking was per<strong>for</strong>med with three-level pyramids in 720x480 video,<br />

which arrived at our DirectShow filter at approximately 13 frames per second.<br />

The tracking results were available after 2-18ms processing time, depending on<br />

search window size <strong>and</strong> the number of features tracked.<br />

Aside from comparing different tracking styles, we also experimented with<br />

different parameterizations of the Flock of Features. The following independent<br />

variables were varied: the number of features tracked, the minimum pairwise<br />

feature distance, <strong>and</strong> the feature search window size.<br />

174


Chapter 6. Tracking of Articulated Objects<br />

6.3.1 Video sequences<br />

A total of 518 seconds of video footage was recorded in seven sequences. Each<br />

sequence follows the motions of the right h<strong>and</strong> of one of two people, some filmed<br />

from the per<strong>for</strong>mer’s point of view, some from an observer’s point of view. For<br />

387 seconds (or 4979 frames) at least one of the styles successfully tracked the<br />

h<strong>and</strong>. Table 6.1 details the sequences’ main characteristics. The videos were shot<br />

in our lab <strong>and</strong> at various outdoor locations, the backgrounds including walkways,<br />

r<strong>and</strong>om vegetation, bike racks, building walls, etc. The video was recorded with<br />

a h<strong>and</strong>-held DV camcorder, then streamed with FireWire to a 3GHz desktop<br />

computer <strong>and</strong> processed in real-time. The h<strong>and</strong> was detected automatically in<br />

the initialization posture as described in the previous chapter.<br />

6.4 Results<br />

We define tracking to be lost when the mean location is not on the h<strong>and</strong> any-<br />

more, with extremely concave postures being an exception. The tracking <strong>for</strong> the<br />

sequence was stopped then, even though the h<strong>and</strong> might later have coincidentally<br />

“caught” the tracker again due to the h<strong>and</strong>’s path intersecting the erroneously<br />

tracked location. Since the average feature location cannot be guaranteed to be<br />

on the center of the h<strong>and</strong> or any other particular part, merely measuring the<br />

175


Chapter 6. Tracking of Articulated Objects<br />

Table 6.1: The video sequences <strong>and</strong> their characteristics:<br />

three sequences were taken indoors, four in the outdoors. In the first one, the<br />

h<strong>and</strong> was held in a mostly rigid posture (fixed finger flexion <strong>and</strong> orientation), all<br />

other sequences contained posture changes. The videos had varying amounts of<br />

skin-colored background within the h<strong>and</strong>’s proximity. Their full length is given<br />

in seconds, counting from the frame in which the h<strong>and</strong> was detected <strong>and</strong> tracking<br />

began. The maximum time <strong>and</strong> number of frames that the best method tracked<br />

a given sequence are stated in the last column.<br />

id outdoors posture changes skin backgrnd total length max tracked<br />

1 no no yes 95s 79.3s 1032f<br />

2 no yes yes 76s 75.9s 996f<br />

3 no lots little 32s 18.5s 226f<br />

4 yes yes little 72s 71.8s 923f<br />

5 yes yes yes 70s 69.9s 907f<br />

6 yes yes yes 74s 31.4s 382f<br />

7 yes yes yes 99s 40.1s 513f<br />

distance between the tracked location <strong>and</strong> some ground truth data cannot be an<br />

accurate measure <strong>for</strong> determining tracking loss. Thus, the results were visually<br />

inspected <strong>and</strong> manually annotated.<br />

6.4.1 Comparison to CamShift<br />

Figure 6.5 illustrates our method’s per<strong>for</strong>mance in comparison to a CamShift<br />

tracker that is purely based on color. The leftmost bar <strong>for</strong> each of the seven<br />

sequences shows that CamShift per<strong>for</strong>ms well on sequences three <strong>and</strong> four due to<br />

the limited amount of other skin-colored objects nearby the tracked h<strong>and</strong>. In all<br />

176


Chapter 6. Tracking of Articulated Objects<br />

other sequences, however, the search region <strong>and</strong> area tracked quickly exp<strong>and</strong> too<br />

far <strong>and</strong> lose the h<strong>and</strong> in the process.<br />

fraction of sequence tracked<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

camshift<br />

worst flock per sequence<br />

mean flock per sequence<br />

overall best flock<br />

1 2 3 4 5 6 7 8<br />

Figure 6.5: Results of tracking with Flocks of Features:<br />

this graph shows the time until tracking was lost <strong>for</strong> each of the different tracking<br />

styles, normalized to the best style’s per<strong>for</strong>mance <strong>for</strong> each video sequence. Groups<br />

1-7 are the seven video sequences, group 8 is the normalized sum of all sequences.<br />

The Flocks of Features track the h<strong>and</strong> much longer than CamShift, the comparison<br />

tracker.<br />

The other bars are from twelve Flock of Features trackers with 20-100 features<br />

<strong>and</strong> search window sizes between 5 <strong>and</strong> 17 pixels squared. Out of these twelve<br />

trackers, the worst <strong>and</strong> mean tracker <strong>for</strong> the respective sequence is shown. In<br />

177


Chapter 6. Tracking of Articulated Objects<br />

all but two sequences, even the worst tracker outper<strong>for</strong>ms CamShift, while the<br />

best tracker frequently achieves an order of magnitude better per<strong>for</strong>mance (each<br />

sequence’s best tracker is normalized to 1 on the y-axis <strong>and</strong> not explicitly shown).<br />

The rightmost bar in each group represents a single tracker’s per<strong>for</strong>mance: the<br />

overall best tracker which had 15x15 search windows, 50 features <strong>and</strong> a minimum<br />

pairwise feature distance of 3 pixels.<br />

Next, we investigated the relative contributions of the flocking behavior <strong>and</strong><br />

the color cue integration on the combined tracker’s per<strong>for</strong>mance. Figure 6.6 in-<br />

dicates that adding color as an additional image cue contributes more to the<br />

combined tracker’s good per<strong>for</strong>mance than the flocking behavior in isolation. The<br />

combination of both techniques achieves vast improvements over the CamShift<br />

tracker.<br />

6.4.2 Parameter optimizations<br />

Figure 6.7 presents the tracking results after varying the target number of<br />

features that the flocking method maintains. The mean fraction’s plateau suggests<br />

that 50 features are able to cover the h<strong>and</strong> area equally well as 100 features. The<br />

search window size of 11x11 pixels allows <strong>for</strong> overlap of the individual feature<br />

areas, making this a plausible explanation <strong>for</strong> no further per<strong>for</strong>mance gains after<br />

50 features.<br />

178


Chapter 6. Tracking of Articulated Objects<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

CamShift<br />

KLT<br />

flock<br />

color<br />

Flock of<br />

Features<br />

Figure 6.6: Contributors towards the Flock of Features’ per<strong>for</strong>mance:<br />

both the flocking behavior <strong>and</strong> the color cue add to the combo tracker’s per<strong>for</strong>mance.<br />

Shown is the normalized sum of the number of frames tracked with each<br />

tracker style, similar to the eighth group in Figure 6.5. The combination into the<br />

actual Flock of Features tracker shows significant synergy effects over the other<br />

trackers’ per<strong>for</strong>mances.<br />

179


Chapter 6. Tracking of Articulated Objects<br />

fraction of sequence tracked<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

mean<br />

seq 1<br />

seq 2<br />

seq 3<br />

seq 4<br />

seq 5<br />

seq 6<br />

seq 7<br />

20 30 40 50 60 70 80 90 100<br />

target number of features<br />

Figure 6.7: Tracking with different numbers of features:<br />

varying the number of features influences the per<strong>for</strong>mance <strong>for</strong> each of the video<br />

sequences. The KLT features were updated within an 11x11 search window <strong>and</strong><br />

a pairwise distance of 2.0 pixels was en<strong>for</strong>ced. (The bars are normalized <strong>for</strong> each<br />

sequence’s best tracker, which might not be shown here.)<br />

180


Chapter 6. Tracking of Articulated Objects<br />

In a related result (not shown), no significant effect was found related to the<br />

minimum pairwise feature distance in the range between two <strong>and</strong> four. However,<br />

smaller threshold values (especially the degenerative case of zero) allow very dense<br />

feature clouds that retract to a confined part on the tracked h<strong>and</strong>, decreasing<br />

robustness significantly.<br />

The number of features, the minimum feature distance, <strong>and</strong> the search win-<br />

dow size should ideally depend on the size of the h<strong>and</strong> <strong>and</strong> possibly the size of its<br />

articulations. These parameters were not dynamically adjusted since our exper-<br />

iments were conducted exclusively on h<strong>and</strong>s, that were also within a size factor<br />

of about two of each other (an example <strong>for</strong> scale change are the first <strong>and</strong> fifth<br />

image in Figure 6.4). The window size has two related implications. A larger size<br />

should be better at tracking global motions (position changes), while a smaller<br />

size should per<strong>for</strong>m advantageously at following finger movements (h<strong>and</strong> artic-<br />

ulations). Second, larger areas are more likely to cross the boundary between<br />

h<strong>and</strong> <strong>and</strong> background. Thus it should be more difficult to pronounce a feature<br />

lost based on its match correlation. However, Figure 6.8 does not explicitly show<br />

these effects. Other factors could play a role in how well the sequences come off,<br />

which warrants further investigation. On the other h<strong>and</strong>, the general trend is very<br />

pronounced.<br />

181


Chapter 6. Tracking of Articulated Objects<br />

fraction of sequence tracked<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

mean<br />

seq 1<br />

seq 2<br />

seq 3<br />

seq 4<br />

seq 5<br />

seq 6<br />

seq 7<br />

3 5 7 9 11 13 15 17<br />

feature search window size (squared)<br />

Figure 6.8: Tracking with different search window sizes:<br />

this graph shows how tracker per<strong>for</strong>mance is affected by search window size<br />

(square; side length given on x-axis). Larger window sizes improve tracking dramatically<br />

<strong>for</strong> sequences with very rapid h<strong>and</strong> location changes (sequences 3, 4, 5),<br />

but tracking of fast or complicated configuration variations suffer with too large<br />

windows (sequences 3, 7).<br />

182


Chapter 6. Tracking of Articulated Objects<br />

6.5 Discussion<br />

The experiments show that the per<strong>for</strong>mance improvement must be attributed<br />

to two factors. First, the purely texture-based <strong>and</strong> thus within-modality technique<br />

of flocking behavior contributes positively as witnessed by comparing KLT features<br />

with <strong>and</strong> without flocking. Second, the cross-modality integration adds further<br />

to the per<strong>for</strong>mance, visible in improvements from flocking-only <strong>and</strong> color-only to<br />

the combined approach.<br />

A perfect integration technique <strong>for</strong> multiple image cues would reduce the failure<br />

modes to simultaneous violations of all modalities’ assumptions. To achieve this<br />

<strong>for</strong> the Flock of Features <strong>and</strong> its on-dem<strong>and</strong> consultation of the color cue, a failure<br />

in the KLT/flocking modality would have to be detectable autonomously (without<br />

help from the color cue). To the best of our knowledge, this cannot be achieved<br />

theoretically. In practice, however, each feature’s match quality between frames is<br />

a good indicator <strong>for</strong> when the modality might not be reliable. This was confirmed<br />

by the above experiments as the features could be observed to flock towards the<br />

center of the h<strong>and</strong> (<strong>and</strong> its fairly stable appearance there) as opposed to the<br />

borders to the background where rapid appearance changes are frequent.<br />

The presented method’s limitations can thus be attributed to two causes, unde-<br />

tected failure of the KLT tracking <strong>and</strong> simultaneous violation of both modalities’<br />

183


Chapter 6. Tracking of Articulated Objects<br />

assumptions. The first case occurs when features gradually drift off to background<br />

areas without being considered lost nor violating flocking constraints. The second<br />

case occurs if the background has a high skin-color probability, has high grey-<br />

level gradients to attract <strong>and</strong> capture features, <strong>and</strong> the tracked h<strong>and</strong> undergoes<br />

trans<strong>for</strong>mations that require many features to reinitialize.<br />

There is a per<strong>for</strong>mance correlation between the target number of features, the<br />

minimum distance between features, <strong>and</strong> the search window size. The optimal<br />

parameters also depend on the size of the h<strong>and</strong>, which is currently assumed to<br />

vary after initialization with no more than approximately a factor of two in each<br />

dimension.<br />

The Flock of Features method was designed <strong>for</strong> coarse h<strong>and</strong> tracking <strong>for</strong> a<br />

time span in the order of ten seconds to one minute. It is to provide 2D position<br />

estimates to an appearance-based posture recognition method that does not re-<br />

quire an overly precise bounding box on the h<strong>and</strong> area. Thus, it was sufficient to<br />

obtain the location of some h<strong>and</strong> area, versus that of a particular spot such as the<br />

index finger’s tip. In the complete vision system (see Chapter 4), every successful<br />

posture classification re-initializes tracking <strong>and</strong> thus extends the tracking period<br />

into the long-term range.<br />

Depending on parameterization, processing one frame took between 2ms <strong>and</strong><br />

18ms. Thus, the achieved frame rate of 13 frames per second was limited by the<br />

184


Chapter 6. Tracking of Articulated Objects<br />

image acquisition <strong>and</strong> transmission hardware <strong>and</strong> not by the tracking algorithm.<br />

Higher frame rates will allow vastly better per<strong>for</strong>mance because KLT feature track-<br />

ing becomes much faster <strong>and</strong> even less error prone with shorter between-frame<br />

latencies.<br />

We have not found a good solution to automatic detection of tracking loss.<br />

Heuristics based on KLT feature locations provide some clues, but in a few occa-<br />

sions the system would track some non-h<strong>and</strong> object.<br />

The frequency of posture recognitions that re-initialize tracking depends en-<br />

tirely on the user task <strong>and</strong> can thus not be evaluated without that context. How-<br />

ever, research (at least <strong>for</strong> communicative gestures) has shown that fingers usually<br />

move when the h<strong>and</strong> is in a fixed pose [137]. This means that mostly a rigid ap-<br />

pearance is to be tracked through large position changes – which is more reliable<br />

– <strong>and</strong> articulations presumably result in <strong>for</strong>mation of one of the key postures –<br />

causing frequent re-initializations.<br />

The accuracy of the average KLT feature location (the pointer’s location)<br />

with respect to some fixpoint on the h<strong>and</strong> cannot be guaranteed because of the<br />

entirely object-independent tracking method. However, this is only of concern<br />

<strong>for</strong> registered manipulation tasks, as other interaction techniques involve pointer<br />

location trans<strong>for</strong>mations or are location independent.<br />

185


Chapter 6. Tracking of Articulated Objects<br />

Flocks of Features frequently track the h<strong>and</strong> successfully despite partial oc-<br />

clusions. Full object occlusions are impossible to h<strong>and</strong>le reliably at the image<br />

level. They are better dealt with at a higher level, such as with physical <strong>and</strong><br />

probabilistic object models (used, <strong>for</strong> example, by Jojic et al. [70] <strong>and</strong> Wren <strong>and</strong><br />

Pentl<strong>and</strong> [186]). A Flock of Features improves the input to these models, provid-<br />

ing them with better image observations that will in turn result in better model<br />

parameter estimates.<br />

A brief evaluation of h<strong>and</strong> tracking precision with an object-following task<br />

yielded no significant differences to the per<strong>for</strong>mance of a h<strong>and</strong>held trackball (in<br />

terms of mean <strong>and</strong> median distance of pointer from object). Empirically, the track-<br />

ing precision is excellent; even minute h<strong>and</strong> movements are tracked. Illustrating<br />

the naturalness of the interface, people frequently employed h<strong>and</strong> movements at<br />

first <strong>for</strong> the trackball task be<strong>for</strong>e remembering that now the h<strong>and</strong> was not being<br />

tracked anymore.<br />

6.6 Two-h<strong>and</strong>ed tracking <strong>and</strong> temporal filters<br />

<strong>H<strong>and</strong></strong>Vu is designed to detect <strong>and</strong> track multiple objects within view <strong>and</strong> its<br />

output interfaces can report multiple object’s states. This allows two-h<strong>and</strong>ed<br />

interfaces <strong>and</strong> even tracking of non-h<strong>and</strong> objects. The current computer vision<br />

186


Chapter 6. Tracking of Articulated Objects<br />

methods allow <strong>for</strong> rudimentary detection <strong>and</strong> tracking of a second h<strong>and</strong> in view<br />

to provide that input modality in specific instances. This functionality can be<br />

turned on with a library call <strong>and</strong> is implemented as follows.<br />

A blob of similar color as the learned probability distribution is sought, with<br />

a fixed minimum distance from the first h<strong>and</strong>, to its diagonal lower-left, <strong>and</strong> of<br />

a fixed minimum cross section perpendicular to that diagonal. The motivation<br />

behind searching only within the lower-left quadrant with respect to the right<br />

h<strong>and</strong>’s position comes from the application interface that was to be realized with<br />

two-h<strong>and</strong>ed interaction: image area selection to take a snapshot of the h<strong>and</strong>-<br />

enclosed “frame” as shown in Figure 8.4.<br />

A more sophisticated second h<strong>and</strong>-detection could employ a mirrored detector<br />

as described in the previous chapter. Since the h<strong>and</strong> appearances can usually be<br />

expected to have a very similar color, the color-based verification will contribute<br />

to achieve a very high confidence in matches. It is straight-<strong>for</strong>ward to extend<br />

the vision module to track a once-detected object with a second Flock of Fea-<br />

tures. High-level artifact interactions such as mutual h<strong>and</strong> occlusions would be<br />

a bigger issue, however, than when only allowing single-h<strong>and</strong>ed input styles <strong>and</strong><br />

appropriate high-level measures would have to be taken, <strong>for</strong> example, Kalman<br />

filtering.<br />

187


Chapter 6. Tracking of Articulated Objects<br />

Kalman filtering <strong>and</strong>/or a Condensation filter can be added at the application<br />

level to keep track of the temporal aspects <strong>and</strong> to improve tracking results. The<br />

input <strong>and</strong> output to the filter are the extracted features such as the h<strong>and</strong> location<br />

<strong>and</strong> derivatives thereof, but no image-level features.<br />

188


Chapter 7<br />

Posture Recognition<br />

The third computer vision component of the h<strong>and</strong> gesture interface <strong>H<strong>and</strong></strong>Vu<br />

attempts posture classification at <strong>and</strong> near the image location of the tracked h<strong>and</strong>.<br />

The terms posture classification <strong>and</strong> recognition are used in this dissertation in the<br />

meaning of view-dependent h<strong>and</strong> configuration classification, that is, determining<br />

the configuration <strong>for</strong>med by the fingers. A posture in this sense is in fact a<br />

combination of a posture <strong>and</strong> a view direction, allowing <strong>for</strong> the possibility to<br />

distinguish two different views of the same finger configuration.<br />

The classification method does not require highly accurate output of the h<strong>and</strong><br />

tracking module <strong>for</strong> two reasons. First, an area larger than the exact tracked<br />

location is scanned <strong>for</strong> the key postures. Tracking imprecision <strong>and</strong> in particular<br />

feature drift along the h<strong>and</strong> are thus countered. Second, the method has explicit<br />

knowledge of a “no known h<strong>and</strong> posture” class <strong>and</strong> can there<strong>for</strong>e produce correct<br />

189


Chapter 7. Posture Recognition<br />

results without requiring knowledge about the presence of a h<strong>and</strong> in the image<br />

area.<br />

The focus of the recognition method is on reliability, not expressiveness. That<br />

is, it distinguishes a few postures reliably <strong>and</strong> does not attempt less consistent<br />

recognition of a larger number of postures. Also, a large number of postures would<br />

at least in initial user interfaces put a high cognitive load on the user who has to<br />

memorize all of them. Thus, the six postures were chosen as the vocabulary size.<br />

Classification based on shape or contour in<strong>for</strong>mation has not proven to be a<br />

very reliable method because of its sensitivity to background clutter. While it<br />

provides good results <strong>for</strong> easily segmentable images, the general case with lots of<br />

background noise does not produce sufficiently stable classifications. Our recogni-<br />

tion method uses a texture-based approach to fairly reliably classify image areas<br />

into seven classes, six postures <strong>and</strong> “no known h<strong>and</strong> posture.” A two-stage hier-<br />

archy achieves both accuracy <strong>and</strong> good speed per<strong>for</strong>mance.<br />

The following section describes the exact method we used <strong>for</strong> posture recogni-<br />

tion. The classifier was evaluated with multiple users; the data collection method<br />

is described in Section 7.2, its results presented in Section 7.3. The chapter con-<br />

cludes with a discussion section.<br />

190


Chapter 7. Posture Recognition<br />

7.1 Fanned detection <strong>for</strong> classification<br />

Sequential execution of six traditional posture classifiers on commodity hard-<br />

ware exceeds the real-time requirements of a user interface. We thus modified the<br />

Viola-Jones method <strong>for</strong> this multi-class classification problem. In a first step, a de-<br />

tector looks <strong>for</strong> any of the six h<strong>and</strong> postures without distinguishing between them.<br />

Elimination of a certain number of h<strong>and</strong> c<strong>and</strong>idates is faster with this approach<br />

than when executing six separate detectors because different postures’ appear-<br />

ances share common features <strong>and</strong> thus need not be evaluated multiple times. In<br />

the second step, only those areas are further investigated that passed the first<br />

step successfully, that is, those that look like some h<strong>and</strong> appearance. Each of the<br />

detectors <strong>for</strong> the individual postures was trained on the result of the combined<br />

classifier which had already eliminated 99.999879% of image areas in the validation<br />

set. Cross-training, that is, using the positive examples <strong>for</strong> one classifier as the<br />

negative examples <strong>for</strong> all others, ensures that the classes are sufficiently distinct.<br />

Figure 7.1 shows an schematic view of this hierarchical recognition method.<br />

One could call this organization of classifiers a “fan” due to its single stem <strong>and</strong><br />

the unique branch point into multiple leaves. 1 For the fan to work as expected,<br />

1 An extension of this organization might yield still better speed per<strong>for</strong>mance: a tree structure<br />

has multiple branch points <strong>and</strong> the number of branches varies. Partial classifiers can share<br />

the same weak classifiers as long as the classification accuracy does not suffer. The fanned<br />

organization takes an “all-or-nothing” approach at this, while a tree structure allows some<br />

classifiers to retain the same root while others have branched out already.<br />

191


Chapter 7. Posture Recognition<br />

h<strong>and</strong>?<br />

closed<br />

open<br />

sidepoint<br />

victory<br />

Lback<br />

Lpalm<br />

Figure 7.1: Fanned arrangement of partial detectors:<br />

if none of the individual detectors is successful, the class “no known h<strong>and</strong> posture”<br />

is chosen.<br />

the templates have to have identical resolutions <strong>and</strong> the image areas they are com-<br />

pared to must have identical size ratios. For detection purposes as described in<br />

Chapter 5, those two parameters could be optimized per posture, but <strong>for</strong> two de-<br />

tectors to share the same root, this cannot be done. To accommodate all postures<br />

as well as possible without cutting off parts of the h<strong>and</strong> or including too much<br />

background, we chose an image size ratio (width/height) of 0.8. The open h<strong>and</strong><br />

posture is occasionally a bit too wide <strong>and</strong> a substantial amount of background is<br />

present in sidepoint images, but overall it is a good compromise. Templates were<br />

trained at a uni<strong>for</strong>m size of 25x25 pixels.<br />

Training a fanned detector is straight <strong>for</strong>ward: first, example images of all six<br />

postures constitute the positive sample set <strong>and</strong> non-h<strong>and</strong> pictures the negative<br />

sample set. Training with this configuration is stopped when further improve-<br />

ments to the detector are too costly, that is, when too many weak classifiers need<br />

to be added in order to reduce the false positive rate <strong>for</strong> a given detection rate.<br />

192


Chapter 7. Posture Recognition<br />

Thereafter, this partial classifier is extended in six independent training sessions<br />

<strong>for</strong> each of the six postures. The <strong>for</strong>merly positive training images of the respec-<br />

tively five other postures are now added to the set of negative examples. This<br />

allows <strong>for</strong> the inter-posture classification.<br />

The results from Chapter 5 with respect to rotational invariance <strong>and</strong> training<br />

were used to achieve the same degree of tolerance towards in-plane rotations in<br />

the fanned classifier. <strong>Gesture</strong>s can there<strong>for</strong>e be recognized when per<strong>for</strong>med with<br />

about 0-15 degrees of counter-clockwise in-plane rotation. The robustness towards<br />

out-of-plane rotations was not investigated.<br />

In addition to the thous<strong>and</strong>s of images that the detectors were trained on, the<br />

last strong classifier’s threshold of every detector was calibrated on 5153 image<br />

frames. They were collected from three people (different from the participants in<br />

the larger data collection, see 7.2) with a head-worn camera.<br />

False positives – recognitions of h<strong>and</strong>s when there are none – are less likely<br />

to occur in this stage compared to the initial detection stage because of context<br />

knowledge: the presence of the h<strong>and</strong> was established recently during detection <strong>and</strong><br />

presumably the h<strong>and</strong> was tracked thereafter. Posture misclassifications occurred<br />

occasionally (see below), but no background artifacts were erroneously classified<br />

as postures. Due to these reasons, no second image cue is consulted <strong>for</strong> result<br />

verification concerning “some posture” versus “no posture.” To further improve<br />

193


Chapter 7. Posture Recognition<br />

the per<strong>for</strong>mance of posture classification, a verification of the classification into<br />

one of the posture classes is desirable. However, color is only of limited use to<br />

this end, especially since the postures’ comparatively small appearance differences<br />

require skin/no skin color classification to a degree of accuracy that cannot easily<br />

be achieved <strong>for</strong> such small artifacts as fingers (especially due to specular light<br />

components). One might, <strong>for</strong> example, investigate feature vectors built from the<br />

spatial color distribution within the h<strong>and</strong> area to get a second opinion about<br />

postures.<br />

7.2 Data collection <strong>for</strong> evaluation<br />

<strong>H<strong>and</strong></strong>Vu’s recognition component was validated during the training with the<br />

same amount of h<strong>and</strong> images as the test set contained. However, we wanted to<br />

test the posture recognition against a larger number of test images, obtained in<br />

more realistic scenarios <strong>and</strong> with an actual head-worn camera. To this end, we<br />

built an automated data collection application that had dual roles: first, it was<br />

to give the user live feedback <strong>for</strong> her per<strong>for</strong>med h<strong>and</strong> postures by running the<br />

recognition module on approximately every other video frame <strong>and</strong> displaying its<br />

results. Second, it was to record the video to disk at frame rate <strong>for</strong> more de-<br />

194


Chapter 7. Posture Recognition<br />

tailed offline processing. Note that the application’s primary purpose was system<br />

evaluation <strong>and</strong> not user evaluation.<br />

Participants wore the HMD <strong>and</strong> camera <strong>and</strong> carried additional hardware in a<br />

backpack. An experimenter saw a copy of the screen that was displayed in the<br />

HMD on a laptop which he carried. He would guide the participants to different<br />

locations <strong>for</strong> the sessions. Overlaid over the video see-through images from the<br />

camera, participants were presented a textual identifier <strong>and</strong> a picture <strong>for</strong> one<br />

posture at a time. A rectangular area of about twice the width <strong>and</strong> height of<br />

a h<strong>and</strong> was shown approximately ten inches from the body in front of the right<br />

arm. A visually-displayed countdown of three seconds started immediately as a<br />

new posture was shown. After completion of the countdown, recognition started<br />

<strong>and</strong> was active <strong>for</strong> five seconds. Participants were given three pieces of feedback:<br />

first, a bar graph showed the time progress, starting high at five seconds <strong>and</strong><br />

ending low after five seconds. Second, the rectangular area turned from white to<br />

green during a successful posture recognition. Third, a second bar graph increased<br />

according to the time fraction of successful recognition <strong>for</strong> the current posture.<br />

Two screenshots of the display as it was visible to the participants is shown in<br />

Figure 7.2.<br />

Participants were instructed to per<strong>for</strong>m the posture immediately after it was<br />

shown, already during the countdown. If they did not receive positive feedback,<br />

195


Chapter 7. Posture Recognition<br />

Figure 7.2: The data collection <strong>for</strong> evaluating the posture recognition:<br />

a screenshot of the entire display of the application that we used to collect data<br />

<strong>for</strong> evaluating the posture recognition, one taken during the countdown, one while<br />

per<strong>for</strong>ming the requested posture.<br />

they were told to make slight changes to the posture <strong>and</strong>/or move their entire bod-<br />

ies slightly. These instructions were followed to a varying degree by the different<br />

participants.<br />

One female <strong>and</strong> two male participants were recruited from our campus com-<br />

munity <strong>and</strong> per<strong>for</strong>med each posture three times in each session. The first session<br />

was a practice run <strong>and</strong> the recorded data was not included in the results. The<br />

second session was situated in a lab environment with backgrounds including ta-<br />

bles, chairs, carpeted floor, etc. The next session was on a patio area next to our<br />

lab. This was an extremely bright area, even though direct sunlight was avoided<br />

both on the h<strong>and</strong> <strong>and</strong> on the background. The background included a light grey,<br />

stone-like floor, aluminum-colored railings, wooden panels etc. The last session<br />

196


Chapter 7. Posture Recognition<br />

had natural vegetation, leaves, <strong>and</strong> soil as primary background objects. Note that<br />

two thirds of all data collected is from outdoor environments with natural lighting<br />

– in fact, one participant was recorded in the morning, one in the afternoon, <strong>and</strong><br />

one in the early evening.<br />

7.3 Results<br />

7.3.1 Accuracy<br />

A total of 19,134 frames were recorded at 15Hz; this is approximately equiv-<br />

alent to 1276 seconds. A few frames with different postures can be seen in Fig-<br />

ure 7.3. Note the different backgrounds <strong>and</strong> lighting conditions. Table 7.1 shows<br />

the results in their entirety. Each row contains data <strong>for</strong> a particular posture that<br />

the user was supposed to per<strong>for</strong>m. The number of times that a certain posture was<br />

recognized is stated in columns. Note that errors can be due to the participant<br />

per<strong>for</strong>ming the wrong posture or the system mis-classifying a correctly executed<br />

posture.<br />

A total of 9,137 postures were recognized, sometimes two or more different<br />

postures in the same frame. While this amounts to an average of one recognition<br />

roughly every other frame, the fast processing speed in application virtually guar-<br />

antees a sufficiently fast response time of the system. In 8,567 frames (44.77%)<br />

197


Chapter 7. Posture Recognition<br />

Figure 7.3: Sample images <strong>for</strong> evaluation of the posture recognition:<br />

shown is a r<strong>and</strong>om selection of images that were recorded during the posture<br />

recognition data collection, of three different people.<br />

the correct posture was recognized, in 570 frames (2.98%) a wrong posture. Of all<br />

recognitions, 93.76% were correct as shown in the row titled “ratio” in Table 7.1.<br />

We note that a majority of misclassified frames are due to a single requested<br />

posture, open. It appears to have a lot of elements of other postures, especially<br />

the closed <strong>and</strong> Lback posture. As mentioned, this could be because of incorrectly<br />

per<strong>for</strong>med postures rather than a failure of the computer vision classification. If<br />

this posture is disregarded, the numbers are as follows.<br />

In a total of 16,021 frames 7,852 postures were recognized. In 7,752 frames<br />

(48.39%) this was the correct posture, in 100 frames (0.62%) this was an incorrect<br />

posture. A recognized posture was correct 98.71% of the time.<br />

198


Chapter 7. Posture Recognition<br />

Table 7.1: Summary of the recognition results:<br />

the “total” column states the number of frames that were recorded <strong>and</strong> tested <strong>for</strong><br />

each of six postures. The “ratio” row shows <strong>for</strong> each frame that was recognized<br />

as a certain posture the percentage that it was identified correctly.<br />

❍<br />

❍❍❍❍❍ is<br />

should<br />

closed open Lback Lpalm victory sidepoint total<br />

closed 1735 0 0 8 0 0 3272<br />

open 100 815 353 0 1 0 3113<br />

Lback 19 16 1211 4 6 10 3135<br />

Lpalm 0 0 1 1343 0 1 3208<br />

victory 0 0 0 7 1837 0 3234<br />

sidepoint 2 0 5 36 1 1626 3172<br />

ratio 0.9348 0.9807 0.7713 0.9607 0.9957 0.9933 0.9376<br />

7.3.2 Speed<br />

The speed per<strong>for</strong>mance of the fanned recognition method allowed us to pro-<br />

vide instantaneous feedback to the participants at interactive frame rates, while<br />

simultaneously recording a 192x215 pixels sized area to disk at 15Hz, all on a<br />

1.1GHz laptop. The scan time per area on a 3GHz desktop computer required<br />

between 15 milliseconds <strong>and</strong> 80 milliseconds with a mean <strong>and</strong> average around 50<br />

milliseconds. The 25x25 template detector was scanned in scales from 3.0 to 8.0<br />

with a factor 1.2 scale increase, initially stepping 2 pixels in the horizontal <strong>and</strong> 3<br />

pixels in the vertical.<br />

199


Chapter 7. Posture Recognition<br />

7.3.3 Questionnaire<br />

Participants were asked to complete an exit survey with a few personal <strong>and</strong> ten<br />

technical questions. While no statistically significant in<strong>for</strong>mation was obtained, a<br />

quick report on trends follows.<br />

All participants felt that postures were recognized fairly accurately. The open,<br />

Lback, victory, <strong>and</strong> sidepoint postures were consistently not experienced as difficult<br />

or uncom<strong>for</strong>table to per<strong>for</strong>m. For the closed <strong>and</strong> Lpalm postures the opinions<br />

differed.<br />

We asked <strong>for</strong> which aspects participants would like to see future improvements.<br />

They voiced the encumbrance of the head-worn display/camera unit. They also<br />

mentioned the desire <strong>for</strong> user-specific, customized gestures, especially <strong>for</strong> cases<br />

when a general gesture is inconvenient <strong>for</strong> a particular user.<br />

7.4 Discussion <strong>and</strong> conclusions<br />

While we would have preferred to compare our results to those of other re-<br />

searchers, a lack in testing st<strong>and</strong>ards <strong>and</strong> sample data made this infeasible. Fur-<br />

thermore, the task of distinguishing exactly these six postures is very specific <strong>and</strong><br />

comparison with results from smaller, larger, or different sets of postures <strong>and</strong> their<br />

200


Chapter 7. Posture Recognition<br />

recognition rates would not be meaningful. The evaluation of our computer vision<br />

module thus has to rely on the recognition rates as detailed above.<br />

The two main results are as follows. First, the fanned recognition method<br />

allows <strong>for</strong> high frame rates. This is due to elimination of all but 1.21 ∗ 10 −6 =<br />

0.000121% c<strong>and</strong>idate areas (on the validation set during training) during evalua-<br />

tion of the common fan root. Thus, only a small fraction of areas has to be tested<br />

<strong>for</strong> every h<strong>and</strong> posture. Second, cross-training the posture detectors <strong>for</strong> that sec-<br />

ond stage is an effective means to build a multi-class classifier as is required <strong>for</strong> a<br />

recognition method.<br />

Recognized postures are correct in 93.76% of the cases. In general, this is not<br />

reliable enough <strong>for</strong> high-end user interfaces. However, considering that this result<br />

was achieved <strong>for</strong> different people than the recognizer was trained on, that it was<br />

evaluated in indoor <strong>and</strong> outdoor environments, <strong>and</strong> that it relied on a single image<br />

cue, the method achieves surprisingly good per<strong>for</strong>mance. Furthermore, the false<br />

positive rate can be reduced to 1.29% (98.71% correct) when the open posture is<br />

not permitted in the interaction.<br />

Again, misclassifications can be due to participants per<strong>for</strong>ming wrong postures<br />

or the system recognizing a wrong posture. The number of incorrectly per<strong>for</strong>med<br />

postures on the test set has not been determined. On the other h<strong>and</strong>, users got<br />

201


Chapter 7. Posture Recognition<br />

feedback if their posture was not recognized at all <strong>and</strong> some participants made<br />

small changes to improve recognition.<br />

We have not yet studied the mean time to recognition after the moment the<br />

user per<strong>for</strong>ms the correct posture. This is a similar issue as discussed in Sec-<br />

tion 5.10: if the conditions are very unfavorable, <strong>for</strong> example, due to shadows<br />

cast onto the h<strong>and</strong>, then the posture will not be recognized <strong>for</strong> as long as no<br />

apparent image change occurs. That is to say, there is a correlation between the<br />

probabilities <strong>for</strong> recognition in two successive video frames. Over time, the user<br />

is expected to react to such a condition <strong>and</strong> actively change his h<strong>and</strong>’s position,<br />

the camera location, or both. This issue will naturally become less important as<br />

recognition methods improve beyond today’s per<strong>for</strong>mance.<br />

It remains to be mentioned that the validity of the general approach has been<br />

confirmed through independent research. Ong <strong>and</strong> Bowden [124] use a technique<br />

similar to our fanned recognition method. However, their focus is on detection<br />

of arbitrary h<strong>and</strong> postures. A single detector <strong>for</strong> all postures must eliminate<br />

all false positives, <strong>and</strong> a subsequent classification sorts the results into posture<br />

classes which were obtained with unsupervised training. Their method has two<br />

disadvantages <strong>for</strong> our purpose. First, the stages of detection <strong>and</strong> classification are<br />

not combined, thus more weak classifiers need to be evaluated to obtain the same<br />

result. This can result in too-long processing times <strong>for</strong> interactivity. Second, their<br />

202


Chapter 7. Posture Recognition<br />

posture clusters are based on the contours of skin-color regions <strong>and</strong> not on verified<br />

h<strong>and</strong> postures as <strong>for</strong> our training examples. Thus, the classification results that<br />

we obtain have semantically higher meaning than that of purely appearance-based<br />

clusters.<br />

203


Chapter 8<br />

<strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

The previous three chapters were concerned with the methods to implement<br />

vision-based gesture interfaces <strong>and</strong> Chapter 3 investigated an aspect of human<br />

factors of h<strong>and</strong> gestures. This chapter describes the culmination that the gained<br />

insights <strong>and</strong> technology developments provide: input to augmented reality (AR)<br />

<strong>and</strong> other non-traditional applications. After a brief overview of the applications<br />

<strong>and</strong> their contributions, we make the case <strong>for</strong> device-external interfaces. Then, we<br />

schematize ways in which applications can interpret h<strong>and</strong> location <strong>and</strong> posture to<br />

achieve their interaction needs. Next, we set <strong>for</strong>th general approaches to providing<br />

feedback <strong>for</strong> the user’s actions. All three applications are explained in detail in<br />

Sections 8.5 though 8.7, <strong>and</strong> concluding remarks will wrap up this chapter.<br />

204


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

8.1 Application overview <strong>and</strong> contributions<br />

Each of the applications contributes an important aspect to the dissertation<br />

thesis. These aspects will be introduced below. Overall, they show that wearable<br />

computers <strong>and</strong> non-traditional environments such as augmented reality benefit<br />

from the enriched interaction modalities that computer vision offers.<br />

The first demonstration of <strong>H<strong>and</strong></strong>Vu’s functionality was an alternative input<br />

modality (besides mouse <strong>and</strong> keyboard input) <strong>for</strong> a 3D map display to allow<br />

translating, scaling, rotation, <strong>and</strong> zooming. The display was an alternative out-<br />

put interface <strong>for</strong> Battuta, a wearable GIS (Geographic In<strong>for</strong>mation System). It<br />

showed the ease of replacing or complementing traditional input methods with<br />

h<strong>and</strong> gestures. This in turn benefits the user with more flexibility in his choice of<br />

interaction modalities.<br />

The second, the “Maintenance” application, gives a building facilities manager<br />

tools at h<strong>and</strong> to receive, inspect, <strong>and</strong> record work orders on the go, such as inves-<br />

tigating a broken pipe, video taping the scene, <strong>and</strong> leaving voice instructions <strong>for</strong><br />

the plumbers. The interface’s benefits lie in “deviceless” manifestation that leaves<br />

the user’s h<strong>and</strong>s unoccupied so he can per<strong>for</strong>m manual tasks. All functional input<br />

to the application is realized with computer vision means, demonstrating the fea-<br />

sibility as a st<strong>and</strong>-alone user interface. This is important as it opens another path<br />

205


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

<strong>for</strong> wearable computers <strong>and</strong> their applications through elimination of constraints<br />

imposed by traditional interfaces.<br />

The third application uses h<strong>and</strong> gestures to complement speech <strong>and</strong> trackball<br />

input in order to control a complex application interface. It gives the user a<br />

tool <strong>for</strong> visualizing in<strong>for</strong>mation that is otherwise hard or impossible to see by<br />

providing him with “virtual x-ray vision” in the shape of a wearable AR system.<br />

The application shows the gained flexibility in designing convenient input methods<br />

<strong>and</strong> the gained versatility <strong>for</strong> concurrent manipulation of many input parameters.<br />

8.2 The case <strong>for</strong> external interfaces<br />

<strong>Wearable</strong> computers have evolved into powerful devices: PDAs 1 , cell phones<br />

with integrated video recorders, smart phones, wrist-watches with heart monitor,<br />

compass, altimeter, <strong>and</strong> GPS 2 . Most of these devices run full-featured operating<br />

systems <strong>and</strong> can house arbitrary applications. Un<strong>for</strong>tunately, their human inter-<br />

face capabilities did not evolve as rapidly <strong>and</strong> are in fact severely limited by the<br />

devices’ continuously shrinking <strong>for</strong>m factors. Traditional interfaces, such as key-<br />

boards <strong>and</strong> LCD screens, can only be as big as the device’s surface. When plotting<br />

the interaction area over the device size as in Figure 8.1, this limitation manifests<br />

1 PDA: Personal Digital Assistant<br />

2 GPS: Global Positioning System; a satellite-based localization system with passive terrestrial<br />

receivers.<br />

206


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

itself in data points that are on or below the identity line. If the interaction de-<br />

vices can be folded to a smaller <strong>for</strong>m factor <strong>for</strong> transportation <strong>and</strong> stowing, the<br />

wearable computer loses one of its important properties: interactional constancy,<br />

the property of being always accessible [113].<br />

Figure 8.1: Interaction area size versus interface device size:<br />

Traditional interface devices are larger than the interface area they provide; their<br />

data points are below the identity line. External interfaces provide <strong>for</strong> interaction<br />

outside the physical dimensions of the implementing devices.<br />

1) considering the microphone only; 2) <strong>H<strong>and</strong></strong>Vu as pars pro toto of vision-based interfaces;<br />

3) VR trackers such as ultrasound or electromagnetic trackers that require<br />

mounted infrastructure; 4) virtual keyboards such as Canesta’s [174].<br />

Fortunately, the device size problem can be overcome by exp<strong>and</strong>ing the in-<br />

teraction area beyond the physical device’s dimensions – data points above the<br />

207


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

identity line in Figure 8.1. For example, the output can be extended beyond<br />

the physical size of the display by augmenting the reality through head-worn<br />

displays, allowing <strong>for</strong> in<strong>for</strong>mation visualization in the entire field of view. The<br />

input area can be enlarged through, <strong>for</strong> example, h<strong>and</strong> gestures per<strong>for</strong>med in free<br />

space, recognized with a head-worn camera, since they are not constrained to the<br />

hardware unit’s dimensions. The industry has recently embraced this concept,<br />

as exemplified by the Canesta Virtual Keyboard [174] that is projected onto any<br />

flat surface in front of the device. Speech recognition is another technology that<br />

allows interaction “outside” the actual device used <strong>for</strong> recognition.<br />

8.3 <strong>H<strong>and</strong></strong> gesture interaction techniques<br />

In [93], we distinguished three styles of interpreting h<strong>and</strong> gestures <strong>for</strong> user<br />

interface purposes. By h<strong>and</strong> gestures we mean h<strong>and</strong> translation in a two-dimen-<br />

sional input image plane, <strong>and</strong> a set of key h<strong>and</strong> configurations – <strong>H<strong>and</strong></strong>Vu’s output.<br />

The styles do not include temporal gestures <strong>and</strong> their meanings, <strong>for</strong> example, of<br />

h<strong>and</strong> waving. The styles’ characteristics <strong>and</strong> the manipulation techniques that<br />

they support are described in the following paragraphs.<br />

Registered manipulation means that the pointer is co-located with the<br />

h<strong>and</strong> in the video see-through display. The h<strong>and</strong> can there<strong>for</strong>e virtually touch<br />

208


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

objects that it is interacting with. This style is especially suitable to interaction<br />

with virtual objects in mixed reality scenarios <strong>and</strong> <strong>for</strong> interaction with the view<br />

of the real world. However, it is hard to per<strong>for</strong>m this kind of manipulation while<br />

walking. Also, care must be taken not to require too much interaction outside the<br />

user’s com<strong>for</strong>t zone as discussed in Chapter 3.<br />

Pointer-based manipulation, such as in [49, 138], describes gestures <strong>and</strong><br />

their interpretation in the style of a computer mouse – using movements in an<br />

input plane to control a pointer on a distinct manipulation plane. The input plane<br />

is fixed relative to the camera coordinate system, while the (“direct”) manipula-<br />

tion plane is fixed relative to the screen coordinate system. The trans<strong>for</strong>mation<br />

between the two planes requires some attention:<br />

1) The initial offset should be chosen so that all interaction can be per<strong>for</strong>med<br />

with h<strong>and</strong> motions that do not exit the com<strong>for</strong>t zone (see Chapter 3). This<br />

suggests, <strong>for</strong> example, that the initial pointer location is chosen centrally among<br />

the interaction items, such as buttons.<br />

2) A method <strong>for</strong> “clutching” [112] must be provided because with h<strong>and</strong> track-<br />

ing, the user cannot “pick up <strong>and</strong> reposition the mouse.” Instead, clutching could<br />

happen automatically when the pointer reaches the confines of the screen, or bet-<br />

ter still, of the com<strong>for</strong>t zone. Further h<strong>and</strong> movements will then dynamically<br />

modify the translation offset between the two planes.<br />

209


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

3) We found that constraining pointer movements to one dimension (<strong>for</strong> ex-<br />

ample, to a horizontal line) is very convenient as it reduces the required precision<br />

of h<strong>and</strong> movements. This in turn appears to reduce fatigue that is sometimes<br />

caused by unnecessarily strict gesture requirements.<br />

4) Larger-than-identity scaling factors avoid overly extensive h<strong>and</strong> movements<br />

while at the same time allowing <strong>for</strong> big, easily visible buttons. On the other h<strong>and</strong>,<br />

too large scaling factors again introduce unnecessarily strict requirements <strong>and</strong><br />

might even subject the input to involuntary jitter during general body motion.<br />

Another positive effect of scaling is that the size of the com<strong>for</strong>t zone does not<br />

restrict the size of the manipulation plane. That is, h<strong>and</strong> motions can remain<br />

within com<strong>for</strong>table bounds <strong>and</strong> the pointer can make much larger movements.<br />

5) Snapping the pointer to the default button can ensure that in most cases<br />

no h<strong>and</strong> movement but only the selection gesture has to be per<strong>for</strong>med. While<br />

this behavior might be disruptive in a desktop environment, it is more convenient<br />

within the mobile user interface context.<br />

6) Humans do not notice slight changes in mapping of input speed to con-<br />

trol speed (see, <strong>for</strong> example, “redirected walking” in virtual environments, by<br />

Razzaque et al. [140]). This can be exploited to artificially increase the con-<br />

trol area while maintaining precision. For example, a nonlinear speed translation<br />

is frequently employed in mouse interaction, called mouse pointer acceleration.<br />

210


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

As another example, imagine that the initial offset has been chosen unfavorably<br />

<strong>and</strong> the h<strong>and</strong> is not within the com<strong>for</strong>t zone. Imbalanced mappings <strong>for</strong> h<strong>and</strong><br />

movements in opposite directions can then be leveraged to gradually allow the<br />

interacting h<strong>and</strong> to return to its com<strong>for</strong>t zone, without requiring clutching.<br />

Location-independent interaction refers to h<strong>and</strong> postures that can be per-<br />

<strong>for</strong>med anywhere within the camera field-of-view (FOV) <strong>and</strong> produce a single<br />

event. Hauptmann [55] pointed out that pointer-based manipulation should not<br />

be the only mode of interaction <strong>for</strong> h<strong>and</strong> gesture interfaces. Location-independent<br />

gestures are thus an important mode of interaction, especially <strong>for</strong> people “on the<br />

move.”<br />

A selection gesture (a “mouse click”) is a necessary concept <strong>for</strong> many pointer-<br />

based <strong>and</strong> registered interfaces. It can be implemented with two techniques: se-<br />

lection by action, which involves a distinct posture to signal the desire to select.<br />

If the same h<strong>and</strong> is employed <strong>for</strong> both pointing <strong>and</strong> selection, some movement<br />

during the selection action must be expected <strong>and</strong> should not interfere with point-<br />

ing precision. For high precision dem<strong>and</strong>s, a selection by suspension technique<br />

might be more appropriate, in which the desire to select is conveyed by not mov-<br />

ing the pointer <strong>for</strong> a threshold period of time. Requiring the user to be idle <strong>for</strong><br />

a few seconds, or constantly move her h<strong>and</strong> to avoid selection, is usually unwise,<br />

particularly in mobile contexts.<br />

211


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

8.4 Feedback<br />

One of the most important aspects of user interfaces is the immediate feedback<br />

to the user once a comm<strong>and</strong> or even a slight change in the input vector was rec-<br />

ognized (see, <strong>for</strong> example, in Hauptmann [55]). A lack thereof decreases usability,<br />

in particular the speed with which the interface can be used. <strong>H<strong>and</strong></strong>Vu itself can<br />

give feedback by overlaying in<strong>for</strong>mation on the video stream. The amount <strong>and</strong><br />

verbosity of the overly can be selected based on the application programmer’s <strong>and</strong><br />

the application user’s needs. Section 4.2.6 on page 102 explains in detail <strong>H<strong>and</strong></strong>Vu’s<br />

different verbosity levels.<br />

In addition to <strong>H<strong>and</strong></strong>Vu’s feedback mechanism, applications can implement<br />

their own ways to signal event recognition to the user, both through visual means<br />

<strong>and</strong> through other channels such as audio. In general, the gesturing user should be<br />

notified of the state of the recognition system, that is, whether the h<strong>and</strong> has been<br />

detected <strong>and</strong> is being tracked, or whether the system is waiting <strong>for</strong> the user to<br />

per<strong>for</strong>m that first gesture. Some sort of location feedback should be given during<br />

h<strong>and</strong> tracking, <strong>for</strong> example, with a small icon overlaid over the h<strong>and</strong>. Lastly,<br />

recognition of one of the key postures should also be signaled, <strong>for</strong> example, with<br />

a button click visualization.<br />

212


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

All additional feedback is dependent on the gesture interpretation in the appli-<br />

cation space. For example, a red border could be drawn around buttons that the<br />

user hovers over, signaling that executing a selection gesture will “click” that but-<br />

ton. An iconic h<strong>and</strong> could be drawn as cursor <strong>for</strong> the pointer-based manipulation<br />

techniques.<br />

If the location of the h<strong>and</strong> is used as input parameter but has no a-priori visual<br />

representation, the first stage should probably also show a pointer at the location<br />

of the h<strong>and</strong>. For unregistered pointing tasks, the location feedback can also be<br />

in the control coordinate system only, not co-located with the h<strong>and</strong> in the video.<br />

This is sufficient feedback, since humans are easily capable of mapping between<br />

two spatial planes (as demonstrated by the example of the computer mouse).<br />

A third level of feedback is entirely in the domain of the common user again.<br />

It is given depending on the recognized gesture. If the task controlled with a<br />

particular gesture has an intrinsic visual outcome, no additional feedback has to<br />

be provided. An example is a map translation task with h<strong>and</strong> gestures in a direct<br />

correlation of their movements. On the other h<strong>and</strong>, if the task has no visual<br />

representation per se such as turning up the volume or switching modes <strong>for</strong> a<br />

moded interface, some feedback has to be artificially created. It can either be<br />

a specifically designed overlay such as a volume slide bar, it can be an iconic<br />

representation of the recognized gesture in a fixed location on the screen, or it can<br />

213


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

be a symbol displayed at a h<strong>and</strong>-stabilized location such as a pen if a drawing<br />

h<strong>and</strong> posture is recognized.<br />

8.5 Battuta: a wearable GIS<br />

We first demonstrated the feasibility <strong>and</strong> the ease of replacing conventional in-<br />

terfaces with <strong>H<strong>and</strong></strong>Vu’s vision-based interface. To this end, we built a st<strong>and</strong>-alone<br />

module that replicated the basic interface functionality of a wearable, mobile GIS 3<br />

application [27]. That application had been built in the context of the Battuta<br />

project <strong>and</strong> different input modalities were being explored <strong>for</strong> their suitability to<br />

controlling the application on a wearable plat<strong>for</strong>m. Our main reason <strong>for</strong> repli-<br />

cating parts of the interface component was the need <strong>for</strong> a much more efficient<br />

implementation, namely <strong>for</strong> hardware support to render the 3D scenery.<br />

The previous interface devices had one aspect in common: input in a continu-<br />

ous domain was only possible by converting a temporal duration into the desired<br />

one-dimensional signal. For example, depressing a key on a h<strong>and</strong>held keyboard<br />

would translate a displayed map with a constant speed into one direction. The<br />

opposite direction would be achieved with a different key. Un<strong>for</strong>tunately, this<br />

mode of operation is less natural <strong>and</strong> less efficient than, <strong>for</strong> example, dragging the<br />

map with a mouse. That, however, is not available to a wearable user.<br />

3 GIS: Geographic In<strong>for</strong>mation System<br />

214


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

One particular device, a so-called ring mouse, differentiated different <strong>for</strong>ces<br />

of the input <strong>and</strong>, thus, the signal could be varied in strength, <strong>for</strong> example, the<br />

velocity of the translation. This can alleviate some of the disadvantages of such<br />

a time-to-space conversion, but it also introduces more complexity since now the<br />

derivative of the actual variable is being controlled.<br />

8.5.1 The gesture interface<br />

<strong>H<strong>and</strong></strong> gesture interaction is thus particularly promising <strong>for</strong> application pa-<br />

rameters in continuous domains, such as the already mentioned map translation,<br />

<strong>and</strong> also map scaling, map rotation, <strong>and</strong> possibly generic slide bar controls. The<br />

functions that we chose to support with gestures are:<br />

• Selection of one of four menu entries. The menu consists of a selectable<br />

interaction surface in each corner of the screen as in the original Battuta<br />

application (see the screenshot in Figure 8.2). Menu selection starts with<br />

a location-independent “menu” h<strong>and</strong> gesture that brings up the menu cor-<br />

ners in the display. A brief motion of the h<strong>and</strong> (in any posture) into the<br />

direction of the desired corner selects the respective menu entry. This also<br />

demonstrates how <strong>H<strong>and</strong></strong>Vu can facilitate interpretation of dynamic gestures<br />

as a single event: by supplying posture <strong>and</strong> location data to the application<br />

215


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

domain which analyses it over time <strong>and</strong> produces one event after observation<br />

of a particular trajectory.<br />

Figure 8.2: The map display of the Battuta wearable GIS.<br />

• Translating the map in the two dimensions of its plane. A dedicated,<br />

location-independent “translation” gesture picks a reference point in 2D<br />

h<strong>and</strong> location space. <strong>H<strong>and</strong></strong> movements in arbitrary postures thereafter move<br />

the map along with the h<strong>and</strong> until the “translation” gesture is per<strong>for</strong>med<br />

again.<br />

• Zooming the map towards <strong>and</strong> away from a fixpoint on the map. Again, a<br />

reference point in h<strong>and</strong> space is chosen by per<strong>for</strong>ming a dedicated “zoom”<br />

gesture. Moving the h<strong>and</strong> further away from the body zooms out, moving<br />

216


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

it closer zooms in. Per<strong>for</strong>ming the “zoom” gesture a second time ends the<br />

zooming mode.<br />

8.5.2 Benefits of <strong>H<strong>and</strong></strong>Vu <strong>for</strong> Battuta<br />

As explained in the literature review (Section 2.4.1), this type of gestural<br />

interaction is not ideal in terms of user per<strong>for</strong>mance <strong>and</strong> preference. Users prefer<br />

not to have semantically fixed gesture interpretations but instead intuitive <strong>and</strong><br />

spontaneous mappings. In particular, replacing the mouse with gestures is not<br />

a favored interaction modality. However, <strong>for</strong> a wearable scenario where there is<br />

no mouse, no keyboard, <strong>and</strong> no table available, even these gesture mappings can<br />

provide the essential means of interaction.<br />

The most intuitive mixed-reality map imaginable would af<strong>for</strong>d the same prop-<br />

erties as its physical counterpart, especially with regard to picking it up <strong>and</strong><br />

moving the sheet around. While <strong>H<strong>and</strong></strong>Vu has not yet achieved this goal in its en-<br />

tirety, it requires no intermediary h<strong>and</strong>-held or h<strong>and</strong>-worn devices such as a mouse<br />

or data glove. It is thus a step closer to said most natural way of interaction.<br />

On the technical side, we used <strong>H<strong>and</strong></strong>Vu’s event server to interface between the<br />

gesture recognition module <strong>and</strong> the Battuta display. The TCP/IP-based commu-<br />

nication mode made <strong>H<strong>and</strong></strong>Vu’s implementation language (C++) <strong>and</strong> other server<br />

specifics entirely transparent to the map display, which was written in Java us-<br />

217


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

ing the Java3D rendering library. This shows the wide availability of <strong>H<strong>and</strong></strong>Vu’s<br />

output.<br />

8.6 <strong>Vision</strong>-only interface <strong>for</strong> mobility<br />

We also tested the functionality of our VBI with a custom-built user interface<br />

component <strong>for</strong> a facilities maintenance application that runs on our wearable com-<br />

puter. This was a much later project than the previous one <strong>and</strong> the vision system<br />

had matured significantly. The main contribution was its actual deployment in<br />

the outdoors.<br />

The hardware setup of our system is described in Section 4.1. This section<br />

describes the application interface.<br />

8.6.1 Functionality of the Maintenance Application<br />

The “Maintenance Application” consists of a set of application panes with a<br />

number of tools to aid facilities personnel. It was designed to demonstrate the<br />

suitability of VBIs <strong>for</strong> the mobile use. Its suggested functionality supports building<br />

facilities managers in their daily tasks of per<strong>for</strong>ming maintenance operations <strong>and</strong><br />

immediate-attention work requests, <strong>for</strong> example, investigating a water leak or a<br />

power failure in a particular room. The wearer of our mobile system can utilize<br />

218


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

three main panes: an audio recorder, a digital still <strong>and</strong> video camera, <strong>and</strong> a<br />

work order <strong>and</strong> communication pane. The active pane is selected by per<strong>for</strong>ming a<br />

location-independent task-switch gesture <strong>for</strong> a short period of time, which cycles<br />

through the application panes <strong>and</strong> a “blank screen” pane, one by one.<br />

Voice recorder: A small microphone clipped to the goggles allows auditory<br />

recordings, activated by gesture comm<strong>and</strong>s that start, pause, resume, <strong>and</strong> stop a<br />

sound recording. This interface utilizes the pointer-based manipulation technique<br />

in combination with a location-independent “select” posture. Buttons 4 are hori-<br />

zontally aligned <strong>and</strong> the pointer is restricted to moving along this dimension. A<br />

red border gives visual feedback whenever the h<strong>and</strong> pointer is in the area of a<br />

button (hovering above it).<br />

Image <strong>and</strong> video capture: The image capture pane has three modes of<br />

operation which are selected via buttons. The interaction technique with the im-<br />

age/video capture menu is very similar to that of the voice recorder, only that<br />

the buttons are arranged in a vertical fashion <strong>and</strong> the pointer movement is con-<br />

strained to that dimension. The first mode allows a user to take a picture of the<br />

entire visible area. A count-down timer is overlaid after activating this mode.<br />

A picture is taken <strong>and</strong> stored at the end of the count-down. The second mode<br />

4 Thanks to James Chainey <strong>for</strong> drawing the icons <strong>and</strong> interaction elements <strong>for</strong> this application<br />

219


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

Figure 8.3: Image of pointer-based interaction:<br />

shown is the image <strong>and</strong> video capture pane. The pointer movement is constrained<br />

to the vertical dimension. Note that the interface images shown in the various<br />

figures were taken in different environments, illustrating the ability of our system<br />

to adjust to varying backgrounds <strong>and</strong> lighting conditions.<br />

records a video stream instead, stopping as soon as the h<strong>and</strong> is detected within<br />

the interaction initiation area.<br />

The third mode allows taking snapshots of selective areas. <strong>H<strong>and</strong></strong>Vu searches<br />

<strong>for</strong> the left h<strong>and</strong> as the nearest skin-colored blob to the lower-left of the right h<strong>and</strong><br />

(see Section 6.6). The rectangular area enclosed by both h<strong>and</strong>s is highlighted in<br />

the display, shown in Figure 8.4. When the positions of both h<strong>and</strong>s stabilized<br />

with respect to the camera, the snapshot is taken. Implementations of the same<br />

functionality that use only one pointer are conceivable but less convenient to<br />

220


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

use. This is the only task where h<strong>and</strong> suspension was the selection method of<br />

choice because the user will most likely have assumed a stationary body position<br />

<strong>and</strong> per<strong>for</strong>ming a selection-by-action gesture would interfere with the pointing<br />

precision.<br />

Figure 8.4: Image of two-h<strong>and</strong>ed interaction:<br />

the user has selected an area, <strong>and</strong> the snapshot will be taken when the h<strong>and</strong>s have<br />

settled <strong>for</strong> five seconds.<br />

Work order scheduler: With the aid of this pane, the person in the field can<br />

retrieve, view, <strong>and</strong> reply to work requests. Up to three work orders with title <strong>and</strong><br />

status (open, closed, follow-up) are shown concurrently, automatic scrolling brings<br />

hidden orders into view (see Figure 8.5). Three dedicated, static h<strong>and</strong> gestures<br />

allow <strong>for</strong> selection <strong>and</strong> manipulation of work requests: one gesture selects the work<br />

order above the current one, another gesture selects the one below the current one.<br />

221


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

We choose the discrete posture technique over pointer-based manipulation because<br />

scrolling with a pointer <strong>and</strong> “scrollbars” is an unnatural, awkward operation,<br />

especially <strong>for</strong> mobile user interfaces. The third gesture facilitates activation of the<br />

currently selected work order. “Attachments” to a report can be selected from<br />

the previously recorded media clips (voice recording, still picture, or video) with<br />

“registered” h<strong>and</strong> movements. This was decided based upon the possibly large<br />

number of clips <strong>and</strong> the convenience of r<strong>and</strong>om access compared to access in a<br />

sequential fashion. The selection gesture picks the currently highlighted number.<br />

Figure 8.5: Image of location-independent interaction:<br />

location-independent postures (up, down) change the highlighted work order. The<br />

gesture being per<strong>for</strong>med in the picture selects the highlighted item.<br />

222


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

Figure 8.6: Selecting from many items with registered manipulation.<br />

8.6.2 Benefits of <strong>H<strong>and</strong></strong>Vu <strong>for</strong> mobility<br />

Over-exposure of the h<strong>and</strong> area was a huge problem of outdoor operation at<br />

first. The reason is that the h<strong>and</strong> is often brighter than the background, yet it<br />

occupies only a small area in the video frame. Most digital video cameras can only<br />

optimize the exposure <strong>for</strong> the entire frame, not <strong>for</strong> a selective <strong>and</strong> dynamically<br />

changing area. This lack in functionality prompted conception <strong>and</strong> implementa-<br />

tion of the automatic exposure control in our software environment that is ex-<br />

plained in Section 4.2.2. <strong>H<strong>and</strong></strong>Vu now h<strong>and</strong>les difficult illuminations both during<br />

the initial h<strong>and</strong> detection as well as during h<strong>and</strong> tracking <strong>and</strong> posture recognition.<br />

This in turn extends the range of conditions in which <strong>H<strong>and</strong></strong>Vu can be deployed as<br />

223


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

a user interface, specifically to the dynamic conditions experienced with mobile<br />

applications.<br />

Two-h<strong>and</strong>ed interaction offers a particularly attractive way to control certain<br />

tools, such as snapping a picture of a selective area. The registered manipulation<br />

is no doubt the best interaction technique <strong>for</strong> this task: even in real life do pro-<br />

fessional photographers <strong>and</strong> videographers use their h<strong>and</strong>s to <strong>for</strong>m an artificial<br />

picture frame around the scene they intend to immortalize. The correlation (with<br />

respect to the h<strong>and</strong>s <strong>and</strong> the FOV of the world) between the content on the input<br />

plane <strong>and</strong> the manipulation/output plane is a strong indicator to the benefits of<br />

this technique.<br />

Most of the interaction techniques that we demonstrated with the Maintenance<br />

Application were mouse-based. Again, this generally does not make the best use<br />

of h<strong>and</strong>s as an interface modality [55]. However, it realizes generic interface ca-<br />

pabilities <strong>for</strong> mobile deployment of almost any WIMP-style application 5 . Beyond<br />

mouse functionality, <strong>H<strong>and</strong></strong>Vu recognizes key postures to offer discrete “keys,”<br />

another important input capability that is difficult to provide in the wearable<br />

computer context. We employed this to switch between the different application<br />

panes with a dedicated posture. Furthermore, we showed techniques (restricted<br />

5 WIMP is “Short <strong>for</strong> Windows, Icons, Menus <strong>and</strong> Pointing device, the type of user interface<br />

made famous by the Macintosh computer <strong>and</strong> later imitated by the Windows operating systems.”<br />

(Webopedia.com)<br />

224


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

pointer movements, scaling, <strong>and</strong> snapping, see Section 8.3) that improve usability,<br />

mostly by decreasing the required extent <strong>and</strong> precision of a user’s input actions.<br />

8.7 A multimodal augmented reality interface<br />

In this section we will describe the third application <strong>for</strong> which we demonstrated<br />

the usefulness of vision interfaces. It uses h<strong>and</strong> gesture input in combination with<br />

voice comm<strong>and</strong>s <strong>and</strong> a h<strong>and</strong>held trackball to control virtual objects, overlaid over<br />

<strong>and</strong> registered with actual outdoor building structures <strong>and</strong> indoor room environ-<br />

ments. The main contribution with regard to h<strong>and</strong> gestures is the demonstration<br />

of the vision interface in concert with two other modalities <strong>and</strong> the gained input<br />

expressiveness.<br />

The mobile augmented reality system, developed by Ryan Bane, visualizes<br />

otherwise “invisible” in<strong>for</strong>mation encountered in urban environments. A versatile<br />

filtering tool allows <strong>for</strong> interactive display of occluded infrastructure <strong>and</strong> of dense<br />

data distributions such as room temperature or wireless network strength, with<br />

applications <strong>for</strong> building maintenance, emergency response, <strong>and</strong> reconnaissance<br />

missions. Bane <strong>and</strong> Höllerer recently extended this interface to a very compre-<br />

hensive visualization toolkit [5]. Operation of the complex application functional-<br />

225


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

ity dem<strong>and</strong>s more context-specific interaction techniques than traditional desktop<br />

paradigms can offer.<br />

The motivation behind the system is to show how multi-modal interfaces can<br />

stretch the boundaries of what tasks can be per<strong>for</strong>med on a wearable plat<strong>for</strong>m.<br />

While mobile computers are constrained in terms of their interaction possibilities<br />

by their <strong>for</strong>m factors, they do allow <strong>for</strong> interaction in <strong>and</strong> with the real world, in<br />

places <strong>and</strong> situations in which desktop-style computer support would be hard to<br />

come by. Through a blend of multimodal integration styles we are able to achieve<br />

good overall usability, avoiding uses of any one input modality <strong>for</strong> purposes it is<br />

not suited <strong>for</strong>.<br />

8.7.1 System description<br />

The wearable plat<strong>for</strong>m is simulated by a bulky prototype backpack system<br />

based on two medium-per<strong>for</strong>mance laptops. Figure 8.7 shows a diagram of the<br />

structure of our system implementation, overlaid over pictures of the actual de-<br />

vices. We need two laptops <strong>for</strong> per<strong>for</strong>mance reasons: the gesture recognizer’s<br />

per<strong>for</strong>mance drops dramatically if it has to compete with other compute- or bus<br />

communication-intensive applications on the same machine, a 1.1GHz Mobile Pen-<br />

tium 3 running Windows XP. The other laptop runs a custom-built OpenGL-based<br />

visualization engine which is described in [5]. In addition to the hardware setup<br />

226


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

voice<br />

recognizer<br />

gesture<br />

recognizer<br />

tracker manager<br />

input h<strong>and</strong>ler<br />

tool manager<br />

tools<br />

tunnel tool<br />

path tool<br />

picking tool<br />

renderer<br />

tracker<br />

trackball<br />

mike<br />

video<br />

combiner<br />

glasses<br />

camera<br />

Figure 8.7: An overview of the hardware components:<br />

together with the <strong>H<strong>and</strong></strong>Vu software they make up the multimodal AR application.<br />

as shown in Section 4.1, we also mounted an InterSense InertiaCube2 orientation<br />

tracker atop the glasses. We assumed position tracking to be provided by auxiliary<br />

means <strong>and</strong> manually set our location. A newscaster-style microphone is clipped<br />

to the side of the glasses. A tv-one CS-450 Eclipse scan converter overlays the<br />

luminance-keyed output from the rendering computer over the mostly unmodified<br />

video feed from the first computer. The combined signal provides the input to<br />

the head-worn display – <strong>for</strong> video see-through augmented reality – <strong>and</strong> to a DV<br />

camera that we used to record the images shown in this section.<br />

In a second configuration, both major software components – recognition <strong>and</strong><br />

rendering – can be run on the same machine, communicating the image frames<br />

227


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

between the two processes through a shared memory segment. This significantly<br />

reduces setup time <strong>and</strong> the amount of required gear as no analog video overlay is<br />

necessary. However, the achievable frame rates <strong>and</strong> the interactivity suffer from<br />

sharing the processing resources on commodity hardware.<br />

<strong>H<strong>and</strong></strong>Vu gets its h<strong>and</strong>s on the video frames first <strong>and</strong> per<strong>for</strong>ms its gesture<br />

recognition tasks. The camera image is also corrected <strong>for</strong> lens distortion. This is<br />

important <strong>for</strong> subsequent registration with the virtual world: graphics hardware<br />

can only display geometric projections <strong>and</strong> is not able to correct <strong>for</strong> nonlinear<br />

distortions. Thus, the image to align with must not be distorted either.<br />

The tracking result <strong>and</strong> the undistorted picture are passed to the AR rendering<br />

module. This module generates 3D graphics that are registered with the current<br />

camera view. It also generates a screen-stabilized user interface that provides<br />

feedback <strong>and</strong> conventional mouse interaction functionality to the user as a fallback<br />

solution <strong>and</strong> <strong>for</strong> development.<br />

8.7.2 The Tunnel Tool <strong>and</strong> other visualizations<br />

The Tunnel Tool is a sophisticated technique to visualize complex in<strong>for</strong>mation<br />

while in the field <strong>and</strong> immersed in augmented reality, conceived <strong>and</strong> implemented<br />

by Ryan Bane in collaboration with Tobias Höllerer <strong>and</strong> myself. 6 In essence, the<br />

6 A related publication [5] describes the tool <strong>and</strong> an extension to more detail.<br />

228


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

Tunnel Tool selects a volume from the entire field of view <strong>and</strong> displays it in a<br />

different manner than its surroundings. The tool is apparent to the viewer as a<br />

bluish plane that occludes the real world <strong>and</strong> occupies a part of the screen. It can<br />

“see through walls” <strong>and</strong> visualize occluded objects in front of the bluish plane. It<br />

can also filter dense <strong>and</strong> overlapping data sets <strong>and</strong> present them in the <strong>for</strong>m of<br />

visually more in<strong>for</strong>mative data slices. We fixed the cutout volume’s position with<br />

respect to the user’s viewpoint because initial experiments showed no gain from<br />

relocating it off the center of the view axis. The three dimensional layout of the<br />

Tunnel Tool that creates this impression is shown in Figure 8.8.<br />

Users can take 3D snapshots of the virtual environment <strong>and</strong> the current view<br />

of the real world. Virtual objects that can be placed in the virtual environment<br />

improve expressiveness by allowing <strong>for</strong> realistic visual annotations. This helps with<br />

explaining the location <strong>and</strong> relation of real <strong>and</strong> virtual objects to other people, <strong>for</strong><br />

example, to someone who has to mount a piece of hardware to a wall in a certain<br />

way.<br />

8.7.3 Speech recognition<br />

We use a prototype automatic speech recognition library (ASRlib), provided<br />

to us by the Panasonic Speech Technology Laboratory. ASRlib is targeted to-<br />

wards computationally very efficient, speaker-independent recognition of simple<br />

229


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

Figure 8.8: Schematic view of the Tunnel Tool:<br />

only objects inside the Focus Region are rendered in full. Objects that fall within<br />

the Context Region are rendered as wireframes. Objects to either side of the<br />

apparent tunnel are rendered in their normal fashion. The drawing is courtesy of<br />

Ryan Bane.<br />

230


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

grammars. We use the English dataset, about 70 keywords, <strong>and</strong> a grammar that<br />

allows around 300 distinct phrases with these keywords. Comm<strong>and</strong> phrases must<br />

be preceded <strong>and</strong> followed by a brief pause in the speech, but the words can be<br />

concatenated naturally. It per<strong>for</strong>med well in our tests, not producing any false<br />

positives despite listening in on all of our conversation, but sometimes requiring<br />

the repetition of a comm<strong>and</strong>. It consumed few enough resources to run aside the<br />

power-hungry computer vision application. It lives in its own process <strong>and</strong> sends<br />

recognized phrases as text strings to our rendering application where they are<br />

parsed again <strong>and</strong> interpreted as AR comm<strong>and</strong>s.<br />

8.7.4 Interacting with the visualized invisible<br />

Not every input interface is well-suited to every task. We will briefly describe<br />

the motivations behind choosing a particular interface <strong>for</strong> the variety of tasks in<br />

our multimodal application.<br />

Keyboards <strong>and</strong> mice are unsuited <strong>for</strong> immersion in the HMD-created mixed<br />

reality world. We felt that discrete, “binary” parameters - previously mapped to<br />

one key each - are best accessed by equally discrete <strong>and</strong> binary speech comm<strong>and</strong>s,<br />

some of which are shown in Table 8.1.<br />

Positioning, sizing, <strong>and</strong> orienting objects could be done with multiple sequen-<br />

tial 1-dimensional input steps, but at least <strong>for</strong> repositioning this is very awkward<br />

231


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

Figure 8.9: A h<strong>and</strong>-worn trackball:<br />

this was our favorite device to provide one-dimensional input to our system: a<br />

trackball that can be worn similar to a ring. We also provided an attachment to<br />

the user’s pants with a Velcro strip.<br />

<strong>and</strong> differs starkly from the direct interaction employed <strong>for</strong> positioning a physical<br />

object. We there<strong>for</strong>e combined the 2-dimensional input from h<strong>and</strong> tracking with<br />

the 1-dimensional input from a “ring trackball” (see the picture in Figure 8.9) to<br />

achieve concurrent input of 3-dimensional control data. The trackball is conve-<br />

nient because it is easy to retrieve <strong>and</strong> to stow away, especially after we provided<br />

<strong>for</strong> attachment to the user’s pants with a Velcro strip. In addition, it allows <strong>for</strong><br />

less encumbered h<strong>and</strong> gesture input: if the user chooses to use the same h<strong>and</strong><br />

<strong>for</strong> gesturing <strong>and</strong> trackball operation, she can leave the trackball dangling from<br />

the index finger during gesturing. Only one dimension of the input was needed<br />

<strong>for</strong> our system, but the device has the full functionality of a 3-button mouse <strong>and</strong><br />

thus allows <strong>for</strong> UI extensibility. In addition, since trackballs <strong>and</strong> similar concepts<br />

232


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

deliver in<strong>for</strong>mation about unbound relative movement, their output domain is in-<br />

finite. This is important if the interaction range is also unlimited – as in the case<br />

of moving the regions in our tunnel tool.<br />

Table 8.1: The mapping of input to effect:<br />

this table details the modalities <strong>and</strong> comm<strong>and</strong>s that controlled the various application<br />

parameters.<br />

control parameter input modality<br />

non-dimensional: voice comm<strong>and</strong>s:<br />

⊲ take/save/discard snapshot ⊲ “take snapshot,” “save,” “discard”<br />

⊲ tunnel mode (float, open, close) ⊲ “open/float/close tunnel”<br />

⊲ add/remove viz to/from env. ⊲ “add bundle networking to tunnel”<br />

⊲ etc. ⊲ etc.<br />

one-dimensional: ring trackball:<br />

⊲ adjust Focus Region distance ⊲ roll <strong>for</strong>wards or backwards<br />

two-dimensional: gesture with speech driven modes:<br />

⊲ pencil tool <strong>for</strong> finger ⊲ point + “save,” “discard”<br />

three-dimensional: gesture+trackball+speech driven mode:<br />

⊲ position, size, orient objects ⊲ point + roll + “finished”<br />

Virtual objects in our system can be manipulated with h<strong>and</strong> gestures, trackball<br />

input <strong>and</strong> voice comm<strong>and</strong>s. The user either selects an object with his finger with<br />

the voice comm<strong>and</strong> “select picking tool <strong>for</strong> finger,” or inserts a new virtual object<br />

into the world with a voice comm<strong>and</strong>. The system then enters the “relocate”<br />

mode: the object can be moved in all three dimensions with a combination of<br />

h<strong>and</strong> gesture (x, y) <strong>and</strong> trackball input (z). The gesture comm<strong>and</strong>s work as<br />

follows: the user first makes the closed h<strong>and</strong> gesture, which sets the system to<br />

track the motions of his h<strong>and</strong> <strong>and</strong> apply them to the object’s position. The user<br />

233


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

moves his h<strong>and</strong> left <strong>and</strong> right to move the object on the x axis, <strong>and</strong> up <strong>and</strong> down to<br />

move the object along the y axis. When he is satisfied with the object’s position,<br />

he again makes the closed h<strong>and</strong> gesture, which stops the system from applying his<br />

h<strong>and</strong> motions to the object. Alternatively, the voice comm<strong>and</strong> “finished” has the<br />

same effect. Next, the “resize” mode is automatically entered <strong>and</strong> the same input<br />

modalities allow 3-dimensional resizing of the object. Again, closed or “finished”<br />

exits this mode <strong>and</strong> enters the next in which the object can be rotated around each<br />

axis, again with the same input modalities. The voice comm<strong>and</strong>s “move object,”<br />

“resize object,” <strong>and</strong> “orient object” will enter the respective manipulation modes<br />

which are to be quit with the “finished” comm<strong>and</strong>.<br />

8.7.5 Multimodal integration<br />

The integration of four modalities is capable of controlling the entire AR sys-<br />

tem, including in particular the Tunnel Tool: h<strong>and</strong> gesture input, voice comm<strong>and</strong>s,<br />

unidirectional trackball motions, <strong>and</strong> head orientation. The most frequent inter-<br />

action in fact is almost unaware to the user: head motions are tracked <strong>and</strong> im-<br />

mediately reflected in the rendered virtual objects. Features are extracted <strong>and</strong><br />

interpreted independently on every channel. That is, the modalities are combined<br />

with late integration, after grammatically correct sentences have been extracted<br />

<strong>and</strong> the location <strong>and</strong> posture of the h<strong>and</strong> has been determined. The style of high-<br />

234


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

level interpretation differs according to input comm<strong>and</strong>s <strong>and</strong> system state. Three<br />

styles of late integration are blended in a way to maximize the overall usability<br />

while choosing input from the best-suited modality <strong>for</strong> a task.<br />

Independent, concurrent interpretation:<br />

Input of this style is interpreted immediately <strong>and</strong> as atomic comm<strong>and</strong>s; think<br />

mouse movements over the same window while typing. In our system, most speech<br />

comm<strong>and</strong>s can be given at any time <strong>and</strong> have the same effect at any time. For<br />

example, the speech directive “add networking to surroundings” can occur simul-<br />

taneously with gesture or trackball comm<strong>and</strong>s <strong>and</strong> it is interpreted independent<br />

of their state.<br />

Singular interpretation of redundant comm<strong>and</strong>s:<br />

Redundant comm<strong>and</strong>s, that is, comm<strong>and</strong>s from one channel that can substi-<br />

tute <strong>for</strong> comm<strong>and</strong>s from a different input channel, are useful <strong>for</strong> giving the user a<br />

choice to pick the momentarily most convenient way to give an instruction. They<br />

are interpreted in exactly the same way, <strong>and</strong> in the case that multiple, mutually<br />

redundant comm<strong>and</strong>s are given they are to be treated as a single instruction. We<br />

currently have two cases of this style: “select picking tool <strong>for</strong> finger” achieves<br />

the same result as per<strong>for</strong>ming a dedicated h<strong>and</strong> posture while the h<strong>and</strong> is being<br />

235


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

tracked, <strong>and</strong> the ‘release’ gesture during object manipulation is equivalent to the<br />

“finished” speech comm<strong>and</strong>. For the case of concurrent comm<strong>and</strong>s, we chose two<br />

seconds as an appropriate interval in which the mutually redundant comm<strong>and</strong>s<br />

are to be considered as one. In the first case we avoid such an arbitrary thresh-<br />

old implicitly by entering the picking mode, in which the two comm<strong>and</strong>s are not<br />

associated with a meaning.<br />

Sequential, moded interpretation:<br />

This style does the opposite of redundant comm<strong>and</strong>s: it requires users to<br />

provide input first in one modality, then in another. This is a common style<br />

within the desktop metaphor – first a mouse click to give focus to a window,<br />

then keyboard interaction with that window – <strong>and</strong>, there, has the drawback of an<br />

associated switching time between mouse <strong>and</strong> keyboard. In our system, however,<br />

there is no such switching time since the two involved modalities do not both<br />

involve h<strong>and</strong>s: the drawing <strong>and</strong> virtual object manipulation modes use gestures<br />

<strong>for</strong> spatial input <strong>and</strong> voice comm<strong>and</strong>s <strong>for</strong> mode selection. In fact, we chose this<br />

style because it makes the best use of each modality without creating a conflict.<br />

Overall, the modalities work together seamlessly, allowing <strong>for</strong> interaction that<br />

has almost conversational character. Voice comm<strong>and</strong>s allow the user to easily<br />

switch features or tools on <strong>and</strong> off, <strong>and</strong> enter non-spatial input such as adding<br />

236


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

items to visualization environments. <strong>H<strong>and</strong></strong> gestures make <strong>for</strong> a very natural input<br />

interface <strong>for</strong> spatial data <strong>and</strong> a few key h<strong>and</strong> postures enable their use to per<strong>for</strong>m<br />

simple action sequences entirely with gestures. Finally, the trackball provides <strong>for</strong><br />

exact, continuous 1-dimensional input <strong>for</strong> situations where h<strong>and</strong> gestures are less<br />

convenient or 3-dimensional input is desired.<br />

8.7.6 Benefits of <strong>H<strong>and</strong></strong>Vu <strong>for</strong> powerful interfaces<br />

The benefits of using <strong>H<strong>and</strong></strong>Vu <strong>and</strong> our multimodal interface to control the<br />

powerful <strong>and</strong> complex application functionality became most apparent during the<br />

early stages of outdoor system deployment. All application parameters were ini-<br />

tially controlled through a keyboard <strong>and</strong> mouse interface that required access to<br />

the two laptop computers. We could only achieve this by spreading out the hard-<br />

ware on a bench (aside from the head-worn components), prohibiting almost all<br />

user mobility. Furthermore, immersion in the head-worn display made the in-<br />

teraction virtually impossible <strong>and</strong> a second person had to operate keyboard <strong>and</strong><br />

mouse on comm<strong>and</strong> of the HMD wearer. The ring trackball, see Figure 8.9, along<br />

with a keyboard held by the main user improved matters, but this was obviously<br />

still significantly flawed. It was not until we had most input functionality from<br />

the speech <strong>and</strong> gesture recognition available that the system became less cumber-<br />

some to operate. Finally, we were able to completely stow away all computational<br />

237


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

components in a backpack <strong>and</strong> regained mobility. The user now only interfaces<br />

to the application through the head-worn camera, the ring trackball, microphone,<br />

<strong>and</strong> the head-worn orientation tracker.<br />

<strong>H<strong>and</strong></strong>Vu’s gesture recognition also contributes significantly to the multimodal<br />

interface as such. For example, spatial input capabilities are much easier achieved<br />

with pointing actions than with spoken comm<strong>and</strong>s. By employing the respectively<br />

most favorable modality <strong>for</strong> each task we have avoided awkward input procedures<br />

such as pointer- <strong>and</strong> menu-based comm<strong>and</strong> selection. But we have also given the<br />

user the possibility to select from two or more modalities to per<strong>for</strong>m the same<br />

task, allowing <strong>for</strong> situational flexibility <strong>and</strong> personal preferences.<br />

Both the availability of the one best-suited modality <strong>and</strong> the flexibility are<br />

steps towards more natural interaction, more <strong>and</strong> more approximating human-<br />

human communication. This is not meant as a value statement about human-<br />

human interaction <strong>and</strong> whether human-computer interfaces should attempt com-<br />

plete replication of those. However, speech recognition <strong>and</strong> the ability to interpret<br />

gestures are long-evolved human skills that should not be ignored when searching<br />

<strong>for</strong> alternative input capabilities <strong>for</strong> wearable computers.<br />

<strong>Wearable</strong> <strong>and</strong> mobile computers offer new applications to new fields of deploy-<br />

ment. <strong>H<strong>and</strong></strong>Vu together with the multimodal user interface increases the wear-<br />

238


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

able computer users’ expressiveness that the devices can leverage <strong>and</strong> achieves<br />

respectable usability even <strong>for</strong> dem<strong>and</strong>ing application interfaces.<br />

8.8 Conclusions<br />

This chapter showed how vision-based h<strong>and</strong> gesture recognition, facilitated<br />

with the <strong>H<strong>and</strong></strong>Vu library <strong>and</strong> the WinTk Windows toolkit, can provide user in-<br />

put to applications with different characteristics <strong>and</strong> thereby achieve a number<br />

of objectives. <strong>H<strong>and</strong></strong> gestures as a replacement interface were shown in the first<br />

application, vision to provide input in the absence of h<strong>and</strong>held devices in a wear-<br />

able computer context was the contribution of the second demonstration, <strong>and</strong> the<br />

benefits of vision-based interfaces in cooperation with three other modalities were<br />

exemplified in the last application.<br />

Areas of logical extensions <strong>for</strong> the current interface are the following. First,<br />

h<strong>and</strong> gestures’ inherent 3D capabilities can supply additional input parameters,<br />

especially <strong>for</strong> manipulation of virtual 3D objects in small-scale virtual or aug-<br />

mented environments. Second, more robust two-h<strong>and</strong>ed interaction is equally<br />

promising, again due to its naturalness in manipulation, exploration, <strong>and</strong> commu-<br />

nication contexts. Third, recognition of dynamic gestures such as clapping is also<br />

an interesting system extension.<br />

239


Chapter 8. <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s in Application<br />

We showed the potential <strong>and</strong> benefits of vision-based interfaces in various<br />

contexts, hoping to have stimulated interest in further exploration of VBIs as<br />

interaction modality.<br />

240


Chapter 9<br />

The Future in Your <strong>H<strong>and</strong></strong>s<br />

9.1 Recapitulation<br />

We have developed <strong>H<strong>and</strong></strong>Vu 1 , a computer vision system <strong>for</strong> recognition of h<strong>and</strong><br />

gestures in real-time. Novel <strong>and</strong> improved vision methods had to be devised to<br />

meet the strict dem<strong>and</strong>s of user interfaces. Tailoring system <strong>and</strong> applications <strong>for</strong><br />

h<strong>and</strong> motions within a com<strong>for</strong>t zone that we have established improves user sat-<br />

isfaction <strong>and</strong> helps optimizing the vision methods. Multiple applications demon-<br />

strated <strong>H<strong>and</strong></strong>Vu in action <strong>and</strong> showed that it adds to the options of interaction<br />

with non-traditional computing environments.<br />

1 <strong>H<strong>and</strong></strong>Vu is pronounced “h<strong>and</strong>-view.”<br />

241


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

9.2 Limitations<br />

The definition of com<strong>for</strong>t <strong>and</strong> the extent of the com<strong>for</strong>t zone are valuable tools<br />

to assess the convenience of postures <strong>and</strong> motions. Com<strong>for</strong>t does not predicate<br />

risk-free postures – injury-prone biomechanics must be independently determined.<br />

<strong>H<strong>and</strong></strong>Vu per<strong>for</strong>ms well on the tasks that it is designed <strong>for</strong>. However, it is a<br />

research tool that can be fooled easily; that is, <strong>H<strong>and</strong></strong>Vu does currently not provide<br />

consumer-grade reliability. It recognizes syntactic gestures only; h<strong>and</strong> waving <strong>and</strong><br />

other more semantically afflicted gestures have to be recognized in subsequent<br />

processing stages to which <strong>H<strong>and</strong></strong>Vu supplies its results.<br />

Keyboards <strong>and</strong> mice are probably going to be the preferred user input modali-<br />

ties <strong>for</strong> many applications. <strong>Gesture</strong> recognition should not be expected to replace<br />

traditional interfaces. Their limitations must be kept in mind when looking <strong>for</strong><br />

applications of the technology.<br />

9.2.1 Limits of h<strong>and</strong> gesture interfaces<br />

<strong>H<strong>and</strong></strong> gesture interfaces are not the silver bullet that solves all human-computer<br />

interaction problems. Rather, it is one of many interaction technologies that<br />

need to coexist <strong>and</strong> cooperate to jointly enhance our communication abilities<br />

with computers. There are <strong>and</strong> will be situations when the h<strong>and</strong>s are occupied<br />

242


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

with other tasks, when it is socially unacceptable to gesture, or when disabilities<br />

prevent their use. Speech recognition is often a well-suited complementary input<br />

means that can be conveniently employed in situations when h<strong>and</strong> gestures fail,<br />

<strong>and</strong> vice versa.<br />

Tactile feedback lacks when haptically interacting with virtual objects or per-<br />

<strong>for</strong>ming free-h<strong>and</strong> gestures <strong>for</strong> comm<strong>and</strong>s. The actual inflicted disadvantage must<br />

still be determined, but it is <strong>for</strong>eseeable that proprioception <strong>and</strong> human vision<br />

cannot fully compensate <strong>for</strong> this lack of fidelity.<br />

Too large a gesture vocabulary could inhibit adoption of otherwise advanta-<br />

geous interfaces. The evolution of the computer mouse – from a single-button<br />

device to multiple buttons <strong>and</strong> scroll wheels – gave a generation enough time to<br />

slowly accept new additions to a by-then-familiar device.<br />

Social acceptance must not be overlooked <strong>and</strong> it is advisable to incremen-<br />

tally introduce unusual concepts such as gesturing wildly in the presence of other<br />

people.<br />

9.2.2 Limits of vision<br />

Computer vision processing still comm<strong>and</strong>s high computing power <strong>and</strong> high<br />

data rates. While smart cameras <strong>and</strong> vision chips promise to mitigate these fac-<br />

tors, commodity hardware will dominate the market <strong>for</strong> vision plat<strong>for</strong>ms at least<br />

243


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

<strong>for</strong> the next five years. However, only with smaller, lighter, <strong>and</strong> less power hungry<br />

devices will the most auspicious environment <strong>for</strong> vision interfaces be exploitable:<br />

the computing in mobile <strong>and</strong> wearable scenarios.<br />

At least monocular vision is theoretically incapable of recovering the full finger<br />

configuration of a h<strong>and</strong> due to occlusions <strong>and</strong> other circumstances that introduce<br />

ambiguity.<br />

Social factors pertaining to computer vision must also not be overlooked. The<br />

camera that must be worn on the body might raise privacy concerns <strong>for</strong> people<br />

in the vicinity of the user. The camera could be recording at any time, which<br />

might not be in the interest of the filmed party. It has also recently come under<br />

discussion whether filming in public places should be generally <strong>for</strong>bidden if there<br />

is reason to believe that the recordings could be used to plan or help execute<br />

malicious activities. Social norms develop slowly over time <strong>and</strong> a sudden novelty<br />

like a permanently worn camera might take considerable adjustment time. Steve<br />

Mann pioneered permanently worn computers <strong>and</strong> gained much more insights<br />

about being a “photographic cyborg” in over twenty years, reported, <strong>for</strong> example,<br />

in [114].<br />

244


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

9.3 Next-generation computer interfaces<br />

Provided all or most of these difficulties can be overcome, the potentials <strong>for</strong><br />

vision-based interfaces are great. Not only will we finally be able to communicate<br />

with a computer like a peer, but with someone who underst<strong>and</strong>s subtle notions<br />

<strong>and</strong> speech-accompanying gestures, we will also be one step closer to virtual worlds<br />

that af<strong>for</strong>d all properties of the real world, with benefits <strong>for</strong> design, travel, meet-<br />

ings, <strong>and</strong> the health sector to name just a few. For example, mute people will be<br />

able to have their gesturing in a sign language (such as the American or the Chi-<br />

nese Sign Language) translated instantaneously into computer-generated speech.<br />

A more distant hope is to recreate some of the human retina’s functionality <strong>and</strong><br />

to build artificial vision systems that would allow blind people to gain at least a<br />

rudimentary sense of vision.<br />

Closer to actual deployment are h<strong>and</strong> gesture interfaces <strong>for</strong> the surgeon. Anti-<br />

septic environments prohibit use of a conventional keyboard <strong>and</strong> mouse interface<br />

to a computer. Inefficient mediation through an operation assistant is currently<br />

the only way <strong>for</strong> the surgeon to access a computer – frequently an essential tool to<br />

modern medicine. Free-h<strong>and</strong> gesture recognition immediately alleviates problems<br />

of sepsis.<br />

245


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

Once a camera is worn on the body, tremendous opportunities present them-<br />

selves, especially in combination with a head-worn display unit: recognition of<br />

familiar faces can augment the human name memory, scene recognition can aid<br />

in navigation, personal video albums can be created with ease, <strong>and</strong> so on. More<br />

technically, registration <strong>for</strong> augmented reality has been shown to be very accurate<br />

with vision sensors – a head-worn camera is in the ideal place to facilitate this<br />

function.<br />

As Turk notes in [177], the technical challenges that presently constitute the<br />

main hurdles are robustness of the vision methods, their speed, automatic ini-<br />

tialization, interface usability, <strong>and</strong> contextual integration of the interface into the<br />

application.<br />

Very immediate challenges are, <strong>for</strong> example, robust real-time detection of<br />

h<strong>and</strong>s in arbitrary postures <strong>and</strong> 3D h<strong>and</strong> posture estimation (the recovery of each<br />

finger’s joint angles). Other applications such as marker-less full body tracking<br />

would benefit from advances in this topic as well, with applications, <strong>for</strong> example,<br />

in motion capture in unprepared environments.<br />

A few developments will increase the speed with which these dreams can be<br />

realized.<br />

• <strong>Vision</strong> chips are image acquisition chips with integrated, programmable cir-<br />

cuitry. Highly parallel processing at the data source allows <strong>for</strong> very fast<br />

246


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

image processing, avoiding off-chip b<strong>and</strong>width limitations. However, no<br />

st<strong>and</strong>ard programming models are established <strong>for</strong> vision chips, only custom<br />

<strong>and</strong> often hardware-based solutions allow their programming. A low-level<br />

language similar to OpenGL is needed to bridge the gap between hardware<br />

developers <strong>and</strong> computer vision researchers.<br />

• Z-cameras, or depth cameras, report each pixel’s distance from the focal<br />

plane. Object segmentation, in particular of proximal objects such as the<br />

h<strong>and</strong>s, is almost trivial given these data. The challenges are high-resolution<br />

chips <strong>and</strong> models that can incorporate the large amount of in<strong>for</strong>mation to<br />

achieve superior precision <strong>and</strong> accuracy.<br />

• St<strong>and</strong>ardization at various levels is m<strong>and</strong>atory <strong>for</strong> fast progress, yet it re-<br />

quires concerted ef<strong>for</strong>ts, often times of business competitors. St<strong>and</strong>ards at<br />

the intersection between hardware <strong>and</strong> software promise equally accelerated<br />

progress as OpenGL spawned <strong>for</strong> graphics. Models of gestural interaction<br />

must be developed that are independent of specific implementation methods<br />

(such as computer vision or data gloves) <strong>and</strong> their momentary shortcomings.<br />

This is important to free application developers from the burden of master-<br />

ing the gesture recognition implementation technology.<br />

247


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

• Hardware miniaturization <strong>and</strong> per<strong>for</strong>mance improvements especially of the<br />

vision <strong>and</strong> display components would help to lower the threshold <strong>for</strong> user<br />

acceptance of the gear required to implement pioneering applications in non-<br />

traditional computing environments.<br />

• Virtual reality <strong>and</strong> augmented reality research into user interaction para-<br />

digms has in the past brought about the most compelling applications <strong>for</strong><br />

h<strong>and</strong> gesture interfaces. Continuing this trend is important to bring these<br />

inextricable technologies <strong>for</strong>ward in close synchronization.<br />

• <strong>Wearable</strong> <strong>and</strong> mobile computer systems are equally important because they<br />

allow computers to penetrate more areas of our lives <strong>and</strong> because of their<br />

particularly high dem<strong>and</strong>s on the user interface. The challenge is to offer<br />

versatile, highly efficient interaction methods without a large <strong>for</strong>m factor<br />

<strong>and</strong> without impeding the wearer’s ability to interact normally with the<br />

environment.<br />

• The famed killer-application would tremendously accelerate progress, <strong>for</strong><br />

example, a computer game <strong>for</strong> the embedded devices market that makes<br />

compelling use of computer vision. This would jump start the chicken-<strong>and</strong>-<br />

egg problem of the mutually required hardware capabilities <strong>and</strong> consumer<br />

market size, respectively.<br />

248


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

9.4 Conclusions<br />

We have broken the ground <strong>for</strong> easy deployment of vision-based h<strong>and</strong> gesture<br />

interfaces in many application domains. Our integrated approach constitutes an<br />

example <strong>for</strong> how novel user interface technologies should be introduced: careful<br />

technology selection <strong>and</strong> evaluation at every level, from theory <strong>and</strong> human factors<br />

considerations to the practical issues of balancing latency with accuracy.<br />

With this dissertation, we have shown that computer vision is on the brink of<br />

becoming a viable user interface technology <strong>for</strong> consumer-grade applications. The<br />

research conducted produced contributions in reliable detection <strong>and</strong> fast tracking<br />

of h<strong>and</strong>s in video images, <strong>and</strong> robust posture recognition; it made possible the<br />

definition of postural com<strong>for</strong>t so that it can be measured with entirely objective<br />

means; <strong>and</strong> lastly, it demonstrated the enhanced interaction capabilities that the<br />

newly available modality enables.<br />

Computers are immensely powerful <strong>and</strong> we have only begun to explore their<br />

far-reaching capabilities. By enabling new ways to interact with computers, <strong>and</strong><br />

by building a toolkit of available interaction modalities, we open the door <strong>for</strong> new<br />

functionalities, new devices, <strong>and</strong> new ways to think about <strong>and</strong> to think with com-<br />

puters. Interaction with h<strong>and</strong> gestures is an important step in that direction since<br />

h<strong>and</strong> motions <strong>and</strong> actions assume such crucial <strong>and</strong> diverse roles in our daily lifes.<br />

249


Chapter 9. The Future in Your <strong>H<strong>and</strong></strong>s<br />

Computer vision in particular offers unencumbered data acquisition capabilities,<br />

<strong>and</strong> our work has shown that it is ready to be taken out of the lab into real appli-<br />

cations. We are in anticipation of further progress on topics that bring together<br />

the fields of computer vision, human-computer interaction, <strong>and</strong> graphics.<br />

250


Bibliography<br />

[1] V. Athitsos <strong>and</strong> S. Sclaroff. Estimating 3D <strong>H<strong>and</strong></strong> Pose from a Cluttered<br />

Image. In Proc. IEEE Conference on Computer <strong>Vision</strong> <strong>and</strong> Pattern Recognition,<br />

volume 2, pages 432–439, 2003.<br />

[2] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, <strong>and</strong> B. MacIntyre.<br />

Recent Advances in Augmented Reality. IEEE Computer Graphics <strong>and</strong><br />

Applications, 21(6):34–47, Nov/Dec 2001.<br />

[3] R. Azuma, J. W. Lee, B. Jiang, J. Park, S. You, <strong>and</strong> U. Neumann. Tracking<br />

in Unprepared Environments <strong>for</strong> Augmented Reality Systems. ACM<br />

Computers & Graphics, 23(6):787–793, December 1999.<br />

[4] R. T. Azuma. A Survey of Augmented Reality. Presence: Teleoperators <strong>and</strong><br />

Virtual Environments, 6(4):355 – 385, August 1997.<br />

[5] R. Bane <strong>and</strong> T. Höllerer. Interactive Tools <strong>for</strong> Virtual X-Ray <strong>Vision</strong> in<br />

Mobile Augmented Reality. In Proc. IEEE <strong>and</strong> ACM Intl. Symposium on<br />

Mixed <strong>and</strong> Augmented Reality, November 2004.<br />

[6] J. L. Barron, D. J. Fleet, <strong>and</strong> S. S. Beauchemin. Per<strong>for</strong>mance of Optical<br />

Flow Techniques. Int. Journal of Computer <strong>Vision</strong>, 12(1):43–77, 1994.<br />

[7] H. S. J. Bell <strong>and</strong> F. Wu. Very Fast Template Matching. In European<br />

Conference on Computer <strong>Vision</strong>, pages 358–372, May 2002.<br />

[8] S. Belongie, J. Malik, <strong>and</strong> J. Puzicha. Shape Matching <strong>and</strong> Object Recognition<br />

Using Shape Contexts. In IEEE Trans. Pattern Analysis <strong>and</strong> Machine<br />

Intelligence, volume 24, pages 509–522, April 2002.<br />

[9] V. Bhatmager, C. Drury, <strong>and</strong> S. Schiro. Posture, Postural Discom<strong>for</strong>t <strong>and</strong><br />

Per<strong>for</strong>mance. Human Factors, 27:189–199, 1985.<br />

251


Bibliography<br />

[10] S. Birchfield. Elliptical head tracking using intensity gradients <strong>and</strong> color<br />

histograms. In Proceedings of the IEEE Conference on Computer <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition, pages 232–237, June 1998.<br />

[11] R. A. Bolt. Put-That-There: Voice <strong>and</strong> <strong>Gesture</strong> in the Graphics Interface.<br />

Computer Graphics, ACM SIGGRAPH, 14(3):262–270, 1980.<br />

[12] G. Borg. Psychophysical bases of perceived exertion. Medicine <strong>and</strong> Science<br />

in Sports <strong>and</strong> Exercise, 14(5):377–381, 1982.<br />

[13] D. Bowman. Interactive Techniques <strong>for</strong> Common Tasks in Immersive Virtual<br />

Environments: Design, Evaluation, <strong>and</strong> Application. PhD thesis, Georgia<br />

Tech, 1999.<br />

[14] G. R. Bradski. Real-time face <strong>and</strong> object tracking as a component of a perceptual<br />

user interface. In Proc. IEEE Workshop on Applications of Computer<br />

<strong>Vision</strong>, pages 214–219, 1998.<br />

[15] A. Braf<strong>for</strong>t, C. Collet, <strong>and</strong> D. Teil. Anthropomorphic model <strong>for</strong> h<strong>and</strong> gesture<br />

interface. In Proceedings of the CHI ’94 conference companion on Human<br />

factors in computing systems, April 1994.<br />

[16] M. Br<strong>and</strong>. Shadow Puppetry. In Proc. Intl. Conference on Computer <strong>Vision</strong>,<br />

1999.<br />

[17] M. Bray, E. Koller-Meier, <strong>and</strong> L. V. Gool. Smart Particle Filtering <strong>for</strong> 3D<br />

<strong>H<strong>and</strong></strong> Tracking. In Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong><br />

<strong>Gesture</strong> Recognition, 2004.<br />

[18] M. Bray, E. Koller-Meier, L. V. Gool, <strong>and</strong> N. M. Schraudolph. 3D <strong>H<strong>and</strong></strong><br />

Tracking by Rapid Stochastic Gradient Descent Using a Skinning Model. In<br />

European Conference on Visual Media Production (CVMP), March 2004.<br />

[19] L. Bretzner, I. Laptev, <strong>and</strong> T. Lindeberg. <strong>H<strong>and</strong></strong> <strong>Gesture</strong> Recognition using<br />

Multi-Scale Colour Features, Hierarchical Models <strong>and</strong> Particle Filtering. In<br />

Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition,<br />

pages 423–428, Washington D.C., 2002.<br />

[20] W. Broll, L. Schäfer, T. Höllerer, <strong>and</strong> D. Bowman. Interface with Angels:<br />

The Future of VR <strong>and</strong> AR <strong>Interfaces</strong>. IEEE Computer Graphics <strong>and</strong> Applications,<br />

21(6):14–17, Nov./Dec. 2001.<br />

252


Bibliography<br />

[21] H. Bunke <strong>and</strong> T. Caelli, editors. Hidden Markov Models in <strong>Vision</strong>, volume<br />

15(1) of International Journal of Pattern Recognition <strong>and</strong> Artificial Intelligence.<br />

World Scientific Publishing Company, 2001.<br />

[22] W. Buxton, E. Fiume, R. Hill, A. Lee, <strong>and</strong> C. Woo. Continuous h<strong>and</strong>-gesture<br />

driven input. In Proceedings of Graphics Interface ’83, 9th Conference of<br />

the Canadian Man-Computer Communications Society, pages 191–195, May<br />

1983.<br />

[23] C. Cadoz. Les réalités virtuelles, 1994.<br />

[24] D. B. Chaffin. Localized Muscle Fatigue – Definition <strong>and</strong> Measurement.<br />

Journal of Occupational Medicine, 15(4):346–354, 1973.<br />

[25] D. B. Chaffin <strong>and</strong> G. B. J. Andersson. Occupational Biomechanics. Wiley-<br />

Interscience, 1984.<br />

[26] M. K. Chung, I. Lee, D. Kee, <strong>and</strong> S. H. Kim. A Postural Workload Evaluation<br />

System <strong>Based</strong> on a Macro-postural Classification. Human Factors <strong>and</strong><br />

Ergonomics in Manufacturing, 12(3):267–277, 2002.<br />

[27] K. C. Clarke, A. Nuernberger, T. Pingel, <strong>and</strong> D. Qingyun. User Interface<br />

Design <strong>for</strong> a <strong>Wearable</strong> Field Computer. In Proc. of National Conference on<br />

Digital Government Research, 2002.<br />

[28] D. Comaniciu, V. Ramesh, <strong>and</strong> P. Meer. Real-Time Tracking of Non-Rigid<br />

Objects Using Mean Shift. In Proc. IEEE Conference on Computer <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition, volume 2, pages 142–149, 2000.<br />

[29] T. F. Cootes, G. J. Edwards, <strong>and</strong> C. J. Taylor. Active Appearance Models.<br />

In Proc. European Conference on Computer <strong>Vision</strong>, pages 484–498, 1998.<br />

[30] T. F. Cootes <strong>and</strong> C. J. Taylor. Active Shape Models: Smart Snakes. In<br />

Proceedings of the British Machine <strong>Vision</strong> Conference, pages 9–18. Springer-<br />

Verlag, 1992.<br />

[31] E. N. Corlett <strong>and</strong> R. P. Bishop. A Technique <strong>for</strong> Assessing Postural Discom<strong>for</strong>t.<br />

Ergonomics, 19(1):175–182, 1976.<br />

[32] J. L. Crowley, F. Berard, <strong>and</strong> J. Coutaz. Finger Tracking as an Input Device<br />

<strong>for</strong> Augmented Reality. In Intl. Workshop on Automatic Face <strong>and</strong> <strong>Gesture</strong><br />

Recognition, 1995.<br />

253


Bibliography<br />

[33] Y. Cui <strong>and</strong> J. Weng. A Learning-<strong>Based</strong> Prediction <strong>and</strong> Verification Segmentation<br />

Scheme <strong>for</strong> <strong>H<strong>and</strong></strong> Sign Image Sequence. IEEE Trans. Pattern<br />

Analysis <strong>and</strong> Machine Intelligence, pages 798–804, 1999.<br />

[34] R. Cutler <strong>and</strong> M. Turk. View-based Interpretation of Real-time Optical Flow<br />

<strong>for</strong> <strong>Gesture</strong> Recognition. In Proc. IEEE Intl. Conference on Automatic Face<br />

<strong>and</strong> <strong>Gesture</strong> Recognition, pages 416–421, April 1998.<br />

[35] R. Desimone, T. D. Albright, C. G. Gross, <strong>and</strong> C. Bruce. Stimulus-Selective<br />

Properties in Inferior Temporal Neurons in the Macaque. Journal of Neuroscience,<br />

4(8):2051–2062, August 1984.<br />

[36] J. Deutscher, A. Blake, <strong>and</strong> I. Reid. Articulated Body Motion Capture by<br />

Annealed Particle Filtering. In Proc. IEEE Conference on Computer <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition, volume 2, pages 126–133, 2000.<br />

[37] M. Dias, J. Jorge, J. Carvalho, P. Santos, <strong>and</strong> J. Luzio. Usability Evaluation<br />

of Tangible User <strong>Interfaces</strong> <strong>for</strong> Augmented Reality. In IEEE Intl. Augmented<br />

Reality Toolkit Workshop, 2003.<br />

[38] D. E. DiFranco, T.-J. Cham, <strong>and</strong> J. M. Rehg. Reconstruction of 3-D Figure<br />

Motion from 2-D Correspondences. In Proc. IEEE Conference on Computer<br />

<strong>Vision</strong> <strong>and</strong> Pattern Recognition, November 2001.<br />

[39] S. M. Dominguez, T. Keaton, <strong>and</strong> A. H. Sayed. Robust Finger Tracking <strong>for</strong><br />

<strong>Wearable</strong> Computer Interfacing. In ACM PUI 2001 Orl<strong>and</strong>o, FL, 2001.<br />

[40] K. Dorfmüller-Ulhaas <strong>and</strong> D. Schmalstieg. Finger Tracking <strong>for</strong> Interaction<br />

in Augmented Environments. In Proc. ACM/IEEE Intl. Symposium on<br />

Augmented Reality, 2001.<br />

[41] C. G. Drury <strong>and</strong> B. G. Coury. A methodology <strong>for</strong> chair evaluation. Applied<br />

Ergonomics, 13(3):195–202, 1982.<br />

[42] S. Feiner, B. MacIntyre, T. Höllerer, <strong>and</strong> T. Webster. A Touring Machine:<br />

Prototyping 3D Mobile Augmented Reality Systems <strong>for</strong> Exploring the Urban<br />

Environment. In Proc. First Int. Symp. on <strong>Wearable</strong> Computers, October<br />

1997.<br />

[43] T. G. Fikes. System Architecture Analysis <strong>for</strong> Reaching <strong>and</strong> Grasping. PhD<br />

thesis, University of Cali<strong>for</strong>nia at Santa Barbara, 1993.<br />

254


Bibliography<br />

[44] P. M. Fitts. The in<strong>for</strong>mation capacity of the human motor system in controlling<br />

the amplitude of movement. Journal of Experimental Psychology,<br />

47:381–391, 1954.<br />

[45] E. Foxlin <strong>and</strong> M. Harrington. WearTrack: A Self-Referenced Head <strong>and</strong> <strong>H<strong>and</strong></strong><br />

Tracker <strong>for</strong> <strong>Wearable</strong> Computers <strong>and</strong> Portable VR. In 4th Intl. Symp. on<br />

<strong>Wearable</strong> Computers, pages 155–162, October 2000.<br />

[46] E. Foxlin <strong>and</strong> L. Naimark. VIS-Tracker: A <strong>Wearable</strong> <strong>Vision</strong>-Inertial Self-<br />

Tracker. In Proc. of the IEEE Virtual Reality Conference, 2003.<br />

[47] W. T. Freeman, D. B. Anderson, P. A. Beardsley, C. N. Dodge, M. Roth,<br />

C. D. Weissman, <strong>and</strong> W. S. Yerazunis. Computer <strong>Vision</strong> <strong>for</strong> Interactive<br />

Computer Graphics. IEEE Computer Graphics <strong>and</strong> Applications, pages 42–<br />

53, May-June 1998.<br />

[48] Y. Freund <strong>and</strong> R. E. Schapire. A decision-theoretic generalization of online<br />

learning <strong>and</strong> an application to boosting. In In Computational Learning<br />

Theory: EuroCOLT, pages 23–37. Springer-Verlag, 1995.<br />

[49] M. Fukumoto, Y. Suenaga, <strong>and</strong> K. Mase. Finger-Pointer: Pointing Interface<br />

by Image Processing. Computers & Graphics, 18(5):633–642, 1994.<br />

[50] Y. Gdalyahu <strong>and</strong> D. Weinshall. Flexible Syntactic Matching of Curves<br />

<strong>and</strong> Its Application to Automatic Hierarchical Classification of Silhouettes.<br />

IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence, 21(12),<br />

December 1999.<br />

[51] E. Gr<strong>and</strong>jean. Fitting the Task to the Man - An Ergonomic Approach. Taylor<br />

& Francis Ltd, London, 1969.<br />

[52] S. Grange, E. Casanova, T. Fong, <strong>and</strong> C. Baur. <strong>Vision</strong>-based Sensor Fusion<br />

<strong>for</strong> Human-Computer Interaction. In Intl. Conference on Intelligent Robots<br />

<strong>and</strong> Systems, October 2002.<br />

[53] Y. Hamada, N. Shimada, <strong>and</strong> Y. Shirai. <strong>H<strong>and</strong></strong> Shape Estimation Using<br />

Sequence of Multi-Ocular Images <strong>Based</strong> on Transition Network. In VI 2002,<br />

2002.<br />

[54] C. <strong>H<strong>and</strong></strong>. A Survey of 3D Interaction Techniques. Computer Graphics<br />

Forum, 16(5):269–281, 1997.<br />

[55] A. G. Hauptmann. Speech <strong>and</strong> <strong>Gesture</strong> <strong>for</strong> Graphic Image Manipulation.<br />

In ACM CHI, pages 241–245, May 1989.<br />

255


Bibliography<br />

[56] T. Heap <strong>and</strong> D. Hogg. Towards 3D <strong>H<strong>and</strong></strong> Tracking Using a De<strong>for</strong>mable<br />

Model. In Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong><br />

Recognition, 1996.<br />

[57] N. Hedley, M. Billinghurst, L. Postner, R. May, <strong>and</strong> H. Kato. Explorations<br />

in the Use of Augmented Reality <strong>for</strong> Geographic Visualization. Presence,<br />

11(2):119–133, 2002.<br />

[58] K. Hinckley, R. Pausch, J. C. Goble, <strong>and</strong> N. F. Kassell. A survey of design<br />

issues in spatial input. In Proceedings of the 7th annual ACM symposium on<br />

User interface software <strong>and</strong> technology, pages 213–222. ACM Press, 1994.<br />

[59] K. Hinckley, R. Pausch, D. Proffitt, <strong>and</strong> N. F. Kassell. Two-h<strong>and</strong>ed virtual<br />

manipulation. ACM Transactions on Computer-Human Interaction<br />

(TOCHI), 5(3):260–302, 1998.<br />

[60] E. Hjelm˚as <strong>and</strong> B. K. Low. Face Detection: A Survey. Computer <strong>Vision</strong><br />

<strong>and</strong> Image Underst<strong>and</strong>ing, 83(3):236–274, September 2001.<br />

[61] T. Höllerer, S. Feiner, D. Hallaway, B. Bell, M. Lanzagorta, D. Brown,<br />

S. Julier, Y. Baillot, <strong>and</strong> L. Rosenblum. User Interface Management Techniques<br />

<strong>for</strong> Collaborative Mobile Augmented Reality. Computers <strong>and</strong> Graphics,<br />

25(5):799–810, October 2001.<br />

[62] T. Höllerer, S. Feiner, T. Terauchi, G. Rashid, <strong>and</strong> D. Hallaway. Exploring<br />

MARS: Developing Indoor <strong>and</strong> Outdoor User <strong>Interfaces</strong> to a Mobile Augmented<br />

Reality System. Computers <strong>and</strong> Graphics, 23(6):779–785, December<br />

1999.<br />

[63] P. Hong, M. Turk, <strong>and</strong> T. S. Huang. <strong>Gesture</strong> Modeling <strong>and</strong> Recognition<br />

Using Finite State Machines. In Proc. IEEE Intl. Conference on Automatic<br />

Face <strong>and</strong> <strong>Gesture</strong> Recognition, pages 410–415. IEEE Computer Society,<br />

March 2000.<br />

[64] X. Hou, S. Z. Li, H. Zhang, <strong>and</strong> Q. Cheng. Direct Appearance Models. In<br />

Proc. IEEE Conference on Computer <strong>Vision</strong> <strong>and</strong> Pattern Recognition, 2001.<br />

[65] C. Hummels <strong>and</strong> P. J. Stappers. Meaningful <strong>Gesture</strong>s <strong>for</strong> Human Computer<br />

Interaction: Beyond <strong>H<strong>and</strong></strong> Postures. In Proc. IEEE Intl. Conference on<br />

Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition, April 1998.<br />

[66] M. Isard <strong>and</strong> A. Blake. A mixed-state CONDENSATION tracker with automatic<br />

model-switching. In ICCV, pages 107–112, 1998.<br />

256


Bibliography<br />

[67] M. Isard <strong>and</strong> A. Blake. Condensation – Conditional Density Propagation<br />

<strong>for</strong> Visual Tracking. Int. Journal of Computer <strong>Vision</strong>, 1998.<br />

[68] J. Isdale. What Is Virtual Reality? A Web-<strong>Based</strong> Introduction, September<br />

1998. http://vr.isdale.com/WhatIsVR.html.<br />

[69] T. Jebara, B. Schiele, N. Oliver, <strong>and</strong> A. Pentl<strong>and</strong>. DyPERS: Dynamic<br />

Personal Enhanced Reality System. In Image Underst<strong>and</strong>ing Workshop,<br />

November 1998.<br />

[70] N. Jojic, M. Turk, <strong>and</strong> T. Huang. Tracking Self-Occluding Articulated<br />

Object in Dense Disparity Maps. In Proc. Intl. Conference on Computer<br />

<strong>Vision</strong>, September 1999.<br />

[71] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.<br />

[72] M. Jones <strong>and</strong> P. Viola. Fast Multi-view Face Detection. Technical Report<br />

TR2003-96, MERL, July 2003.<br />

[73] M. J. Jones <strong>and</strong> J. M. Rehg. Statistical Color Models with Application to<br />

Skin Detection. Int. Journal of Computer <strong>Vision</strong>, 46(1):81–96, Jan 2002.<br />

[74] S. J. Julier, J. K. Uhlmann, <strong>and</strong> H. F. Durrant-Whyte. A new approach <strong>for</strong><br />

filtering nonlinear systems. In Proc. American Control Conference, pages<br />

1628–1632, June 1995.<br />

[75] R. E. Kalman. A New Approach to Linear Filtering <strong>and</strong> Prediction Problems.<br />

Transactions of the ASME Journal of Basic Engineering, pages 34–45,<br />

1960.<br />

[76] Y. Kameda, M. Minoh, <strong>and</strong> K. Ikeda. Three dimensional pose estimation<br />

of an articulated object from its silhouette image. In Proceedings of Asian<br />

Conference on Computer <strong>Vision</strong>, pages 612–615, 1993.<br />

[77] K. Karhunen. Über Lineare Methoden in der Wahrscheinlichkeitsrechnung.<br />

Annales Academiae Scientiarum Fennicae, 37:3–79, 1946.<br />

[78] W. Karwowski, R. Eberts, G. Salvendy, <strong>and</strong> S. Nol<strong>and</strong>. The effects of computer<br />

interface design on human postural dynamics. Ergonomics, 37(4):703–<br />

724, 1994.<br />

[79] M. Kass, A. Witkin, <strong>and</strong> D. Terzopoulos. Snakes: Active contour models.<br />

In Proc. Intl. Conference on Computer <strong>Vision</strong>, pages 259–268, 1987.<br />

257


Bibliography<br />

[80] H. Kato <strong>and</strong> M. Billinghurst. Marker Tracking <strong>and</strong> HMD Calibration <strong>for</strong> a<br />

Video-<strong>Based</strong> Augmented Reality Conferencing System. In Proceedings of the<br />

2nd IEEE <strong>and</strong> ACM International Workshop on Augmented Reality, pages<br />

85–94, October 1999.<br />

[81] H. Kato, M. Billinghurst, I. Poupyrev, K. Imamoto, <strong>and</strong> K. Tachibana.<br />

Virtual Object Manipulation of a Table-Top AR Environment. In Proc.<br />

Intl. Symp. Augmented Reality, pages 111–119. IEEE CS Press, 2000.<br />

[82] D. Kee. A method <strong>for</strong> analytically generating three-dimensional isocom<strong>for</strong>t<br />

workspace based on perceived discom<strong>for</strong>t. Applied Ergonomics, 33(1):51–62,<br />

2002.<br />

[83] A. Kendon. How gestures can become like words. Cross-cultural perspectives<br />

in nonverbal communication, pages 131–141, 1988.<br />

[84] M. Kirby <strong>and</strong> L. Sirovich. Application of the Karhunen-Loève Procedure<br />

<strong>for</strong> the Characterization of Human Faces. IEEE Transactions on Pattern<br />

Analysis <strong>and</strong> Machine Intelligence, 12(1):103–108, January 1990.<br />

[85] R. Kjeldsen <strong>and</strong> J. Kender. Finding Skin in Color Images. In Proceedings of<br />

the International Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition,<br />

pages 312–317, October 1996.<br />

[86] N. Kohtake, J. Rekimoto, <strong>and</strong> Y. Anzai. InfoPoint: A Device that Provides<br />

a Uni<strong>for</strong>m User Interface to Allow Appliances to Work Together over a<br />

Network. Personal <strong>and</strong> Ubiquitous <strong>Computing</strong>, 5(4):264–274, 2001.<br />

[87] D. Koller, P. Lindstrom, W. Ribarsky, L. F. Hodges, N. Faust, <strong>and</strong><br />

G. Turner. Virtual GIS: A Real-Time 3D Geographic In<strong>for</strong>mation System.<br />

In Proceedings of Visualization’, pages 94–100, October 1995.<br />

[88] M. Kölsch, A. C. Beall, <strong>and</strong> M. Turk. An Objective Measure <strong>for</strong> Postural<br />

Com<strong>for</strong>t. In HFES Annual Meeting Notes, October 2003.<br />

[89] M. Kölsch, A. C. Beall, <strong>and</strong> M. Turk. The Postural Com<strong>for</strong>t Zone <strong>for</strong><br />

Reaching <strong>Gesture</strong>s. In HFES Annual Meeting Notes, October 2003.<br />

[90] M. Kölsch <strong>and</strong> M. Turk. Analysis of Rotational Robustness of <strong>H<strong>and</strong></strong> Detection<br />

with a Viola-Jones Detector. In IAPR International Conference on<br />

Pattern Recognition, 2004.<br />

258


Bibliography<br />

[91] M. Kölsch <strong>and</strong> M. Turk. Fast 2D <strong>H<strong>and</strong></strong> Tracking with Flocks of Features<br />

<strong>and</strong> Multi-Cue Integration. In IEEE Workshop on Real-Time <strong>Vision</strong> <strong>for</strong><br />

Human-Computer Interaction (at CVPR), 2004.<br />

[92] M. Kölsch <strong>and</strong> M. Turk. Robust <strong>H<strong>and</strong></strong> Detection. In Proc. IEEE Intl.<br />

Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition, May 2004.<br />

[93] M. Kölsch, M. Turk, <strong>and</strong> T. Höllerer. <strong>Vision</strong>-<strong>Based</strong> <strong>Interfaces</strong> <strong>for</strong> Mobility.<br />

In Intl. Conference on Mobile <strong>and</strong> Ubiquitous Systems (MobiQuitous),<br />

August 2004.<br />

[94] T. Koskela <strong>and</strong> I. Vilpola. Usability of MobiVR Concept: Towards Large<br />

Virtual Touch Screen <strong>for</strong> Mobile Devices. In Proc. Intl. Conference on Mobile<br />

HCI, 2004.<br />

[95] T. Koskela, I. Vilpola, <strong>and</strong> I. Rakkolainen. User Requirements <strong>for</strong> Large<br />

Virtual Display <strong>and</strong> Finger Pointing Input <strong>for</strong> Mobile Devices. In Proc.<br />

Intl. Conference on Mobile <strong>and</strong> Ubiquitous Multimedia, December 2003.<br />

[96] M. Kourogi <strong>and</strong> T. Kurata. A method of personal positioning based on sensor<br />

data fusion of wearable camera <strong>and</strong> self-contained sensors. In Proc. IEEE<br />

Conference on Multisensor Fusion <strong>and</strong> Integration <strong>for</strong> Intelligent Systems,<br />

pages 287–292, 2003.<br />

[97] D. M. Krum, O. Omoteso, W. Ribarsky, T. Starner, <strong>and</strong> L. F. Hodges.<br />

Evaluation of a Multimodal Interface <strong>for</strong> 3D Terrain Visualization. In IEEE<br />

Visualization, pages 411–418, October 27-November 1 2002.<br />

[98] D. M. Krum, O. Omoteso, W. Ribarsky, T. Starner, <strong>and</strong> L. F. Hodges.<br />

Speech <strong>and</strong> <strong>Gesture</strong> Multimodal Control of a Whole Earth 3D Visualization<br />

Environment. In Proc. Joint Eurographs <strong>and</strong> IEEE TCVG Symposium on<br />

Visualization (VisSym), pages 195–200, May 2002.<br />

[99] T. Kurata, T. Kato, M. Kourogi, J. Keechul, <strong>and</strong> K. Endo. A Functionally-<br />

Distributed <strong>H<strong>and</strong></strong> Tracking Method <strong>for</strong> <strong>Wearable</strong> Visual <strong>Interfaces</strong> <strong>and</strong> Its<br />

Applications. In Proc. IAPR Workshop on Machine <strong>Vision</strong> Applications,<br />

pages 84–89, 2002.<br />

[100] T. Kurata, T. Okuma, M. Kourogi, <strong>and</strong> K. Sakaue. The <strong>H<strong>and</strong></strong> Mouse:<br />

GMM <strong>H<strong>and</strong></strong>-color Classification <strong>and</strong> Mean Shift Tracking. In Second Intl.<br />

Workshop on Recognition, Analysis <strong>and</strong> Tracking of Faces <strong>and</strong> <strong>Gesture</strong>s in<br />

Real-time Systems, July 2001.<br />

259


Bibliography<br />

[101] I. Laptev <strong>and</strong> T. Lindeberg. Tracking of multi-state h<strong>and</strong> models using particle<br />

filtering <strong>and</strong> a hierarchy of multi-scale image features. Technical Report<br />

ISRN KTH/NA/P-00/12-SE, Department of Numerical Analysis <strong>and</strong> Computer<br />

Science, KTH (Royal Institute of Technology), September 2000.<br />

[102] J. Lee <strong>and</strong> T. L. Kunii. Model-<strong>Based</strong> analysis of h<strong>and</strong> posture. IEEE<br />

Computer Graphics <strong>and</strong> Applications, 15(5):77–86, 1995.<br />

[103] A. Leganchuk, S. Zhai, <strong>and</strong> W. Buxton. Manual <strong>and</strong> cognitive benefits of<br />

two-h<strong>and</strong>ed input: an experimental study. ACM Transactions on Computer-<br />

Human Interaction (TOCHI), 5(4):326–359, 1998.<br />

[104] M.-H. Liao <strong>and</strong> C. G. Drury. Posture, Discom<strong>for</strong>t <strong>and</strong> Per<strong>for</strong>mance in a<br />

VDT task. Ergonomics, 43(3):345–359, 2000.<br />

[105] R. Lienhart <strong>and</strong> J. Maydt. An Extended Set of Haar-like Features <strong>for</strong> Rapid<br />

Object Detection. In Proc. IEEE Intl. Conference on Image Processing,<br />

volume 1, pages 900–903, Sep 2002.<br />

[106] J. Lin, Y. Wu, <strong>and</strong> T. S. Huang. Modeling the Constraints of Human <strong>H<strong>and</strong></strong><br />

Motion. In Proceedings of the 5th Annual Federated Laboratory Symposium,<br />

2001.<br />

[107] P. Lindstrom, D. Koller, W. Ribarsky, L. F. Hodges, A. O. den Bosch,<br />

<strong>and</strong> N. Faust. An Integrated Global GIS <strong>and</strong> Visual Simulation System.<br />

Technical Report GIT-GVU-97-07, Georgia Tech, 1997.<br />

[108] M. M. Loève. Probability Theory. Van Nostr<strong>and</strong>, 1955.<br />

[109] S. Lu, D. Metaxas, D. Samaras, <strong>and</strong> J. Oliensis. Using Multiple Cues <strong>for</strong><br />

<strong>H<strong>and</strong></strong> Tracking <strong>and</strong> Model Refinement. In Proc. IEEE Conference on Computer<br />

<strong>Vision</strong> <strong>and</strong> Pattern Recognition, 2003.<br />

[110] B. D. Lucas <strong>and</strong> T. Kanade. An Iterative Image Registration Technique with<br />

an Application to Stereo <strong>Vision</strong>. In Proc. Imaging Underst<strong>and</strong>ing Workshop,<br />

pages 121–130, 1981.<br />

[111] J. MacCormick <strong>and</strong> M. Isard. Partitioned sampling, articulated objects, <strong>and</strong><br />

interface-quality h<strong>and</strong> tracking. In Proc. European Conf. Computer <strong>Vision</strong>,<br />

2000.<br />

[112] I. S. MacKenzie. Input devices <strong>and</strong> interaction techniques <strong>for</strong> advanced computing.<br />

In W. Barfield <strong>and</strong> T. A. F. III, editors, Virtual environments <strong>and</strong><br />

advanced interface design, pages 437–470. Ox<strong>for</strong>d University Press, 1995.<br />

260


Bibliography<br />

[113] S. Mann. Smart clothing: The wearable computer <strong>and</strong> wearcam. Personal<br />

Technologies, 1(1), March 1997.<br />

[114] S. Mann. <strong>Wearable</strong> <strong>Computing</strong>: A First Step Toward Personal Imaging.<br />

IEEE Computer, 30(2), February 1997.<br />

[115] D. McNeill. <strong>H<strong>and</strong></strong> <strong>and</strong> Mind: What <strong>Gesture</strong>s Reveal about Thoughts. University<br />

of Chicago Press, 1992.<br />

[116] D. McNeill, editor. Language <strong>and</strong> <strong>Gesture</strong>. Cambridge University Press,<br />

2000.<br />

[117] M. Mine, F. Brooks, <strong>and</strong> C. Sequin. Moving Objects in Space: Exploiting<br />

Proprioception in Virtual Environment Interaction. In Proc. ACM SIG-<br />

GRAPH, 1997.<br />

[118] D. D. Morris <strong>and</strong> J. M. Rehg. Singularity Analysis <strong>for</strong> Articulated Object<br />

Tracking. In Proc. IEEE Conference on Computer <strong>Vision</strong> <strong>and</strong> Pattern<br />

Recognition, 1998.<br />

[119] T. A. Mysliwiec. FingerMouse: A Freeh<strong>and</strong> Computer Pointing Interface.<br />

Technical Report VISLab-94-001, <strong>Vision</strong> <strong>Interfaces</strong> <strong>and</strong> Systems Lab, The<br />

University of Illinois at Chicago, October 1994.<br />

[120] C. Nolker <strong>and</strong> H. Ritter. GREFIT: Visual recognition of h<strong>and</strong> postures. In<br />

<strong>Gesture</strong>-<strong>Based</strong> Communication in HCI, pages 61–72, 1999.<br />

[121] S. Nusser, L. Miller, K. Clarke, <strong>and</strong> M. Goodchild. Future views of field<br />

data collection in statistical surveys. In Proc. of National Conference on<br />

Digital Government Research, 2001.<br />

[122] T. Oberg, L. S<strong>and</strong>sjo, <strong>and</strong> R. Kade<strong>for</strong>s. Subjective <strong>and</strong> Objective Evaluation<br />

of Shoulder Muscle Fatigue. Ergonomics, 37(8):1323–1333, 1994.<br />

[123] T. Ohshima, K. Satoh, H. Yamamoto, <strong>and</strong> H. Tamura. RV-Border Guards:<br />

A Multi-Player Mixed Reality Entertainment. Trans. Virtual Reality Soc.<br />

Japan, 4(4):699–705, 1999.<br />

[124] E. J. Ong <strong>and</strong> R. Bowden. A Boosted Classifier Tree <strong>for</strong> <strong>H<strong>and</strong></strong> Shape Detection.<br />

In Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong><br />

Recognition, pages 889–894, 2004.<br />

261


Bibliography<br />

[125] V. Paelke, J. Stöcklein, C. Reimann, <strong>and</strong> W. Rosenbach. Supporting User<br />

Interface Evaluation of AR Presentation <strong>and</strong> Interaction Techniques with<br />

ARToolkit. In IEEE Intl. Augmented Reality Toolkit Workshop, 2003.<br />

[126] J. Park, B. Jiang, <strong>and</strong> U. Neumann. <strong>Vision</strong>-based pose computation: Robust<br />

<strong>and</strong> accurate augmented reality tracking. In Proceedings of the 2nd<br />

IEEE <strong>and</strong> ACM International Workshop on Augmented Reality, pages 3–<br />

12, October 1999.<br />

[127] R. Pausch <strong>and</strong> R. D. Williams. Tailor: creating custom user interfaces<br />

based on gesture. In Proceedings of the the third annual ACM SIGGRAPH<br />

symposium on User interface software <strong>and</strong> technology, 1990.<br />

[128] V. Pavlovic, R. Sharma, <strong>and</strong> T. S. Huang. Visual Interpretation of <strong>H<strong>and</strong></strong><br />

<strong>Gesture</strong>s <strong>for</strong> Human Computer Interaction: A Review. In IEEE Transactions<br />

on Pattern Analysis <strong>and</strong> Machine Intelligence, volume 19, pages<br />

677–695, July 1997.<br />

[129] V. I. Pavlovic <strong>and</strong> A. Garg. Boosted Detection of Objects <strong>and</strong> Attributes.<br />

In Proc. IEEE Conference on Computer <strong>Vision</strong> <strong>and</strong> Pattern Recognition,<br />

2001.<br />

[130] W. Piekarski <strong>and</strong> B. Thomas. ARQuake: The Outdoors Augmented Reality<br />

Gaming System. Communications of the ACM, 45(1):36–38, January 2002.<br />

[131] W. Piekarski <strong>and</strong> B. Thomas. Tinmith-<strong>H<strong>and</strong></strong>: Unified User Interface Technology<br />

<strong>for</strong> Mobile Outdoor Augmented Reality <strong>and</strong> Indoor Virtual Reality.<br />

In IEEE VR, pages 287–288, March 2002.<br />

[132] W. Piekarski <strong>and</strong> B. Thomas. Using AR Toolkit <strong>for</strong> 3D <strong>H<strong>and</strong></strong> Position<br />

Tracking in Mobile Outdoor Environments. In The First IEEE Workshop<br />

on the Augmented Reality Toolkit, September 2002.<br />

[133] W. Piekarski <strong>and</strong> B. H. Thomas. Developing Interactive Augmented Reality<br />

Modelling Applications. In International Workshop on Software Technology<br />

<strong>for</strong> Augmented Reality Systems, 2003.<br />

[134] J. S. Pierce, B. Stearns, <strong>and</strong> R. Pausch. Two <strong>H<strong>and</strong></strong>ed Manipulation of<br />

Voodoo Dolls in Virtual Environments. In Symposium on Interactive 3D<br />

Graphics, pages 141–145, 1999.<br />

[135] M. Porta. <strong>Vision</strong>-based user interfaces: methods <strong>and</strong> applications. Int.<br />

Journal on Human-Computer Studies, 57:27–73, 2002.<br />

262


Bibliography<br />

[136] F. K. H. Quek. Eyes in the Interface. Image <strong>and</strong> <strong>Vision</strong> <strong>Computing</strong>, 13,<br />

August 1995.<br />

[137] F. K. H. Quek. Unencumbered Gestural Interaction. IEEE Multimedia,<br />

4(3):36–47, 1996.<br />

[138] F. K. H. Quek, T. Mysliwiec, <strong>and</strong> M. Zhao. FingerMouse: A Freeh<strong>and</strong><br />

Pointing Interface. In Proc. Int’l Workshop on Automatic Face <strong>and</strong> <strong>Gesture</strong><br />

Recognition, pages 372–377, June 1995.<br />

[139] I. Rauschert, P. Agrawal, R. Sharma, S. Fuhrmann, I. Brewer,<br />

A. MacEachren, H. Wang, <strong>and</strong> G. Cai. Designing a Human-Centered, Multimodal<br />

GIS Interface to Support Emergency Management. In GIS, November<br />

2002.<br />

[140] S. Razzaque, Z. Kohn, <strong>and</strong> M. C. Whitton. Redirected Walking. In EURO-<br />

GRAPHICS, 2001.<br />

[141] J. M. Rehg <strong>and</strong> T. Kanade. Visual Tracking of High DOF Articulated<br />

Structures: an Application to Human <strong>H<strong>and</strong></strong> Tracking. In Third European<br />

Conf. on Computer <strong>Vision</strong>, pages 35–46, May 1994.<br />

[142] J. M. Rehg <strong>and</strong> T. Kanade. Model-<strong>Based</strong> Tracking of Self-Occluding Articulated<br />

Objects. In Proc. Intl. Conference on Computer <strong>Vision</strong>, pages<br />

612–617, June 1995.<br />

[143] J. Rekimoto. Matrix: A Realtime Object Identification <strong>and</strong> Registration<br />

Method <strong>for</strong> Augmented Reality. In Proc. Asia Pacific Computer Human<br />

Interaction (APCHI), 1998.<br />

[144] J. Rekimoto <strong>and</strong> K. Nagao. The World through the Computer: Computer<br />

Augmented Interaction with Real World Environments. In Proceedings<br />

of Eighth Annual Symposium on User Interface Software <strong>and</strong> Technology<br />

(UIST’95), pages 29– 36, 1995.<br />

[145] C. W. Reynolds. Flocks, Herds, <strong>and</strong> Schools: A Distributed Behavioral<br />

Model. Computer Graphics, 21(4):25–34, 1987. SIGGRAPH ’87 Conference<br />

Proceedings.<br />

[146] B. J. Rhodes. The wearable remembrance agent: a system <strong>for</strong> augmented<br />

memory. Personal Technologies Journal; Special Issue on <strong>Wearable</strong> <strong>Computing</strong>,<br />

pages 218–224, 1997.<br />

263


Bibliography<br />

[147] D. A. Rosenbaum, R. J. Meulenbroek, J. Vaughan, <strong>and</strong> C. Jansen. Posture-<br />

<strong>Based</strong> Motion Planning: Applications to Grasping. Psychological Review,<br />

4(108):709–734, 2001.<br />

[148] G. Salvendy, editor. <strong>H<strong>and</strong></strong>book of Human Factors <strong>and</strong> Ergonomics. John<br />

Wiley & Sons, Inc, 2nd edition, 1997.<br />

[149] Y. Sato, Y. Kobayashi, <strong>and</strong> H. Koike. Fast Tracking of <strong>H<strong>and</strong></strong>s <strong>and</strong> Fingertips<br />

in Infrared Images <strong>for</strong> Augmented Desk Interface. In Proc. IEEE Intl.<br />

Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition, March 2000.<br />

[150] D. Saxe <strong>and</strong> R. Foulds. Toward robust skin identification in video images. In<br />

Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition,<br />

pages 379–384, Sept. 1996.<br />

[151] B. Schiele <strong>and</strong> A. Waibel. Gaze tracking based on face-color. In Proceedings<br />

of the International Workshop on Automatic Face- <strong>and</strong> <strong>Gesture</strong>-Recognition,<br />

pages 344–349, June 1995.<br />

[152] J. Segen <strong>and</strong> S. Kumar. <strong>Gesture</strong>VR: <strong>Vision</strong>-<strong>Based</strong> 3D <strong>H<strong>and</strong></strong> Interface <strong>for</strong><br />

Spatial Interaction. In The Sixth ACM Intl. Multimedia Conference, September<br />

1998.<br />

[153] J. Segen <strong>and</strong> S. Kumar. Shadow <strong>Gesture</strong>s: 3D <strong>H<strong>and</strong></strong> Pose Estimation Using a<br />

Single Camera. In Proc. IEEE Conference on Computer <strong>Vision</strong> <strong>and</strong> Pattern<br />

Recognition, pages 1479–1486, 1999.<br />

[154] J. Segen <strong>and</strong> S. Kumar. Look Ma, No Mouse! Communications of the ACM,<br />

43(7):102–109, July 2000.<br />

[155] C. Shan, Y. Wei, T. Tan, <strong>and</strong> F. Ojardias. Real Time <strong>H<strong>and</strong></strong> Tracking by<br />

Combining Particle Filtering <strong>and</strong> Mean Shift. In Proc. IEEE Intl. Conference<br />

on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition, 2004.<br />

[156] T. Sheridan <strong>and</strong> W. Ferrell. Remote Manipulative Control with Transmission<br />

Delay. IEEE Transactions on Human Factors in Electronics, 4:25–29,<br />

1963.<br />

[157] J. Shi <strong>and</strong> J. Malik. Motion segmentation <strong>and</strong> tracking using normalized<br />

cuts. In Proc. Intl. Conference on Computer <strong>Vision</strong>, pages 1154–1160, 1998.<br />

[158] J. Shi <strong>and</strong> C. Tomasi. Good features to track. In Proc. IEEE Conference<br />

on Computer <strong>Vision</strong> <strong>and</strong> Pattern Recognition, Seattle, June 1994.<br />

264


Bibliography<br />

[159] N. Shimada, Y. Shirai, Y. Kuno, <strong>and</strong> J. Miura. <strong>H<strong>and</strong></strong> <strong>Gesture</strong> Estimation<br />

<strong>and</strong> Model Refinement Using Monocular Camera - Ambiguity Limitation by<br />

Inequality Constraints. In Proc. IEEE Intl. Conference on Automatic Face<br />

<strong>and</strong> <strong>Gesture</strong> Recognition, pages 268–273, April 1998.<br />

[160] B. Shneiderman. Direct Manipulation <strong>and</strong> Virtual Environments. In Designing<br />

the User Interface: Strategies <strong>for</strong> Effective Human-Computer Interaction,<br />

chapter 6. Addison Wesley, 3rd edition, March 1998.<br />

[161] T. Starner, J. Auxier, D. Ashbrook, <strong>and</strong> M. G<strong>and</strong>y. The <strong>Gesture</strong> Pendant:<br />

A Self-illuminating, <strong>Wearable</strong>, Infrared Computer <strong>Vision</strong> System <strong>for</strong> Home<br />

Automation Control <strong>and</strong> Medical Monitoring. In International Symposium<br />

on <strong>Wearable</strong> Computers, 2000.<br />

[162] T. Starner, S. Mann, B. Rhodes, J. Healey, K. B. Russell, J. Levine, <strong>and</strong><br />

A. Pentl<strong>and</strong>. <strong>Wearable</strong> <strong>Computing</strong> <strong>and</strong> Augmented Reality. Technical report,<br />

MIT Media Lab, <strong>Vision</strong> <strong>and</strong> Modeling Group, November 1995.<br />

[163] T. E. Starner, J. Weaver, <strong>and</strong> A. Pentl<strong>and</strong>. Real-Time American Sign Language<br />

Recognition Using Desk <strong>and</strong> <strong>Wearable</strong> Computer <strong>Based</strong> Video. IEEE<br />

Transactions on Pattern Recognition <strong>and</strong> Machine Intelligence, 20(12):1371–<br />

1375, December 1998.<br />

[164] A. State, G. Hirota, D. Chen, W. Garrett, <strong>and</strong> M. Livingston. Superior augmented<br />

reality registration by integrating l<strong>and</strong>mark tracking <strong>and</strong> magnetic<br />

tracking. In Proceedings of SIGGRAPH, pages 439–446, August 1996.<br />

[165] B. Stenger, P. R. S. Mendonça, <strong>and</strong> R. Cipolla. Model-<strong>Based</strong> 3D Tracking<br />

of an Articulated <strong>H<strong>and</strong></strong>. In Proc. IEEE Conference on Computer <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition, volume 2, pages 310–315, December 2001.<br />

[166] J. Ström, T. Jebara, S. Basu, <strong>and</strong> A. Pentl<strong>and</strong>. Real Time Tracking <strong>and</strong><br />

Modeling of Faces: An EKF-based Analysis by Synthesis Approach. In<br />

ICCV, 1999.<br />

[167] D. J. Sturman <strong>and</strong> D. Zeltzer. A Design Method <strong>for</strong> ”Whole-<strong>H<strong>and</strong></strong>”<br />

Human-Computer Interaction. ACM Transactions on In<strong>for</strong>mation Systems,<br />

11(3):219–238, July 1993.<br />

[168] Z. Szalavári <strong>and</strong> M. Gervautz. The personal interaction panel – a twoh<strong>and</strong>ed<br />

interface <strong>for</strong> augmented reality. In Proc. 18th Eurographics, Eurographics<br />

Assoc, pages 335–346, 1997.<br />

265


Bibliography<br />

[169] R. M. Taylor II., T. C. Hudson, A. Seeger, H. Weber, J. Juliano, <strong>and</strong> A. T.<br />

Helser. VRPN: A Device-Independent, Network-Transparent VR Peripheral<br />

System. In VRST, 2001.<br />

[170] A. Thayananthan, B. Stenger, P. H. S. Torr, <strong>and</strong> R. Cipolla. Shape Context<br />

<strong>and</strong> Chamfer Matching in Cluttered Scenes. In Proc. IEEE Conference<br />

on Computer <strong>Vision</strong> <strong>and</strong> Pattern Recognition, volume I, pages 127–133,<br />

Madison, USA, June 2003.<br />

[171] B. Thomas, B. Close, J. Donoghue, J. Squires, P. De Bondi, M. Morris, <strong>and</strong><br />

W. Piekarski. ARQuake: An Outdoor/Indoor Augmented Reality First-<br />

Person Application. In Proc. of the Fourth International Symposium on<br />

<strong>Wearable</strong> Computers, pages 139–146, October 2000.<br />

[172] B. H. Thomas <strong>and</strong> W. Piekarski. Glove <strong>Based</strong> User Interaction Techniques<br />

<strong>for</strong> Augmented Reality in an Outdoor Environment. Virtual Reality: Research,<br />

Development, <strong>and</strong> Applications, 6(3), 2002.<br />

[173] M. Toews <strong>and</strong> T. Arbel. Entropy-of-likelihood Feature Selection <strong>for</strong> Image<br />

Correspondence. In Proc. Intl. Conference on Computer <strong>Vision</strong>, October<br />

2003.<br />

[174] C. Tomasi, A. Rafii, <strong>and</strong> I. Torunoglu. Full-Size Projection Keyboard <strong>for</strong><br />

<strong>H<strong>and</strong></strong>held Devices. Communications of the ACM, 46(7):70–75, July 2003.<br />

[175] H. L. V. Trees. Detection, Estimation, <strong>and</strong> Modulation Theory, volume 1.<br />

Wiley, 1968.<br />

[176] M. Turk. <strong>Gesture</strong> recognition. In K. Stanney, editor, <strong>H<strong>and</strong></strong>book of Virtual<br />

Environments: Design, Implementation <strong>and</strong> Applications. Lawrence<br />

Erlbaum Associates Inc., December 2001.<br />

[177] M. Turk. Computer <strong>Vision</strong> in the Interface. Communications of the ACM,<br />

47(1):60–67, 2004.<br />

[178] M. Turk <strong>and</strong> A. Pentl<strong>and</strong>. Eigenfaces <strong>for</strong> Recognition. J. Cognitive Neuroscience,<br />

3(1):71–86, 1991.<br />

[179] P. Viola <strong>and</strong> M. Jones. Fast <strong>and</strong> Robust Classification using Asymmetric<br />

AdaBoost <strong>and</strong> Detector Cascade. Neural In<strong>for</strong>mation Processing Systems,<br />

December 2001.<br />

[180] P. Viola <strong>and</strong> M. Jones. Robust Real-time Object Detection. In Intl. Workshop<br />

on Statistical <strong>and</strong> Computational Theories of <strong>Vision</strong>, July 2001.<br />

266


Bibliography<br />

[181] C. von Hardenberg <strong>and</strong> F. Bérard. Bare-h<strong>and</strong> human-computer interaction.<br />

In Perceptual User <strong>Interfaces</strong>, 2001.<br />

[182] G. Welch, G. Bishop, L. Vicci, S. Brumback, K. Keller, <strong>and</strong> D. Colucci.<br />

The HiBall Tracker: High-Per<strong>for</strong>mance Wide-Area Tracking <strong>for</strong> Virtual <strong>and</strong><br />

Augmented Environments. In Proceedngs of the ACM Symposium on Virtual<br />

Reality Software <strong>and</strong> Technology (VRST), December 1999.<br />

[183] S. F. Wiker, G. D. Langolf, <strong>and</strong> D. B. Chaffin. Arm Posture <strong>and</strong> Human<br />

Movement Capability. Human Factors, 31(4):421–441, 1989.<br />

[184] A. Wilson <strong>and</strong> S. Shafer. XW<strong>and</strong>: UI <strong>for</strong> Intelligent Spaces. In ACM CHI,<br />

2003.<br />

[185] W. E. Woodson, B. Tillman, <strong>and</strong> P. Tillman. Human Factors Design <strong>H<strong>and</strong></strong>book.<br />

McGraw-Hill Professional, 2 edition, 1992.<br />

[186] C. R. Wren <strong>and</strong> A. P. Pentl<strong>and</strong>. Dynamic Models of Human Motion. In<br />

Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition,<br />

pages 22–27. IEEE Computer Society, April 1998.<br />

[187] Y. Wu <strong>and</strong> T. S. Huang. <strong>Vision</strong>-based gesture recognition: A review.<br />

In A. Braf<strong>for</strong>t, R. Gherbi, S. Gibet, J. Richardson, <strong>and</strong> D. Teil, editors,<br />

<strong>Gesture</strong>-<strong>Based</strong> Communication in Human-Computer Interaction, volume<br />

1739 of Lecture Notes in Artificial Intelligence. Springer Verlag, Berlin Heidelberg,<br />

1999.<br />

[188] Y. Wu <strong>and</strong> T. S. Huang. View-independent Recognition of <strong>H<strong>and</strong></strong> Postures.<br />

In Proc. IEEE Conference on Computer <strong>Vision</strong> <strong>and</strong> Pattern Recognition,<br />

volume 2, pages 84–94, 2000.<br />

[189] Y. Wu <strong>and</strong> T. S. Huang. <strong>H<strong>and</strong></strong> Modeling, Analysis, <strong>and</strong> Recognition. IEEE<br />

Signal Processing Magazine, May 2001.<br />

[190] M.-H. Yang, D. J. Kriegman, <strong>and</strong> N. Ahuja. Detecting Faces in Images: A<br />

Survey. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence,<br />

24(1):34–58, 1 2002.<br />

[191] S. You, U. Neumann, <strong>and</strong> R. Azuma. Orientation Tracking <strong>for</strong> Outdoor<br />

Augmented Reality Registration. IEEE Computer Graphics <strong>and</strong> Applications,<br />

19(6):36–42, November/December 1999.<br />

[192] S. J. Young. HTK: Hidden Markov Model Toolkit V1.5, December 1993.<br />

Entropic Research Laboratories Inc.<br />

267


Bibliography<br />

[193] B. D. Zarit, B. J. Super, <strong>and</strong> F. K. H. Quek. Comparison of Five Color<br />

Models in Skin Pixel Classification. In Workshop on Recognition, Analysis,<br />

<strong>and</strong> Tracking of Faces <strong>and</strong> <strong>Gesture</strong>s in Real-Time Systems, pages 58–63,<br />

Sept. 1999.<br />

[194] Z. Zhang, M. Li, S. Li, <strong>and</strong> H. Zhang. Multi-View Face Detection with<br />

FloatBoost. In Proc. IEEE Workshop on Applications of Computer <strong>Vision</strong>,<br />

2002.<br />

[195] Q. Zhu, K.-T. Cheng, C.-T. Wu, <strong>and</strong> Y.-L. Wu. Adaptive Learning of an<br />

Accurate Skin-Color Model. In Proc. IEEE Intl. Conference on Automatic<br />

Face <strong>and</strong> <strong>Gesture</strong> Recognition, 2004.<br />

[196] X. Zhu, J. Yang, <strong>and</strong> A. Waibel. Segmenting <strong>H<strong>and</strong></strong>s of Arbitrary Color. In<br />

Proc. IEEE Intl. Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition,<br />

2000.<br />

[197] Y. Zhu, H. Ren, G. Xu, <strong>and</strong> X. Lin. Toward Real-Time Human-Computer<br />

Interaction with Continuous Dynamic <strong>H<strong>and</strong></strong> <strong>Gesture</strong>s. In Proceedings of<br />

the Conference on Automatic Face <strong>and</strong> <strong>Gesture</strong> Recognition, pages 544–549,<br />

2000.<br />

268

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!