Vision Based Hand Gesture Interfaces for Wearable Computing and ...

UNIVERSITY OF CALIFORNIA 

Santa Barbara 

Vision Based Hand Gesture Interfaces for 

Wearable Computing and Virtual Environments 

Committee in Charge: 

A Dissertation submitted in partial satisfaction 

of the requirements for the degree of 

Professor Matthew Turk, Chair 

Ronald T. Azuma, Ph. D. 

Professor Andrew C. Beall 

Professor Keith C. Clarke 

Professor Tobias Höllerer 

Professor Yuan-Fang Wang 

Doctor of Philosophy 

in 

Computer Science 

by 

Mathias Kölsch 

September 2004

The Dissertation of 

Mathias Kölsch is approved: 

Ronald T. Azuma, Ph. D. 

Professor Andrew C. Beall 

Professor Keith C. Clarke 

Professor Tobias Höllerer 

Professor Yuan-Fang Wang 

Professor Matthew Turk, Committee Chairperson 

August 2004



Copyright c○ 2004 

by 


iii

To my parents Ute and Otto Kölsch, 

as a humble sign of my gratitude for their 

love, their sacrifices for me, and for 

supporting my sometimes inscrutable paths. 

iv

Acknowledgements 

You would not hold this document in your hands had it not been for the 

direct and indirect help of many, many people that shared their love, inspiration, 

time, advice, and many other essentials with me. It would fill countless more 

pages than the technical content of this dissertation to name all of them and their 

contributions. However, I would like to thank in particular: 

My advisor Matthew for the freedom he allowed me in my studies and the 

opportunities he presented me with, for his advice in uncountable respects and 

situations; my committee members Andy for critical thinking, his humor, and the 

awesome hot chocolate in a windy night; Keith for his excitement for my work; 

Ron for his time, the many tips for graduate life and for sharing his thoughts 

on how (not) to write dissertation acknowledgements; Tobias for his enthusiasm, 

involvement, and dedication; and Yuan-Fang Wang for his insights, especially at 

an early stage of my thesis work. Further, I want to thank Professor Tichy, Klaus, 

and Urs for their inspiring teaching; Jerry for an awesome internship and how it 

shaped my career in computer science; Juli and the entire Computer Science staff 

for super-friendly assistance, especially when I was forgetful or burst into the 

office in the last minute; our equally friendly and helpful computer support staff 

Richard, Andy, Andreas, and Jeff; Anurag for supporting me through my first 

years at UCSB; my lab colleagues Arun, Changbo, Haiying, James, Jason, Jeff, 

Lihua, Rogerio, Ryan B., Ryan G., Seb, Steve, Vineet, and Ya for much help and 

many a good research discussion. 

My deep and sincere gratitude goes to: 

My parents and their unconditional love and support; Steffen for putting up with 

his brother and still baking him delicious cookies; Schorsch, Markus, and Frank 

for their continued friendship; Kai for being the first computer geek I met and 

for helping me with my German; Chris for all that she taught me; S. Schwenk for 

telling me about remote islands; Simone for the motivation to cross the Atlantic; 

Christoph B. for a diverse time in Karlsruhe; Corneliu for his wit and wisdom; 

Kris for lots of computer help, many trips, delicious Italian dinners, and our special 

friendship; Alex, Claudia, and Gunnar for infinite hospitality; Elizabeth and 

Mark for all I learned from them; Wine Wednesday and its followers for weekly 

joys since October 1999 and Matthew Allen for taking the reign after three years; 

Radu and Karin for being my friends; Leonie for her inspiration, unconventional- 

v

ism, and mostly for being herself; Mike L. for winning the most-tolerant-neighbor 

price; Todd and Krista for the comfort of their company; Christa-Lynn for happycamping 

and all those other Canadian qualities; my hiking and mountaineering 

friends to allow me to experience the “freedom of the hills;” Egle for her honesty 

and musical pleasures; Eric for his incredible energy and motivation; and last but 

not least, to the East Beach volleyball players for great workouts with such a 

welcoming group of people! 

Thank you! Mathias. 

vi

Education 

Curriculum Vitæ 


2003 Master of Science in Computer Science, University of California, 

Santa Barbara. 

1997 Bachelor of Science in Informatik (Vordiplom and one year 

of graduate studies), Universität Karlsruhe, Germany. 

1993 Abitur, Kurfürst Ruprecht Gymnasium, Neustadt an der 

Weinstraße, Germany. 

Experience 

1999 – 2004 Graduate Research Assistant, University of California, Santa 

Barbara. 

1997 – 2001 Teaching Assistant, University of California, Santa Barbara. 

1996 – 1997 Teaching Assistant, Universität Karlsruhe, Germany. 

Selected Publications 

Mathias Kölsch, Matthew Turk, and Tobias Höllerer: “Vision-Based Interfaces 

for Mobility,” In Proc. IEEE Intl. Conference on Mobile and Ubiquitous Systems 

(Mobiquitous), August 2004. 

Mathias Kölsch and Matthew Turk: “Fast 2D Hand Tracking with Flocks of 

Features and Multi-Cue Integration,” In Proc. IEEE Workshop on Real-Time 

Vision for Human-Computer Interaction (at CVPR), July 2004. 

Mathias Kölsch and Matthew Turk: “Robust Hand Detection,” In Proc. IEEE 

Intl. Conference on Automatic Face and Gesture Recognition, May 2004. 

Mathias Kölsch, Andrew C. Beall, and Matthew Turk: “An Objective Measure for 

Postural Comfort,” In Human Factors and Ergonomics Society’s Annual Meeting 

Notes, October 2003. 

vii

Abstract 



by 


Current user interfaces are unsuited to harness the full power of computers. 

Mobile devices like cell phones and technologies such as virtual reality demand 

a richer set of interaction modalities to overcome situational constraints and to 

fully leverage human expressiveness. Hand gesture recognition lets humans use 

their most versatile instrument – their hands – in more natural and effective 

ways than currently possible. While most gesture recognition gear is cumbersome 

and expensive, gesture recognition with computer vision is non-invasive and more 

flexible. Yet, it faces difficulties due to the hand’s complexity, lighting conditions, 

background artifacts, and user differences. 

The contributions of this dissertation have helped to make computer vision a 

viable technology to implement hand gesture recognition for user interface pur- 

poses. To begin with, we investigated arm postures in front of the human body in 

order to avoid anthropometrically unfavorable gestures and to establish a “comfort 

zone” in which humans prefer to operate their hands. 

viii

The dissertation’s main contribution is “HandVu,” a computer vision system 

that recognizes hand gestures in real-time. To achieve this, it was necessary to 

advance the reliability of hand detection to allow for robust system initialization 

in most environments and lighting conditions. After initialization, a “Flock of 

Features” exploits optical flow and color information to track the hand’s location 

despite rapid movements and concurrent finger articulations. Lastly, robust ap- 

pearance-based recognition of key hand configurations completes HandVu and 

facilitates input of discrete commands to applications. 

We demonstrate the feasibility of computer vision as the sole input modality 

to a wearable computer, providing “deviceless” interaction capabilities. We also 

present new and improved interaction techniques in the context of a multimodal 

interface to a mobile augmented reality system. HandVu allows us to exploit 

hand gesture capabilities that have previously been untapped, for example, in 

areas where data gloves are not a viable option. 

This dissertation’s goal is to contribute to the mosaic of available interface 

modalities and to widen the human-computer interface channel. Leveraging more 

of our expressiveness and our physical abilities offers new and advantageous ways 

to communicate with machines. 

ix

Zusammenfassung 

Visuelle Handgesten-Erkennung als Schnittstelle für 

Tragbare Computer und Virtuelle Welten 


Mit den heute vorherrschenden Mensch-Maschine Schnittstellen können sich 

sowohl die Ausdrucksmöglichkeiten des Menschen als auch das Potential moder- 

ner Rechner nicht voll entfalten. Insbesondere tragbare Computer und unkonven- 

tionelle Technologien, wie zum Beispiel Mobiltelefone und Virtuelle Realitäten, 

verlangen ein vielfältigeres Angebot an Interaktionsmodalitäten um situationsbe- 

zogene Einschränkungen zu überwinden und die gesamte Bandbreite des men- 

schlichen Ausdrucksvermögens ausschöpfen zu können. 

Computergestützte Handgestenerkennung gibt Menschen die Möglichkeit, ihr 

flexibelstes Werkzeug – die Hand – in natürlicherer und effektiverer Weise be- 

nutzen zu können als dies bisher der Fall war. Die meisten Geräte, mit denen 

Gestenerkennung realisiert werden kann, sind jedoch umständlich in der Anwen- 

dung und kostspielig. Gestenerkennung mit Hilfe von Bildverarbeitung hinge- 

gen ist flexibler und nicht invasiv. Dem stehen jedoch Schwierigkeiten der Bild- 

verarbeitungsverfahren gegenüber, vor allem aufgrund interpersoneller Varianz, 

der Komplexität von Handform und Handbewegungen, unterschiedlicher Beleuch- 

tungsbedingungen, sowie komplexen Hintergründen. 

x

Die Beiträge dieser Dissertation dienen dazu, computergestützte Bildverar- 

beitung zu einer praktikablen Implementierungstechnologie für handbasierte Be- 

nutzerschnittstellen zu machen. Zunächst wurden manipulative Armstellungen 

vor dem Körper untersucht, um anthropometrisch ungünstige Handgesten zu ver- 

meiden und einen komfortablen Aktionsradius festzulegen. 

Als Hauptbeitrag wurde “HandVu” entwickelt: ein Bildverarbeitungssystem, 

welches in der Lage ist, Handgesten in Echtzeit zu erkennen. Um dies zu erre- 

ichen war es notwendig, die Zuverlässigkeit der Handerkennung zu verbessern, so 

dass eine robuste Initialisierung des Systems in unterschiedlichsten Umgebungen 

und Beleuchtungssituationen gewährleistet ist. Der “Schwarm von Merkmalen” – 

oder “Flock of Features” – ist eine neuartige Methode zum Verfolgen von Händen. 

Nach der Initialisierung werden Farbinformationen und optischer Fluss genutzt, 

um die Position der Hand trotz schneller globaler Bewegungen und gleichzeiti- 

gen Fingerbewegungen zu verfolgen. Robuste, erscheinungsbasierte Erkennung 

bestimmter Schlüsselgesten vervollständigt HandVu und erlaubt somit die Inter- 

pretation diskreter Handbefehle. 

Unter Verwendung oben genannter Techniken werden in dieser Arbeit wei- 

terhin Möglichkeiten demonstriert, Handgesten per computergestützter Bildver- 

arbeitung als einzige Eingabemodalität für einen tragbaren Computer zu ver- 

wenden, und dem Benutzer somit zu einer “gerätelosen” Eingabemöglichkeit zu 

xi

verhelfen. In Verbindung mit einer multimodalen Schnittstelle werden zudem 

neue und verbesserte Techniken zur Interaktion mit einem System für gemischt 

wirklich-virtuelle Realität vorgestellt. So erlaubt HandVu den Zugriff auf Aspekte 

von Handgesten, die bisher unzugänglich waren – wie zum Beispiel in Bereichen 

in denen Datenhandschuhe ungeeignet sind. 

Das Ziel dieser Dissertation ist es, dem für Benutzerschnittstellen verfügbaren 

Mosaik an Modalitäten weitere hinzuzufügen und dadurch den Mensch-Maschi- 

ne-Kommunikationskanal etwas zu erweitern. Die Aussicht, mehr seiner Aus- 

drucksmöglichkeiten und physischen Fähigkeiten einsetzen zu können, birgt für 

den Menschen neue und vorteilhafte Wege in der Kommunikation mit Maschinen. 

xii

Contents 

Acknowledgements v 

Curriculum Vitæ vii 

Abstract viii 

Zusammenfassung x 

List of Figures xvii 

List of Tables xix 

1 Introduction 1 

1.1 User interfaces: bottleneck of proliferation . . . . . . . . . . . . . 1 

1.2 Hand gesture interfaces . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.4 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.5 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2 Literature Review 9 

2.1 Hand gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.2 Human factors and biomechanics . . . . . . . . . . . . . . . . . . 11 

2.2.1 Postural comfort . . . . . . . . . . . . . . . . . . . . . . . 12 

2.3 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.3.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.3.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

2.3.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.3.4 Skin color . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.3.5 Shapes and contours . . . . . . . . . . . . . . . . . . . . . 22 

2.3.6 Motion flow . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

2.3.7 Texture and appearance based methods . . . . . . . . . . . 26 

xiii

2.3.8 Viola-Jones detection method . . . . . . . . . . . . . . . . 28 

2.3.9 Temporal tracking and filtering . . . . . . . . . . . . . . . 33 

2.3.10 Higher-level models . . . . . . . . . . . . . . . . . . . . . . 35 

2.3.11 Temporal gesture recognition . . . . . . . . . . . . . . . . 39 

2.4 User interfaces and gestures . . . . . . . . . . . . . . . . . . . . . 41 

2.4.1 Gesture-based user interfaces . . . . . . . . . . . . . . . . 41 

2.4.2 Vision-based interfaces . . . . . . . . . . . . . . . . . . . . 43 

2.5 Virtual environments and applications . . . . . . . . . . . . . . . 46 

2.5.1 Virtual environments and GISs . . . . . . . . . . . . . . . 47 

2.5.2 Vision-based interfaces for virtual environments . . . . . . 52 

2.5.3 Mobile interfaces . . . . . . . . . . . . . . . . . . . . . . . 54 

3 Hand Gestures in the Human Context 58 

3.1 Postural comfort . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

3.1.1 Operational definition of comfort . . . . . . . . . . . . . . 61 

3.2 The comfort zone for reaching gestures . . . . . . . . . . . . . . . 64 

3.2.1 Method and design . . . . . . . . . . . . . . . . . . . . . . 65 

3.2.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

3.2.3 Materials and apparatus . . . . . . . . . . . . . . . . . . . 68 

3.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

3.2.5 Instructions to participants . . . . . . . . . . . . . . . . . 71 

3.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

3.3.1 The meaning of comfort . . . . . . . . . . . . . . . . . . . 78 

3.3.2 Comfort results and related work . . . . . . . . . . . . . . 79 

3.3.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . 81 

3.3.4 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

4 HandVu: A Computer Vision System for Hand Interfaces 85 

4.1 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

4.2 Vision system overview . . . . . . . . . . . . . . . . . . . . . . . . 88 

4.2.1 Core gesture recognition module . . . . . . . . . . . . . . . 89 

4.2.2 Area-selective exposure control . . . . . . . . . . . . . . . 94 

4.2.3 Speed and size scalability . . . . . . . . . . . . . . . . . . 97 

4.2.4 Correction for camera lens distortion . . . . . . . . . . . . 98 

4.2.5 Application programming interface . . . . . . . . . . . . . 99 

4.2.6 Verbosity overlays . . . . . . . . . . . . . . . . . . . . . . . 102 

4.2.7 HandVu WinTk: video pipeline and toolkit . . . . . . . . . 104 

4.2.8 Recognition state distribution . . . . . . . . . . . . . . . . 106 

xiv

4.2.9 The vision conductor configuration file . . . . . . . . . . . 109 

4.3 Vision system performance . . . . . . . . . . . . . . . . . . . . . . 115 

4.4 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

5 Hand Detection 119 

5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

5.2 Parallel training with MPI . . . . . . . . . . . . . . . . . . . . . . 123 

5.3 Classification potential of various postures . . . . . . . . . . . . . 124 

5.3.1 Estimation with frequency spectrum analysis . . . . . . . . 125 

5.3.2 Predictor accuracy . . . . . . . . . . . . . . . . . . . . . . 128 

5.4 Effect of template resolution . . . . . . . . . . . . . . . . . . . . . 130 

5.5 Rotational robustness . . . . . . . . . . . . . . . . . . . . . . . . . 132 

5.5.1 Rotation baseline . . . . . . . . . . . . . . . . . . . . . . . 134 

5.5.2 Problem: rotational sensitivity . . . . . . . . . . . . . . . . 134 

5.5.3 Rotation bounds for undiminished performance . . . . . . 136 

5.5.4 Rotation density of training data . . . . . . . . . . . . . . 137 

5.5.5 Rotations of other postures . . . . . . . . . . . . . . . . . 137 

5.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

5.6 A new feature type . . . . . . . . . . . . . . . . . . . . . . . . . . 147 

5.6.1 Four Box feature instance generation . . . . . . . . . . . . 147 

5.6.2 Four Box Same feature type . . . . . . . . . . . . . . . . . 151 

5.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 

5.7 Fixed color histogram . . . . . . . . . . . . . . . . . . . . . . . . . 155 

5.8 Hand pixel probability maps . . . . . . . . . . . . . . . . . . . . . 156 

5.9 Learned color distribution . . . . . . . . . . . . . . . . . . . . . . 158 

5.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 

6 Tracking of Articulated Objects 163 

6.1 Preliminary studies . . . . . . . . . . . . . . . . . . . . . . . . . . 164 

6.2 Flocks of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 

6.2.1 KLT features and tracking initialization . . . . . . . . . . 167 

6.2.2 Flocking behavior . . . . . . . . . . . . . . . . . . . . . . . 170 

6.2.3 Color modality and multi-cue integration . . . . . . . . . . 172 

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 

6.3.1 Video sequences . . . . . . . . . . . . . . . . . . . . . . . . 175 

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 

6.4.1 Comparison to CamShift . . . . . . . . . . . . . . . . . . . 176 

6.4.2 Parameter optimizations . . . . . . . . . . . . . . . . . . . 178 

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 

6.6 Two-handed tracking and temporal filters . . . . . . . . . . . . . 186 

xv

7 Posture Recognition 189 

7.1 Fanned detection for classification . . . . . . . . . . . . . . . . . . 191 

7.2 Data collection for evaluation . . . . . . . . . . . . . . . . . . . . 194 

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 

7.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 

7.3.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 

7.3.3 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . 200 

7.4 Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . 200 

8 Hand Gestures in Application 204 

8.1 Application overview and contributions . . . . . . . . . . . . . . . 205 

8.2 The case for external interfaces . . . . . . . . . . . . . . . . . . . 206 

8.3 Hand gesture interaction techniques . . . . . . . . . . . . . . . . . 208 

8.4 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 

8.5 Battuta: a wearable GIS . . . . . . . . . . . . . . . . . . . . . . . 214 

8.5.1 The gesture interface . . . . . . . . . . . . . . . . . . . . . 215 

8.5.2 Benefits of HandVu for Battuta . . . . . . . . . . . . . . . 217 

8.6 Vision-only interface for mobility . . . . . . . . . . . . . . . . . . 218 

8.6.1 Functionality of the Maintenance Application . . . . . . . 218 

8.6.2 Benefits of HandVu for mobility . . . . . . . . . . . . . . . 223 

8.7 A multimodal augmented reality interface . . . . . . . . . . . . . 225 

8.7.1 System description . . . . . . . . . . . . . . . . . . . . . . 226 

8.7.2 The Tunnel Tool and other visualizations . . . . . . . . . . 228 

8.7.3 Speech recognition . . . . . . . . . . . . . . . . . . . . . . 229 

8.7.4 Interacting with the visualized invisible . . . . . . . . . . . 231 

8.7.5 Multimodal integration . . . . . . . . . . . . . . . . . . . . 234 

8.7.6 Benefits of HandVu for powerful interfaces . . . . . . . . . 237 

8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 

9 The Future in Your Hands 241 

9.1 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 

9.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 

9.2.1 Limits of hand gesture interfaces . . . . . . . . . . . . . . 242 

9.2.2 Limits of vision . . . . . . . . . . . . . . . . . . . . . . . . 243 

9.3 Next-generation computer interfaces . . . . . . . . . . . . . . . . . 245 

9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 

Bibliography 251 

xvi

List of Figures 

2.1 The rectangular feature types for Viola-Jones detectors. . . . . . . 30 

2.2 The structure of the hand: . . . . . . . . . . . . . . . . . . . . . . 36 

3.1 Plan view of the experiment setup: . . . . . . . . . . . . . . . . . 66 

3.2 A participant performing the skill task: . . . . . . . . . . . . . . . 69 

3.3 Mean and standard deviation of body movement: . . . . . . . . . 73 

3.4 Hand-to-shoulder distance over body movement: . . . . . . . . . . 75 

3.5 The comfort ratings in front of the human body: . . . . . . . . . . 77 

4.1 Our mobile user interface in action: . . . . . . . . . . . . . . . . . 87 

4.2 Arrangement of the computer vision methods: . . . . . . . . . . . 91 

4.3 A screen capture with verbose output turned on: . . . . . . . . . . 93 

4.4 The algorithm for area-selective software exposure control. . . . . 96 

4.5 The vision module in the application context: . . . . . . . . . . . 104 

5.1 Sample areas of the six hand postures: . . . . . . . . . . . . . . . 121 

5.2 Mean hand appearances and their Fourier transforms: . . . . . . . 125 

5.3 ROC curves for monolithic classifiers: . . . . . . . . . . . . . . . . 129 

5.4 ROC curves for different template resolutions: . . . . . . . . . . . 131 

5.5 ROC curves for various training data rotations: . . . . . . . . . . 135 

5.6 ROC curves showing the rotational sensitivity: . . . . . . . . . . . 140 

5.7 ROC curves for detection of randomly rotated images: . . . . . . 141 

5.8 ROC curves for detection of discrete-rotated images: . . . . . . . . 142 

5.9 ROC curves for the bounds of training with rotated images: . . . 143 

5.10 ROC curves for different rotation steps: . . . . . . . . . . . . . . . 144 

5.11 The six hand postures and rotated images: . . . . . . . . . . . . . 145 

5.12 Overall gain of training with rotated images: . . . . . . . . . . . . 146 

5.13 The Four Box and Four Box Same feature types: . . . . . . . . . . 150 

xvii

5.14 Example instances of the Four Box Same feature type: . . . . . . 153 

5.15 ROC curves for detectors with Four Box Same features: . . . . . . 154 

5.16 The probability maps for six hand postures: . . . . . . . . . . . . 157 

5.17 The areas for learning the skin color model: . . . . . . . . . . . . 159 

6.1 Tracking a hand with an Active Shape Model: . . . . . . . . . . . 165 

6.2 The Flock of Features in action: . . . . . . . . . . . . . . . . . . . 166 

6.3 The Flock of Features tracking algorithm: . . . . . . . . . . . . . 168 

6.4 Images taken during tracking: . . . . . . . . . . . . . . . . . . . . 171 

6.5 Results of tracking with Flocks of Features: . . . . . . . . . . . . . 177 

6.6 Contributors towards the Flock of Features’ performance: . . . . . 179 

6.7 Tracking with different numbers of features: . . . . . . . . . . . . 180 

6.8 Tracking with different search window sizes: . . . . . . . . . . . . 182 

7.1 Fanned arrangement of partial detectors: . . . . . . . . . . . . . . 192 

7.2 The data collection for evaluating the posture recognition: . . . . 196 

7.3 Sample images for evaluation of the posture recognition: . . . . . 198 

8.1 Interaction area size versus interface device size: . . . . . . . . . . 207 

8.2 The map display of the Battuta wearable GIS. . . . . . . . . . . . 216 

8.3 Image of pointer-based interaction: . . . . . . . . . . . . . . . . . 220 

8.4 Image of two-handed interaction: . . . . . . . . . . . . . . . . . . 221 

8.5 Image of location-independent interaction: . . . . . . . . . . . . . 222 

8.6 Selecting from many items with registered manipulation. . . . . . 223 

8.7 An overview of the hardware components: . . . . . . . . . . . . . 227 

8.8 Schematic view of the Tunnel Tool: . . . . . . . . . . . . . . . . . 230 

8.9 A hand-worn trackball: . . . . . . . . . . . . . . . . . . . . . . . . 232 

xviii

List of Tables 

2.1 Measuring various degrees of comfort: . . . . . . . . . . . . . . . . 14 

5.1 The hand image data collection: . . . . . . . . . . . . . . . . . . . 122 

6.1 The video sequences and their characteristics: . . . . . . . . . . . 176 

7.1 Summary of the recognition results: . . . . . . . . . . . . . . . . . 199 

8.1 The mapping of input to effect: . . . . . . . . . . . . . . . . . . . 233 

xix

Chapter 1 

Introduction 

First we thought the PC was a calculator. Then we found out how to 

turn numbers into letters with ASCII - we thought it was a typewriter. 

Then we discovered graphics, and we thought it was a television. With 

the World Wide Web, we’ve realized it’s a brochure. 

Douglas Adams (1952-2001) 

What else can computers do for us? How can we tap into more of their vast 

resources? Which applications are we overlooking because of a limiting view of 

computers? 

1.1 User interfaces: bottleneck of proliferation 

Current user interfaces are unsuited to harness the full power of computers. 

Keyboards and desktop displays – the most prevalent interfaces – offer only a 

narrow bridge across the barrier between brain and circuitry. 3D applications and 

wearable computers such as cell phones are in particular need of more natural and 

effective means of interaction. In fact, the limitations of current human-computer 

1

Chapter 1. Introduction 

interfaces hinder expansion of the computer’s abilities to serve humans in many 

aspects of life. The goal of this dissertation is to add to the mosaic of available 

interface modalities and, thereby, to widen the human-computer interface channel. 

Leveraging more of our expressiveness and our physical abilities in particular offers 

new and advantageous ways to communicate with machines. 

1.2 Hand gesture interfaces 

Hand gesture interfaces are computer input methods that utilize the 3-dimen- 

sional location of a person’s hand, its orientation, and its posture (the finger 

configuration). These perceptual interfaces promise to blend input and output 

spaces, allowing for unmodulated interactions between real and virtual objects. 

Gesture interfaces have many uses, for example, for live manipulation of objects 

in virtual environments, for automated video transcription of user studies, as an 

aid for people with special requirements, for contactless interfaces in antiseptic 

environments, or for character animation through motion capture. 

Hand gesture interfaces through means of computer vision are particularly ad- 

vantageous from a user’s point of view because they are untethered, no gloves need 

be worn, and they can be deployed anywhere a camera can be taken. Also, for 

example, while the interaction surface of a conventional keyboard or touch screen 

2


cannot be larger than the hardware, the input space observable with a camera is 

not limited by the camera’s form factor. This property of “device-external inter- 

faces” is of increasing importance to ever-shrinking portable electronics. However, 

dealing with person-specific variations, cluttered backgrounds, changing lighting 

conditions, camera specifics, rapid relative motion, and hard real-time require- 

ments has so far prevented vision-based interfaces (VBI) from achieving robust- 

ness and usability in settings other than the lab environment. Recently, Sony’s 

EyeToy (a full-body VBI for the PlayStation2) and Canesta’s virtual keyboard 

(type on a keyboard projected onto any flat surface) have proven that consumer- 

grade applications have become feasible. Yet, they still are very limited in their 

interaction capabilities and the keyboard requires customized hardware. 

3


Thesis Statement 

This dissertation introduces and evaluates novel and improved computer 

vision methods that facilitate robust, user-independent hand detection, 

hand tracking, and posture recognition in real-time. It demonstrates 

that computer vision is a feasible means to provide hand gesture 

interfaces. Wearable computers and non-traditional environments such 

as augmented reality benefit from the enriched interaction modalities. 

The remainder of this chapter motivates the problem setting further, states our 

main contributions to overcome those problems, and finally provides an overview 

of the dissertation organization. 

1.3 Problem statement 

Hand gestures are very powerful human interface components, yet their ex- 

pressiveness has not been leveraged for computer input. No currently available 

technology is able to recognize these gestures in a convenient and easily accessible 

manner. Computer Vision promises to achieve that, but even its most advanced 

methods are either too fragile or deliver too coarse grained output to be of uni- 

versal use for hand gesture recognition. In particular, methods for hand gesture 

interfaces must surpass current performance in terms of speed and robustness to 

achieve interactivity and usability. No reliable hand detection methods exist, and 

recognition of finger configurations fails due to the high degrees of freedom of 

4


the hand, presenting unsurmountable complexity if approached with traditional 

modeling means. 

1.4 Key contributions 

This dissertation’s first contribution is the establishment of a comfortable in- 

teraction range for hand gesture interfaces. Since the field of human factors lacked 

the means to objectively assess the subtle feeling of postural discomfort, a defini- 

tion of postural comfort had to be introduced that allows precise quantification, 

without the need for subjective questionnaires. Equipped with this definition, a 

user study was conducted to chart the range of comfortable hand positions in 

front of a standing person at about stomach height, an important consideration 

for free-hand gesture interfaces. Chapter 3 covers comfort in greater detail and 

explains how these results contribute towards the remainder of this dissertation. 

Next, “HandVu” 1 was built, the first vision-based interface (VBI) that allows 

real-time detection of the unmarked hand, hand tracking, and posture recognition 

while being mostly invariant to changes in background, lighting, camera, and 

person. The very fast texture-based hand detection and posture classification 

methods operate on unconstrained monocular grey-level images. Together with 

1 HandVu is pronounced “hand-view.” 

5


color-based verification, the detector has a very low false positive rate of a few 

false matches per hour of live processing, indoors and outdoors. 

The “Flock of Features”, a novel hand tracking method, also employs multiple 

image cues – globally-constrained optical flow in combination with a dynamic 

skin color probability function – to surpass the performance of single-modality 

trackers such as CamShift on deformable objects. The method is robust to most 

distraction: it operates indoors and outdoors, with different people, and despite 

dynamic backgrounds and camera motion. It requires no object model and thus 

might be applicable to tracking other very deformable and articulated objects 

such as human bodies. Its computation time of 2-18ms per 720x480 RGB frame 

leaves room for the subsequent hand posture recognition. 

These results for detection and tracking provide an important step towards 

easily deployable and robust vision-based hand gesture interfaces. Further, a qual- 

itative measure is presented that amounts to an a priori estimate of “detectability” 

with Viola-Jones-like detectors, alleviating the need for compute-intensive training 

but obtaining results immediately instead. 

HandVu was successfully demonstrated controlling two wearable computer sys- 

tems. The first system features a head-worn camera and display, demonstrating 

the VBI’s capability to act as the sole input modality. The other is a registered 

Augmented Reality system that adds speech recognition and a custom control for 

6


multimodal input, permitting efficient operation of its complex functionality. The 

VBI was used by different people and with a stationary and a head-worn cam- 

era. HandVu has evolved into a software toolkit that allows out-of-the-box use of 

vision-based hand gesture recognition as interface modality. 

This vision-based interface offers novel and unencumbered data acquisition ca- 

pabilities that are important for new functionalities, new devices, and new ways to 

interact with computers. The contributions of this dissertation have shown that 

VBIs are ready to be taken out of the lab into real applications. Further progress 

on topics that bring together the fields of computer vision, human-computer inter- 

action, and graphics is expected to harbor many opportunities for the computer 

of the future. 

1.5 Dissertation overview 

The remainder of this document is organized as follows. First, the literature 

related to the dissertation is discussed in the next chapter. In Chapter 3, we inves- 

tigate the human factors and biomechanical sides of hand gestures, in particular 

the physical ranges for hand gestures. 

Computer vision as an enabling technology for hand gesture recognition is 

covered in Chapters 4 through Chapter 7. First, the HandVu system for interface 

7


implementation is introduced in Chapter 4. It presents the HandVu library’s 

API and how applications can make use of the out-of-the-box gesture recognition 

toolkit. The functional division into the three subsequent chapters is motivated in 

that chapter as well. Our robust hand detection method is discussed in Chapter 5. 

The fast “Flock of Features” tracking method is introduced and evaluated in 

Chapter 6. As the last purely vision-concerned part, Chapter 7 explains our hand 

posture classifier and presents evaluation results. 

Our experiences with putting HandVu to use are detailed in Chapter 8. The 

vision-based hand gesture interface controlled three different applications, two of 

them wearable computer systems. The last chapter, Chapter 9, relates our work 

to a larger context, provides an outlook to its potentials and implications, and 

concludes the dissertation. 

8

Chapter 2 

Literature Review 

This chapter discusses the most relevant research and publications pertaining 

to this dissertation. The path that this literature review takes follows user in- 

terface (UI) construction from theory to practice. First, a definition and some 

possible classifications of gestures are given. Next, research is described that con- 

cerns biomechanical and human factors issues of hand gestures. The lion’s share 

of this chapter deals with computer vision (CV) methods and their suitability to 

implementing user interfaces. Last but not least, two promising application areas 

for hand gesture-based user interfaces are covered: the non-traditional realms of 

augmented environments as well as mobile and wearable computing. 

2.1 Hand gestures 

In its most general meaning, a gesture is any physical configuration of the 

body, whether the person is aware of it or not, whether performed with the entire 

9

Chapter 2. Literature Review 

body or just the facial muscles, whether static in nature or involving a movement. 

In the computer vision literature, gesture usually refers to a continuous, dynamic 

motion, whereas a posture is a static configuration. In this dissertation, the term 

gesture pertains to static and dynamic, continuous hand gestures. Only when 

discussing computer vision methods will ‘posture’ be used to explicitly address 

the static aspects of hand gestures. For example, posture classification refers to 

the estimation of finger configurations, that is, the ability to distinguish a fist 

from a flat palm and so on. 

Kendon [83] describes different kinds of gestures from what has become known 

as Kendon’s Gesture Continuum: 

gesticulation → language-like gestures → pantomimes 

→ emblems → sign languages 

From left to right, “language-like properties” increase and the presence of speech 

decreases. McNeill [115] proposes a typology of strongly speech-related “gestic- 

ulation” gestures based on semiotic properties, that is, properties that pertain 

to the symbolics of the gesture. This typology is usually preferred over other 

classification schemes owing to its practical value and close relation to the ac- 

companying speech. Language and gestures are also the topic of a comprehen- 

sive McNeill-edited publication [116]. A classification from Quek [136] focuses on 

10


non speech-related gestures from Kendon’s Gesture Continuum and adds uninten- 

tional movements and manipulative hand motions aside from communicative ones. 

Cadoz [23] distinguishes further between ergodic 1 gestures – gestural manipulation 

– and epistemic gestures – tactile exploration. 

According to McNeill [115], gestures are composable into three stages, namely 

pre-stroke, stroke and post-stroke. The pre-stroke prepares the movement of the 

hand. The hand waits in this ready-state until the speech arrives at the point 

when the stroke is to be delivered. The stroke is often characterized by a peak 

in the hand’s velocity and distance from the body. The hand is retracted during 

the post-stroke, but this phase is frequently omitted or strongly influenced by the 

following gesture. 

2.2 Human factors and biomechanics 

The field of human factors is the study of the human body’s physical and 

mental properties and abilities. Biomechanics is a subfield that is exclusively con- 

cerned with the body as a kinematic, mechanical structure. Research in these 

areas has established guidelines for occupational motions and postures that hu- 

mans can assume without harm. Good introductions to workspace design can be 

found in several comprehensive reference manuals [25, 148, 185]. 

1 ergodic: mathematical term meaning space-filling 

11


A number of researchers have investigated the range of arm and hand motions 

to great detail: Chaffin [24] devised a method for measuring fatigue with phys- 

iological indicators, and Wiker et al. [183] surveyed arm movement capabilities. 

Grandjean [51] states that the best grasping distance is about two thirds of the 

maximum reaching distance. The best reaching height with an upright upper 

body is around stomach or elbow height (with hanging arms). In addition to 

the work considering reachability and comfort, other topics relevant to evaluating 

space as an interaction area include the temporal characteristics of reaching move- 

ments, as well as the precision and accuracy of such movements. Fitts Law [44], 

and more recent work such as Fikes [43], show that a good interaction range is 

also characterized by the time it takes to reach the various locations in this area. 

Specific to computer input of spatial data is a survey by Hinckley et al. [58]. 

2.2.1 Postural comfort 

Discomfort and fatigue are two closely related feelings. Discomfort is generally 

assumed to be a subjective quantity, whereas fatigue can be measured objectively 

(see below). Bhatmager et al. [9], Karwowski et al. [78], and Liao et al. [104] 

showed that the frequency of non-work related posture shifts is strongly related 

to the perceived discomfort. They measured discomfort exclusively with question- 

naires for the participants during and after the study. The standard format for 

12


whole-body fatigue assessment is Borg’s Rate of Perceived Exertion, or the RPE 

scale [12]. Borg found a logarithmic relationship between achieved power on a 

treadmill and the participant’s subjective exertion. The RPE is obtained with a 

questionnaire on which participants mark their rate of exertion on a scale of 6-20 

with some numbers having a descriptive name, ranging from 6, “least effort” over 

13 “somewhat hard” to 20 “maximal effort.” The RPE scale is sometimes used 

to assess discomfort as well, especially since the boundary between discomfort 

and fatigue is not clearly defined. For assessment of discomfort of specific body 

parts, the Body Part Discomfort (BPD) scale by Corlett et al. [31] is used. With 

BPD, participants are asked to indicate the body parts that are experiencing the 

highest degree of discomfort on a body chart and rate the severity of discomfort. 

These body parts are then covered up on the chart and the next most uncom- 

fortable body parts have to be identified and so forth, until no more body parts 

experience discomfort. Drury et al. [41] added BPD Frequency and BPD Severity 

to the evaluation method for these questionnaires. BPD Frequency describes the 

total number of body parts with some discomfort, and BPD Severity measures 

the mean severity of discomfort of only these body parts with discomfort. 

Fatigue can be objectively observed by degraded performance (speed and ac- 

curacy) in task execution. Fatigue might also be measurable directly by means of 

superficial sensors for voltage potentials. Electromyographic (EMG) amplitudes 

13


and frequencies reflect the nervous contraction signals to the muscles. Chaffin [24] 

suggests that localized muscular fatigue (LMF), measured by a frequency shift 

towards lower frequency bands, is correlated to experienced discomfort. Other 

studies, however, found little or even inverse correlation between objective EMG 

frequency shifts and subjective fatigue [122]. 

Table 2.1: Measuring various degrees of comfort: 

the table shows which and how feelings along the comfort dimension can be measured. 

BPD is Body Part Discomfort, RPE the Rate of Perceived Exertion, and 

EMG are electromyographic signals. Our work is detailed in Section 3.1. 

feeling comfort discomfort fatigue 

measure unaware aware 

direct no no BPD (RPE) EMG 

indirect no our work posture shifts RPE, performance 

objective no our work posture shifts EMG, performance 

subjective absence of discomfort BPD (RPE) RPE 

The characteristics of the methods to assess feelings along the “comfort dimen- 

sion” are summarized in Table 2.1. For example, EMG is a direct and objective 

measure for fatigue, and BPD is a subjective method for direct ascertainment of 

aware discomfort. “Direct” refers to measurement methods that can detect the 

feeling as such, while “indirect” methods have to rely on measuring a secondary 

effect caused by the feeling. From this chart it is apparent that no traditional 

method can assess very subtle feelings of discomfort, particularly not with objec- 

tive measurements. Our work, detailed in Chapter 3, tries to fill this void. 

14


Kee [82] built a human-physics model in order to compute an “isocomfort 

workspace,” surfaces of uniform discomfort based on perceived discomfort. With 

Chung et al. [26], he had earlier collected subjective discomfort data which can 

now be used for purely observational discomfort estimation, without the need 

for additional questionnaires. For this method, the combination of discomfort 

scores for joint angles of various limbs results in a full-body discomfort estimation, 

hinting at postural workload. 

2.3 Computer vision 

There is an extensive body of related computer vision research which could 

fill many books. Here, we summarize the major works that could fit the bill 

for real-time user interface operation through hand gesture recognition in a fairly 

unconstrained environment. To get an independent overview, the reader is referred 

to a 1998 paper by Freeman et al. that surveys “computer vision for interactive 

computer graphics” [47] and an assessment of the state of the art by Turk [177]. 

The next three sections (2.3.1–2.3.3) mention seminal works in their functional 

contexts, while the remaining sections (2.3.4–2.3.11) cover specific approaches in 

more detail. 

15


Three common tasks for computer vision processing are 1) the detection of the 

presence of an object in an image, 2) the spatial tracking of a once-acquired object 

over time, and 3) recognition of one of many object types, that is, classification of 

the observation in the image into one of many classes. The vision system described 

in Chapter 4 is organized in these three stages and this review starts with a brief 

overview of them. 

2.3.1 Detection 

The human visual system has the amazing ability to detect hands in almost any 

configuration and situation, and possibly a single “hand neuron” is responsible 

for recording and signaling such an event as Desimone et al. showed in early 

neuroscientific experiments [35]. The computer vision research has not quite yet 

achieved this goal. However, it is crucial that a hand that is supposed to function 

as an input mechanism to the computer is robustly and reliably detected in front 

of arbitrary backgrounds because all further stages and functionality depend on it. 

Object detection of artificial objects, such as colored or blinking sticks as in Wilson 

and Shafer [184], can achieve very high detection rates despite low false positive 

rates. Yet, the same is not true for faces and even less for hands. Face detection 

has attracted a great amount of interest and many methods relying on shape, 

texture, and/or temporal information have been thoroughly investigated over the 

16


years. The interested reader is referred to two surveys by Yang, Kriegman, and 

Ahuja [190] and Hjelm˚as and Low [60]. 

Little work has been done on finding hands in grey-level images based on their 

appearance and texture. Wu and Huang [188] surveyed a number of methods for 

their suitability to hand detection. Very recently, boosted classifiers have achieved 

compelling results for view- and posture-independent hand detection as Ong and 

Bowden demonstrated [124]. However, most of the hand detection methods resort 

to less object-specific approaches and instead employ color information (see, for 

example, Zhu et al. [196]), sometimes in combination with location priors (for ex- 

ample, Kurata et al. [100]), motion flow (see Cutler and Turk [34]) or background 

differencing (for example, Segen and Kumar [152]). The focus of the methods 

developed for this dissertation is on reliable detection without placing severe con- 

straints on the scenery. Those must use the maximum amount of information 

available in the image, combining texture and color cues for reliability. 

2.3.2 Tracking 

If the detection method is fast enough to operate at image acquisition frame 

rate, it can be used for tracking as well. However, tracking hands is notoriously 

difficult since they can move very fast and their appearance can change vastly 

within a few frames. On the other hand, mostly rigid objects can be tracked with 

17


very limited shape modeling efforts. Some of the most effective head trackers, 

for example, use a fixed elliptical shape model which is fast and sufficient for 

the rigid head structure, as in Birchfield [10]. Similarly, more or less rigid hand 

models work well for a few select hand configurations and relatively static lighting 

conditions (see Isard and Blake [66]). 

Since tracking with a rigid appearance model is not possible for hands in 

general, most approaches resort to shape-free color information or background 

differencing as in the mentioned works by Cutler and Turk [34], Kurata et al. [100], 

and Segen and Kumar [152]. These methods are vulnerable to unimodal failure 

modes, caused, for example, by a non-stationary camera or other skin-colored 

objects in the neighborhood. In Chapter 6, we describe the Flock of Features, 

our multimodal technique that can overcome these vulnerabilities. Other multi- 

cue methods integrate, for example, texture and color information and can then 

recognize and track a small number of fixed shapes despite arbitrary backgrounds 

(for example, Bretzner et al. [19]). Shan et al. [155] also integrate color information 

with motion data. In their work, a particle filtering method is optimized for speed 

with deterministic mean shift, and dynamic weights determine the blend of color 

with motion data: the faster the object moves, the more weight is given to the 

motion data, and slower object movements result in the color cue being weighted 

higher. Some of their performance is surely due to a simple yet usually effective 

18


dynamical model (of the object velocity), which we could add to our method as 

well. Depth information combined with color as in Grange et al. [52] also yields 

a robust hand tracker, yet stereo cameras are more expensive and cumbersome 

than the single imaging device required for our monocular approach. 

Object segmentation based on optical flow (for example, normalized graph 

cuts as proposed by Shi and Malik [157]) can produce good results for tracking 

objects that exhibit a limited amount of deformations during global motions and 

thus have a fairly uniform flow [137]. The Flock of Features method relaxes this 

constraint and can track despite concurrent articulation and location changes. 

2.3.3 Recognition 

Recognizing or distinguishing different hand configurations is a very difficult 

and largely unsolved problem in its generality. First attempts have recently been 

made by Ong and Bowden [124] with fairly good results. However, to achieve the 

more stringent requirements of user interface quality, robustness and a low false 

positive rate are more important than recognition of the complete hand configu- 

ration space from the entire view sphere. The posture recognition task becomes 

more tractable for a few select postures from fixed views. Segen and Kumar’s 

Shadow Gestures [153] demonstrated how heavy constraints on the scenery make 

computer vision a viable user interface implementation modality: they require a 

19


calibrated point light source, and the hand has to be a certain distance from the 

camera and from a bright, even background. Even then, only three gestures are 

recognized and distinguished reliably. Other, traditional methods based on edge 

detection can achieve fairly good results as Athitsos and Sclaroff [1] demonstrate. 

Wu and Huang [188] show good classification performance but require extensive 

processing time. 

We use a method (described in Chapter 7) similar to Ong and Bowden’s but 

are able to achieve the real-time performance that is necessary for user interfaces. 

Additionally, the result of our classification is a known posture, not a similarity to 

a cluster that was learned without supervision and can thus contain syntactically 

similar, but semantically different hand postures. 

2.3.4 Skin color 

That compelling results can be achieved merely by skin color properties was 

shown early on, for example, by Schiele and Waibel who used it in combination 

with a neural network to estimate gaze direction [151]. Kjeldsen and Kender [85] 

demonstrate interface-quality hand gesture recognition solely with color segmen- 

tation means. Their method uses an HSV-like color space, which is debatably 

beneficial to skin color identification. 2 

2 HSV stands for Hue, Saturation, and Value. This color space separates the chrominance 

(hue and saturation) of light from its brightness (value). 

20


The appearance of skin color varies mostly in intensity while the chrominance 

remains fairly consistent (see Saxe and Foulds [150]). Therefore, and according 

to Zarit et al. [193], color spaces that separate intensity from chrominance are 

better suited to skin segmentation when simple threshold-based segmentation is 

used. However, their results are based on a few images only, while a paper from 

Jones and Rehg [73] examined a huge number of images and found an excellent 

classification performance with a histogram-based method in RGB color space. 3 

It seems that very simple threshold methods or other linear filters achieve better 

results in HSV space, while more complex methods, particularly learning-based, 

nonlinear models excel in any color space. 

Jones and Rehg also state that Gaussian mixture models are inferior to his- 

togram-based approaches. This is true as long as a large enough training set 

is available. Otherwise, Gaussians can fill in for insufficient training data and 

achieve better classification results. Bradski [14] and Comaniciu et al. [28] showed 

that object tracking based on color information is possible with a method called 

CamShift which is based on the mean shift algorithm. These methods dynamically 

slide a “color window” along the color probability distribution to dynamically 

parameterize a thresholding segmentation. A certain amount of lighting changes 

can be dealt with. Patches or blobs of uniform color have also been used, especially 

3 RGB stands for Red, Green, and Blue. Most digital cameras natively provide images in this 

format. 

21


in fairly controlled scenes (for example, by Wren and Pentland [186]). Zhu et 

al. [195] achieve excellent segmentation with dynamic adaptation of the skin color 

model based on the observed image. 

2.3.5 Shapes and contours 

In theory, the contour or silhouette of an object reveals a lot about its shape 

and orientation. If perfect segmentation is possible, similarity based on curve 

matching is a viable approach to object classification, for example, based on polar 

coordinates as in Hamada et al. [53]. One can benefit even more from curve 

descriptors that are invariant to scale variation and rigid transformations such 

as those by Gdalyahu and Weinshall [50] and Shape Context descriptors [8]. For 

less-than-perfect conditions however, more robust 2D methods must be employed. 

Those usually bank on finding enough local clues in the image to place a shape 

model close to where the most likely incarnation of this shape can be found in the 

image data. Iterative methods frequently try to minimize an energy defined as 

image misalignments (far from an edge) and a second energy component defined 

as shape deformations. For example, snakes are “rubber bands” that are attracted 

to strong gradients in gray-level images (see Kass et al. [79]). Active Shape Models 

(ASM, introduced by Cootes and Taylor [30]) are rubber bands that default to 

other than round shapes as snakes would without image constraints. Those shapes 

22


are learned from training data. This approach is based on PCA 4 which extracts 

“modes of variation” from the data. For a hand in top view, these modes could 

conceptually be the flexing of each finger. ASM’s iterative matching method 

then tries to find the lowest energy deformation of the shape while achieving the 

lowest image mismatch energy. The limitations of ASMs are partially due to the 

PCA which does not prevent creation of shapes that are implausible according 

to the training data, especially for silhouettes of highly articulate objects. All 

pure shape models also require very good initialization in close proximity to the 

object in question or background artifacts will make good registration difficult. 

Statistical models of an object’s 3D shape, often called “point clouds,” can also 

be built (as did, for example, Heap and Hogg [56]), but they shall not be further 

surveyed since they exhibit limited speed performance. 

Conventional contour or silhouette-based methods stand and fall with the qual- 

ity of segmentation or edge finding methods. They are sensitive to background 

variation and – when used for tracking – they are generally unable to recover from 

tracking loss. Particle filtering methods (see Section 2.3.9) have built-in tolerance 

for temporary false positives and they recover from tracking errors more easily. 

This comes at the cost of a higher computational requirement. Also, temporal 

4 For an explanation of principal component analysis (PCA) see Section 2.3.7. 

23


data is usually part of the modeled state whereas conventional shape models have 

no notion of time. 

Athitsos and Sclaroff [1] took an increasingly popular approach and had their 

recognition method learn from rendered hand images instead of from actual pho- 

tographs. During testing, edge data between the observation and the learned 

database are compared and 3D hand configurations can be estimated from 2D 

grey-level edges. According to their paper, matching takes less than a second for 

an approximate result, but too long for interactive frame rates. Thayananthan et 

al. [170] detect hands in distinctive postures despite cluttered backgrounds with 

chamfer matching. The chamfer distance between two curves or contours is the 

mean of the distances between each point on one curve and its closest point on 

the other curve. 

2.3.6 Motion flow 

The motion field of a frame specifies for every sample point the direction 

and distance with which it moved in image coordinates from one frame to the 

next. Optical flow and motion fields (see Barron et al. [6]) can serve two main 

purposes in hand gesture interfaces. First, if the motion field for the entire image 

is computed, regions of uniform flow (in direction and speed) can aid segmentation 

and attract other methods to areas of interest. This is particularly effective for 

24


setups with static camera positions, where motion blobs direct other processing 

stages to attention areas, regions in the image of likely locations for the desired 

object (Cutler and Turk [34], Cui and Weng [33], Freeman et al. [47]). Background 

differencing assumes scene composition of a fairly static background and a more 

vivid foreground. This can be seen as a specialized motion flow model where 

additional information is used about some of the objects in the scene. While 

these methods can achieve good results for stationary cameras, they are unsuited 

to moving cameras and large depth differences between objects that make up the 

background. 

Second, motion flow can be used to track a few select image features, usu- 

ally corners or other areas with high image gradients. KLT features – named 

after their inventors Kanade, Lucas, and Tomasi – can be matched efficiently to 

similar regions in the following frame (see “good features to track” in Shi and 

Tomasi [158] and pyramid-based matching in Lucas and Kanade [110]). For ex- 

ample, Quek [137] used this technique in his often-cited gesture recognition paper. 

An improvement on the feature selection method was recently proposed by Toews 

and Arbel [173], but this has not been evaluated yet for its effect on practical 

tracking performance. Feature tracking of course has its limitations due to the 

constancy assumption (no change in appearance from frame to frame), match 

window size (aperture problem), and search window size (speed of moving object, 

25


computational effort). Yet our Flock of Features method introduced in Chap- 

ter 6 is able to capitalize on the benefits and makes up for some deficiencies by 

integration of color, a second image cue. 

2.3.7 Texture and appearance based methods 

Information in the image domain must play an important role in every object 

recognition or tracking method. This information is extracted to form image 

features: higher-level descriptions of the observations. The features’ degree of 

abstraction and the scale of what they describe (small, local image artifacts or 

large, global impressions) have a strong influence on the method’s characteristics. 

Features built from local, small-scale image information such as steep gray-level 

gradients are more sensitive to noise, they need a good spatial initialization, and 

if all but the simplest objects are to be detected, a frequently large collection of 

many features is mandatory (see also Section 2.3.5, Shapes and contours). Once 

the features are extracted from the image, they need to be brought into context 

with each other, often times involving an iterative and computationally expensive 

method. 

If instead the features are composed of many more pixels, cover a larger region 

in the image, and abstract to more complex visuals, the dependent methods are 

usually better able to deal with clutter and might flatten the hierarchy of process- 

26


ing levels since they already contain much more information than smaller-scale 

features. These methods focus on an object’s appearance, which is a description 

of its color and brightness properties at every point as it appears to the observer. 

Since these attributes are view-dependent, it only makes sense to talk about ap- 

pearance from a given view. Appearance is further caused by texture, surface 

structure, and lighting. The appearance of deformable and articulated objects 

naturally is also caused by the object’s current form. 

One of the most influential procedures uses a set of training images and the 

Karhunen-Loève transform [77, 108]. This transformation is an orthogonal ba- 

sis transformation (or “rotation”) of the training space that maximizes sample 

variance along the new basis vectors and is frequently known in the computer vi- 

sion literature as principal component analysis (PCA, see Jolliffe [71]) or singular 

value decomposition (SVD). In their seminal work, Turk and Pentland applied this 

method to perform face detection and recognition and called it Eigenfaces [178], 

building on work by Kirby and Sirovich [84] on image representation of faces. 

Matching an observation to a PCA-based appearance model is computationally 

expensive. Active Appearance Models (AAM, see Cootes et al. [29]) encode shape 

and appearance information in one model, built in a two-step process with PCAs. 

Matching to an AAM can be done with a relatively fast, iterative approximation 

method, allowing for some real-time applications in face recognition, for example. 

27


Direct Appearance Models, proposed by Hou et al. [64], exploit an observation 

that is true at least for face images: that the shape is entirely determined by the 

appearance. Thus, their method avoids some of the complexity of AAMs. 

Further ways to learn and then test for common appearances of objects make 

use of neural networks. However, their performance limits (in terms of speed and 

accuracy) seem to have been surpassed by various other approaches. The following 

section describes a particular method that overcomes most of the speed problems 

associated with spatially large features. A more complete review of appearance- 

based methods for detection and recognition of patterns (faces in fact) can be 

found in Yang et al.’s survey on face detection [190]. 

2.3.8 Viola-Jones detection method 

In this section, the object detection method is described that is the basis for our 

robust hand detector and the posture classification method. Proposed by Viola 

and Jones [180], this extremely fast and almost arbitrarily accurate approach has 

taken the vision community by storm and a number of extension and application 

papers have shown its advantages for detection and recognition tasks. 

Very simple features based on intensity comparisons between rectangular im- 

age areas serve as weak classifiers. A weak classifier may have just better-than- 

guessing abilities to separate the two classes. AdaBoost [48] then combines many 

28


weak classifiers into a strong classifier: Given a set of positive and negative train- 

ing samples, it iteratively selects the best weak classifier from a usually large pool 

of possible weak classifiers. The best classifier achieves the smallest number of 

misclassifications – a small training error. The samples are then weighted accord- 

ing to the classification result, such that an incorrectly classified sample gets a 

larger weight and correct ones obtain smaller weights. In subsequent iterations, 

the quality of weak classifiers is evaluated on the weighted samples. Next, the 

now best weak classifier is picked, the samples re-weighted, and so on. A strong 

classifier is a linear combination of weak classifiers with scalar coefficients that 

correspond to each classifier’s training error. 

Freund and Schapire show that asymptotically perfect classification is possible 

if enough training samples are available. The number of weak classifiers that 

form one good strong classifier is usually in the hundreds. In their application to 

pattern recognition, Viola and Jones arrange a number of these strong classifiers 

in sequence to form a “cascade.” Lazy cascade evaluation during detection allows 

early elimination of negative image areas, resulting in high speed performance. A 

typical number of cascade stages is in the order of tens. The detector cascades 

achieve excellent detection rates on face images with a low false positive rate. 

The second major contribution of Viola and Jones’ paper [180] is the clever 

technique for constant-time computation of the feature values: a single-pass pre- 

29


computation step creates an “integral image,” better known as data cubes in the 

database community. With those, the pixel values in arbitrarily large rectangular 

areas can be summed up with three additions. This is crucial as other techniques 

usually require time proportional to the area of the extent. Some of these “rectan- 

gle features” are depicted in Figure 2.1. Due to the exhaustive-search component 

to find the best weak classifier at each AdaBoost iteration, training a cascade 

takes on the order of 24 hours on a 30+ node PC cluster. 

1 2 3 

Figure 2.1: The rectangular feature types for Viola-Jones detectors. 

At detection time, the entire image is scanned at multiple scales. For example, 

a template of size 25x25 pixels, swept across a 640x480 image pixel by pixel, 

then enlarged in size by 25%, swept again, enlarged, swept, etc. yields 355614 

classifications. Every stage of the cascade has to classify the area positive for 

an overall positive match. This lazy successive cascade evaluation, together with 

the rectangular features’ constant-time property, allows the detector to run fast 

enough for the low latency requirements of real-time object detection. 

Overall, the method’s accuracy and speed performance, as well as its sole 

reliance on grey-level images, make it very attractive for hand detection. 

30


Boosting and particularly Viola and Jones’ method has become one of the 

most actively researched areas in computer vision. For example, Pavlovic and 

Garg [129] showed applications to skin color segmentation and face detection 

based on other features. Jones and Viola extended their work to multi-view face 

detection [72]. The three feature types shown in Figure 2.1 were not sufficient to 

accurately separate the classes, so they introduced a feature type based on four 

rectangular areas. For hand detection, we conceived a similar feature type that is, 

however, even more expressive because it is not restricted to exhibit only adjacent 

rectangular areas. 

Viola and Jones [179] also addressed the issue of the asymmetrical training 

data: usually, there are many more negative examples than positive ones. Yet 

the cost for misclassification of either of them is equal in AdaBoost, which does 

not reflect the importance of the few positive examples. In AsymBoost, they 

propose a way to dynamically and regularly adjust the weights such that positive 

examples have an increased importance. Their results show a reduction in false 

positive rates by up to a factor of two for a given detection rate. 

Zhang et al. [194] also tackled the multi-view face detection. They improve 

the core AdaBoost with a method they call FloatBoost. During training, not only 

are weak classifiers added, but they are also removed from a strong classifier if 

they do not contribute progressively to the performance anymore. This reduces 

31


the number of required feature evaluations during detection and/or improves the 

accuracy of the detector. 

Lienhart and Maydt [105] show that not only grid-aligned rectangular features 

can be computed in constant time, but that diagonal, diamond-shaped arrange- 

ments are equally possible. 45-degree rotated rectangular areas can be summed 

up in constant time, however, the precomputation step requires two passes over 

the image versus one for the original integration templates, not counting creation 

of the squared integral image that is required for constant-time normalization. 

A template is a knowledge-based description of the typical components of an 

object. Components can be locations of prominent intensity levels and their spa- 

tial relationship, image symmetries, certain frequency ranges in Fourier-trans- 

formed images etc. Machine vision and image processing often use image moments 

to describe an object template. Schweitzer et al. [7] extended the idea of integral 

images to compute any order moments in constant time, again with one or several 

precomputation steps. 

Last but certainly not least, Ong and Bowden’s posture- and view-indepen- 

dent hand (shape) detector [124] must be mentioned as it is the closest to our 

hand posture recognition method. With the aid of unsupervised training, they 

can automatically group thousands of training images into 300 clusters based on 

their shape appearance (contour) using k-medoid clustering. In a first stage during 

32


detection, a generic hand detector finds hands in any pose and posture. A second 

stage performs the classification into one of the 300 shape clusters. Different from 

our recognition method, their second stage has no influence on whether an area 

is considered a hand or not. This is likely to involve redundant classifier work, 

causing a speed performance penalty that might be prohibitive for user interface 

requirements (no computation times are stated in their paper). Also, their clusters 

have no immediate relation to a hand posture as they are solely contour-based. 

Our classes directly correspond to one known posture each. 

2.3.9 Temporal tracking and filtering 

This section concerns methods that go beyond motion flow, that is, those that 

track on the object level, not at the pixel- or pattern- level. While our vision 

system currently does not include a dynamic model of hand motion, this is likely 

to improve its performance significantly. Such a dynamic filter can be easily 

implemented at the application layer. 

Kalman filtering [75] takes advantage of smooth movements of the object of 

interest. At every frame the filter predicts the object’s location based on the 

previous motion history, collapsed into the state at the prior frame and the error 

covariances. The image matching is initialized with this prediction and once the 

33


object is found, the Kalman parameters are adjusted according to the prediction 

error. 

One of the limitations of Kalman filtering is the underlying assumption of a 

single Gaussian probability. If this is not given, and the probability function is es- 

sentially multi-modal as is the case for scenes with cluttered backgrounds, Kalman 

filtering cannot cope with the non-Gaussian observations. The particle filtering or 

factored sampling method, often called Condensation (conditional density propa- 

gation) tracking, makes no implicit assumptions of a certain probability function 

but rather represents it with a set of sample points into this function. Thus, irreg- 

ular functions with multiple “peaks” – corresponding to multiple hypotheses for 

object states – can be handled without violating the method’s assumptions. Fac- 

tored sampling, introduced to the vision community by Isard and Blake [67], has 

been applied with great success to tracking various fast-moving, fixed-shape ob- 

jects in very complex scenes (see Laptev and Lindeberg [101], Deutscher et al. [36], 

and Bretzner et al. [19]). How various models, one for each typical motion pat- 

tern, can improve tracking is shown in another paper by Isard and Blake [66]. 

Partitioned sampling is a technique that reduces the complexity of particle filters, 

introduced by MacCormick and Isard [111]. The modeled domain is usually a 

feature vector, combined from shape-describing elements such as the coefficients 

of B-splines and temporal elements such as the object velocities. 

34


2.3.10 Higher-level models 

While our method makes no attempt at estimating the articulation of the 

hand’s phalanges, this topic is still covered here due to its close relation to our 

work. Also, logical extensions of our work certainly include systems for full state 

estimation of complex hand models. 

Higher-level models describe properties of the tracked objects that are not im- 

mediately visible in the image domain. For hands, there are structural, anatomical 

models and kinematic models. Both are 3D models that have explicit knowledge 

about the hand’s physique. For example, the knowledge about limbs and that 

they can occlude each other can avoid problematic situations for the vision algo- 

rithms as Rehg and Kanade showed in [141, 142]. A kinematic model furthermore 

describes interactions between the limbs; see, for example, Lee and Kunii [102]. 

Anatomically, the hand is a connection of 18 elements: the 5 fingers with 3 

elements each, the thumb-proximal part of the palm, and the two parts of the 

palm that extend from the pinky and ring fingers to the wrist (see Figure 2.2). 

The 17 joints that connect the elements have one, two, or three degrees of freedom 

(DOF). There are a total of 23 DOF, but for computer vision (CV) purposes the 

joints inside the palm are frequently ignored as well as the rotational DOF of the 

trapeziometacarpal joint. In addition to these 20 DOF, the hand reference frame 

has 6 DOF (location and orientation). See Braffort et al. [15] for an exemplary an- 

35


thumb IP 

thumb MP 

TMC 

DIP 

PIP 

MCP 

1 

MCC 

1 

3 

1 

1 

2 

1 

1 

distal phalanges 

middle phalanges 

proximal phalanges 

metacarpals 

carpals 

radius 

ulna 

Figure 2.2: The structure of the hand: 

the joints and their degrees of freedom are: distal interphalangeal joints (DIP, 1 

DOF), proximal interphalangeal joints (PIP, 1 DOF), metacarpophalangeal joints 

(MCP, 2 DOF), metacarpocarpal joints (MCC, 1 DOF for pinky and ring fingers), 

thumb’s interphalangeal joint (IP, 1 DOF), thumb’s metacarpophalangeal joint 

(MP, 1 DOF), and thumb’s trapeziometacarpal joint (TMC, 3 DOF). 

thropometric hand model. A hand configuration is a point in this 20-dimensional 

configuration space. 

The difficulties of searching such a high-dimensional space are obvious. Lin, 

Wu, and Huang [106] suggest limiting the search to the interesting subspace of 

natural hand configurations and motions. Type I constraints limit the extent of 

the space by considering only anatomically possible joint angles for each joint (see 

36


also earlier work by Lee and Kunii [102]). Type II constraints reduce the dimen- 

sionality by assuming direct correlation between DIP and PIP flexion. Type III 

constraints limit the extent of the space again by eliminating generally impossible 

configurations and unlikely transitions between configurations. Altogether and 

after a PCA, a dimensionality reduction to seven dimensions was achieved while 

retaining 95% of all configurations observed in their experiments. Wu and Huang 

also published a good high-level survey of the state of the art of hand modeling, 

analysis, and recognition [189]. 

Bray et al. [18] introduce a special gradient descent method called “Stochastic 

Meta Descent.” It has adaptive step sizes so that it does not likely get stuck in 

a local minimum. Together with an anthropometric model and stereo video they 

achieve good results in 3D hand tracking. The processing time is not short enough 

for real-time deployment (4.7 seconds/frame on 1GHz Sunfire). Another paper 

by the same authors [17] combines the Stochastic Meta Descent with a particle 

filtering method, thus making it easier for the deterministic component to march 

out of local minima. This improvement is bought with increased computation 

time. 

Analysis by synthesis describes estimation of model parameters by back-pro- 

jection of the model into the image domain, and then adjusting it iteratively 

to the closest match between back-projection and observation (see, for example, 

37


Kameda et al. [76], Shimada [159], and Ström et al. [166]). These methods lack the 

capability to deal with singularities that arise from ambiguous views (see Morris 

and Rehg [118]). When using more complex hand models that allow methods from 

projective geometry to be used to generate the synthesis, self-occlusion is modeled 

again and thus it can be dealt with. Stenger et al. [165], for example, use quadrics 

to model the hand geometry and show that good tracking of a few hand postures 

with a fairly unambiguous contour is possible. They use an “unscented Kalman 

filter” (UKF) to align the model with the observations. The UKF (introduced 

by Julier et al. [74]) improves on the widely used Extended Kalman filter (EKF) 

for nonlinear systems by not assuming an underlying Gaussian distribution of 

the parameters, by being faster to implement and to apply, and by achieving 

superior performance due to a more accurate estimation of the covariance matrix. 

It samples the distribution with a few, well selected samples which are propagated 

through every estimation step. One of the drawbacks of all Kalman-based filters, 

however, and the reason why the proposed hand posture tracking method will 

most likely not generalize to more complex postures is that these filters assume 

a unimodal distribution. This assumption is most likely violated by complex, 

articulated objects such as the hand. It remains to be noted that the UKF is faster 

than particle filtering, but that the combination with the quadrics representation 

slows the approach down below real-time performance. 

38


Lu et al. [109] also employ an articulated hand model, parameterized with the 

image edges, optical flow, and shading information to estimate a hand posture in 

3D. In an earlier system, Nolker and Ritter [120] start from the fingertip locations 

(found with prior work) and deduce the possible 3D hand postures. Under the 

hood, a neural network parameterizes an articulated hand model. 

2.3.11 Temporal gesture recognition 

When it comes to modeling and extracting the temporal movements of the 

hands, general physics-based motion models are called for. Kalman filtering in 

combination with a skeletal model can easily deal with resolving simple occlusion 

ambiguities as Wren and Pentland [186] demonstrated. Dynamic gesture recog- 

nition, that is, recognition of the continuity aspects of gestures, and especially 

their semantic interpretation is tangential to this dissertation. Readily available 

mathematical tools can easily and independently be applied to the data produced 

from the proposed methods, should the need arise. Therefore, the related work 

description will be kept very brief. 

Many methods in use are borrowed from the more evolved field of speech recog- 

nition due to the similarity of the domains: temporal and noisy. Hidden Markov 

Models (HMMs, see Bunke and Caelli [21] and Young [192]) are frequently em- 

ployed to dissect and recognize gestures due to their suitability to processing 

39


temporal data. More discrete approaches also perform well at detecting spatio- 

temporal patterns; see work by Hong et al. [63]. The learning methods of tradi- 

tional HMMs cannot model some structural properties of moving limbs very well. 5 

Brand [16] uses another learning procedure to train HMMs that overcomes these 

problems: it allows for estimation of 3D model configurations from a series of 2D 

silhouettes and achieves excellent results. The advantage of this approach is that 

no knowledge has to be hard-coded but instead everything is learned from training 

data. This of course has its drawbacks when it comes to recognizing previously 

unseen motion. 

With much more modeling effort, strong enough priors can be placed on the 

observations and simple motions can be recognized even though they are not 

close to training data. DiFranco et al. [38] showed that – when combined with 

manually annotated key frames – even very complex movements can be extracted 

successfully. They model a body as a kinematic chain and add a model for link 

dynamics plus angular restrictions on the joints. 

5 Brand says that the traditional learning methods are not well suited to model state transitions 

since they do not improve much on the initial guess about connectivity. Estimating the 

structure of a manifold with these methods thus is extremely suboptimal in its results. See [16] 

for details. 

40


2.4 User interfaces and gestures 

This section covers literature related to issues of gesture-based user interfaces 

(UI). It sheds light on the suitability of gestures as an input modality. Approaches 

of matching gestures to application needs differ qualitatively. This is discussed 

first, followed by an introductory look at vision-based gesture recognition for 

human-computer interaction. Another detailed review of gesture recognition tech- 

nology can be found in a book chapter by Turk [176]. 

2.4.1 Gesture-based user interfaces 

Gesture interfaces can offer natural and easy ways to communicate with a 

computer and programs. Hummels and Stappers [65] describe two experiments 

that show how well gestures are suited to convey geometric design information 

without requiring pre-defined semantics. For anything but geometric constructs 

of course the semantics have to be defined. Hauptmann showed early on that 

multi-modal interfaces combining gestures and speech are highly intuitive and the 

preferred mode of interaction (given speech-only and gestures-only as the other 

choices). In a Wizard-of-Oz study [55] he had participants manipulate graphical 

objects on a screen. Other recommendations based on his experience are not to try 

to substitute gestures for mouse input since people want to use their hands in more 

41


than just one rigid configuration. He also stresses the importance of immediate 

feedback to allow for compensation of alignment errors when pointing. A study 

by Krum et al. [97] of applications of the Gesture Pendant (see below and [161]) 

confirmed these recommendations: their system maximizes gesture recognition 

robustness by allowing only four static hand configurations, followed by relative 

movements from the starting positions. This severely restricts the expressiveness 

and intuitiveness of the interface – which resulted in limited performance and 

users disliking the gesture interface. 

UI issues of hand gesture interfaces are usually handled in an ad-hoc manner 

without a systematic approach. That is to say, current research often limits the 

choice of gestures to those that are easily recognizable, or it employs habitual 

gestures out of convenience. Few projects take a systematic approach; exceptions 

can be found in Pausch et al. [127], Buxton et al. [22], and Leganchuk et al. [103], 

the latter explaining some of the benefits that two-handed interfaces have over 

single-handed interfaces. A high-level design method for gesture interfaces was 

developed by Sturman et al. [167], complete with an evaluation guide and im- 

plementation recommendations. They distinguish gestures by whether they are 

continuous or discrete. Gesture interpretations are classified according to whether 

the actions in the application domain have a direct relationship to the hand ges- 

42


tures, the gestures are mapped to arbitrary commands, or gestures are interpreted 

in a more context sensitive, symbolic way. 

The XWand [184] is a UI device in the shape of a wand or stick. It enables 

natural interaction with consumer devices through gestures and speech while over- 

coming many of the problems of hand gesture recognition. It closely matches 

“perfect” interaction with the environment: point at and gesture or speak com- 

mands to objects. While none of the techniques for realization of the UI are new, 

the combination into such a package is novel. Earlier, Kohtake et al. [86] showed a 

similar wand-like device that enabled data transfer between consumer appliances 

such as a digital camera, a computer, and a printer by pointing the “InfoPoint” 

at it. A built-in camera is used for fiducial recognition. 

Since the focus of this dissertation is on recognition of hand configurations from 

single-frame appearances (not hand motions as in Section 2.3.11), sign language 

recognition will not be covered for its strong dynamic character in this literature 

review. The interested reader is referred to a review of some recognition methods 

for dynamic gestures, written in 1999 by Wu and Huang [187]. 

2.4.2 Vision-based interfaces 

An early survey of methods and applications of vision-based hand gesture 

interfaces was written by Pavlovic, Sharma and Huang [128]. One of the first 

43


papers that used finger tracking as real-time input was by Fukumoto et al. [49]. 

There, a cursor can be moved around on a projection screen by making a pointing 

hand posture and moving the hand within a space observed from two cameras. 

Two different postures can be distinguished (thumb up and down) with various 

interpretations to control a VCR and to draw. The paper also deals with the 

problem of estimating where a person points. Crowley et al. [32] showed another 

vision-based interface that provides finger tracking input to an augmented reality 

(in its wider sense) painting application. While the CV method presented in 

that paper is not very robust and does not distinguish between different postures, 

they nicely showed correlations between tracking parameters (template area and 

search window sizes) and effects for the UI (maximal speeds). A paper by Sato 

et al. [149] uses infrared images and similar methods to achieve simple interaction 

functionality such as browsing the web with finger gestures. 

A background-robust system that uses shape information and color probabil- 

ities was demonstrated at various public venues by Bretzner et al. [19]. Their 

person-independent methods can distinguish five view-dependent hand configu- 

rations and control consumer electronics with the recognition data. In order to 

make the system real-time despite the employed compute-intensive particle filters 

they use hierarchical sampling and a dual-processor machine. 

44


With a camera positioned above the keyboard, Mysliwiec et al.’s FingerMouse 

project [119, 138] is able to detect the hand in one posture. Once it is detected, the 

extended index finger is found and its tip acts as a mouse pointer. A probability- 

based color-segmented image is the input to a finite state machine (FSM) which 

performs hand detection. The fingertip is found with a simple heuristic that looks 

for the north-most hand part. 

Twelve gestures are distinguished in a paper by Zhu et al. [197]. The ges- 

tures are a combination of motions and configurations, recognized with a trained 

model of a combination of affine transformation and ellipse fitting. A variation 

of dynamic time warping handles temporal variation. With a camera mounted 

opposing the user, a map browser interface is demonstrated to work fairly reliably. 

They observed that their measure of shape is less well recognized (just above sev- 

enty percent of the time) than their measure of motion (almost ninety percent of 

the time), which is not surprising given their choice of gestures and CV methods. 

The Gesture Pendant [161] is a clever combination of hardware and software 

to implement a robust gesture recognition interface. An array of infrared LEDs 

illuminates the area before the device, which is worn like a necklace and ends up in 

front of the sternum. Hand gestures performed in front of it are easily segmented 

from the background with the aid of an infrared filter on a camera, co-located 

with the LEDs. The system can recognize four postures and their movements 

45


very reliably in indoor settings. The authors note that the limited expressiveness 

and outdoor performance might be remedied with a laser beam that produces 

structured (grid-patterned) light instead of the constant illumination of the LEDs. 

Computer vision can provide services to many different applications. Its flex- 

ibility due to the absence of physical devices (aside from the camera of course) 

and due to most of its implementation being software-based rather than in hard- 

ware, reconfiguration and in particular adaptation to the individual are a matter 

of software adjustments, as opposed to cumbersome or impossible device changes. 

The interested reader is referred to a fairly recent survey by Porta [135]. Also, 

Turk [177] recently evaluated the state of computer vision as interface modality 

and gives advice as to which challenges must be overcome before widespread vision 

interfaces can become possible. 

2.5 Virtual environments and applications 

This section discusses some general issues of Virtual Environments (VEs) and 

Geographic Information Systems and then continues the previous section by look- 

ing at ways to perform vision-based human-computer interaction with VEs. 

46


2.5.1 Virtual environments and GISs 

Virtual Environments (see Isdale [68]) have many advantages over conventional 

desktop or other 2D displays. The two main capabilities that make them appeal 

to many other disciplines are their inherent 3D characteristics and the increased 

spatial extent of the display volume. Even though the actual display unit might 

not have a large field of view, viewpoint navigation can turn the overall area 

into infinite spaces. Thus, a natural application area for VEs are Geographic 

Information Systems (GISs). 

The Virtual GIS in Koller et al. [87] is arguably the first and most exten- 

sive combination of state-of-the-art GIS, visualization, and interaction techniques. 

System design for these complex applications poses many challenges (see Lind- 

strom et al. [107] for a project update). Interaction with VEs is also a very 

challenging problem and many researchers are working on solving it. Bowman’s 

thesis [13] is a comprehensive starting point to learn about methods and devices 

for interaction techniques for VEs. Shneiderman [160] summarizes that direct ma- 

nipulation – which our vision-based interface inherently provides – is well suited 

for interaction with VEs due to at least the following reasons: “control-display 

compatibility, less syntax reduces error rates, errors are more preventable, faster 

learning and higher retention, encourages exploration.” 

47


One frequently taken path leads to multimodal interfaces, usually involving 

speech and gesture recognition. These interfaces mimic natural spatial interac- 

tion as demonstrated in the seminal “Put that there” paper by Bolt [11]. Krum 

et al. [98] show a fairly simple yet robust multimodal interface to navigate a vir- 

tual earth model. They use the Gesture Pendant wearable gesture recognition 

device [161] and thus limit the gesture types to a body-relative reference frame. 

In contrast, Rauschert et al. [139] cannot leverage the users’ proprioceptive sense. 

On the other hand, their system allows for direct, absolute pointing to objects 

in the virtual world. The mentioned paper also briefly discusses “step in and 

use” operation, which refers to distinguishing multiple people without requiring 

person-specific training. 

On a related note, Foxlin and Harrington [45] say about Mine et al. [117]: 

“Mine et al. have discussed very convincingly the benefits of designing virtual 

environment interaction techniques that exploit our proprioceptive sense of the 

relative pose of our head, hands and body. A variety of ‘preferred’ techniques 

were presented, such as: 1. Direct manipulation of objects within arms reach 2. 

Scaled-world grab 3. Hiding tools and menus on the user’s body 4. Body-relative 

gestures.” This is very encouraging for the style of interfaces that we propose 

since the proprioceptive sense is conveniently and automatically exploited with 

vision-based hand tracking. 

48


Most of the VE-enhanced GIS systems are designed for indoor use, however, 

while a project at UCSB attempts to take the entire GIS and some of the novel 

visualization techniques outdoors. Clarke et al. [27] describe the UI design issues 

of such an approach. Nusser et al. [121] discuss current technological advances in 

various research areas and how they could affect the future of data sampling in 

the field. A good collection of wearable technology is maintained by Sutherland 

on his Wearables web site. 6 The demands on a wearable GIS are even higher than 

they are on traditional VEs: full immersion of the wearer into the virtual world 

is usually not desired, but instead a mixed reality that places virtual objects into 

the real world. 

Augmented Reality (AR) is by many considered the new exciting frontier in VE 

research. 7 It is a fairly young field that still has many opportunities for solving 

technology problems. For an in-depth overview of AR see Azuma’s survey [4] 

and a more recent update [2]. One of the earliest attempts to take AR to the 

outdoors was by Starner [162]. This paper also briefly deals with user input to 

their wearable AR system, for which they employed a chording keyboard. These 

devices have a slow learning curve and the familiarity penetration in the general 

6 http://home.earthlink.net/∼wearable/ 

7 Dieter Schmalstieg (TU Wien) suggested on a panel discussion at the VR 2003 conference 

to focus on AR research for now, because purely virtual reality (VR) research seems to have 

stalled a little and could definitely benefit from most AR advances as well. He considers AR to 

be harder, so there would be more room for good research. As one of the steps that could bring 

the field on the right track he proposed to integrate camera and video support into existing VR 

toolkits, almost instantly yielding AR toolkits. 

49


population is extremely low. Another early, more elaborate outdoor AR system 

is the Touring Machine from Columbia University (see Feiner et al. [42]). 

A paper by Höllerer et al. [62] mostly concerns design of interaction elements, 

but it also describes the implementation of various ways of user input. They use, 

for example, selection by gaze (head) direction and following menus either on the 

HMD or a hand-held computer. Indoors, wireless InterSense position sensors and 

wireless trackballs take advantage of more available infrastructure. In another 

paper [61], the application of strategies for designing interface techniques is advo- 

cated. When displaying information, it should not only be relevant to the user at 

the respective time, it should also come in a package appropriate for the current 

environment and be visually placed in a sensible spot. 

Head-worn cameras provoke many novel applications. The Remembrance 

Agent by Rhodes [146], for example, associates computer vision-recognized faces 

with learned names and can thus augment the wearer’s memory. Rekimoto and 

Nagao proposed similar augmentations with their NaviCam system [144]. A built- 

in video recorder that operates on the wearer’s commands such as in Jebara et 

al.’s “DyPERS” system [69] is conceivable as well as using the camera to track his 

or her head orientation or even 6DOF location in the environment. The latter can 

be in the unprepared outdoors (see Azuma et al. [3]) or in infrastructured settings 

(see, for example, Park et al.’s take on vision-based pose estimation [126], Welch 

50


et al.’s paper describing the HiBall Tracker [182], Rekimoto’s Matrix [143], and 

Kato and Billinghurst’s AR Toolkit [80]). Purely vision-based ego-pose tracking 

without any infrastructure is very hard, but hybrid tracking can yield better accu- 

racy and precision than each of the involved methods in isolation. This has been 

demonstrated for a combination of magnetic trackers and vision-based tracking 

of artificial fiducials by State et al. [164], with a database of known images by 

Kourogi and Kurata [96], as well as for a combination of compass and inertial 

trackers with CV-based methods that dynamically identify and track features in 

the environment, see You et al. [191]. The current trend for commercial trackers 

certainly seems headed that way since InterSense is developing a hybrid tracker 

with exciting performance characteristics (see Foxlin and Naimark [46]). A related 

project from InterSense [45] had previously looked at ultrasound-based hand track- 

ing from a head-worn reference unit. This would surpass the accuracy of current 

computer vision approaches that also attempt hand tracking in a head-centered 

reference frame, but a hand-worn unit is required and no finger configuration 

information can be acquired. 

Another well-known outdoor AR project, ARQuake by Piekarski et al. [130, 

171], accomplishes registration mainly with huge fiducials in the environment, 

tracked with the AR Toolkit. Whenever the vision-based tracking delivers un- 

reliable results, a differential GPS and compass tracker take over to supply the 

51


6DOF spatial parameters to the application. They use these fiducials also for 

hand tracking [132], which achieves performance somewhere in between an ultra- 

sound tracker and purely vision-based tracking. Again, no finger configurations 

can be distinguished. Piekarski and Thomas [133, 131] also strongly advocate 

development of standard interfaces to both virtual and augmented realities, and 

tools to make application content creation a process that can be executed within 

the augmented environment to support immediate visualization and interaction 

capabilities. 

2.5.2 Vision-based interfaces for virtual environments 

Broll et al. [20] state that the “absence of tether-free solutions” to tracking 

is one of the big problems of interfacing in a convenient manner with VRs. The 

Studierstube [168] and a precursor project of the AR Toolkit by Kato et al. [81] 

use a plain object with fiducials on it to visually track one tool that can virtually 

assume many functionalities. In another application of the same technology, Hed- 

ley et al. [57] investigate how AR can be used in simple geography applications 

on the desktop. In addition, a simple hand gesture recognition system allows 

for free-hand input which is interpreted similar to a mouse pointer. Due to its 

strong relation to the above projects, one non vision-based application shall be 

mentioned: the RV-Border Guards game [123] also uses one device (an electro- 

52


magnetically tracked glove) for various purposes, depending on the performed 

gestures: shooting and shielding. 

Dorfmüller-Ulhaas and Schmalstieg use extensive amounts of special equip- 

ment [40]: users must wear gloves with infrared-reflecting markers, the scene is 

infrared-illuminated, and a pair of cameras is used for stereo vision. However, the 

system’s accuracy and robustness is quite high even with cluttered backgrounds. 

It is capable of delivering the accuracy necessary to grab and move virtual checker- 

board figures. The tradeoffs between special equipment and constraints on the 

environment is illustrated by comparing that paper with one presented by Segen 

and Kumar [154]: their techniques also operate with two cameras, but they do not 

require the user to wear gloves. As a direct result, the background has to be very 

uniform and easy to segment from the hand. The recognized gestures constitute 

input to a simple flight simulator or a robot arm. 

Hardenberg and Bérard [181] show three applications: painting, presentation 

control, and moving of virtual objects. They focus on usability and the paper 

discusses the requirements for a usable interface. An implementation is shown 

that is based on a frame differencing algorithm with decaying average. Assuming 

good segmentation, they first extract the fingertip locations. From those the hand 

configuration is estimated. 

53


Chris Hand [54] took on the tremendously important task of investigating in- 

teraction techniques for 3D environments without limiting himself to a specific im- 

plementation technology. He classifies the uses of gestures in virtual environments 

into viewpoint control, selection, manipulation, and system control. Only by es- 

tablishing device-independent metrics for which gestures are suited to what task, 

can the development of applications and implementation technologies progress 

without slowing each other down. 

2.5.3 Mobile interfaces 

This section reviews VBIs for mobile computers and applications. 

Starner et al. pioneered mobile VBIs for American Sign Language recogni- 

tion [163]. A cap-mounted camera tracked skin-colored blobs whose spatial pro- 

gression was analyzed over time with Hidden Markov Models. Their system 

worked with non-instrumented hands just as ours. However, our system inte- 

grates multiple image cues (skin color and texture information) to overcome the 

robustness limitations associated with relying on the accuracy of single-cue im- 

age segmentation. Recognizing a set of communicative gestures (that frequently 

exhibit distinct spatial trajectories) requires more semantic post-processing, but 

manipulative and discrete postures as recognized by our methods are more de- 

manding on the CV methods. Another color-based VBI was shown by Dominguez 

54


et al. [39] who implemented a compelling wearable VBI that enabled the user to 

encircle objects in view with a pointing gesture. 

In a later project, Krum, Starner, et al. built the Gesture Pendant, a previ- 

ously mentioned mobile system for recognizing gestures and speech [98] (see also 

Section 2.5.1). It employed specialized imaging hardware with active infrared il- 

lumination and provided a small interactive area at sternum height in front of the 

wearer’s body. Our vision hardware is entirely passive; that is, it does not include 

light sources. A related user study [97] found that the relatively static hand po- 

sition for extended periods of time caused fatigue. We hope to avoid fatigue and 

discomfort symptoms even for long-term interaction through use of a much larger 

interaction area and less rigid hand postures. The reader is referred to Section 2.2 

and Chapter 3 for more on fatigue and discomfort. 

Kurata et al.’s HandMouse [100] is a VBI for mobile users wearing an HMD and 

camera very similar to ours, allowing for the registered manipulation technique 

(see Section 8.3). It differs in that the hand has to be the visually prominent 

object in the camera image and that it relies solely on skin color. The robustness 

gained with our multi-modal approach makes it possible for the image of the 

hand to be much smaller. Via wireless networking, they employ a stationary 

cluster for processing [99] whereas our vision methods run on a single laptop 

without sacrificing real-time performance. Going beyond the interaction methods 

55


they demonstrated, we characterize additional techniques and their suitability for 

mobility and the outdoors. Our system then shows how this improves the mobile 

user interface’s usability and effectiveness. 

A few recent research projects use the ARtoolkit [80] software to obtain the 

hand’s 6 degree-of-freedom (DOF) position purely by means of grey-level image 

processing. For example, for the previously mentioned Outdoor Tinmith Backpack 

Computer [172], Thomas and Piekarski attached a fiducial to the back of the hand. 

Our system requires no markers, tracks without restrictions on rotation, and can 

obtain posture information in addition to 2D location. Their system is an excellent 

example of a high-fidelity wearable computer, but also of the amount of equipment 

required to facilitate this functionality. We designed our system to minimize 

extraneous hardware requirements and instead make the computer disappear as 

much as possible. Only the head-worn devices are exposed, everything else is 

carried in a small backpack. 

Wearable Augmented Reality systems such as described in Feiner et al. [42] 

are related in that they are a prime recipient for our interaction methods, as they 

lack sufficient interface capabilities and are thus still limited in their application. 

System usability evaluation is becoming more and more important as witnessed 

by a few projects. Dias et al. [37] found that participants “positively adhered to 

the concept of Tangible Interaction in” augmented reality environments. They 

56


advocate the development of standardized tests for usability as well as standard 

metrics to that end. A first step in that direction is taken by Paelke et al. [125] 

who have started creating an easy-to-use testbed for interaction techniques in 

augmented and virtual environments. Koskela et al. [94, 95] show in their concept 

papers that wearable augmented reality technology has created interest. They 

have approached user evaluation on proof-of-concept systems as a next step. 

57

Chapter 3 

Hand Gestures in the Human 

Context 

Hands are our most versatile tool to accomplish every day’s tasks, from prepar- 

ing a cup of tea or coffee in the morning to switching off the light at the end of 

the day. We use our hands and arms without thinking about them, but instead 

focus on the task they are to do: when writing a letter, we concentrate on the 

words and sentences and their meanings, not on how our fingers guide the pencil 

on the paper. This is the holy grail of user interfaces: controls and interaction 

metaphors should not occupy foreground brain processing resources but recede 

to unawareness in order to make space for task-related thinking. Three rules of 

thumb help to achieve this goal for gesture interfaces. 

• First, the gesture should be as natural as possible to control the task at 

hand. If the task resembles an action that we perform in other situations, 

the gesture should also be similar. 

58

Chapter 3. Hand Gestures in the Human Context 

• Second, the gesture should not involve uncomfortable or even painful pos- 

tures or motions which would attract one’s attention and distract from the 

task. 

• Third, the tradeoff between learning curve and efficacy of the gesture must 

be considered carefully. More required learning such as for keyboard short- 

cuts only pays off versus a generic graphical user interface for the frequent 

and experienced user. 

In this chapter, we address issues from the second bullet: comfort of hand 

postures and motions. We investigated the interaction range in front of a standing 

human in the transverse plane at about stomach height. This is of particular 

interest for our envisioned scenario in which a head-worn camera and display 

provide input and output capabilities to a wearable computer, for example, while 

walking or standing. The questions that directed our research were the following: 

- What range is the hand likely to operate in? 

- What percentage of people appreciate a more expansive interaction area? 

- Does this range change with longer interaction times? 

59


- When assuming direct, registered manipulation (versus the unregistered 

mouse- and pointer- style interaction), where shall items be placed to al- 

low for convenient access with hand gestures? 

- Where in the image should the computer vision methods search for the 

hand? 

These issues are important at two levels of building a vision-based interface. 

At the implementation level, the computer vision methods can be optimized in a 

number of ways, for example, by restricting the search region for the hand. The 

hand detection stage of our HandVu vision interface takes these optimizations 

into account. At the user level, general acceptance depends to a large degree 

upon the ease of use and sustained positive experience with the interface. This 

satisfaction in turn depends on hand postures and motions that are convenient 

and comfortable to perform for a majority of the users. Our vision-based gesture 

interfaces described in Chapter 8 were designed with these recommendations in 

mind. 

To find answers to these questions, we conducted a user study that employed a 

novel method to measure comfort. That method does not require the participants 

to fill in questionnaires but enables fine-grained objective comfort assessment in- 

stead. The main idea is that study participants will naturally and intuitively 

60


select comfortable postures if they are given the chance to do so. This chapter 

summarizes our results, which are also available in two publications that were 

presented at the Human Factors Society’s annual meeting in 2003 [88, 89]. 

3.1 Postural comfort 

This section describes the steps that lead to a quantifiable indication of pos- 

tural comfort. The main concept is to allow for compensation for uncomfortable 

postures through alternative body motions or postures and to measure under what 

conditions the compensation takes over. The goal of this operational definition 

is to lay out a generic method which can detect a sub range corresponding to 

comfort within the full base range of the physically possible. Observing the range 

of postures that participants intuitively assume will give an indication of their 

comfort zone. 

3.1.1 Operational definition of comfort 

Base: Given is an object under investigation (“OI,” a body part or joint) for 

which a comfortable sub range of positions is to be determined during exe- 

cution of a certain task. Select a base range of the OI’s possible positions, 

which can be either the anthropometrically feasible range (that is, the phys- 

61


ical reach limit for example), or it can be an ergonomically sensible range 

(for example, the maximum non-fatiguing reach) for this body part or joint. 

Compensation: A way to adjust the OI must be identified such that the OI 

can assume any location within its full base range without hampering task 

execution. This compensation can either be another body part or joint 

that can substitute for the movement of the OI, or it can be a way to 

adjust the experiment settings. For example, assume the OI is sideways head 

rotation (around the longitudinal axis). Then, compensation by another 

body part could be full-body rotation around the same axis. Or, assume 

users are performing a visual display terminal task (VDT task, reading on 

a monitor). A monitor that can be rotated around the body’s longitudinal 

axis is a device that can compensate for the head’s rotation by adjusting 

the experiment settings. It provides an alternative for head rotation in the 

longitudinal axis and, thus, is a compensation for the OI. Care must be taken 

to choose a compensation that does not require considerable effort from the 

participant. That is to say, it is important that these alternatives are not 

anthropometrically more “expensive” than the adoption of an uncomfortable 

posture. 

62


Comfort Zone: In an experiment where the participants are free to use both the 

OI and the compensation to their desire, the sub-range which corresponds 

to the comfort zone will be determined. Design a set of tasks such that, if 

compensation was not permitted, areas in the base range of OI’s positions 

have to be assumed at sufficient sample density. The tasks must not evoke 

the participant’s desire to compromise posture for task performance. Now 

allow for compensation and observe which sub-range of the OI’s positions is 

still assumed, and which complementary region is compensated for by the 

alternative motion or posture. This boundary delineates the comfort zone 

which is the main result of the procedure. Frequently, a second range can 

be observed, the positions assumed by the OI after compensating motion. 

Strictly speaking, this range is independent of the comfort zone. However, 

we generally expect it to fall within the limits of the comfort zone. This in 

fact validates and reinforces the result of the comfort zone and was observed 

in the experiment described in the following section. 

The concept of devising experiments to determine comfort zones is illustrated 

below with two potential experiments. 

Example 1: A potential experiment could include displaying text at locations 

around the head and having the participants read it aloud. For some locations, 

the participants will be more likely to turn their heads towards the text without 

63


moving their bodies, and for other locations they will compensate for extreme 

head rotations by rotating their entire bodies. The range of assumed rotational 

head-to-body offsets corresponds to the comfort zone for this OI. 

Example 2: A comfortable arm and hand motion is one that requires little 

or no trunk motion, provided that the trunk is free to move. Presenting a set 

of targets at various locations will elicit trunk motions for some locations, while 

others will not. The comfort zone is the range of targets which elicits little or no 

trunk motion. 

3.2 The comfort zone for reaching gestures 

We determined the physical range in which humans prefer to perform fine 

motor hand motion through a user study that is described in the following. The 

important distinctions between this study and those in the previous human factors 

research are the following. First, the participants had to sustain only the weight 

of their arms and hands as opposed to carrying additional weight. Second, as our 

participants were naïve to the study’s purpose, we were able to expect that demand 

characteristics of the study itself did not play a significant role in influencing 

our results. In fact, we measured an objective quantity as opposed to acquiring 

subjective, questionnaire-based data as was done traditionally. 

64


3.2.1 Method and design 

According to our definition of comfort, a comfortable arm and hand motion 

is one that requires little or no trunk motion, provided that the trunk is free to 

move. Presenting a set of targets at various locations will elicit trunk motions for 

some locations, while others will be reached solely by arm and hand gestures. The 

comfort zone is indicated indirectly as the range of targets which elicits little or 

no trunk motion. We hypothesized that there is such a range that satisfies most 

users; that is, users would prefer to operate (are comfortable) therein. 

We used a 13 x 2 within-participants design that studied the effect of 13 

target locations and 2 trial durations. The locations were chosen to optimally 

sample the frontal transverse plane at stomach height across angle and distance 

with respect to the right shoulder joint. These locations are depicted as little 

circles in Figure 3.1 which shows a sketch of the experiment area from a bird’s 

eye view. In every trial, a trackball mounted on a tripod was placed at one of 

the 13 locations. The participants had to perform a task that required their hand 

to remain suspended in this position for five or twenty seconds. In Figure 3.2 a 

participant can be seen performing the skill task on the trackball. 

The dependent variables were the locations of the participants’ shoulder joint 

and of the hand’s palm top. Both trajectories were measured in 3D space with 

electromagnetic trackers throughout each trial. Of particular interest were two 

65


50 

40 

30 

20 

10 

-10 

-20 

-30 

-40 

0 

cm 

70° 

-55° 

45° 

-50 

-10 0 10 20 30 40 50 60 

-5° 

error 

counter 

-30° 

20° 

80cm 

00 

countdown 

timer 

250 

wall 

Figure 3.1: Plan view of the experiment setup: 

shown are the 13 target locations (open circles at line intersections) and the size of 

the displayed object for the skill task on the right. Rays indicate the six azimuthal 

directions (-55 ◦ , -30 ◦ , -5 ◦ , 20 ◦ , 45 ◦ , and 70 ◦ ); concentric partial circles depict the 

four radial distances (20, 33, 46, and 60 cm). Also shown are two measured 

values: the initial shoulder locations (dot cluster at coordinate system origin) and 

the hand locations during the trials (dot clusters left of targets). 

66 

cm


derived measures from these variables, the shoulder motion from the initial start- 

ing position to the location at the end of each trial, and the distance between 

hand and shoulder. 

3.2.2 Participants 

Three females and four males from the campus community ranging in age 

between 21 and 30 years participated in this study. All were naïve to the purpose 

and predictions of the study. Prior computer exposure varied from ”email and 

web” to computer science majors. All participants reported being right-hand 

dominant, and we recorded their body, shoulder, and elbow heights from the 

floor. The participants were compensated for their efforts with material goods 

valued at approximately USD 5. 

Woodson [185] delineates reaching distances for arms in seated positions. Af- 

ter converting his head-centric coordinate system to our shoulder-centric one, we 

obtain the following percentiles: the 5 th percentile of reaching distance is at 61cm, 

the 50 th at 66cm, and the 95 th at 70cm. Taking into account about 7cm for the 

distance between the target reached with the curled fingertips and the tracker 

location on top of the palm, our participants’ physiques spread from the 10 th to 

the 90 th percentile, thus being a representative sample of the population. 

67


3.2.3 Materials and apparatus 

The participants had one Ascension electromagnetic six degrees of freedom 

tracker attached to the right palm top and one above the right shoulder joint. For 

each participant, a starting foot position was established and marked on the floor 

such that the shoulder joint was at approximately the same horizontal location 

for all participants. This is the origin of the coordinate system for all further 

discussion. The target object was the ball of a conventional computer trackball. 

The choice of this device is irrelevant to the study because its only objectives were 

a) to require the participants to keep their hands suspended in the air and b) to 

distract the participants from this fact. Furthermore, various input devices fare 

very similar with regard to perceived discomfort of use (see Kee [82]). 

The skill task required participants to turn the trackball in both directions 

around the horizontal axis, that is, forward and backward from the participant’s 

point of view (roll). The goal was to keep a virtual object from rotating beneath 

a virtual ground level. The object exhibited a randomly changing velocity and 

acceleration of its rotational speed around an anchor point on the ground. This 

rotation had to be counteracted by the participant. The nature of the skilled task 

itself is not relevant to the study, its purpose was merely to keep the participants 

occupied and distracted from the study objective. The object was projected with a 

video projector on the wall about 2.5 meters from the participant’s initial position. 

68


Figure 3.2: A participant performing the skill task: 

note one tracker attached to the participant’s shoulder and one on his hand. 

Projected in front of him is the virtual object, a spiky sphere, which had to be 

balanced with the trackball. 

Its size was 80 centimeters in diameter. It was rendered in real-time with OpenGL 

at 60Hz. A countdown timer for the time remaining in each trial was shown to 

the right and an error counter to the left of it. A few pictures from the setup can 

be seen in Figure 3.2. 

To prevent participants from pre-planning their movements, they were re- 

quired to close their eyes during setup between all trials. This was mandated 

since pre-planning can change the way motions are performed, see Rosenbaum 

et al. [147]. To also eliminate sound cues to target placement, participants wore 

mono headphones that rendered spatial sound cues ineffective. 

69


Two constants were defined. First, body sway and body rotation cause shoul- 

der movements. Participants moved their shoulders no more than 9cm without 

taking a step forward or backward. We defined this as the threshold for signifi- 

cant compensating movement, tnc = 9cm. Second, the offset between the tracker 

location on the hand and the fingertips (which had to reach the target) was also 

constant: we placed the tracker at a distance of dtf = 7cm. This amount was also 

observed in the data. 

3.2.4 Procedure 

A trial consisted of one rotate task executed for 5 or 20 seconds at one of the 13 

trackball locations. After concluding a trial the participants were told to momen- 

tarily walk to a location about one meter from the starting location, and thereafter 

go back to the initial position and close their eyes. This was implemented after pi- 

lot data showed a tendency for participants to move less and less over the course 

of subsequent trials. We speculated that promoting a general level of walking 

would help reduce this ostensible impedance. The trackball was repositioned to 

the next location and the countdown timer reset to either 5 or 20 seconds. The 

start of a new trial was indicated by vocal announcement. The participants would 

then open their eyes, move towards the target (trackball), and execute one rotate 

task. The target locations and motion duration was randomized. 

70


During each trial after opening the eyes, participants were free to move around. 

In particular, they were allowed to move their bodies towards or away from the 

target. To reinforce this possibility for compensating motion, we positioned the 

trackball well out of arm reach for a few test trials before the beginning of data 

collection. The entire experiment lasted about 1.5 hours per participant for three 

repetitions, including a 5-10 minute break after 50 minutes. 

3.2.5 Instructions to participants 

Participants were told that the study tested motor skill levels at various loca- 

tions, that they should focus on the rotate task and produce as accurate perfor- 

mance as possible. In the pilot study, participants had slid their palms back and 

forth on the trackball, causing arm weight offload onto the trackball mounting 

structure. Therefore, they were also instructed to use a “finger-walking” motion 

on the trackball to minimize the detrimental effect of weight offload. It was made 

explicit that the time between the instructor’s start command and the partici- 

pants starting to rotate the trackball did not matter. The intention was to avoid 

participants compromising comfort for speed. It was in fact observed that partici- 

pants took different amounts of time to “settle” before they touched the trackball. 

The dialog for each trial was: 

(after having completed the previous trial) 

71


instructor: “please move out of the tracker range, then go to the starting 

position and close your eyes” 

participant: (does so) “eyes closed” 

instructor: (repositions trackball) “stand still please” 

participant: (stands motionless) 

instructor: (starts recording) “go!” 

participant: (opens eyes and starts the trial) 

3.2.6 Results 

Fairly consistently the participants’ body positions remained constant shortly 

(500ms) after their hands had reached the target. Thereafter, very little motion 

of hand and body was observed until the end of the trial. We now define a few 

symbols for the sake of a thorough discussion. Let tE be the time at the end of 

each trial. Let db be the amount of body movement, that is, the distance between 

the initial position of the shoulder and the shoulder’s position at time tE. In fact, 

db is defined only from the movement component along the vector from origin to 

target location. Negative numbers indicate a movement backwards, away from 

the target. 

Figure 3.3 shows contour lines (isolines or “isocomfort”) for medians of |db| over 

the entire study area – interpolated with Matlab’s v4 algorithm, which produces 

72


Figure 3.3: Mean and standard deviation of body movement: 

absolute body movement |db| in cm is considered. The rays and coordinate system 

are identical to Figure 3.1. 

smoother results than cubic or spline interpolation. The central region about 35-45 

cm from the shoulder joint is clearly visible as the region of least body movement 

(1cm isoline). The participants engaged in compensational motion of less than 

1cm for targets positioned in this area. Thus it is the most comfortable region for 

gesture interaction in the horizontal plane at about elbow or stomach height. The 

entire area at this radial distance around the shoulder is also highly preferred: 

the median body movement was less than 2cm (2cm isoline). The variation in 

compensational motion was not significant across different angles (p=0.46). The 

standard deviation of absolute body movement shown in Figure 3.3 confirms a 

73


high consistency for all participants in this area. Results were consistent (p > 0.5) 

throughout the repetitions and naturally showed a high significance in the target 

distance parameter (p < 0.001). 

One could expect that the absolute size of the comfort zone for a particular 

person is only dependent on the person’s physical reach, that is, the person’s 

actual arm length (as measured with a tape). However, our experiments could 

not confirm this. Similarly, the observed greatest or median reaching distances for 

a particular participant did not play a significant role either. In the contrary, the 

comfort zones that we measured seem to be a function of personal preference or 

habit. 

Let dhs be the distance between hand and shoulder at time tE. Note that dhs 

is measured from the shoulder to the tracker attached to the palm top, not to 

the location of the fingertips. Figure 3.4 shows for all trials at -30 ◦ azimuth the 

distance dhs on the y-axis, plotted over the distance db on the x-axis. 

The linear relationship between body movement and hand-to-shoulder distance 

is clearly visible in the diagonal arrangement of the data for each target distance 

dt: either the hand reaches further towards the target (larger dhs) and the com- 

pensating body movement db is smaller, or vice versa. In every case however, the 

sum db + dhs roughly equals the target distance (minus the tracker-to-fingertip 

offset dtf). Figure 3.4 also shows that some participants chose to use the com- 

74


Figure 3.4: Hand-to-shoulder distance over body movement: 

for four target distances at -30 ◦ azimuth (to the right). The comfort zone is defined 

only for dhs, as cz− ≤ dhs ≤ cz+. 

75


pensating motion, that is, to take a step forward or backward for target distances 

dt ∈ {20, 46, 60}. The respective clusters of data points spread to the left and right 

outside of the range of body movements which are possible without compensating 

stepping, |db| < tnc. 

There are two important observations to make. First, these participants take a 

step if the target is outside their comfort zone for hand-shoulder distances, cz− ≤ 

dhs ≤ cz+. This range marks the limits of the participant’s comfortable reaching 

distance that, when exceeded, is compensated for by body movement. Second, the 

hand-to-shoulder distance dhs that these participants assume thereafter is again 

within a tight range. This can be observed in data points to the left and right 

of the tnc threshold lines: they are confined to within the two dashed horizontal 

lines. It is critical to note that, strictly speaking, these two ranges are independent 

from each other, yet they seem to correspond strongly. This indeed indicates the 

existence of a “comfortable interaction range.” In other words, if a participant 

takes steps for dt /∈ K for some K ⊂ ℜ, then dhs ∈ K is assumed thereafter. For 

reaching straight forward (-5 ◦ azimuth) we found cz− = 23cm and cz+ = 38cm. 

The two interaction durations turned out to be too similar: no significant 

difference in the participants’ behavior was found between 5 and 20 second trials 

(p=0.60). However, our guess is that with interactions longer than a few minutes 

participants will start changing their postures slightly during the trials. 

76


Figure 3.5: The comfort ratings in front of the human body: 

the isolines depict the percentiles of all study participants’ trials in which the 

absolute body movement was less than the no-step threshold tnc = 9cm. 

The main results of this user study can be explained with Figure 3.5. It 

shows, again in birds-eye view, the 13 target locations as points denoted by little 

circles along the six rays emanating from the coordinate system’s origin. The 

origin of the coordinate system is the shoulder location at the beginning of each 

trial. The participants are facing to the right. The isolines depict the percentage 

of all participants’ trials that did not evoke compensating body motion, that is, 

participants did not step from their starting position (thus, |db| < tnc. Accord- 

77


ing to our operational definition, this behavior corresponds to the participants’ 

comfort zone for hand and arm postures. For example, to reach a target located 

straight forward from the right shoulder joint at a distance of about 45 cm, in 

about 85 percent of the time the participants did not take a step to alleviate an 

uncomfortable reaching gesture and are thus within their comfort zone. 

3.3 Discussion 

3.3.1 The meaning of comfort 

While our definition of comfort clearly describes how to measure a quantity, 

the relationship between this quantity and (dis)comfort is not inherent. In other 

words, the comfort zone, defined as the area of the most comfortable motions 

or postures for a given task, does not predicate an absolute measure of well- 

being. Further experiments have to be designed to study this relationship. For 

the purpose of evaluating biomechanical postures with respect to their relative 

sustainability, however, our definition delivers the desired results: users within 

their comfort zone are unlikely to change into other postures. 

Similarly, comfort does not effectuate a risk-free posture. Anthropometric 

soundness of a posture or motion has to be established with complementary means. 

Also, personal differences might cause a generally comfortable and safe posture 

78


to be uncomfortable or even risk-afflicted for select individuals. This is inherent 

in results that can only be stated as population percentiles. Furthermore, a gen- 

eral impairment in a person’s mobility might negatively affect the quality of our 

comfort measure. 

3.3.2 Comfort results and related work 

Here, we will put our results in the context with two previous studies on pos- 

tural workload and discomfort. Note, however, that our method does not usually 

evoke an aware experience of discomfort, as is essential for all other methods. 

Please refer to the related work Section 2.2 for a more general embedding of our 

research into the body of literature. 

After adding the tracker-to-fingertip offset dtf = 7cm to cz− = 23cm and 

cz+ = 38cm (at -5 ◦ azimuth, almost straight forward), our findings correspond to 

recommendations on workspace design and tool or materials positioning: Grand- 

jean [51] suggests an optimal placement within 35-45 cm radius from the lowered 

elbow. Our study goes further as it not only allows recommendation of an opti- 

mal reaching distance, but also quantification of how well this and other areas are 

likely to be experienced by the human. Figure 3.5 details this result of our study. 

To build comfortable gesture interfaces, designers should stay within the 95 th or 

at least the 90 th percentile to accommodate most users. 

79


If the participants were not allowed the compensating motion, arm motions 

necessary to reach all target locations in our study would provoke arm-joint dis- 

comfort scores in the range of 1 to 8 according to Chung et al. [26], where a higher 

number reflects greater discomfort. After allowing for compensating motion, no 

participant assumed postures with a score higher than 1. This shows that our 

findings are in line with these previous results as well. Our study delivered more 

fine grained results, however, as those scores are discrete values that stem from 

discrete intervals of joint angles: 0-45 degrees produce a score of one, and 45-90 

degrees a score of three, and so on. Thus, the comfort zone is a true subspace of 

the area of no experienced discomfort. 

Our definition does not rely on physiological data about muscle fatigue but 

instead considers postural comfort, which – to the best of our knowledge – is not 

physiologically manifested. Also, most human factors work is targeted towards 

decreasing the risk for musculoskeletal injuries resulting directly from the actual, 

assumed position during the task performance. Our work, however, aims at finding 

postures that do not motivate the desire to change posture, thus eliminating the 

risk of postures that the task designer has not anticipated. 

80


3.3.3 Miscellaneous 

We did not collect subjective, questionnaire-based user discomfort data for 

three reasons. First, we did not expect them to deviate from the previous studies’ 

results. Second, the information gathered with conventional methods would have 

been too coarse for a meaningful comparison. Third, it is essential to the comfort 

definition that participants are oblivious to the study objective. Intermittent data 

collection is therefore prohibitive. Post-trial data collection either includes recall 

of 26 experiment configurations – an unlikely feat – or re-execution of them with 

interlaid evaluation, which again would interfere with the participants’ naïvety 

towards the purpose of the compensational motion. 

User interfaces that utilize both hands provide many benefits over single- 

handed interaction [59, 134]. The comfort zone for the non-dominant hand is 

expected to be very similar to the mirrored image of the comfort zone that we 

observed for the dominant hand. However, when both hands are to be used con- 

currently in a user interface, special considerations might be necessary. 

3.3.4 Open issues 

Our definition of comfort represents a novel evaluation tool for the detailed 

investigation of human postures and motions. The following are directions for 

further research that are, however, out of the scope of this dissertation. 

81


• The relation between the effort for the compensating motion/posture and 

the effort for the primary motion/posture might influence the extent of the 

comfort zone. For example, if the compensation is physically too costly, the 

study participant might choose to put up with the uncomfortable, primary 

motion or posture. A related question is how the effort of assuming a cer- 

tain posture compares to the effort to execute a certain motion, and what 

implications this has for compensating for one with the other. 

• Postural shifts during long-term stationary postures such as sitting indicate 

some degree of discomfort, see in Liao and Drury [104]. More frequent 

shifts indicate an increase in discomfort. If shifts occurred more frequently 

for postures outside the comfort zone than they occur for “comfortable” 

postures, that would be an independent indication of the validity of our 

comfort measure. 

• Temporal evolution of comfort and discomfort is largely unknown. It is 

unclear for how long people can be comfortable within a certain motion 

range or posture, as well as what preferable remedies are for this temporally 

acquired discomfort. The size of the comfort zone for very long term postures 

and motions could remain unchanged or it could shrink. In any case, the 

validity of the method to determine the comfort zone would be assured 

82


if the aforementioned increase in postural shifts happened with different 

magnitudes for areas inside the comfort zone as for outside. 

• The link between our definition of comfort and observed postures that pose 

a health risk is not yet established. An experiment designed to show this 

link for certain postures would make the proposed theory an even more 

important contribution. The objective is to determine whether participants 

that perform a task that is outside their comfort range will assume postures 

that are known to compromise their health. On the other hand, for a positive 

link to be established, they have to perform a task that is within their 

comfort range without assuming those postures. 

3.4 Conclusions 

The comfort assessment method as described in Section 3.1.1 provides a prin- 

cipled approach for identifying the comfort zone of bodily motions and postures. 

The particular study example of hand reach (Section 3.2) was chosen because it 

surveyed an important aspect of the space that is likely to be chosen for a hand 

gesture interface. The study results define a fine-grained, two-dimensional com- 

fort function over the area in front of the body: the optimal distance for hand 

placement is within a half moon-shaped area about 35 to 45 centimeters from 

83


the shoulder joint, at an angular range from 70 degrees adduction to 50 degrees 

abduction (away from the body center). Researchers and designers of gesture 

interfaces should primarily make use of this comfort zone. In general, novel inter- 

faces should be evaluated for comfort or they could expose users to risk-fraught, 

unanticipated use patterns. 

The results of this study were used in the remainder of this dissertation work in 

the following ways. 1) The camera’s field of view includes the entire comfort zone 

in front of the human body. 2) Hand detection occurs in the most comfortable 

interaction area. 3) The pointer-based interaction style is preferred whenever 

possible, allowing hand movements in a dynamically determined area due to the 

input-to-output coordinate translation. 4) Also for the pointer-based interaction 

style, the input range is scaled to a larger output range, thus allowing smaller 

hand movements that do not exit the comfort zone and still reach all interaction 

elements with the pointer. 

84

Chapter 4 

HandVu: A Computer Vision 

System for Hand Interfaces 

Hand gestures can be recognized with various means and varying fidelity. Data 

gloves, for example, are gloves equipped with bending-sensing elements that can 

accurately report the intrinsic parameters of the hand – flexion/extension and 

abduction/adduction of the various joints. A position- and orientation-sensing 

device, mounted on the wrist of the glove, can track the hand’s extrinsic pa- 

rameters with six degrees of freedom (DOF). These devices set the high bar for 

body posture estimation. However, they require gear to be worn on the hand and 

usually some fixed-mounted infrastructure. 

Computer vision-based approaches hold the promise to avoid these disadvan- 

tages. While the presence of a camera is inevitable, be it mounted on the ceiling 

or strapped to the wrist, the observed body part is unhindered by cloth or worn 

85

Chapter 4. HandVu: A Computer Vision System for Hand Interfaces 

gear. In addition, cameras worn on the body allow for mobility of the recognition 

system and thus for use of the data as input to wearable devices. 

This chapter presents the structure and main characteristics of the computer 

vision system that we built. This software system is capable of detecting the 

human hand in monocular video, tracking its location over time, and recognizing 

a set of finger configurations (postures). It operates in real time on commodity 

hardware and its output can thus function as a user interface. We will first describe 

the physical setup, then the software organization, and finally characteristics of 

the system as a whole. 

4.1 Hardware setup 

A camera is assumed to be worn on the forehead, facing forward and downward 

to cover the lower quarter-sphere that hands can operate in, in front of the body. 

This configuration is advantageous because gestures performed in the observed 

space include the most convenient hand/arm postures as detailed in Chapter 3. 

Furthermore, that location allows camera mounting atop a head-worn display as it 

is frequently employed for virtual and augmented (mixed) reality applications (see 

Chapter 8). The co-location can in turn be exploited to realize video see-through 

86


capabilities by feeding the recorded video to the display in real-time. This is a 

popular way of facilitating augmented reality. 

Figure 4.1: Our mobile user interface in action: 

all hardware components aside from display and camera are stowed in the backpack. 

Output is realized through a head-worn display (HMD, in our case Sony 

Glasstron LDI-A55 glasses), atop which we mounted a small digital camera (Fire- 

Fly, Point Grey Research); see Figure 4.1. The camera has a horizontal field of 

view (FOV) of 70 degrees. The live video stream, augmented with the application 

overlay described in Chapter 8, is fed into the display to achieve video see-through 

mixed reality. This alleviates problems with the HMD’s small 30 degree FOV be- 

cause it makes 70 degrees FOV available to the wearer. The resulting spatial 

compression takes users a few minutes to get used to, but seems quite natural 

after that time. Use of this fish-eye-style lens reduced the tunnel effect that most 

optical see-through mixed reality displays exhibit. The high FOV is also impor- 

87


tant for interface functionality because both the hands and a more forward-facing 

view direction are within the FOV, which allows direct feedback as well as a regis- 

tered interaction style (see Section 8.3). Furthermore, the FOV encompasses the 

entire comfort zone as discussed in the previous chapter, leaving the possibility 

to leverage the full area of convenient hand motions for the interface designer to 

explore. 

Note that no other input device such as a Twiddler keyboard or 3D mouse 

has to be used. Instead, the input- and output-interface is combined into a single 

head-worn unit. The other logical component of the system, a laptop plus a few 

adapters and batteries, is stored away in a conventional backpack. Overall, this 

makes for a fairly easy to assemble and relatively inexpensive mobile computer. 

4.2 Vision system overview 

The software system that realizes the vision-based hand gesture recognition 

and allows for its utilization as a user interface consists of a number of software 

components that will be described in the following. HandVu – pronounced “hand- 

view” – is a library and the core gesture recognition module that implements all 

of the computer vision methods for detection, tracking, and recognition of hand 

gestures. This module receives the video feed from a DirectShow pipeline and sup- 

88


plies the gesture results to an MFC application. This application, called HandVu 

WinTk, handles pipeline initialization and implements convenience functions. In 

addition to these runtime components, there is also an offline module that im- 

plements AdaBoost training for the detection and recognition components. The 

training module is described in Chapter 5. 

4.2.1 Core gesture recognition module 

The core module is a combination of recently developed methods with novel 

algorithms to achieve real-time performance and robustness. A careful orches- 

tration and automatic parameterization is largely responsible for the high speed 

performance while multi-modal image cue integration guarantees robustness. 

There are three stages: the first stage detects the presence of the hand in one 

particular posture. (It is undesirable to have the vision interface always active 

since coincidental gestures may be interpreted as commands. Also, processing is 

faster and more robust if only one gesture is to be detected.) After this gesture- 

based activation, the second stage serves as an initialization to the third stage, 

the main tracking- and posture recognition stage. 

This multi-stage approach makes it possible to take advantage of less general 

situations at each stage. Exploiting spatial and other constraints that limit the 

dimensionality and/or extent of the search space achieves better quality and faster 

89


processing speed. We use this at a number of places: the generic skin color model 

is adapted to the specifics of the observed user (see Section 5.9 in Chapter 5), 

and the search window for posture recognition is positioned with fast model- 

free tracking (see Chapter 6). However, staged systems are more prone to error 

propagation and failures at each stage. To avoid these pitfalls, every stage makes 

conservative estimations and uses multiple image cues (grey-level texture and local 

color information) to increase confidence in the results. 

HandVu, the core vision component, is entirely platform independent and its 

only necessary dependency is the OpenCV library. 1 The Maintenance Support 

application (see Chapter 8) that is built into the HandVu WinTk MFC component 

requires the Magick image library for loading the icon overlays. It is started on 

demand only. Image operations are kept scalable to different frame sizes as much 

as possible. 

HandVu serves as a library for gesture recognition that can be built into any 

application that demands a hand gesture user interface. However, it does not 

handle any platform-specific operations such as image acquisition or display, and 

thus requires some programming before it can be used. Section 4.2.7 describes 

1 However, using Intel’s Integrated Performance Primitives (in particular the Image and 

Video Processing part, formerly IPL) as OpenCV subsystem increases the performance on Intel 

platforms. 

90


method detection 

tracking 

tracking and image cue 

initialization 

recognition 

extended Viola-Jones 

detection/recognition 

Flock of Features 

histogram 

lookup 

hand detection 

color verification 

(fixed histogram) 

- 

+ 

initialize 

feature locations 

learn fore-&background 

color 

+ 

- 

posture 

classification 

feature tracking 

learned color 

model 

grey-level 

grey-level 

Figure 4.2: Arrangement of the computer vision methods: 

only on successful hand detection will the tracking method start operating. Posture 

recognition is attempted after each tracking step. If successful, features and 

color are re-initialized. 

the WinTK application that embeds the library to provide a set of versatile and 

easy-to-access interfaces to the gesture recognition results. 

The final output of the vision system indicates for every frame the 2D location 

of the hand if is tracked, or that it has not been detected yet. Chapter 6 defines 

what exactly is meant by the “hand location.” The location of a second hand 

within view can also be determined in certain cases. If the dominant hand’s 

posture is recognized, it is described with a string identifier as a classification 

into a set of predefined, recognizable hand configurations. HandVu’s API and the 

various ways of obtaining output from the system are described in Section 4.2.5. 

The diagram in Figure 4.2 and the following paragraphs briefly introduce the 

components of our vision system and their interactions. More detail can then be 

found in the following chapters. 

91 

color


Hand Detection 

The initial stage of the vision system attempts detection of the hand in a 

particular posture. Since hands are frequently over-exposed in comparison to 

the background, the vision system performs automatic exposure correction for a 

smaller area than the entire image (see Section 4.2.2). To facilitate this, the hand 

is only detected in a rectangular region that can be specified by the interface 

designer. For our applications, we chose the central part of the comfort zone as 

discussed in Chapter 3 to function as the hand detection area. If the detection 

was not successful, vision processing for the current frame ends here. If the hand 

was detected, the location of a 2D bounding box is sent to the tracking initializa- 

tion stage. More information on how the detection is facilitated can be found in 

Chapter 5. 

Tracking initialization 

The observed hand color is then learned in a color histogram, given the bound- 

ing box location and a probability map that specifies the likelihood of pixels within 

that area to belong to the hand. This color is contrasted to a reference area of 

background image areas. Next, a “Flock of Features” is placed on what is believed 

to be the hand. (See Chapter 6 for tracking details.) No further processing is done 

92


on the current video frame, and the following frames are sent to the third stage, 

introduced in the next paragraph. 

Figure 4.3: A screen capture with verbose output turned on: 

the image is partially color-segmented, illustrating how skin color by itself is not 

a reliable modality. In color prints, the green KLT features are also visible. This 

shot was taken while walking. 

Tracking and recognition 

The Flock of Features follows small grey-level image artifacts. A weak global 

constraint on the features’ locations is enforced, keeping the features tightly to- 

gether. Features that are not likely to still be on an area of the hand appearance 

are relocated to close proximity of the remaining features and on an area with high 

skin color probability. This technique integrates grey-level texture and dimension- 

less color cues, resulting in more robustness towards tracking disturbances cause 

93


by background artifacts. From the feature locations a small area is determined 

that is scanned for the key postures that recognition is attempted for. Posture 

recognition is described in detail in Chapter 7, while all tracking aspects are cov- 

ered in Chapter 6. If the posture recognition succeeds, the feature locations and 

the color lookup table are re-initialized as described in the previous paragraph. 

4.2.2 Area-selective exposure control 

The automatic exposure control that most digital cameras perform does not 

suit the purposes of a vision-based hand gesture interface. Ideally, the hand 

would always be perfectly exposed. Yet the cameras optimize exposure for the 

entire image area, not just where the hand is located. We therefore designed and 

implemented a software-based exposure correction function that only considers a 

sub-area of the frame. For it to work, the camera (or another component in the 

video pipeline) must provide a simple interface to read and set its exposure level. 

class CameraController { 

public: 

virtual double GetCurrentExposure() = 0; // [0..1] 

// true if change has an effect, false if step is too small 

virtual bool SetExposure(double exposure) = 0; // [0..1] 

virtual bool SetCameraAutoExposure(bool enable=true) = 0; 

virtual bool CanAdjustExposure() = 0; 

}; 

For example, the DirectShow interface IAMCameraControl that many camera 

source filters expose, can fulfill these demands with only minute changes. The 

94


correction algorithm runs periodically, currently every 500 or 1000 milliseconds. 

It counts the number of highly-exposed pixels within the current scan area, where 

a highly-exposed pixel has 80% or more of the maximum brightness scale. If that 

number is greater than max frac = 30% of the total number of pixels within the 

scan area, the image area is considered over-exposed. In that case, the exposure 

time per frame is corrected by a factor of max frac over the current percentage, 

bounded to a factor greater than 0.75 to avoid abrupt changes. 

Under-exposure is corrected for if less than min frac = 10% of the pixels are 

within the upper 20 percent of the brightness scale. Exposure time is extended by 

a factor of 1.1 over the current percentage of pixels in the 20 th percentile, bounded 

to a maximum correction factor of 1.25. See Figure 4.4 for a pseudo-code notation 

of the algorithm. 

Setting a new exposure level might or might not result in an actual adjustment 

to the camera because of the camera’s exposure level resolution. To obtain the 

exact effect, the new exposure setting has to be checked after a call to SetExposure. 

However, as a coarse measure, the function returns true if the CameraController 

changed its exposure. 

While no formal experiment has been conducted, subjective image results show 

a large reduction of strongly over-exposed pixels within the area of adjustment. As 

a direct result from our area-selective software adjustment, the detection method 

95


input: 

bbox : scan area 

img : grey-level image, values [0..1] 

exposure: last computed exposure level 

constants: 

bright_exp = 0.8 

max_frac = 0.3 

min_frac = 0.1 

algorithm: 

bright_pixels = num pixels in bbox with brightness>=bright_exp 

bright_pixels = bright_pixels / area(bbox) 

correction_factor = 1.0 

if (bright_pixels>max_frac) 

correction_factor = max(0.75, max_frac/bright_pixels) 

else if (bright_pixels


consistently (over time and with different cameras) succeeds in lighting conditions 

in which it consistently fails when relying on the cameras’ built-in exposure control 

mechanisms. 

4.2.3 Speed and size scalability 

The vision module adapts to the available processing power of the hardware it 

is running on. This is important for the responsiveness of the user interface, and 

necessary because some computation can take longer than the time between two 

successive frames. 

Incoming frames are tagged based on their latency, which is the time that 

has passed between frame capture and the frame’s arrival at HandVu’s processing 

module. If this time is less than a threshold t max normal latency, it is tagged to 

be fully processed. For the most part, this turns on the detection and recognition 

subroutines. If the frame’s latency is greater than that but less than a second 

threshold t max abnormal latency, the frame is tagged to be “skipped.” This 

means that partial processing is done and the HandVu-wrapping application is 

recommended to also only perform minimal processing steps before displaying the 

video frame. If HandVu is tracking the hand, it will perform an update step on the 

Flock of Features. If the hand has not been detected, HandVu will not do any pro- 

cessing. Lastly, if the frame arrives with more than t max abnormal latency de- 

97


lay, only Flock of Features tracking is performed and the frame should be dropped 

by the wrapping application. This behavior is transparent to applications that 

are connected through the event server. 

The described method results in smooth scaling down to machines with about 

1GHz CPU speed but becomes increasingly choppy and uneven for slower ma- 

chines. 

A minimum resolution of 320x240 pixels is highly recommended. Beyond that, 

all functionality is independent of the video scale. However, larger video frames 

on slower machines require longer processing times per frame. 

4.2.4 Correction for camera lens distortion 

Most camera lenses introduce spatial distortions into the video frames. The 

HandVu system optionally corrects for those artifacts with the help of a function 

from the OpenCV library, turning every frame into correct perspective projection. 

This is important for many applications, particularly for augmented reality ap- 

plications that need to draw aligned, perspectively correct geometry over the real 

world as seen through the camera. All frames get undistorted aside from those 

that are dropped. Since this operation takes a considerable amount of time (on 

the order of the Flock of Features tracking), it can be turned on and off without 

affecting other settings. 

98


Features are tracked in every frame, even in frames with such a high latency 

that they are to be dropped by the wrapping application and no undistortion is 

performed on them. Thus, features are tracked in the original, distorted image. 

To be compatible with the image output, the locations reported by the event 

server are converted to their respective coordinates in the undistorted frame as a 

last step of processing. 

4.2.5 Application programming interface 

HandVu is primarily a library and this section describes its application pro- 

gramming interface (API). It also introduces ways to connect to the HandVu 

WinTk, described in Section 4.2.7. To keep the core as platform-independent as 

possible, HandVu requires an application that uses the library to implement one 

or two functionalities. The camera controller that interfaces to specific camera 

functions was already introduced in Section 4.2.2 and is in fact optional. The 

other, mandatory functionality to be provided by the wrapping application is a 

clock that can report both the time when an image frame was captured and the 

current time in micro-seconds: 

typedef long long RefTime; 

class RefClock { 

public: 

virtual RefTime GetSampleTimeUsec() const = 0; 

virtual RefTime GetCurrentTimeUsec() const = 0; 

}; 

99


The state of an “object” – currently the right hand, object ID 0 – is accessed 

via the state data structure HVState, see Section 4.2.8 below. Given these two def- 

initions, we can introduce the main HandVu class. The most important functions 

of HandVu’s API are shown below. 

class HandVu { 

public: 

enum HVAction { // specify recommendations to application: 

HV_PROCESS_FRAME, // fully process and display the frame 

HV_SKIP_FRAME, // display but do not further process 

HV_DROP_FRAME // do not display the frame 

}; 

void Initialize(int width, int height, RefClock* pClock, 

CameraController* pCamCon); 

void LoadConductor(const string& filename); 

void StartRecognition(int obj_id=0); 

void StopRecognition(int obj_id=0); 

HVAction ProcessFrame(IplImage* inOutImage); 

void GetState(int obj_id, HVState& state) const; 

void SetOverlayLevel(int level); 

void CorrectDistortion(bool enable=true); 

void SetAdjustExposure(bool enable=true); 

}; 

The operation is fairly straight-forward: first, HandVu is initialized with the 

width and height of the video stream, and the RefClock and camera controller 

are supplied. After a conductor configuration file (see Section 4.2.9 below) has 

been loaded with LoadConductor, recognition can be started and stopped at will. 

Every video frame needs to be passed to HandVu via the ProcessFrame function. 

It returns a recommendation on what to do with that frame. The main result, 

100


the state of the hand gesture recognition, is available via the GetState function 

call. For CorrectDistortion to work, the conductor configuration file has to 

specify a valid camera calibration file. For the exposure adjustment to be possible 

(turned on via SetAdjustExposure), HandVu must have been initialized with a 

CameraController!=NULL. 

User applications (such as those described in Chapter 8) can also connect to the 

“HandVu WinTk” toolkit, a stand-alone Windows MFC application that is built 

on top of the library using DirectShow (see Section 4.2.7). Applications are either 

embedded in the pipeline as an extension to the recognition module’s DirectShow 

(DX) Filter, they are inserted into the pipeline by the MFC application as a 

separate DX Filter, or they connect through one of the two networked interfaces, 

the Gesture Server or the OSC protocol. 

Independent of the connection channel that allows access to the recognition 

state, a separate channel can be opened to transfer the actual video data from 

HandVu’s WinTk to applications that do not have direct access to the pipeline’s 

video stream. This is essential in case the application requires access to the video 

for input or for its own output capabilities. It can request that the frames are not 

displayed through DX but instead saved to a shared memory location that resides 

in a DLL called FrameDataLib. 2 Its API is shown below. 

2 Many thanks to Ryan Bane who implemented the DLL. 

101


void FDL_WaitForInitialization(); 

void FDL_GetDimensions(int* width, int* height, int* channels); 

void FDL_GetImage(unsigned char** img); 

The application calls FDL_WaitForInitialization, a blocking call that re- 

turns when HandVu’s WinTk has initialized the buffer. FDL_GetDimensions re- 

turns the frame size and the number of color channels. Finally, FDL_GetImage 

obtains a pointer to the library-internal shared memory data structure that con- 

tains the latest frame that HandVu has processed. 

4.2.6 Verbosity overlays 

One of the most important aspects of user interfaces is the immediate feedback 

to the user once a command or even a slight change in the input vector was 

recognized. A lack thereof decreases usability, in particular the speed with which 

the interface can be used. 

HandVu can give feedback about its gesture recognition state by overlaying 

information on the processed video frame. The amount and verbosity of the over- 

lay can be selected based on the application programmer’s and the application 

user’s needs. HandVu provides timely and direct feedback about the most impor- 

tant vision-level information – whether detection and tracking of the hand was 

successful and whether one of the key postures was recognized. The following is a 

102


detailed explanation of HandVu’s different verbosity levels. Higher levels display 

all lower levels’ information in addition to the mentioned information. 

Level 0: No overlay, only the (possibly distortion-corrected) video stream is 

rendered. 

Level 1: A textual display in the right upper corner shows the frames per second 

that HandVu achieved, and in parentheses the frames per second in which 

posture recognition was attempted. The fastest and slowest processing time 

within the last second is stated in milliseconds. A single dot on the tracked 

hand shows the Flock of Features’ mean location. 

Level 2: The frames’ incoming latency is shown, along with a white rectangle 

around the current scan area. A large dot identifies the hand’s mean loca- 

tion, and little dots mark each of the features in the flock. Recognized hand 

postures have a green box drawn around them. 

Level 3: During tracking, the scan area is color-segmented with a 0.5 probability 

threshold. This back-projection turns non-skin pixels black and leaves skin 

pixels unchanged. The number of individual detected areas is shown. 

In addition to HandVu’s feedback mechanism, applications can implement 

their own ways to signal event recognition to the user. Section 8.4 on page 212 

explains the implications on some examples. 

103


4.2.7 HandVu WinTk: video pipeline and toolkit 

For convenient access to the hand tracking and recognition results, we built a 

prototypical application for the Microsoft Windows platform that embeds the core 

vision components. It provides true out-of-the-box utilization of hand gestures 

as a user interface. It is a stand-alone application that leverages Microsoft’s 

DirectShow API to support almost any video source, be it a camera or file. The 

recognition results are made available in two network protocol formats so that 

client applications can run on the same or another machine. Figure 4.5 is a 

schematic diagram of the data flow and interfaces. 

DX filter 

frame 

action 

detection 

hand detection 

- 

color verification 

(fixed histogram) 

+ 

tracking 

initialization 

initialize 

feature locations 

learn fore-&background 

color 

clock exposure 

toolkit controller 

last frame 

+ 

- 

state 

tracking and 

recognition 

posture 

classification 

feature tracking 

learned color 

model 

state 

server 

application A application B 

Figure 4.5: The vision module in the application context: 

embedding of the vision module into the video pipeline and stand-alone application. 

104


The DirectShow Filter serves three purposes. First, it prepares each frame for 

processing by the vision module. This mostly amounts to creating the appropri- 

ate image structure (an IplImage) and possibly flipping the image upside down 

to accommodate for different video source properties. Second, it takes the vision 

system’s recommendation on subsequent frame processing into account and for- 

wards or drops the frame (see Section 4.2.3). Third, it is a thin interface wrapper 

to allow for COM access to the vision module’s functionality. This is necessary 

because the main application lives in a different process space than the DX filter. 

The Maintenance Support application (see Chapter 8) and two user studies are 

built into the filter and controlled partially through input to the MFC applica- 

tion. If active, buttons and other interactive visualizations are overlaid over the 

rendered video. 

The application has a number of responsibilities. First, it has to build the 

DirectShow graph including a video source, the DX filter that wraps the vision 

module, and a rendering filter. Second, it implements the RefClock and Cam- 

eraController interfaces with the help of a few DirectShow COM interfaces, and 

announces their availability to the DX filter. Third, it spawns the gesture event 

server’s thread and initiates the sending of events after every video frame. Fourth, 

if desired, it also makes the entire image frame available to the FrameDataLib 

DLL (see page 101) so other applications can display the video in a custom for- 

105


mat. Lastly, some keyboard and mouse input is interpreted as commands to the 

toolkit or HandVu library, and other traditional input is forwarded unprocessed 

to the DX filter. 

4.2.8 Recognition state distribution 

There are three main ways to obtain the current state of HandVu. The first is 

through a library call to the GetState function, the second is through a TCP/IP 

client-server connection, and the third uses a UDP packet format frequently used 

in the music and arts community. They are described below. 

GetState 

The common way for applications to obtain the result of processing a frame 

with HandVu is to call the following function: 

void HandVu::GetState(int obj_id, HVState& state) const; 

where 

class HVState { 

public: 

int m_obj_id; 

bool m_tracked; 

bool m_recognized; 

double m_center_xpos, m_center_ypos; 

string m_posture; 

}; 

106


The obj_id is currently fixed to a value of zero, identifying the right hand. The 

two boolean member variables indicate whether the object is successfully tracked 

and whether one of the key postures was recognized, respectively. The location 

of the tracked object is reported in relative image coordinates, the image origin is 

in the left upper corner of the image. If one of the key postures was recognized, 

the posture string contains the identifier string of the detecting cascade (see also 

Section 4.2.9). 

Gesture events 

The client-server architecture sends events from the computer vision module to 

any connected gesture event listener, somewhat similar to the VRPN server [169]. 

VRPN is a VR periphery system that makes device differences between conven- 

tional UIs and trackers transparent to the clients. 

The gesture event server component is currently implemented within the plat- 

form-specific WinTk, but plans are for its inclusion into the library in a platform- 

independent manner. The server opens a TCP/IP port (default port is 7045) 

and runs the accept-loop in its own thread, accepting an arbitrary maximum of 

five concurrent clients. The blocking send commands are invoked from the main 

application thread. Thus, the client applications should read events promptly 

from their sockets. 

107


The protocol is a unidirectional stream of events. Each event is a string of 

ASCII characters, delimited by a carriage return and a line feed. The current 

protocol version is 1.2, which has the following format. 

Where 

1.2 tstamp id: t, r, "posture" (x, y) [s, a]\r\n 

• 1.2 is the protocol version number, 

• tstamp is a long integer timestamp of the respective image capture time, in 

milliseconds starting with the first seen frame, 

• id is an identifier for which object this event belongs to, currently fixed to 

0, 

• t is 1 if the object is being tracked, 0 otherwise, 

• r is 1 if one of the key postures was recognized, 0 otherwise, 

• posture is a string identifier of one of the six recognized postures, or the 

empty string “”, 

• x, y are the tracked location in relative image coordinates, the image origin 

is in the top left, 

108


• s, a are currently unused but will eventually contain a scale identifier and 

a rotation angle. 

Open Sound Controller interface 

The format of the Open Sound Controller (OSC) packets is very similar to the 

custom packet format described above: 

gesture_event, siiiisffff, tstamp, id, t, r, posture, x, y, s, a 

Note that the OSC identifier is gesture_event and that the cryptic siiiisffff 

encodes the type information for all arguments: the following four arguments 

are integers, posture is a string argument, and the last four arguments are float 

numbers. 

4.2.9 The vision conductor configuration file 

For the HandVu application programmer and user who desires more control 

over the interface operation, the vision modules’ main settings are stored in and 

read from a configuration file. This can be conveniently modified to fit the specific 

needs. Due to the orchestrating nature of the settings, we termed it a “vision 

conductor” file. We will briefly describe its format and refer to the respective 

places in the dissertation that cover the details. The following is a typical example 

of a conductor configuration file. 

109


HandVu VisionConductor file, version 1.5 

camera calibration: - 

#camera calibration: config/FireFly4mm_calib.txt 

camera exposure: software 

#camera exposure: camera 

detection params: coverage 0.3, duration 0, radius 10.0 

tracking params: num_f 30, min_f 10, win_w 7, win_h 7, \ 

min_dist 3.0, max_err 400 

tracking style: OPTICAL_FLOW_COLORFLOCK 

#tracking style: CAMSHIFT_HSV 

#tracking style: CAMSHIFT_LEARNED 

recognition params: max_scan_width 0.4, max_scan_height 0.6 

1 detection cascades 

config/closed_30x20.cascade 

area: left 0.6, top .2, right 0.94, bottom .84 

params scaling: start 1.0, stop 8.0, inc_factor 1.2 

params misc: translation_inc_x 2, translation_inc_y 3, \ 

post_process 1 

0 tracking cascades 

1 recognition cascades 

config/all_hands_combined.cascade 

area: left 0.47, top .2, right 0.94, bottom .84 

params scaling: start 1.0, stop 8.0, inc_factor 1.2 

params misc: translation_inc_x 2, translation_inc_y 3, \ 

post_process 0 

7 masks 

config/Lpalm.mask 

config/Lback.mask 

config/sidepoint.mask 

config/closed.mask 

110


config/open.mask 

config/victory.mask 

config/closed_30x20.mask 

The backslash \ at line endings in the above printout indicates that there must 

not be a line break in the actual configuration file. All configuration settings must 

be present in the order shown above. Blank lines and comment lines, prefixed with 

a pound #, are ignored. 

• camera calibration specifies whether a correction for lens distortion is 

to be performed and what file holds the calibration information. See Sec- 

tion 4.2.4 for details. A dash - indicates that no calibration is desired. 

• camera exposure can be either camera or software and specifies whether 

the camera’s automatic exposure control is to be used or the software-based, 

area-selective exposure control as introduced in Section 4.2.2. 

• detection params are three general settings pertaining to hand detection: 

the coverage specifies the relative amount of masked hand area that has to 

have skin color as determined with the fixed color histogram method from 

Section 5.7. The duration gives an amount in milliseconds that a hand must 

be detected in every successive frame for it to be considered a match and a 

valid system initialization. A value of 0 prompts acceptance with only one 

frame. The radius parameter is only used for durations greater than 0 and 

111


delimits the radius in pixels in which subsequent hand detections must lie 

from the first one to be considered a match. The discussion in Section 5.10 

explains when these settings might be helpful. 

• tracking params are used exclusively for the Flock of Features tracking 

style and specify: num_f the target number of features that is maintained, 

min_f the minimum number of features that has to be successfully tracked 

from one frame to the next or tracking is considered lost, win_w the width 

of the search window for KLT features, win_h the window height, min_dist 

the minimum-distance flocking constraint, and max_err the maximum area 

mismatch before a KLT feature is considered lost. All units but the last are 

in pixels. More details about the meaning of these parameters can be found 

in Chapter 6. 

• tracking style determines the method to be used for tracking a once- 

detected hand: 

OPTICAL_FLOW_COLORFLOCK causes tracking with a Flock of Features, 

CAMSHIFT_HSV with CamShift based on a fixed HSV skin color distribution, 

CAMSHIFT_LEARNED with CamShift based on a color distribution learned at 

detection time. 

Again, please see Chapter 6 for more. 

112


• recognition params limit the maximum size of the area that is scanned 

for hand postures during tracking to the width and height specified trough 

max_scan_width and max_scan_height, relative to the video size. 

• n detection cascades is a list of length n of detector cascades and their 

detection parameters. The first line of a list entry points to a file that de- 

scribes a detector cascade. In addition to all weak classifiers, each cascade 

file contains a textual identifier (a fanned detector contains multiple iden- 

tifiers). This name is used for associating the correct masks (probability 

maps) and giving detected appearances a name, for example, for reporting 

detected postures (see Section 4.2.8). Specifics about the detection method 

and cascades can be found in Chapter 5. The remaining three lines in a list 

entry are described in the following. 

• area defines a rectangular region that is to be scanned with the respective 

cascade, in relative coordinates. 

• params scaling specifies the scales at which the respective cascade is to be 

scanned across the area. For example, a start scale of 1.0 is the minimum 

template resolution, a stop scale of 8.0 says to increase the scale incremen- 

tally while it is smaller than eight times the template resolution, and an 

113


inc_factor of 1.2 asks for scale increase steps of 20% over the previous 

size. 

• params misc specify the translation of the cascade during scanning in pixel- 

sized increments, both in the horizontal and the vertical dimension. The 

increments are for the smallest scale and scaled with the cascade size there- 

after. post_process can be 0 or 1, where 1 means that all intersecting 

matches found in a single frame are to be combined into a single rectangular 

area as suggested by Viola and Jones in [180], and 0 causes all individ- 

ual matches to be reported. See Section 2.3.8 for more details on detector 

scanning. 

• n tracking cascades is currently not used and n must be 0. 

• n recognition cascades are the cascades used for recognizing different 

postures as described in Chapter 7. The list of cascades has the same 

format as for the detection cascades, but the area line is ignored, only 

params scaling and params misc are used. 

• n masks are the names of n files that contain the hand pixel probability maps 

as described in Section 5.8. Each of these files contains a textual posture 

identifier that is used to match a map to its cascade and a template-sized 

matrix of probabilities for the respective pixel to belong to the hand area. 

114


A file that follows these specifications can be read with the LoadConductor 

API call. Upon successful parsing, the changes are assumed immediately. 

4.3 Vision system performance 

The quality and usability of any vision-based interface is determined by four 

main aspects of the computer vision methods: speed, accuracy, precision, and 

robustness. In addition, usability of the application interface is of course an 

important factor, but this shall not be considered here. While the main results of 

user studies and runtime data are reported in the following chapters, this section 

summarizes the performance as it pertains to the entire vision system. 

The tracking component requires 2-18ms processing time on a 3GHz Xeon 

CPU. The combination of tracking, recognition, and color re-learning takes be- 

tween 50ms and 90ms total time, with O2-compiled C++ code. On a 1.13GHz 

laptop, the respective times are 18-33 and 50-140ms. The latency from frame cap- 

ture to render completion time as reported by DirectShow is a few milliseconds 

higher. 

Sheridan and Ferrell found a maximum latency between event occurrence and 

system response of 45ms to be experienced as “no delay” [156]. While HandVu 

does not achieve that end-to-end latency for all methods in combination, together 

115


they are well below the threshold of 300ms for when interfaces start to feel sluggish, 

might provoke oscillations, and cause the “move and wait” symptom [156]. With 

the cameras that we used, the tracking always runs at capture rate (up to 15Hz), 

while recognition is interlaced with 6-10Hz, time permitting. In comparison to 

other mobile VBIs, our method is significantly more responsive than the Hand 

Mouse [100], judging from a video available on the authors’ web site. The following 

chapters will detail the processing time for each of the vision module’s components. 

The combination of vision methods is generally robust to different environmen- 

tal conditions, including different lighting, different users, cluttered backgrounds, 

and non-trivial motion such as walking. They are largely camera-independent and 

can cope with the automatic image quality adjustments of digital cameras. No 

performance degradation was observed even for severely distorting camera lenses. 

Two conditions will still violate the system’s assumptions and might impact recog- 

nition and tracking negatively. First, an extremely over- or under-exposed hand 

appearance does not contain a sufficient amount of skin-colored pixels for suc- 

cessful detection. Second, if the color changes dramatically in between two con- 

secutive successful posture classifications, the tracking degenerates into single-cue 

grey-level KLT tracking with flocking constraints. Since the system updates its 

color model periodically, it is able to cope with slowly changing lighting conditions, 

however. 

116


The detection and posture recognition classifiers were trained with images 

taken with different still picture cameras, while the system was successfully tested 

with three different digital video cameras. In addition, none of the training images 

was shot with as short a focal length lens as our mobile camera has. These facts 

suggest that the entire system will run with almost any color camera available. 

No user calibration is necessary; the methods are largely person independent. 

4.4 Delimitation 

Motions of the head-mounted camera are not treated explicitly. Only the 

AR application has independent means to obtain the extrinsic camera parameters 

(location and orientation). Given these parameters, it is also possible to transform 

the hand location from image coordinates into an absolute world reference frame, 

save the distance from the camera which is not available from HandVu. 

The gesture recognition as facilitated with the described system is in fact a 

more challenging achievement than recognition of events that have a temporal 

extent, such as hand waving. Frame-based methods can easily be extended to 

recognition of dynamic, continuous motions with an independent module that 

analyzes sequences of single-frame results. It is up to the application designer 

117


to include Hidden Markov Models or similar techniques to recognize dynamic 

gestures. 

No 3D model of the hand was built due to time constraints of the fastest 

currently known parameterization methods. However, such a model might prove 

helpful in order to enforce posture consistency over time and of course to facilitate 

more fine-grained gestural commands. Therefore it is a good candidate for a 

possible extension of this dissertation work. 

None of the system’s functionality explicitly detects or models hand occlusions. 

Yet, brief occlusions of the tracked hand with foreign objects or the other hand 

do generally not cause all KLT features to be lost and tracking might continue. 

118

Chapter 5 

Hand Detection 

Hand detection for user interfaces must favor reliability over expressiveness: 

false positives are less tolerable than false negatives. Since detecting hands in 

arbitrary configurations is a largely unsolved problem in computer vision, the 

detector for HandVu allows reliable and fast detection of the hand in one particular 

pose from a particular view direction. Starting the interaction from this initiation 

pose is particularly important for a hand gesture interface that serves as the 

sole input modality as it functions as a switch to turn on the interface: without 

this and instead with an always-on interface, any gesture might inadvertently be 

interpreted as a command. The output of the detection stage amounts to the 

extent of the detected hand area in image coordinates. 

This chapter describes the methods used for HandVu’s robust hand detector: 

a combination of an adapted Viola-Jones detection method and skin color veri- 

fication. The particularities of hands for the purpose of reliable detection were 

119

Chapter 5. Hand Detection 

researched. For training, a large set of hand images was collected in various con- 

figurations and views. Detector training was performed with an MPI-parallelized 

training program on Linux clusters. Different hand configurations and views were 

compared for their suitability to be detected before arbitrary backgrounds. The 

best one was chosen as the initialization posture for the vision system described in 

the previous chapter. The detection parameters were then optimized for this and 

other postures, in particular the amount of in-plane rotation was tuned during 

training to allow for detection of the widest range of in-plane rotations. Another 

training parameter modification reduced the training time, yet another increased 

detection performance. Lastly, a new rectangular feature type that allows com- 

parison of non-adjacent areas was conceived and its superior performance on hand 

appearances demonstrated. 

5.1 Data collection 

We collected over 2300 images of hands of ten male and female students’ 

right hands with two different digital still cameras. The pictures were taken 

indoors and outdoors with widely varying backgrounds and lighting conditions, 

but without direct sunlight on the hands. The rectangular bounding boxes of the 

areas containing hand posture appearances were manually marked and rotated to 

120


a standard orientation. Figure 5.1 shows five examples for each of the six postures 

for which we trained detectors. 

closed 

sidepoint 

victory 

open 

Lpalm 

Lback 

Figure 5.1: Sample areas of the six hand postures: 

they are shown in the smallest resolution necessary for detection (25x25 pixels). 

The posture closed is a flat palm with all fingers extended and touching each 

other, open is the same but with the fingers spread apart. Sidepoint is a pointing 

posture with only the index finger extended, seen from the thumb side. The victory 

or peace posture has index and middle finger extended. The “L” posture involves 

an abducted thumb and extended index finger and can be seen from the Lpalm 

side and the Lback side of the hand. Two additional gestures were investigated 

but no detectors were trained for them: the grab gesture is suited to picking up 

121


coffee mugs, seen from the top, and the fist posture is viewed from the back of the 

hand. 

Table 5.1: The hand image data collection: 

the number of training images and the bounding box ratios (width over height) 

for each posture. Template size and bounding box ratio determine the template 

resolution along the vertical and horizontal dimensions. 

closed sidepoint victory open Lpalm Lback 

389 331 341 455 382 433 

0.6785 0.5 0.5 1.0 0.9 0.9 

The rectangular areas had different but fixed aspect ratios for each of the 

postures (Table 5.1). Since we wanted uniform template sizes for all postures 

for better comparability, this resulted in varying resolutions for the interpolation 

step. For example, the posture sidepoint with a template of size 25 by 25 pixels 

has twice the sample density along the horizontal dimension than its resolution 

in the vertical dimension. Similarly, during matching of each detector, different 

scale factors have to be applied. The effect of this is covered in Section 5.4 in this 

chapter. 

The non-cascaded detectors were trained with more than 23000 negative ex- 

amples, randomly selected areas from the pictures containing the hand images, 

but not intersecting the hand areas. To avoid over-training, AdaBoost was per- 

formed on one half of the hand images, error rate-validation on the other half. For 

the cascaded detectors, 180 random images not containing hands were scanned to 

122


periodically increase the negative training set as explained in Section 2.3.8 and 

in [180]. Again, half of them were added to the training set, the other half was 

used for validation. 

5.2 Parallel training with MPI 

Our implementation of the heavily compute- and memory-intensive AdaBoost 

for a Viola-Jones detection method is a processor-scalable parallel program. It 

uses MPI (Message Passing Interface) for remote process instantiation and com- 

munication. It was run on two Linux clusters, one with 16 nodes, one with 32 

nodes and two CPUs per node. The workload was split such that each CPU eval- 

uated some instances of a single feature type on all examples. The disadvantage 

against splitting across the examples 1 is that every CPU needs all examples for 

processing, which can require a large amount of memory. For most of the exper- 

iments however, the image areas of question fit into each CPU’s 2GB memory, 

almost obliterating any performance penalties. On the other hand, the advantage 

is that much less information needs to be communicated during synchronization 

(which occurs with the same frequency in both cases, once for each weak classifier). 

Each process can determine the feature instance with the smallest cumulative er- 

1 Michael Jones has implemented his improved training method in that manner [private 

conversation]. 

123


ror over all images and needs only to send this information back to a root process. 

If split across image examples, every feature instance’s partial error sum must be 

communicated, and there are hundreds of thousands of feature instances to be 

considered. 

Depending on the feature types utilized and also depending particularly on 

the desired ratio of negative to positive example images, training for one detector 

took between a few hours to two days. 

5.3 Classification potential of various postures 

Hand appearances – the combinations of postures and the directions from 

which they are viewed – differ in their potential for classification from background 

and other objects, their “detectability.” In order to pick the appearance with 

the best separability from background (that is, the one that allows detectors to 

achieve the best performance), one could train a detector for each combination 

and a posteriori analyze their performance. As previously mentioned, the training 

for Viola-Jones detection method takes far too long to explore all possible hand 

posture and view combinations for their suitability to detection. 

This section presents a frequency analysis-based method for instantaneous 

estimation of class separability, without the need for any training but based on 

124


only a few training images for each posture. The receiver operating characteristics 

for detectors for our postures confirm the estimates. This estimator contributes to 

a systematic approach to building an extremely robust hand appearance detector, 

providing an important step towards easily deployable and reliable vision-based 

hand gesture interfaces. This research was also published in [92]. 

5.3.1 Estimation with frequency spectrum analysis 

We investigated eight postures from fixed views, which were selected based on 

their different appearances and because they can be performed easily. A proto- 

typical example for each posture is shown in Figure 5.2. 

closed sidepoint victory open Lpalm Lback grab fist 

0.435339 0.38612 0.323325 0.391111 0.335228 0.315761 0.263778 0.202895 

Figure 5.2: Mean hand appearances and their Fourier transforms: 

larger s-values (see Equation 5.4) indicate more high-amplitude frequency components 

being present, suggesting better suitability for classification from background. 

The bottom row of images shows the “artifact-free” Fourier transforms. 

The separability of two classes depends on many factors, including feature 

dimensionality and method of classification. In particular for AdaBoost classifiers, 

125


it is desirable to a priori predict the potential for successful classification of hand 

appearances from background due to the detector’s computationally expensive 

training phase. The estimator presented here is the first method to approximate 

the performance of the Viola-Jones detection method. It is based on the intuition 

that appearances with a prominent pattern can be detected more reliably than 

very uniformly shaded appearances. The advantage of the estimator is that it 

only requires a single prototypical example of the positive class. There is no need 

for explicit or formal representation of the negative class, “everything else.” 

We collected up to ten training images of each of eight hand postures from 

similar views and computed their mean image (top row in Figure 5.2). Due to 

limited training data for the fist posture we took only one image and manually 

set non-skin pixels to a neutral grey. The areas of interest were resized and 

rescaled to 25x25 pixels, see table 5.1. The higher-frequency components of a 

Fourier transform describe the amount of grey-level variation present in an image 

– exactly what we are looking for. However, the transformation F (Equation 5.1) 

introduces strong artificial frequencies, caused by the image’s finite and discrete 

nature. Choosing a power-of-two image side length would avoid some artifacts; 

however, we did not want to deviate from the template size that the detectors 

would be built for. 

126


F (u, v) = 

1 

25 ∗ 25 

24 

24 

m=0 n=0 

I(m, n)e 

mu nv 

−i2π( + 25 25 ) 

(5.1) 

The Fourier transform P of a neutrally colored 25x25-sized image patch is 

therefore subtracted from F . This ensures that frequencies resulting from image 

cropping are eliminated, yielding an artifact-free difference-transform D. 

where 

D(u, v) = log |F (u, v) − P (u, v)| , (5.2) 

P (u, v) = 

1 

25 ∗ 25 

24 

24 

m=0 n=0 

1 

2 

mu nv 

+ 

e−i2π( 25 25 ) 

(5.3) 

In the last step (Equation 5.4), the sum of all frequency amplitudes is com- 

puted, normalized by the Fourier transform’s resolution. This sum is the sought- 

for estimator, giving an indication of the amount of appearance variation present 

in the image: 

s = e 1 

k ∗ 

u,v D(u,v) . (5.4) 

The bottom row in Figure 5.2 presents the postures’ artifact-free Fourier trans- 

forms D, annotated with s, the sums of their log amplitudes over the entire fre- 

quency spectrum. The sums’ absolute values have limited meaning, they are to 

be regarded in relation to each other. As expected after visual inspection, the 

127


closed hand appearance has the most amount of grey-level variation, reflected in a 

high amplitude sum. The fist, being mostly a uniformly grey patch, has the least 

amount of appearance variation, thus also a low s-value. 

In the following section, a comparison of the estimates with actual detectors’ 

performances will confirm the hypothesis – that appearances with larger s-values 

can be detected more reliably. Computing s-values therefore alleviates the need 

for the compute-intensive training of many detectors in order to gauge their per- 

formance potentials. 

5.3.2 Predictor accuracy 

To evaluate predictor accuracy, we built detectors with unmodified AdaBoost, 

which produces a single set of weak classifiers for each detector. Cascaded detec- 

tors – composed of multiple, staged sets of weak classifiers – will be covered in 

the following sections, their performance also following the estimator’s prediction. 

Here, the three traditional feature types (see Figure 2.1 on page 30) were used. 

The detectors were evaluated for their false positive rates by scanning a test 

set of 200 images of varying sizes not containing hands, some obtained from a 

web crawl and some taken in and around our lab. Note that the false positive 

rate is relative to all detector evaluations, and that there are 355,614 evaluations 

required to scan a VGA-sized image (see Section 2.3.8). 

128


detection rate 

1 

0.95 

0.9 

0.85 

0.8 

10 −5 

0.75 

10 −4 

false positive rate 

10 −3 

closed 

sidepoint 

victory 

open 

Lpalm 

Lback 

Figure 5.3: ROC curves for monolithic classifiers: 

the curves for all hand postures are shown, trained on integral images with 25x25 

pixel resolution. Each of the six detectors consists of 100 weak classifiers. The 

x-axis is in log scale. 

129 

10 −2


Results: The receiver operating characteristic (ROC, see Trees [175]) curves 

in Figure 5.3 show the results of evaluating the six detectors. The posture closed 

fares much better than its competitor hand postures, in that it achieves a higher 

detection rate for a given false positive rate. This is in line with the prediction of 

the spectrum-analysis estimator. The sidepoint posture does second-best for high 

detection rates, but then deviates from the prediction. We will later see, however, 

that it comparatively does much better again for very low false positive rates with 

the cascaded detector. Another prediction failure can be observed for the Lback 

and Lpalm curves: the more structured Lpalm appearance should achieve better 

class separability. Again, the more expressive features in the cascaded detector 

actually do bring out this advantage and are in line with the prediction. 

5.4 Effect of template resolution 

Before training detectors with more expressive, but also more expensive feature 

types (on the order of two magnitudes more computational effort during training), 

we wanted to make sure the integration templates did not contain any redundant 

information. Therefore, we varied the size of the template area for the best- 

faring appearance, hoping for resolution reduction without sacrificing accuracy. 

130


The impact of different integral image resolutions can be seen in Figure 5.4 on a 

monolithic detector. These results are also published in [92]. 


1 

0.95 

0.9 

0.85 

0.8 

10 −5 

0.75 

10 −4 


25x25 

35x35 

20x30 

30x20 

Figure 5.4: ROC curves for different template resolutions: 

ROC curves for the closed posture detector with 100 weak classifiers. The detectors 

with higher resolution in the horizontal (35x35 and 30x20) outperform the 

other two. 

Unsurprisingly, the finest resolution integral (35x35 pixels) achieves the best 

performance. Remember that the observed image area is constant, only the sam- 

pling resolution differs. But higher resolution in the vertical dimension contributes 

131 

10 −3 

10 −2


little to this improvement, as witnessed by the lower detection rates of the 20x30 

curve. On the other hand, the 30x20 curve has high resolution along the di- 

mension that the estimator frequency analysis showed more high amplitudes for 

– see the bright horizontal extent in the frequency image for the closed posture 

in Figure 5.2. This seems to enable the detector to capitalize much more on 

appearance peculiarities and rewards us with detection rates comparable to the 

highest-resolution detector. 

It is interesting to note that the detector with 30x20 templates performs bet- 

ter for low false positive rates, while a 20x30 resolution performs better for higher 

false positive rates. We speculate that the stretch in the vertical produces large, 

uniform areas that allow for easy distinction between hands and many other ap- 

pearances. However, the lack in horizontal resolution compresses away the fine 

finger structures that are required for separation from most other appearances. 

5.5 Rotational robustness 

The research described in this section, published in [90], analyzes the in-plane 

rotational robustness of the Viola-Jones detection method when used for hand 

appearance detection. This is necessary because the object detection method is 

not inherently invariant to in-plane object rotations. When trained with only 

132


strictly aligned data and then used for gesture interfaces, it would require the 

users to perform very precise gestures – a daunting task with a head-worn camera. 

Viola&Jones’ face detectors handle about 30 degrees of in-plane rotation of frontal 

and profile views, 15 degrees in either direction [72]. However, we found detectors 

for hands to be much more sensitive to in-plane rotations. This prompted the 

research presented here. 

Viola and Jones recently extended their method to detect objects exhibiting 

arbitrary in-plane rotations and side views of faces [72]. This extension requires 

additional effort both algorithmically, during training, and during detection: in a 

first stage of classification, implemented with a decision tree, one of twelve detec- 

tors is selected. Each of these handles detection of faces within about 30 degrees 

of in-plane rotations. While this approach is still very fast, it adds training time 

and about doubles detection time. 

Similarly, we investigated detection of in-plane rotations of various hand pos- 

tures. However, our focus was not on covering the entire 360 degrees range of 

rotations. Instead, we wanted to increase each detector’s range of detected rota- 

tions without adding any computational overhead and without negatively affecting 

the false positive rate, that is, without incurring a performance penalty. Objects 

other than faces have different appearance characteristics that warrant specific 

treatment. This is motivated in the following section. 

133


5.5.1 Rotation baseline 

First, a baseline was established to which the subsequent results could be 

compared. For the closed posture, both the training and validation sets were 

rotated by various amounts around the image area’s center. Then, one detec- 

tor was trained for each angle. Consistent parameters for the training caused 

equally-complex cascade stages throughout all experiments in this section. The 

evaluation (Figure 5.5) shows that there are no large differences in the accuracy 

of the detectors, especially for low false positive rates. Establishing this baseline 

is important because some rotations could be intrinsically harder to detect than 

others – these experiments dismiss this possibility. 

5.5.2 Problem: rotational sensitivity 

To demonstrate the sensitivity of the detection method when used for hand 

appearances, a detector that had been trained on well-aligned examples was tested 

for its accuracy. In contrast to the detector’s application to face detection, for 

hands it achieved poor accuracy with rotated test images for as little as 4 de- 

grees (Figure 5.6). The performance decrease is roughly symmetric for clockwise 

(negative angles) and counter-clockwise (positive angles) rotations. 

A second set of experiments shows that this is not caused by peculiarities of the 

unrotated appearance of the particular hand posture. Eight detectors were built, 

134



1 

0.95 

0.9 

0.85 

0.8 

0.75 

10 −8 

10 −7 

10 −6 

10 −5 

10 −4 


0 (166) 

3 (254) 

5 (146) 

6 (247) 

9 (201) 

10 (194) 

12 (151) 

15 (123) 

Figure 5.5: ROC curves for various training data rotations: 

these curves constitute the baseline for our experiments and are from detectors 

trained and evaluated on the same rotation angle. 

each trained on examples rotated by a certain, fixed amount. They were then 

tested with examples rotated randomly between 0 and 15 degrees. The results in 

Figure 5.7 demonstrate their high rotational sensitivity in contrast to a detector 

(top curve) that was trained on examples that also exhibited a varying degree of 

rotations. The difference is even larger for false positive rates below 10 −4 . 

135 

10 −3 

10 −2 

10 −1 

10 0


The data in Figure 5.8 stems from a very similar experiment, only different in 

that the detectors were evaluated on a test set whose examples exhibited rotations 

by discrete amounts, not random on a continuous scale. The graph shows that 

smaller deviations in rotation from the training data achieve better detection 

rates: detectors trained for angles “in the middle” of the rotation spectrum, 6 and 

9 degrees in particular, fare better than those trained on angles 0 and 15. 

5.5.3 Rotation bounds for undiminished performance 

The objective of this set of experiments was to determine the angles that the 

training examples could be rotated and still achieve good detection performance 

on the equally-rotated test set. Four repetitions of the original training set for the 

closed posture were rotated by 0, 15, 30, and 45 degrees, respectively, and joined 

into one large training set. The Viola-Jones detection method over time keeps the 

positive examples that are reliably detectable, while it successively ignores those 

that would require an unacceptably high false positive rate. The experiment’s as- 

sumption is that well-detectable examples will be retained and all others sacrificed 

in order to achieve a low false positive rate. The evaluation in Figure 5.9 shows 

this effect. It suggests that the examples with 0 degrees and 15 degrees of rotation 

are more consistently recognizable than those with 30 degrees and 45 degrees, the 

136


latter ones being sacrificed to achieve a low false positive rate. Therefore, the 

bounds for rotating the training examples were set to within 15 degrees. 

5.5.4 Rotation density of training data 

Next, we were interested in the influence that different rotation angle densities 

have on training and detection performance. Three detectors were trained; their 

training and validation sets contained examples rotated in varying steps: A = {0, 

5, 10, 15}, B = {0, 3, 6, 9, 12, 15}, and C = {0..15} with random angles. They 

consisted of 198, 190, and 239 weak classifiers, respectively. The detectors were 

evaluated on examples randomly rotated between 0 and 15 degrees. 

No significant accuracy variation can be observed in Figure 5.10, leading to 

the conclusion that detector accuracy is not affected by rotation angle density 

for angles of 5 degrees or less. This is an important result because wider steps 

allow for fewer training examples, reducing both the data collection effort and the 

computational training cost. 

5.5.5 Rotations of other postures 

Finally, we confirmed the applicability of the main results that we had obtained 

for the closed posture to the other five postures, shown in Figure 5.11. Plotted 

in Figure 5.12 are the detection rates of detectors built with rotated training sets 

137


(0-15 degrees random) divided by that of detectors built with unrotated training 

sets. Both were evaluated with a test set with all examples rotated by 15 degrees. 

The detectors trained on rotated examples achieve at least equal performance, 

and for low false positive rates they outperform the detectors trained on fixed 

examples by about one order of magnitude. They also have a lower minimum 

false positive rate while still detecting some hand appearances. 

5.5.6 Discussion 

The number of weak classifiers required for a certain accuracy did not differ 

significantly between detectors trained on rotated and unrotated training images. 

Since consistent training parameters (number of weak classifiers per cascade stage 

and their accuracy) had been used for all detectors, the resulting detection speed 

performance of detectors for 0-15-degree-rotated images is about equal to that of 

detectors that detect unrotated images only. 

The results presented in this section pertaining to rotational robustness are 

likely to generalize to other objects because the surveyed hand appearances ex- 

hibit very different characteristics, such as their convexity (open versus closed), 

their texture variation (Lback versus Lpalm), and the background to foreground 

ratio (closed versus sidepoint). As detailed in Section 5.5.2, presenting training 

138


images that are rotated within these bounds is crucial to good accuracy for object 

appearances other than faces. 

In summary, the result is that only about 15 degrees of rotations can be ef- 

ficiently detected with one detector, different from the method’s performance on 

faces (30 degrees total). The difference with faces stems from the hand’s smaller 

features (fingers) being more sensitive to correct alignment during training, as 

well as from less inter-person appearance variation of a certain posture and view. 

Most importantly, the training data must contain rotated example images within 

these rotation limits. Detection rates on rotated appearances improve by about 

one order of magnitude without algorithmic modifications. This also has no nega- 

tive impact on detection speed. These results are consistent for a number of hand 

postures and appearances. The implications of the results effect both savings in 

training costs as well as increased naturalness and comfort of vision-based hand 

gesture interfaces. 

We employed the improved detectors in our mobile vision interface and can 

report better and faster initialization due to more natural and less rigid hand 

postures required for detection. 

139



1 

0.95 

0.9 

0.85 

0.8 

0.75 

0.7 

0.65 

0.6 

0.55 

0.5 

10 −8 

10 −7 

25x25 templates, 0 evaluated on each of −14..0..14 

10 −6 

10 −5 

10 −4 


Figure 5.6: ROC curves showing the rotational sensitivity: 

the classifiers in this figure were trained on unrotated training images and evaluated 

on test images rotated by various angles. There is a sharp decrease in 

detection accuracy for in-plane rotations of 4 degrees or more. Note the symmetry 

for rotations to the left and right. Also note the scale of the y-axis; unlike in 

the other graphs it starts at 0.5. 

140 

10 −3 

10 −2 

10 −1 

0 

2 

4 

6 

8 

10 

12 

14 

−2 

−4 

−6 

−8 

−10 

−12 

−14 

10 0



1 

0.95 

0.9 

0.85 

0.8 

0.75 

10 −8 

rotated all−extended−closed, evaluated on random 0−15 rotation 

10 −7 

0 

3 

5 

6 

9 

12 

15 

0,5,10,15 

10 −6 

10 −5 

10 −4 


Figure 5.7: ROC curves for detection of randomly rotated images: 

the classifiers were trained for the stated angle and evaluated on a randomly 

rotated test set. None of the fixed-angle detectors achieves accuracy close to that 

of the detector trained for various angles. 

141 

10 −3 

10 −2 

10 −1 

10 0



1 

0.95 

0.9 

0.85 

0.8 

0.75 

10 −8 

25x25 templates, 0−3−6−9−12−15 evaluated on each of 0, 3, 6, 9, 12, 15 

10 −7 

10 −6 

10 −5 

10 −4 


Figure 5.8: ROC curves for detection of discrete-rotated images: 

these classifiers were trained for the stated angle and evaluated on rotated examples 

with various angles. The detector favors angles “in the middle.” 

142 

10 −3 

10 −2 

0 

3 

6 

9 

12 

15 

10 −1 

10 0



1 

0.95 

0.9 

0.85 

0.8 

0.75 

10 −8 

25x25 templates, 0−15−30−45 evaluated on each of 0, 15, 30, 45 

10 −7 

10 −6 

10 −5 

10 −4 


Figure 5.9: ROC curves for the bounds of training with rotated images: 

the curves show that a detector created on a training set with multiple rotations 

does not treat all angles equally. Instead, examples rotated by 30 degrees 

and 45 degrees are more likely to be dropped in favor of examples with smaller 

rotations. 

143 

10 −3 

0 

15 

30 

45 

10 −2 

10 −1 

10 0



1 

0.95 

0.9 

0.85 

0.8 

0.75 

10 −8 

30x20 templates, various training sets evaluated on 0−15 random 

10 −7 

10 −6 

10 −5 

10 −4 


10 −3 

0,5,10,15 

0,3,6,9,12,15 

0−15 

Figure 5.10: ROC curves for different rotation steps: 

the classifiers were trained for various rotational densities and evaluated on 0-15 

random. 

144 

10 −2 

10 −1 

10 0


Figure 5.11: The six hand postures and rotated images: 

shown are typical images of the postures in 25x25 pixels resolution, the bottom 

row is rotated by 15 degrees. From left to right: closed, open, sidepoint, victory, 

Lpalm, and Lback. 

145



10 2 

10 1 

10 0 

10 −8 

pairwise comparison of unrotated vs. rotated trained, eval on 15, ratio 

10 −7 

10 −6 

10 −5 

10 −4 


10 −3 

10 −2 

closed 

open 

sidepoint 

victory 

Lpalm 

Lback 

Figure 5.12: Overall gain of training with rotated images: 

shown is the ratio of detection rates for “trained on rotated” over “trained on 

unrotated,” evaluated on 15 degrees rotated areas. There are no data points if 

the unrotated detectors have a detection rate of zero. 

146 

10 −1 

10 0


5.6 A new feature type 

In this section, we will show that the particular choice of feature types influ- 

ences the relative detectability of hand appearances. For each posture, a cascaded 

detector was trained that could select its weak classifiers from a set of four fea- 

ture types – instead of from only the three types that were used by Viola and 

Jones [180] and are shown in Section 2.3.8. The novel feature type, called “Four 

Box” feature, is a comparison of four rectangular areas. This type is similar to 

the “diagonal” filters proposed by an extension of their work in [72]; however, our 

filters can compare non-adjacent rectangular areas. During training, the areas can 

move about relative to each other with few strings attached, even partially over- 

lapping each other, just their sizes are restricted. These more powerful features 

allow the detector to achieve better accuracy, demonstrated in Figure 5.15. 

5.6.1 Four Box feature instance generation 

It is important to note the difference between a feature type and a feature 

instance. The feature type describes general properties, while a feature instance 

is one particular constellation of rectangular areas that has these properties. In- 

stances can differ in the size of their rectangular areas, the location of the areas in 

the sample images, or both. An example type would be described by “two rectan- 

147


gular areas of identical size that share one vertical edge.” An example instance of 

this type would be the area that contains the pixels (x=5, y=8) and (5, 9) and its 

adjacent area with pixels (6, 8) and (6, 9). During AdaBoost training, all possible 

weak classifiers must be tested on all sample images. This requires that all pos- 

sible feature instances for all feature types are evaluated on every image. Since 

this pool of feature instances is large and instance generation is cheap, we do not 

cache instances but instead generate them anew at every iteration of AdaBoost. 

The following is an algorithmic description of how all instances of the Four Box 

feature type are generated sequentially. 

First a brief definition: a rectangular area is described by four coordinates, 

(left, top, right, bottom). It is defined to contain pixels (x, y) with left < x ≤ 

right and top < y ≤ bottom. Thus, the area with coordinates (-1, -1, 0, 0) contains 

exactly one pixel, the one at (0, 0). This is the actual leftmost and topmost pixel 

in a sample image. 

Given is the first, initial instance of the Four Box feature type and the width 

and height of all sample images (also called the template resolution). The initial 

instance is shown on the left in Figure 5.13). The leftmost and topmost edges 

(numbers 5 and 11) are at the pixel coordinates -1 in x and -1 in y; edges 4 and 

10 are at coordinates 0, edges 3 and 9 at coordinates 1, and so forth. 

148


In every iteration of the sequential generation of all instances, one or more 

edges are moved. To avoid repeated construction of the same instance, an order 

is defined on the edges, represented by their numbers in Figure 5.13. Edges 

with smaller numbers are moved first. If an edge would move beyond the size 

of the sample images, the next-higher numbered edge is moved. After an edge 

has been moved, all its dependent edges are reset to a coordinate that is in a 

fixed relation to its ancestor. Edge dependency is indicated through an arrow: 

dependent edges connected through a double-lined arrow are placed to the same 

coordinate as their ancestor, and those connected through a solid arrow are placed 

one coordinate further than the ancestor edge. For example, after placing the 

leftmost and topmost edges, all other edges are updated because – directly or 

independently – they all depend upon those two edge’s locations. This results in 

the initial instance; all four rectangular areas are 2 pixels wide and 2 pixels high. 

All further feature instances are created successively. To obtain the next in- 

stance from a certain instance, the following procedure is applied. 

• The edge numbered 1 is moved one pixel to the right. If this is a valid coordinate, 

this is the next instance. 

• If the coordinate is not valid and instead exceeds the template area’s dimensions, 

the right edge of the topmost box (numbered 2) is moved one pixel to the right. 

149


5 

10 

4 

4 

8 

11 

3 

6 

9 

2 

2 

7 

+1 

== 

1 

Figure 5.13: The Four Box and Four Box Same feature types: 

the Four Box feature type is shown on the left and the Four Box Same feature 

type is shown on the right. Note that the two less-than conditions must always 

be met, while the “+1” and equality dependencies are only enforced when the 

respective ancestor edge is moved. In the right feature type, the width and height 

of all boxes is the same. 

• All edges that depend on edges numbered 2 are updated; those are the right edges 

of the bottom box (which is set to the same pixel) and the right edge of the rightmost 

box (which is set one pixel further to the right). 

• If the rightmost box’ right edge has a valid pixel coordinate, this is the next instance. 

• If the coordinate is not valid, edge 3 is moved one pixel to the right. 

• All edges depending on edge 3 are updated; those are edges numbered 2 and 1. 

150 

2 

0 

4 

-1 

4 

0 

0 

1 

1 

< 

-1 

1 

< 

3 

3 

2 

2 

4 

4 

1 

1 

1 

3 

3 

1 

2


• ... and so on for edges 4 and 5. 

• If (after moving edge 5 and updating all dependent edges) the edge numbered 1 is 

on a valid pixel, this is the next instance. 

• If edge 1 is not on a valid pixel, edge number 6 is moved downwards one pixel and 

edge number 5 is reset to coordinate -1. All dependent edges are updated. 

• If edge 6 is on a valid pixel, this is the next instance. 

• If edge 6 is not on a valid pixel, edge 7 is moved down one pixel and all dependent 

edges (number 6) are updated. 

• ... and so on for edges 8, 9, 10, and 11. 

• Eventually, after moving down edge 11 and updating all dependent edges, edge 6 

will not be on a valid pixel coordinate. This indicates that all instances have been 

produced and no new ones can be created. 

5.6.2 Four Box Same feature type 

The number of instances that the Four Box feature type can produce is im- 

mense (15, 144, 529, 400 ≈ 1.5 ∗ 10 10 instances in a 25x25 template), and Ad- 

aBoost’s exhaustive search component takes many hours to create a single weak 

classifier, even on a cluster with over 60 processors. A simplified version of this 

feature type was therefore conceived, called “Four Box Same.” It produces only 

151


8, 233, 632 ≈ 8.2 ∗ 10 6 instances in a 25x25 template and cuts computation time 

by more than three orders of magnitude. Three additional constraints are en- 

forced: first, all rectangles have the same size in any particular instance. Second, 

rectangles move in pairs from one instance to the next. The circled numbers in 

the right drawing of Figure 5.13 indicate which edges move concurrently. Third, 

the “less-than” condition is enforced for every instance. If it is violated after an 

edge labeled n has been updated, the instance is considered invalid and the edges 

numbered n + 1 are moved (prompting their dependent edges to be updated in 

turn). 

The procedure to get the next instance is slightly different for the Four Box 

Same feature type. As before, edges are dependent upon edges with higher num- 

bers. However, updating dependent edges happens in a different manner: the 

new edge location is not a function of the ancestor edge’s location but instead is 

always set to a fixed initial value. This value is indicated by the numbers on the 

arrows. Together, these conditions allow for instances as they are shown in Fig- 

ure 5.14. Note that by adding and subsequent subtracting of partial rectangles, 

fairly irregular areas can be compared. Also note that the rectangular areas need 

not be adjacent to each other as for Jones’ and Viola’s feature type in [72]. 

152


Figure 5.14: Example instances of the Four Box Same feature type: 

the framed areas are subtracted from solid-black areas. 

5.6.3 Results 

The relative performance of detectors for different postures stays roughly the 

same, even though the curves are not as smooth as with non-cascaded detectors 

due to the staged cascading and the resulting evaluation method (details in Sec- 

tion 2.3.8 and [180]). Of particular interest are the left parts of the curves since 

a fail-safe hand detection for vision-based interfaces must be on the conservative 

side with very few false positives. There, the cascaded detectors show ROCs along 

the lines of the performance predicted in Section 5.3.1: closed outperforms all oth- 

ers, sidepoint is second-best, and the more structured appearance Lpalm now does 

better than the more uniform Lback. 

Extrapolating from the results of this study, we suggest that mostly convex 

appearances with internal grey-level variation are better suited to the purpose of 

detection with the Viola-Jones detection method. The open posture, for example, 

already has a lower Fourier structure “s” value, hinting that background noise 

153



1 

0.95 

0.9 

0.85 

0.8 

0.75 

10 −8 

10 −7 

10 −6 

10 −5 

10 −4 


10 −3 

closed (30x20) 

sidepoint (20x25) 

victory (25x25) 

open (25x25) 

Lpalm (25x25) 

Lback (25x25) 

Figure 5.15: ROC curves for detectors with Four Box Same features: 

this feature type is less constrained and the areas need not be adjacent. Note that 

the scale on the y axis is different from previous figures. 

hinders extraction of consistent patterns. The detector’s accuracy confirms the 

difficulty to distinguish hands from other appearances. 

The final hand detector that we chose for our application detects the closed 

posture. For scenarios where we desire fast detection, we picked the parame- 

terization that achieved a detection rate of 92.23% with a false positive rate of 

1.01 ∗ 10 −8 in the test set, or one false hit in 279 VGA-sized frames. For most 

154 

10 −2 

10 −1 

10 0


scenarios it is sufficient, however, to pick a parameterization that had a detec- 

tion rate of 65.80%, but not one false positive in the test set. The high frame 

rate of the algorithm almost guarantees that the posture is detected within a few 

consecutive frames. 

5.7 Fixed color histogram 

Upon detection of a hand area, it is tested for the amount of skin-colored pixels 

that it contains. To this end, we built a histogram-based statistical model in 

HSV space from a large collection of hand-segmented pictures from many imaging 

sources, similar to Jones and Rehg’s approach [73]. We used a histogram-based 

method because they achieve better results in general, user-independent cases. If 

a sufficient amount of area pixels are classified as skin pixels, the hand detection 

is considered successful and control is passed to the second stage. This coverage 

can be set in the vision conductor configuration file, see Section 4.2.9. For good 

performance in this step, the hand must not be vastly over- or under-exposed. 

The software exposure control can correctly expose a selective area in the video 

(see Section 4.2.2). 

The “area of the hand” for this and other postures is defined with the help 

of a probability map. These maps have the same scale and resolution as the 

155


corresponding detectors and state for every pixel the probability that it belonged 

to the hand versus to background in the training data. 

No grey-level fiducials or colored markers are employed and still a good de- 

tection accuracy is achieved. This is possible due to the multi-cue integration 

of texture and skin color. Other than in the tracking method explained in the 

following chapter, both modalities must report a positive match for the detection 

to be successful. This makes sense as a false positive is potentially more harmful 

than a false negative, assuming that subsequent frames will eventually correctly 

detect the initialization posture. 

5.8 Hand pixel probability maps 

The grey-level appearance-based hand detector finds rectangular areas that 

contain hands. Not every pixel within those areas is likely to belong to a hand, 

however, and some pixels will belong to the background. This spatial probability 

distribution is defined for each posture and estimated from training images. Fig- 

ure 5.16 shows the probability maps for six postures. A brighter pixel indicates 

a higher probability that the respective pixel in the detected area is going to be 

belong to the hand, that is, to be of skin color. 

156


The maps were constructed by averaging a number of grey-level training im- 

ages. Since the hand is usually brighter than the background area, pixels belonging 

to the hand showed up in the mean image as brighter pixels. This mean image 

was normalized to have values between zero and one. Areas that have skin color 

but are darker, for example, those between two adjacent fingers, were manually 

set to high probability values. Similarly, high-value pixels that were known to be 

from the background were set to low values. 

Figure 5.16: The probability maps for six hand postures: 

the maps are shown in 25x25 pixel resolution. A brighter pixel indicates a higher 

probability for a pixel observed at that location to belong to the hand appearance 

and thus to be of skin color. 

The probability maps are used in two places in HandVu: first, the color of 

the very hand appearance is learned upon detection based on pixels with high 

probability values in the map. To this end, the map is scaled to the actual 

detection area’s size. Second, the “Flock of Features” (see Chapter 6) favors high 

map values when initially placing the features. 

157


5.9 Learned color distribution 

At hand detection time, the observed hand color is learned in a normalized- 

RGB histogram and contrasted to the background color as observed in a horse- 

shoe-shaped area in the image around the hand, see Figure 5.17. This assumes that 

no other exposed skin body parts of the same person who’s hand is to be tracked 

is within that background reference area. Since our applications mostly assume a 

forward- and downward-facing head-worn camera, this assumption is reasonable. 

We ensured that it was met for our test videos, which also included other camera 

locations. The segmentation quality that this dynamic learning achieves is very 

good for as long as the hand’s lighting conditions do not change dramatically 

and the reference background is representative for the actual background. For 

example, wooden objects that are not within the reference background area during 

learning will frequently be classified wrongly as foreground color. 


In this section we will discuss practical considerations with the detection mod- 

ule. To achieve the detection rates that are reported in this chapter, the following 

conditions must be met. First, there should be no intense, direct sunlight on the 

hand, particularly no hard, cast shadows that cover parts of the hand. Second, 

158


Figure 5.17: The areas for learning the skin color model: 

After the hand was detected, the color in the hand-masked area (white) is learned 

in a histogram. The pixelized look stems from scaling the 30x20 sized maps to the 

detected hand’s size. A second histogram is learned from the horseshoe-shaped 

area around the hand (black); it is presumed to contain only background. 

the exposure of the hand area should be approximately correct. This can be cor- 

rected for automatically with the method proposed in Section 4.2.2. Third, the 

hand area in the image must be at least as big as the properly scaled recognition 

template. For example, the 30x20 resolution template with the 0.6785 aspect ra- 

tio (the best detector) has a minimum size of 30x37 pixels in width and height. 

Fourth, camera and hand must not both be held static at the same time since 

this produces unchanging video frames. This is problematic as discussed in the 

following. 

The detection probabilities in two consecutive frames are not independent. In 

particular, this means that the chances for hand detection are reduced if it did not 

159


succeed in the previous frame. This is an inherent property of video processing 

and not a shortcoming of any particular computer vision method. The results 

of this chapter were obtained with still-image cameras where this is much less of 

an issue. While there is no technical solution to this problem beyond improving 

the per-frame performance, a user of a vision-based interface is expected to adapt 

to these characteristics over time. In HandVu’s case this means that after a few 

unsuccessful detections a user could move the hand, the camera, or both in order 

to present slightly different images to the vision methods. This discussion also 

applies to the posture recognition method that is covered in Chapter 7. 

Some parameters of the detection module can easily be changed in the vision 

conductor configuration file, see Section 4.2.9. In particular, the relative amount of 

masked hand area that must be of skin color (determined with the fixed histogram) 

in order to regard a detection as a match can be used to adapt VBI performance to 

different environments: if the hand area is expected to be always well exposed and 

the lighting is such that it does not result in many specularities, the parameter 

can be turned up to about 80%. This is because the skin color is going to be much 

more reliable to segment correctly and can reduce the number of false positives, 

for example, in well-lit indoor environments. On the other hand, in the outdoors 

the parameter should not be higher than 30% since the apparent skin color can 

vary a lot more and more weight should be given to the grey-level information. 

160


If the application designer wishes to make hand detection a more distinct event 

and thus distinguish it stronger from undesired activations, the duration parame- 

ter in the configuration file can help to avoid inadvertent VBI initializations: the 

value of this parameters specifies in milliseconds the time that a posture has to be 

recognized continuously before a match is reported. Similarly, initialization can 

be restricted to postures performed within a certain pixel radius from the first 

detection. This allows distinction of moving versus fixed-pose hands that are in 

the same posture. 

Also, we give the HandVu user the opportunity to detect different postures 

for initialization by specifying more than one entry in the “detection cascades” 

list. The additional cascades can be scanned over the same detection area as the 

cascade for the first posture or over a different one. For example, this can be 

conveniently exploited for an extension to HandVu that can be initialized with 

both the left hand and the right hand in separate locations, with applications for 

left-handed users. 

Lastly, the processing time of the detection module by itself shall be men- 

tioned, using the most accurate detector for the closed hand posture as detailed 

in Section 5.6.3. To scan an entire VGA-sized video frame (640x480 pixels), with 

initial translation increments of two pixels in the horizontal and three pixels in 

the vertical, between the scales 1.0 and the maximum that the frame size allows 

161


with a scale increment factor of 1.2, it takes between 114ms and 211ms with a 

mean of 118.625ms and a median of 118ms on a 3GHz Xeon running Windows 

XP. For the 218x308-sized area that we most frequently used (for example, in the 

Maintenance Application described in Section 8.6), the detector is scaled from its 

minimum size (scale factor 1.0) to scale factor 8.0, with increment factor of 1.2, 

the processing time is between 21ms and 27ms with a mean of 22.373ms and a 

median of 22ms. 

162

Chapter 6 

Tracking of Articulated Objects 

The objective of the vision component described in this chapter is to follow 

the hand as robustly as possible after it has been detected. Preliminary studies 

showed that shape-based methods are unsuited for non-rigid objects because the 

variety of contours would have to be dealt with in an explicit and thus high- 

dimensional manner. Color-based methods work well only as long as the hand 

is the predominant skin-colored object in view. To overcome these problems, 

this chapter introduces the “Flock of Features,” a fast tracking method for non- 

rigid and highly articulated objects such as hands. It combines optical flow and 

a learned color probability distribution to facilitate 2D position tracking of the 

object as a whole (not each articulation) from a monocular view. The tracker’s 

benefits include its speed and its ability to track rapid hand movements despite 

arbitrary finger configuration changes (postures). It can deal with arbitrary and 

dynamic backgrounds, significant camera motion, and some lighting changes. It 

163

Chapter 6. Tracking of Articulated Objects 

does not require a shape-based hand model, thus it is in principle applicable 

to tracking any deformable or articulated object. A more distinct and uniform 

object color increases performance but is not essential. Tracker performance is 

evaluated on hand tracking with a non-stationary camera in unconstrained indoor 

and outdoor environments. The main results of this work were published in [91]. 

6.1 Preliminary studies 

Shape or contour-based tracking is not very robust. Even for very rigid and 

optimally distinct and optimally concave/convex objects as in Figure 6.1, track- 

ing is very sensitive to background noise. The shown example was implemented 

with an Active Shape Model on image pyramids. It requires good initialization 

and small frame-to-frame differences for tracking. As can be seen, the tracking is 

disturbed by intensity variations on the hand (third, top right image). Eventu- 

ally the shape is attracted to high gradients not caused by the hand but by the 

keyboard and the display border (bottom right image). 

A combination of ASMs with predictive filters such as Kalman or particle 

filters improves the tracking performance, but does not eliminate the sensitivity 

to noise. Some texture- or appearance-based methods on the other hand track 

more robustly and do not require a model learned during training. 

164


Figure 6.1: Tracking a hand with an Active Shape Model: 

these images of tracking with the ASM were taken every 2 seconds. The thin, 

static line is only used for initialization purposes. The thick line represents the 

ASM-estimated shape. 

6.2 Flocks of Features 

The tracker’s core idea is motivated by the seemingly chaotic flight behavior 

of a flock of birds such as pigeons. While no single bird has any global control, 

the entire flock still stays tightly together, a large “cloud.” This decentralized 

organization has been found to mostly hinge upon two simple constraints that 

can be evaluated on a local basis: birds like to maintain a minimum safe flying 

distance to the other birds, but desire not to be separated from the flock by more 

than another threshold distance; see, for example, in Reynolds [145]. 

165


Figure 6.2: The Flock of Features in action: 

tracking despite a non-stationary camera, hand articulations, and changing lighting 

conditions. The images are selected frames from sequence #5. 

The hand tracker consists of a set of small image areas, or features, moving 

from frame to frame in a way similar to a flock of birds. Their “flight paths” 

are determined by optical flow, and then constrained by observing a minimum 

distance from all other features and by not exceeding a maximum distance from 

the feature median. If these conditions are violated, the feature is repositioned to a 

location that has a high skin color probability. This fall-back on a second modality 

counters the drift of features onto nearby background artifacts that exhibit strong 

grey-level gradients. 

The speed of pyramid-based KLT feature tracking (see Section 6.2.1 below) 

allows our method to overcome the computational limitations of model-based 

approaches to tracking, easily achieving the real-time performance required for 

166


vision-based interfaces. 1 It delivers excellent results for tracking quickly moving 

rigid objects. The flocking feature behavior was introduced to allow for tracking 

of objects whose appearance changes over time, that is, to make up for features 

that are “lost” from one frame to another because the image mark they were 

tracking disappeared. Since mere feature re-introduction within proximity of the 

flock cannot provide any guarantees on whether it will be located on the object 

of interest or some background artifact, color as the second modality is consulted 

to aid in the choice of location. An overview of the entire algorithm is given in 

Figure 6.3. 

6.2.1 KLT features and tracking initialization 

KLT features are named after Kanade, Lucas, and Tomasi 2 who found that 

a steep brightness gradient along at least two directions makes for a promising 

feature candidate to be tracked over time (“good features to track,” see [158]). 

In combination with image pyramids (a series of progressively smaller-resolution 

interpolations of the original image [110]), a feature’s image area can be matched 

efficiently to the most similar area within a search window in the following video 

frame. The feature size determines the amount of context knowledge that is used 

1The color distribution can be seen as a model, yet it is not known a priori but learned on 

the fly. 

2KLT trackers are not to be confused with the Karhunen-Loeve Transform, often abbreviated 

KLT as well. 

167


input: 

h_size - rectangular area containing hand 

mindist - minimum pixel distance between features 

n - number of features to track 

winsize - size of feature search windows 

initialization: 

learn color histogram 

find n*k good-features-to-track with mindist 

rank them based on color and fixed hand probability maps 

pick the n highest-ranked features 

tracking: 

update KLT feature locations with image pyramids 

compute median feature 

for each feature 

if less than mindist from any other feature 

or outside h_size, centered at median 

or low match correlation 

then relocate feature onto good color spot 

that meets the flocking conditions 

output: 

the average feature location 

Figure 6.3: The Flock of Features tracking algorithm: 

k is an empirical value, chosen so that enough features end up on good colors; we 

use k = 3. The fixed hand probability map is a known spatial distribution for 

pixels belonging to some part of the hand in the initialization posture. 

168


for matching. If the feature match correlation between two consecutive frames is 

below a threshold, the feature is considered “lost.” 

Recently, Toews and Arbel [173] proposed a method for finding good can- 

didates for tracking and claim better performance than KLT features achieve. 

Picking our features based on their criterion might in fact improve our Flock of 

Features tracking even further. 

The hand detection component, described in the previous chapter, supplies 

both a rectangular bounding box and a probability distribution to initialize track- 

ing. This probability “map” is particular to the recognized gesture and was learned 

offline. It states for every pixel in the bounding box the likelihood that it belongs 

to the hand and is described in Section 5.8. A set of approximately 100 features 

is chosen according to the goodness criterion and observing a pairwise minimum 

distance. A subset of the features is then selected based on the map and color 

probability. The subset’s cardinality is the target number of features which will 

be maintained throughout tracking by replacing lost features with new ones. 

Each feature is tracked individually from frame to frame. That is, its new 

location becomes the area with the highest match correlation between the two 

frame’s areas. The features will not move in a uniform direction; some might be 

lost and others will venture far from the flock. 

169


6.2.2 Flocking behavior 

The flocking behavior is a way of enforcing a loose global constraint on the 

feature locations that keeps them spatially confined. During tracking, the feature 

locations are first updated just like regular KLT features as described in the 

previous section and their median is computed. Then, the two flocking conditions 

are enforced at every frame: no two features must be closer to each other than 

a threshold distance, and no feature must be further from the feature median 

than a second threshold distance. Unlike birds that will gradually change their 

flight paths if the flocking conditions are not met, the tracking method abruptly 

relocates affected features to a new location that fulfills the conditions. The flock 

of features can be seen in Figure 6.4 as clouds of little dots. 

The effect of this method is that individual features can latch on to arbitrary 

artifacts of the object being tracked, such as the fingers of a hand. They can 

then move independently along with the artifact, without disturbing most other 

features and without requiring the explicit updates of model-based approaches, 

resulting in flexibility and speed. Too dense concentrations of features that would 

ignore other object parts are avoided because of the minimum-distance constraint. 

Stray features that are likely to be too far from the object of interest are brought 

back into the flock due to the maximum-distance constraint. 

170


Figure 6.4: Images taken during tracking: 

these images are individual frames from sequence #3 with highly articulated hand 

motions. 200x230 pixel areas were cropped from the 720x480-sized frames. The 

cloud of little dots represents the flock of features, the big dot is their mean. Note 

the change in size of the hand appearance between the first and fifth image and 

its effect on the feature cloud. 

The median was chosen over the mean location to enforce the maximum- 

distance constraint because of its robustness towards spatial outliers. In fact, 

the furthest 15% of features are also removed from the median computation to 

achieve temporally more stable results. However, the location of the tracked object 

as a whole is considered to be the mean of all features since this measure changes 

more smoothly over time than the median. The gained precision is important for 

the vision-based interface’s usability. 

171


6.2.3 Color modality and multi-cue integration 

A histogram-based probability for a pixel’s color to be of hand color is obtained 

as described in the previous chapter, Section 5.9. The color information is used 

as a probability map (of a pixel’s color belonging to the hand) in three places. 

First, the CamShift method, which the tracker was compared to, solely operates 

on this modality. Second, at tracker initialization time, the KLT features are 

placed preferably onto locations with a high skin color probability. This is true 

even for the two tracking styles that did not use color information in subsequent 

tracking steps, see Section 6.3. 

Third, the new location of a relocated feature (due to low match correlation 

or violation of the flocking conditions) is chosen to have a high color probabil- 

ity. If this is not possible without repeated violation of the flocking conditions, 

it is chosen randomly. The goodness-to-track is not taken into account at this 

point anymore, but probably this would not improve tracking because the fea- 

tures quickly move to those locations nevertheless. 

This method leads to a very natural multi-modal integration, combining cues 

from feature movement based on grey-level image texture with cues from texture- 

less skin color probability. The relative contribution of the modalities can be 

controlled by changing the threshold of when a KLT feature is considered lost 

between frames. If this threshold is low, features are relocated more frequently, 

172


raising the importance of the color modality, and vice versa. The threshold can 

be set in the vision conductor configuration, see Section 4.2.9. 

6.3 Experiments 

The main objective of the experiments was to assess the tracker’s performance 

in comparison to a frequently used, state of the art tracker. The CamShift tracking 

method (see Bradski [14]) was chosen because it is widely available and because it 

is representative of single-cue approaches. The contribution of both the flocking 

behavior and of the multi-cue integration was also of interest. Five tracking styles 

were therefore compared: 

• CamShift: The OpenCV implementation of CamShift [14] was supplied 

with the learned color distribution. A pilot study using a fixed HSV his- 

togram yielded inferior results. 

• KLT features only: The KLT features were initialized on the detected 

hand and subject to no restrictions during subsequent frames. If their match 

quality from one to the next frame was below a threshold, they were reini- 

tialized randomly within proximity of the feature median. 

173


• KLT features with flocking behavior: As above, but the constraints on 

minimum pairwise feature distance and maximum distance from the median 

were enforced at every frame (see Section 6.2.2). 

• KLT features with color: As plain KLT features, but resurrected features 

were placed onto pixels with high skin-color probabilities (see Section 6.2.3). 

• Combined flocking and color cue: This tracking style combines the 

above two methods into the actual Flock of Features tracker. 

All styles used color information that was obtained in identical ways. All KLT- 

based styles used the same feature initialization technique, based on a combination 

of known hand area locations and learned hand color. This guarantees equal 

starting conditions to all styles. 

Feature tracking was performed with three-level pyramids in 720x480 video, 

which arrived at our DirectShow filter at approximately 13 frames per second. 

The tracking results were available after 2-18ms processing time, depending on 

search window size and the number of features tracked. 

Aside from comparing different tracking styles, we also experimented with 

different parameterizations of the Flock of Features. The following independent 

variables were varied: the number of features tracked, the minimum pairwise 

feature distance, and the feature search window size. 

174


6.3.1 Video sequences 

A total of 518 seconds of video footage was recorded in seven sequences. Each 

sequence follows the motions of the right hand of one of two people, some filmed 

from the performer’s point of view, some from an observer’s point of view. For 

387 seconds (or 4979 frames) at least one of the styles successfully tracked the 

hand. Table 6.1 details the sequences’ main characteristics. The videos were shot 

in our lab and at various outdoor locations, the backgrounds including walkways, 

random vegetation, bike racks, building walls, etc. The video was recorded with 

a hand-held DV camcorder, then streamed with FireWire to a 3GHz desktop 

computer and processed in real-time. The hand was detected automatically in 

the initialization posture as described in the previous chapter. 

6.4 Results 

We define tracking to be lost when the mean location is not on the hand any- 

more, with extremely concave postures being an exception. The tracking for the 

sequence was stopped then, even though the hand might later have coincidentally 

“caught” the tracker again due to the hand’s path intersecting the erroneously 

tracked location. Since the average feature location cannot be guaranteed to be 

on the center of the hand or any other particular part, merely measuring the 

175


Table 6.1: The video sequences and their characteristics: 

three sequences were taken indoors, four in the outdoors. In the first one, the 

hand was held in a mostly rigid posture (fixed finger flexion and orientation), all 

other sequences contained posture changes. The videos had varying amounts of 

skin-colored background within the hand’s proximity. Their full length is given 

in seconds, counting from the frame in which the hand was detected and tracking 

began. The maximum time and number of frames that the best method tracked 

a given sequence are stated in the last column. 

id outdoors posture changes skin backgrnd total length max tracked 

1 no no yes 95s 79.3s 1032f 

2 no yes yes 76s 75.9s 996f 

3 no lots little 32s 18.5s 226f 

4 yes yes little 72s 71.8s 923f 

5 yes yes yes 70s 69.9s 907f 



distance between the tracked location and some ground truth data cannot be an 

accurate measure for determining tracking loss. Thus, the results were visually 

inspected and manually annotated. 

6.4.1 Comparison to CamShift 

Figure 6.5 illustrates our method’s performance in comparison to a CamShift 

tracker that is purely based on color. The leftmost bar for each of the seven 

sequences shows that CamShift performs well on sequences three and four due to 

the limited amount of other skin-colored objects nearby the tracked hand. In all 

176


other sequences, however, the search region and area tracked quickly expand too 

far and lose the hand in the process. 

fraction of sequence tracked 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

camshift 

worst flock per sequence 

mean flock per sequence 

overall best flock 

1 2 3 4 5 6 7 8 

Figure 6.5: Results of tracking with Flocks of Features: 

this graph shows the time until tracking was lost for each of the different tracking 

styles, normalized to the best style’s performance for each video sequence. Groups 

1-7 are the seven video sequences, group 8 is the normalized sum of all sequences. 

The Flocks of Features track the hand much longer than CamShift, the comparison 

tracker. 

The other bars are from twelve Flock of Features trackers with 20-100 features 

and search window sizes between 5 and 17 pixels squared. Out of these twelve 

trackers, the worst and mean tracker for the respective sequence is shown. In 

177


all but two sequences, even the worst tracker outperforms CamShift, while the 

best tracker frequently achieves an order of magnitude better performance (each 

sequence’s best tracker is normalized to 1 on the y-axis and not explicitly shown). 

The rightmost bar in each group represents a single tracker’s performance: the 

overall best tracker which had 15x15 search windows, 50 features and a minimum 

pairwise feature distance of 3 pixels. 

Next, we investigated the relative contributions of the flocking behavior and 

the color cue integration on the combined tracker’s performance. Figure 6.6 in- 

dicates that adding color as an additional image cue contributes more to the 

combined tracker’s good performance than the flocking behavior in isolation. The 

combination of both techniques achieves vast improvements over the CamShift 

tracker. 

6.4.2 Parameter optimizations 

Figure 6.7 presents the tracking results after varying the target number of 

features that the flocking method maintains. The mean fraction’s plateau suggests 

that 50 features are able to cover the hand area equally well as 100 features. The 

search window size of 11x11 pixels allows for overlap of the individual feature 

areas, making this a plausible explanation for no further performance gains after 

50 features. 

178


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

CamShift 

KLT 

flock 

color 

Flock of 

Features 

Figure 6.6: Contributors towards the Flock of Features’ performance: 

both the flocking behavior and the color cue add to the combo tracker’s performance. 

Shown is the normalized sum of the number of frames tracked with each 

tracker style, similar to the eighth group in Figure 6.5. The combination into the 

actual Flock of Features tracker shows significant synergy effects over the other 

trackers’ performances. 

179



1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

mean 

seq 1 

seq 2 

seq 3 

seq 4 

seq 5 

seq 6 

seq 7 

20 30 40 50 60 70 80 90 100 

target number of features 

Figure 6.7: Tracking with different numbers of features: 

varying the number of features influences the performance for each of the video 

sequences. The KLT features were updated within an 11x11 search window and 

a pairwise distance of 2.0 pixels was enforced. (The bars are normalized for each 

sequence’s best tracker, which might not be shown here.) 

180


In a related result (not shown), no significant effect was found related to the 

minimum pairwise feature distance in the range between two and four. However, 

smaller threshold values (especially the degenerative case of zero) allow very dense 

feature clouds that retract to a confined part on the tracked hand, decreasing 

robustness significantly. 

The number of features, the minimum feature distance, and the search win- 

dow size should ideally depend on the size of the hand and possibly the size of its 

articulations. These parameters were not dynamically adjusted since our exper- 

iments were conducted exclusively on hands, that were also within a size factor 

of about two of each other (an example for scale change are the first and fifth 

image in Figure 6.4). The window size has two related implications. A larger size 

should be better at tracking global motions (position changes), while a smaller 

size should perform advantageously at following finger movements (hand artic- 

ulations). Second, larger areas are more likely to cross the boundary between 

hand and background. Thus it should be more difficult to pronounce a feature 

lost based on its match correlation. However, Figure 6.8 does not explicitly show 

these effects. Other factors could play a role in how well the sequences come off, 

which warrants further investigation. On the other hand, the general trend is very 

pronounced. 

181



1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

mean 

seq 1 

seq 2 

seq 3 

seq 4 

seq 5 

seq 6 

seq 7 

3 5 7 9 11 13 15 17 

feature search window size (squared) 

Figure 6.8: Tracking with different search window sizes: 

this graph shows how tracker performance is affected by search window size 

(square; side length given on x-axis). Larger window sizes improve tracking dramatically 

for sequences with very rapid hand location changes (sequences 3, 4, 5), 

but tracking of fast or complicated configuration variations suffer with too large 

windows (sequences 3, 7). 

182



The experiments show that the performance improvement must be attributed 

to two factors. First, the purely texture-based and thus within-modality technique 

of flocking behavior contributes positively as witnessed by comparing KLT features 

with and without flocking. Second, the cross-modality integration adds further 

to the performance, visible in improvements from flocking-only and color-only to 

the combined approach. 

A perfect integration technique for multiple image cues would reduce the failure 

modes to simultaneous violations of all modalities’ assumptions. To achieve this 

for the Flock of Features and its on-demand consultation of the color cue, a failure 

in the KLT/flocking modality would have to be detectable autonomously (without 

help from the color cue). To the best of our knowledge, this cannot be achieved 

theoretically. In practice, however, each feature’s match quality between frames is 

a good indicator for when the modality might not be reliable. This was confirmed 

by the above experiments as the features could be observed to flock towards the 

center of the hand (and its fairly stable appearance there) as opposed to the 

borders to the background where rapid appearance changes are frequent. 

The presented method’s limitations can thus be attributed to two causes, unde- 

tected failure of the KLT tracking and simultaneous violation of both modalities’ 

183


assumptions. The first case occurs when features gradually drift off to background 

areas without being considered lost nor violating flocking constraints. The second 

case occurs if the background has a high skin-color probability, has high grey- 

level gradients to attract and capture features, and the tracked hand undergoes 

transformations that require many features to reinitialize. 

There is a performance correlation between the target number of features, the 

minimum distance between features, and the search window size. The optimal 

parameters also depend on the size of the hand, which is currently assumed to 

vary after initialization with no more than approximately a factor of two in each 

dimension. 

The Flock of Features method was designed for coarse hand tracking for a 

time span in the order of ten seconds to one minute. It is to provide 2D position 

estimates to an appearance-based posture recognition method that does not re- 

quire an overly precise bounding box on the hand area. Thus, it was sufficient to 

obtain the location of some hand area, versus that of a particular spot such as the 

index finger’s tip. In the complete vision system (see Chapter 4), every successful 

posture classification re-initializes tracking and thus extends the tracking period 

into the long-term range. 

Depending on parameterization, processing one frame took between 2ms and 

18ms. Thus, the achieved frame rate of 13 frames per second was limited by the 

184


image acquisition and transmission hardware and not by the tracking algorithm. 

Higher frame rates will allow vastly better performance because KLT feature track- 

ing becomes much faster and even less error prone with shorter between-frame 

latencies. 

We have not found a good solution to automatic detection of tracking loss. 

Heuristics based on KLT feature locations provide some clues, but in a few occa- 

sions the system would track some non-hand object. 

The frequency of posture recognitions that re-initialize tracking depends en- 

tirely on the user task and can thus not be evaluated without that context. How- 

ever, research (at least for communicative gestures) has shown that fingers usually 

move when the hand is in a fixed pose [137]. This means that mostly a rigid ap- 

pearance is to be tracked through large position changes – which is more reliable 

– and articulations presumably result in formation of one of the key postures – 

causing frequent re-initializations. 

The accuracy of the average KLT feature location (the pointer’s location) 

with respect to some fixpoint on the hand cannot be guaranteed because of the 

entirely object-independent tracking method. However, this is only of concern 

for registered manipulation tasks, as other interaction techniques involve pointer 

location transformations or are location independent. 

185


Flocks of Features frequently track the hand successfully despite partial oc- 

clusions. Full object occlusions are impossible to handle reliably at the image 

level. They are better dealt with at a higher level, such as with physical and 

probabilistic object models (used, for example, by Jojic et al. [70] and Wren and 

Pentland [186]). A Flock of Features improves the input to these models, provid- 

ing them with better image observations that will in turn result in better model 

parameter estimates. 

A brief evaluation of hand tracking precision with an object-following task 

yielded no significant differences to the performance of a handheld trackball (in 

terms of mean and median distance of pointer from object). Empirically, the track- 

ing precision is excellent; even minute hand movements are tracked. Illustrating 

the naturalness of the interface, people frequently employed hand movements at 

first for the trackball task before remembering that now the hand was not being 

tracked anymore. 

6.6 Two-handed tracking and temporal filters 

HandVu is designed to detect and track multiple objects within view and its 

output interfaces can report multiple object’s states. This allows two-handed 

interfaces and even tracking of non-hand objects. The current computer vision 

186


methods allow for rudimentary detection and tracking of a second hand in view 

to provide that input modality in specific instances. This functionality can be 

turned on with a library call and is implemented as follows. 

A blob of similar color as the learned probability distribution is sought, with 

a fixed minimum distance from the first hand, to its diagonal lower-left, and of 

a fixed minimum cross section perpendicular to that diagonal. The motivation 

behind searching only within the lower-left quadrant with respect to the right 

hand’s position comes from the application interface that was to be realized with 

two-handed interaction: image area selection to take a snapshot of the hand- 

enclosed “frame” as shown in Figure 8.4. 

A more sophisticated second hand-detection could employ a mirrored detector 

as described in the previous chapter. Since the hand appearances can usually be 

expected to have a very similar color, the color-based verification will contribute 

to achieve a very high confidence in matches. It is straight-forward to extend 

the vision module to track a once-detected object with a second Flock of Fea- 

tures. High-level artifact interactions such as mutual hand occlusions would be 

a bigger issue, however, than when only allowing single-handed input styles and 

appropriate high-level measures would have to be taken, for example, Kalman 

filtering. 

187


Kalman filtering and/or a Condensation filter can be added at the application 

level to keep track of the temporal aspects and to improve tracking results. The 

input and output to the filter are the extracted features such as the hand location 

and derivatives thereof, but no image-level features. 

188

Chapter 7 

Posture Recognition 

The third computer vision component of the hand gesture interface HandVu 

attempts posture classification at and near the image location of the tracked hand. 

The terms posture classification and recognition are used in this dissertation in the 

meaning of view-dependent hand configuration classification, that is, determining 

the configuration formed by the fingers. A posture in this sense is in fact a 

combination of a posture and a view direction, allowing for the possibility to 

distinguish two different views of the same finger configuration. 

The classification method does not require highly accurate output of the hand 

tracking module for two reasons. First, an area larger than the exact tracked 

location is scanned for the key postures. Tracking imprecision and in particular 

feature drift along the hand are thus countered. Second, the method has explicit 

knowledge of a “no known hand posture” class and can therefore produce correct 

189

Chapter 7. Posture Recognition 

results without requiring knowledge about the presence of a hand in the image 

area. 

The focus of the recognition method is on reliability, not expressiveness. That 

is, it distinguishes a few postures reliably and does not attempt less consistent 

recognition of a larger number of postures. Also, a large number of postures would 

at least in initial user interfaces put a high cognitive load on the user who has to 

memorize all of them. Thus, the six postures were chosen as the vocabulary size. 

Classification based on shape or contour information has not proven to be a 

very reliable method because of its sensitivity to background clutter. While it 

provides good results for easily segmentable images, the general case with lots of 

background noise does not produce sufficiently stable classifications. Our recogni- 

tion method uses a texture-based approach to fairly reliably classify image areas 

into seven classes, six postures and “no known hand posture.” A two-stage hier- 

archy achieves both accuracy and good speed performance. 

The following section describes the exact method we used for posture recogni- 

tion. The classifier was evaluated with multiple users; the data collection method 

is described in Section 7.2, its results presented in Section 7.3. The chapter con- 

cludes with a discussion section. 

190


7.1 Fanned detection for classification 

Sequential execution of six traditional posture classifiers on commodity hard- 

ware exceeds the real-time requirements of a user interface. We thus modified the 

Viola-Jones method for this multi-class classification problem. In a first step, a de- 

tector looks for any of the six hand postures without distinguishing between them. 

Elimination of a certain number of hand candidates is faster with this approach 

than when executing six separate detectors because different postures’ appear- 

ances share common features and thus need not be evaluated multiple times. In 

the second step, only those areas are further investigated that passed the first 

step successfully, that is, those that look like some hand appearance. Each of the 

detectors for the individual postures was trained on the result of the combined 

classifier which had already eliminated 99.999879% of image areas in the validation 

set. Cross-training, that is, using the positive examples for one classifier as the 

negative examples for all others, ensures that the classes are sufficiently distinct. 

Figure 7.1 shows an schematic view of this hierarchical recognition method. 

One could call this organization of classifiers a “fan” due to its single stem and 

the unique branch point into multiple leaves. 1 For the fan to work as expected, 

1 An extension of this organization might yield still better speed performance: a tree structure 

has multiple branch points and the number of branches varies. Partial classifiers can share 

the same weak classifiers as long as the classification accuracy does not suffer. The fanned 

organization takes an “all-or-nothing” approach at this, while a tree structure allows some 

classifiers to retain the same root while others have branched out already. 

191


hand? 

closed 

open 

sidepoint 

victory 

Lback 

Lpalm 

Figure 7.1: Fanned arrangement of partial detectors: 

if none of the individual detectors is successful, the class “no known hand posture” 

is chosen. 

the templates have to have identical resolutions and the image areas they are com- 

pared to must have identical size ratios. For detection purposes as described in 

Chapter 5, those two parameters could be optimized per posture, but for two de- 

tectors to share the same root, this cannot be done. To accommodate all postures 

as well as possible without cutting off parts of the hand or including too much 

background, we chose an image size ratio (width/height) of 0.8. The open hand 

posture is occasionally a bit too wide and a substantial amount of background is 

present in sidepoint images, but overall it is a good compromise. Templates were 

trained at a uniform size of 25x25 pixels. 

Training a fanned detector is straight forward: first, example images of all six 

postures constitute the positive sample set and non-hand pictures the negative 

sample set. Training with this configuration is stopped when further improve- 

ments to the detector are too costly, that is, when too many weak classifiers need 

to be added in order to reduce the false positive rate for a given detection rate. 

192


Thereafter, this partial classifier is extended in six independent training sessions 

for each of the six postures. The formerly positive training images of the respec- 

tively five other postures are now added to the set of negative examples. This 

allows for the inter-posture classification. 

The results from Chapter 5 with respect to rotational invariance and training 

were used to achieve the same degree of tolerance towards in-plane rotations in 

the fanned classifier. Gestures can therefore be recognized when performed with 

about 0-15 degrees of counter-clockwise in-plane rotation. The robustness towards 

out-of-plane rotations was not investigated. 

In addition to the thousands of images that the detectors were trained on, the 

last strong classifier’s threshold of every detector was calibrated on 5153 image 

frames. They were collected from three people (different from the participants in 

the larger data collection, see 7.2) with a head-worn camera. 

False positives – recognitions of hands when there are none – are less likely 

to occur in this stage compared to the initial detection stage because of context 

knowledge: the presence of the hand was established recently during detection and 

presumably the hand was tracked thereafter. Posture misclassifications occurred 

occasionally (see below), but no background artifacts were erroneously classified 

as postures. Due to these reasons, no second image cue is consulted for result 

verification concerning “some posture” versus “no posture.” To further improve 

193


the performance of posture classification, a verification of the classification into 

one of the posture classes is desirable. However, color is only of limited use to 

this end, especially since the postures’ comparatively small appearance differences 

require skin/no skin color classification to a degree of accuracy that cannot easily 

be achieved for such small artifacts as fingers (especially due to specular light 

components). One might, for example, investigate feature vectors built from the 

spatial color distribution within the hand area to get a second opinion about 

postures. 

7.2 Data collection for evaluation 

HandVu’s recognition component was validated during the training with the 

same amount of hand images as the test set contained. However, we wanted to 

test the posture recognition against a larger number of test images, obtained in 

more realistic scenarios and with an actual head-worn camera. To this end, we 

built an automated data collection application that had dual roles: first, it was 

to give the user live feedback for her performed hand postures by running the 

recognition module on approximately every other video frame and displaying its 

results. Second, it was to record the video to disk at frame rate for more de- 

194


tailed offline processing. Note that the application’s primary purpose was system 

evaluation and not user evaluation. 

Participants wore the HMD and camera and carried additional hardware in a 

backpack. An experimenter saw a copy of the screen that was displayed in the 

HMD on a laptop which he carried. He would guide the participants to different 

locations for the sessions. Overlaid over the video see-through images from the 

camera, participants were presented a textual identifier and a picture for one 

posture at a time. A rectangular area of about twice the width and height of 

a hand was shown approximately ten inches from the body in front of the right 

arm. A visually-displayed countdown of three seconds started immediately as a 

new posture was shown. After completion of the countdown, recognition started 

and was active for five seconds. Participants were given three pieces of feedback: 

first, a bar graph showed the time progress, starting high at five seconds and 

ending low after five seconds. Second, the rectangular area turned from white to 

green during a successful posture recognition. Third, a second bar graph increased 

according to the time fraction of successful recognition for the current posture. 

Two screenshots of the display as it was visible to the participants is shown in 

Figure 7.2. 

Participants were instructed to perform the posture immediately after it was 

shown, already during the countdown. If they did not receive positive feedback, 

195


Figure 7.2: The data collection for evaluating the posture recognition: 

a screenshot of the entire display of the application that we used to collect data 

for evaluating the posture recognition, one taken during the countdown, one while 

performing the requested posture. 

they were told to make slight changes to the posture and/or move their entire bod- 

ies slightly. These instructions were followed to a varying degree by the different 

participants. 

One female and two male participants were recruited from our campus com- 

munity and performed each posture three times in each session. The first session 

was a practice run and the recorded data was not included in the results. The 

second session was situated in a lab environment with backgrounds including ta- 

bles, chairs, carpeted floor, etc. The next session was on a patio area next to our 

lab. This was an extremely bright area, even though direct sunlight was avoided 

both on the hand and on the background. The background included a light grey, 

stone-like floor, aluminum-colored railings, wooden panels etc. The last session 

196


had natural vegetation, leaves, and soil as primary background objects. Note that 

two thirds of all data collected is from outdoor environments with natural lighting 

– in fact, one participant was recorded in the morning, one in the afternoon, and 

one in the early evening. 

7.3 Results 

7.3.1 Accuracy 

A total of 19,134 frames were recorded at 15Hz; this is approximately equiv- 

alent to 1276 seconds. A few frames with different postures can be seen in Fig- 

ure 7.3. Note the different backgrounds and lighting conditions. Table 7.1 shows 

the results in their entirety. Each row contains data for a particular posture that 

the user was supposed to perform. The number of times that a certain posture was 

recognized is stated in columns. Note that errors can be due to the participant 

performing the wrong posture or the system mis-classifying a correctly executed 

posture. 

A total of 9,137 postures were recognized, sometimes two or more different 

postures in the same frame. While this amounts to an average of one recognition 

roughly every other frame, the fast processing speed in application virtually guar- 

antees a sufficiently fast response time of the system. In 8,567 frames (44.77%) 

197


Figure 7.3: Sample images for evaluation of the posture recognition: 

shown is a random selection of images that were recorded during the posture 

recognition data collection, of three different people. 

the correct posture was recognized, in 570 frames (2.98%) a wrong posture. Of all 

recognitions, 93.76% were correct as shown in the row titled “ratio” in Table 7.1. 

We note that a majority of misclassified frames are due to a single requested 

posture, open. It appears to have a lot of elements of other postures, especially 

the closed and Lback posture. As mentioned, this could be because of incorrectly 

performed postures rather than a failure of the computer vision classification. If 

this posture is disregarded, the numbers are as follows. 

In a total of 16,021 frames 7,852 postures were recognized. In 7,752 frames 

(48.39%) this was the correct posture, in 100 frames (0.62%) this was an incorrect 

posture. A recognized posture was correct 98.71% of the time. 

198


Table 7.1: Summary of the recognition results: 

the “total” column states the number of frames that were recorded and tested for 

each of six postures. The “ratio” row shows for each frame that was recognized 

as a certain posture the percentage that it was identified correctly. 

❍ 

❍❍❍❍❍ is 

should 

closed open Lback Lpalm victory sidepoint total 

closed 1735 0 0 8 0 0 3272 

open 100 815 353 0 1 0 3113 

Lback 19 16 1211 4 6 10 3135 

Lpalm 0 0 1 1343 0 1 3208 

victory 0 0 0 7 1837 0 3234 

sidepoint 2 0 5 36 1 1626 3172 

ratio 0.9348 0.9807 0.7713 0.9607 0.9957 0.9933 0.9376 

7.3.2 Speed 

The speed performance of the fanned recognition method allowed us to pro- 

vide instantaneous feedback to the participants at interactive frame rates, while 

simultaneously recording a 192x215 pixels sized area to disk at 15Hz, all on a 

1.1GHz laptop. The scan time per area on a 3GHz desktop computer required 

between 15 milliseconds and 80 milliseconds with a mean and average around 50 

milliseconds. The 25x25 template detector was scanned in scales from 3.0 to 8.0 

with a factor 1.2 scale increase, initially stepping 2 pixels in the horizontal and 3 

pixels in the vertical. 

199


7.3.3 Questionnaire 

Participants were asked to complete an exit survey with a few personal and ten 

technical questions. While no statistically significant information was obtained, a 

quick report on trends follows. 

All participants felt that postures were recognized fairly accurately. The open, 

Lback, victory, and sidepoint postures were consistently not experienced as difficult 

or uncomfortable to perform. For the closed and Lpalm postures the opinions 

differed. 

We asked for which aspects participants would like to see future improvements. 

They voiced the encumbrance of the head-worn display/camera unit. They also 

mentioned the desire for user-specific, customized gestures, especially for cases 

when a general gesture is inconvenient for a particular user. 

7.4 Discussion and conclusions 

While we would have preferred to compare our results to those of other re- 

searchers, a lack in testing standards and sample data made this infeasible. Fur- 

thermore, the task of distinguishing exactly these six postures is very specific and 

comparison with results from smaller, larger, or different sets of postures and their 

200


recognition rates would not be meaningful. The evaluation of our computer vision 

module thus has to rely on the recognition rates as detailed above. 

The two main results are as follows. First, the fanned recognition method 

allows for high frame rates. This is due to elimination of all but 1.21 ∗ 10 −6 = 

0.000121% candidate areas (on the validation set during training) during evalua- 

tion of the common fan root. Thus, only a small fraction of areas has to be tested 

for every hand posture. Second, cross-training the posture detectors for that sec- 

ond stage is an effective means to build a multi-class classifier as is required for a 

recognition method. 

Recognized postures are correct in 93.76% of the cases. In general, this is not 

reliable enough for high-end user interfaces. However, considering that this result 

was achieved for different people than the recognizer was trained on, that it was 

evaluated in indoor and outdoor environments, and that it relied on a single image 

cue, the method achieves surprisingly good performance. Furthermore, the false 

positive rate can be reduced to 1.29% (98.71% correct) when the open posture is 

not permitted in the interaction. 

Again, misclassifications can be due to participants performing wrong postures 

or the system recognizing a wrong posture. The number of incorrectly performed 

postures on the test set has not been determined. On the other hand, users got 

201


feedback if their posture was not recognized at all and some participants made 

small changes to improve recognition. 

We have not yet studied the mean time to recognition after the moment the 

user performs the correct posture. This is a similar issue as discussed in Sec- 

tion 5.10: if the conditions are very unfavorable, for example, due to shadows 

cast onto the hand, then the posture will not be recognized for as long as no 

apparent image change occurs. That is to say, there is a correlation between the 

probabilities for recognition in two successive video frames. Over time, the user 

is expected to react to such a condition and actively change his hand’s position, 

the camera location, or both. This issue will naturally become less important as 

recognition methods improve beyond today’s performance. 

It remains to be mentioned that the validity of the general approach has been 

confirmed through independent research. Ong and Bowden [124] use a technique 

similar to our fanned recognition method. However, their focus is on detection 

of arbitrary hand postures. A single detector for all postures must eliminate 

all false positives, and a subsequent classification sorts the results into posture 

classes which were obtained with unsupervised training. Their method has two 

disadvantages for our purpose. First, the stages of detection and classification are 

not combined, thus more weak classifiers need to be evaluated to obtain the same 

result. This can result in too-long processing times for interactivity. Second, their 

202


posture clusters are based on the contours of skin-color regions and not on verified 

hand postures as for our training examples. Thus, the classification results that 

we obtain have semantically higher meaning than that of purely appearance-based 

clusters. 

203

Chapter 8 

Hand Gestures in Application 

The previous three chapters were concerned with the methods to implement 

vision-based gesture interfaces and Chapter 3 investigated an aspect of human 

factors of hand gestures. This chapter describes the culmination that the gained 

insights and technology developments provide: input to augmented reality (AR) 

and other non-traditional applications. After a brief overview of the applications 

and their contributions, we make the case for device-external interfaces. Then, we 

schematize ways in which applications can interpret hand location and posture to 

achieve their interaction needs. Next, we set forth general approaches to providing 

feedback for the user’s actions. All three applications are explained in detail in 

Sections 8.5 though 8.7, and concluding remarks will wrap up this chapter. 

204

Chapter 8. Hand Gestures in Application 

8.1 Application overview and contributions 

Each of the applications contributes an important aspect to the dissertation 

thesis. These aspects will be introduced below. Overall, they show that wearable 

computers and non-traditional environments such as augmented reality benefit 

from the enriched interaction modalities that computer vision offers. 

The first demonstration of HandVu’s functionality was an alternative input 

modality (besides mouse and keyboard input) for a 3D map display to allow 

translating, scaling, rotation, and zooming. The display was an alternative out- 

put interface for Battuta, a wearable GIS (Geographic Information System). It 

showed the ease of replacing or complementing traditional input methods with 

hand gestures. This in turn benefits the user with more flexibility in his choice of 

interaction modalities. 

The second, the “Maintenance” application, gives a building facilities manager 

tools at hand to receive, inspect, and record work orders on the go, such as inves- 

tigating a broken pipe, video taping the scene, and leaving voice instructions for 

the plumbers. The interface’s benefits lie in “deviceless” manifestation that leaves 

the user’s hands unoccupied so he can perform manual tasks. All functional input 

to the application is realized with computer vision means, demonstrating the fea- 

sibility as a stand-alone user interface. This is important as it opens another path 

205


for wearable computers and their applications through elimination of constraints 

imposed by traditional interfaces. 

The third application uses hand gestures to complement speech and trackball 

input in order to control a complex application interface. It gives the user a 

tool for visualizing information that is otherwise hard or impossible to see by 

providing him with “virtual x-ray vision” in the shape of a wearable AR system. 

The application shows the gained flexibility in designing convenient input methods 

and the gained versatility for concurrent manipulation of many input parameters. 

8.2 The case for external interfaces 

Wearable computers have evolved into powerful devices: PDAs 1 , cell phones 

with integrated video recorders, smart phones, wrist-watches with heart monitor, 

compass, altimeter, and GPS 2 . Most of these devices run full-featured operating 

systems and can house arbitrary applications. Unfortunately, their human inter- 

face capabilities did not evolve as rapidly and are in fact severely limited by the 

devices’ continuously shrinking form factors. Traditional interfaces, such as key- 

boards and LCD screens, can only be as big as the device’s surface. When plotting 

the interaction area over the device size as in Figure 8.1, this limitation manifests 

1 PDA: Personal Digital Assistant 

2 GPS: Global Positioning System; a satellite-based localization system with passive terrestrial 

receivers. 

206


itself in data points that are on or below the identity line. If the interaction de- 

vices can be folded to a smaller form factor for transportation and stowing, the 

wearable computer loses one of its important properties: interactional constancy, 

the property of being always accessible [113]. 

Figure 8.1: Interaction area size versus interface device size: 

Traditional interface devices are larger than the interface area they provide; their 

data points are below the identity line. External interfaces provide for interaction 

outside the physical dimensions of the implementing devices. 

1) considering the microphone only; 2) HandVu as pars pro toto of vision-based interfaces; 

3) VR trackers such as ultrasound or electromagnetic trackers that require 

mounted infrastructure; 4) virtual keyboards such as Canesta’s [174]. 

Fortunately, the device size problem can be overcome by expanding the in- 

teraction area beyond the physical device’s dimensions – data points above the 

207


identity line in Figure 8.1. For example, the output can be extended beyond 

the physical size of the display by augmenting the reality through head-worn 

displays, allowing for information visualization in the entire field of view. The 

input area can be enlarged through, for example, hand gestures performed in free 

space, recognized with a head-worn camera, since they are not constrained to the 

hardware unit’s dimensions. The industry has recently embraced this concept, 

as exemplified by the Canesta Virtual Keyboard [174] that is projected onto any 

flat surface in front of the device. Speech recognition is another technology that 

allows interaction “outside” the actual device used for recognition. 

8.3 Hand gesture interaction techniques 

In [93], we distinguished three styles of interpreting hand gestures for user 

interface purposes. By hand gestures we mean hand translation in a two-dimen- 

sional input image plane, and a set of key hand configurations – HandVu’s output. 

The styles do not include temporal gestures and their meanings, for example, of 

hand waving. The styles’ characteristics and the manipulation techniques that 

they support are described in the following paragraphs. 

Registered manipulation means that the pointer is co-located with the 

hand in the video see-through display. The hand can therefore virtually touch 

208


objects that it is interacting with. This style is especially suitable to interaction 

with virtual objects in mixed reality scenarios and for interaction with the view 

of the real world. However, it is hard to perform this kind of manipulation while 

walking. Also, care must be taken not to require too much interaction outside the 

user’s comfort zone as discussed in Chapter 3. 

Pointer-based manipulation, such as in [49, 138], describes gestures and 

their interpretation in the style of a computer mouse – using movements in an 

input plane to control a pointer on a distinct manipulation plane. The input plane 

is fixed relative to the camera coordinate system, while the (“direct”) manipula- 

tion plane is fixed relative to the screen coordinate system. The transformation 

between the two planes requires some attention: 

1) The initial offset should be chosen so that all interaction can be performed 

with hand motions that do not exit the comfort zone (see Chapter 3). This 

suggests, for example, that the initial pointer location is chosen centrally among 

the interaction items, such as buttons. 

2) A method for “clutching” [112] must be provided because with hand track- 

ing, the user cannot “pick up and reposition the mouse.” Instead, clutching could 

happen automatically when the pointer reaches the confines of the screen, or bet- 

ter still, of the comfort zone. Further hand movements will then dynamically 

modify the translation offset between the two planes. 

209


3) We found that constraining pointer movements to one dimension (for ex- 

ample, to a horizontal line) is very convenient as it reduces the required precision 

of hand movements. This in turn appears to reduce fatigue that is sometimes 

caused by unnecessarily strict gesture requirements. 

4) Larger-than-identity scaling factors avoid overly extensive hand movements 

while at the same time allowing for big, easily visible buttons. On the other hand, 

too large scaling factors again introduce unnecessarily strict requirements and 

might even subject the input to involuntary jitter during general body motion. 

Another positive effect of scaling is that the size of the comfort zone does not 

restrict the size of the manipulation plane. That is, hand motions can remain 

within comfortable bounds and the pointer can make much larger movements. 

5) Snapping the pointer to the default button can ensure that in most cases 

no hand movement but only the selection gesture has to be performed. While 

this behavior might be disruptive in a desktop environment, it is more convenient 

within the mobile user interface context. 

6) Humans do not notice slight changes in mapping of input speed to con- 

trol speed (see, for example, “redirected walking” in virtual environments, by 

Razzaque et al. [140]). This can be exploited to artificially increase the con- 

trol area while maintaining precision. For example, a nonlinear speed translation 

is frequently employed in mouse interaction, called mouse pointer acceleration. 

210


As another example, imagine that the initial offset has been chosen unfavorably 

and the hand is not within the comfort zone. Imbalanced mappings for hand 

movements in opposite directions can then be leveraged to gradually allow the 

interacting hand to return to its comfort zone, without requiring clutching. 

Location-independent interaction refers to hand postures that can be per- 

formed anywhere within the camera field-of-view (FOV) and produce a single 

event. Hauptmann [55] pointed out that pointer-based manipulation should not 

be the only mode of interaction for hand gesture interfaces. Location-independent 

gestures are thus an important mode of interaction, especially for people “on the 

move.” 

A selection gesture (a “mouse click”) is a necessary concept for many pointer- 

based and registered interfaces. It can be implemented with two techniques: se- 

lection by action, which involves a distinct posture to signal the desire to select. 

If the same hand is employed for both pointing and selection, some movement 

during the selection action must be expected and should not interfere with point- 

ing precision. For high precision demands, a selection by suspension technique 

might be more appropriate, in which the desire to select is conveyed by not mov- 

ing the pointer for a threshold period of time. Requiring the user to be idle for 

a few seconds, or constantly move her hand to avoid selection, is usually unwise, 

particularly in mobile contexts. 

211


8.4 Feedback 

One of the most important aspects of user interfaces is the immediate feedback 

to the user once a command or even a slight change in the input vector was rec- 

ognized (see, for example, in Hauptmann [55]). A lack thereof decreases usability, 

in particular the speed with which the interface can be used. HandVu itself can 

give feedback by overlaying information on the video stream. The amount and 

verbosity of the overly can be selected based on the application programmer’s and 

the application user’s needs. Section 4.2.6 on page 102 explains in detail HandVu’s 

different verbosity levels. 

In addition to HandVu’s feedback mechanism, applications can implement 

their own ways to signal event recognition to the user, both through visual means 

and through other channels such as audio. In general, the gesturing user should be 

notified of the state of the recognition system, that is, whether the hand has been 

detected and is being tracked, or whether the system is waiting for the user to 

perform that first gesture. Some sort of location feedback should be given during 

hand tracking, for example, with a small icon overlaid over the hand. Lastly, 

recognition of one of the key postures should also be signaled, for example, with 

a button click visualization. 

212


All additional feedback is dependent on the gesture interpretation in the appli- 

cation space. For example, a red border could be drawn around buttons that the 

user hovers over, signaling that executing a selection gesture will “click” that but- 

ton. An iconic hand could be drawn as cursor for the pointer-based manipulation 

techniques. 

If the location of the hand is used as input parameter but has no a-priori visual 

representation, the first stage should probably also show a pointer at the location 

of the hand. For unregistered pointing tasks, the location feedback can also be 

in the control coordinate system only, not co-located with the hand in the video. 

This is sufficient feedback, since humans are easily capable of mapping between 

two spatial planes (as demonstrated by the example of the computer mouse). 

A third level of feedback is entirely in the domain of the common user again. 

It is given depending on the recognized gesture. If the task controlled with a 

particular gesture has an intrinsic visual outcome, no additional feedback has to 

be provided. An example is a map translation task with hand gestures in a direct 

correlation of their movements. On the other hand, if the task has no visual 

representation per se such as turning up the volume or switching modes for a 

moded interface, some feedback has to be artificially created. It can either be 

a specifically designed overlay such as a volume slide bar, it can be an iconic 

representation of the recognized gesture in a fixed location on the screen, or it can 

213


be a symbol displayed at a hand-stabilized location such as a pen if a drawing 

hand posture is recognized. 

8.5 Battuta: a wearable GIS 

We first demonstrated the feasibility and the ease of replacing conventional in- 

terfaces with HandVu’s vision-based interface. To this end, we built a stand-alone 

module that replicated the basic interface functionality of a wearable, mobile GIS 3 

application [27]. That application had been built in the context of the Battuta 

project and different input modalities were being explored for their suitability to 

controlling the application on a wearable platform. Our main reason for repli- 

cating parts of the interface component was the need for a much more efficient 

implementation, namely for hardware support to render the 3D scenery. 

The previous interface devices had one aspect in common: input in a continu- 

ous domain was only possible by converting a temporal duration into the desired 

one-dimensional signal. For example, depressing a key on a handheld keyboard 

would translate a displayed map with a constant speed into one direction. The 

opposite direction would be achieved with a different key. Unfortunately, this 

mode of operation is less natural and less efficient than, for example, dragging the 

map with a mouse. That, however, is not available to a wearable user. 

3 GIS: Geographic Information System 

214


One particular device, a so-called ring mouse, differentiated different forces 

of the input and, thus, the signal could be varied in strength, for example, the 

velocity of the translation. This can alleviate some of the disadvantages of such 

a time-to-space conversion, but it also introduces more complexity since now the 

derivative of the actual variable is being controlled. 

8.5.1 The gesture interface 

Hand gesture interaction is thus particularly promising for application pa- 

rameters in continuous domains, such as the already mentioned map translation, 

and also map scaling, map rotation, and possibly generic slide bar controls. The 

functions that we chose to support with gestures are: 

• Selection of one of four menu entries. The menu consists of a selectable 

interaction surface in each corner of the screen as in the original Battuta 

application (see the screenshot in Figure 8.2). Menu selection starts with 

a location-independent “menu” hand gesture that brings up the menu cor- 

ners in the display. A brief motion of the hand (in any posture) into the 

direction of the desired corner selects the respective menu entry. This also 

demonstrates how HandVu can facilitate interpretation of dynamic gestures 

as a single event: by supplying posture and location data to the application 

215


domain which analyses it over time and produces one event after observation 

of a particular trajectory. 

Figure 8.2: The map display of the Battuta wearable GIS. 

• Translating the map in the two dimensions of its plane. A dedicated, 

location-independent “translation” gesture picks a reference point in 2D 

hand location space. Hand movements in arbitrary postures thereafter move 

the map along with the hand until the “translation” gesture is performed 

again. 

• Zooming the map towards and away from a fixpoint on the map. Again, a 

reference point in hand space is chosen by performing a dedicated “zoom” 

gesture. Moving the hand further away from the body zooms out, moving 

216


it closer zooms in. Performing the “zoom” gesture a second time ends the 

zooming mode. 

8.5.2 Benefits of HandVu for Battuta 

As explained in the literature review (Section 2.4.1), this type of gestural 

interaction is not ideal in terms of user performance and preference. Users prefer 

not to have semantically fixed gesture interpretations but instead intuitive and 

spontaneous mappings. In particular, replacing the mouse with gestures is not 

a favored interaction modality. However, for a wearable scenario where there is 

no mouse, no keyboard, and no table available, even these gesture mappings can 

provide the essential means of interaction. 

The most intuitive mixed-reality map imaginable would afford the same prop- 

erties as its physical counterpart, especially with regard to picking it up and 

moving the sheet around. While HandVu has not yet achieved this goal in its en- 

tirety, it requires no intermediary hand-held or hand-worn devices such as a mouse 

or data glove. It is thus a step closer to said most natural way of interaction. 

On the technical side, we used HandVu’s event server to interface between the 

gesture recognition module and the Battuta display. The TCP/IP-based commu- 

nication mode made HandVu’s implementation language (C++) and other server 

specifics entirely transparent to the map display, which was written in Java us- 

217


ing the Java3D rendering library. This shows the wide availability of HandVu’s 

output. 

8.6 Vision-only interface for mobility 

We also tested the functionality of our VBI with a custom-built user interface 

component for a facilities maintenance application that runs on our wearable com- 

puter. This was a much later project than the previous one and the vision system 

had matured significantly. The main contribution was its actual deployment in 

the outdoors. 

The hardware setup of our system is described in Section 4.1. This section 

describes the application interface. 

8.6.1 Functionality of the Maintenance Application 

The “Maintenance Application” consists of a set of application panes with a 

number of tools to aid facilities personnel. It was designed to demonstrate the 

suitability of VBIs for the mobile use. Its suggested functionality supports building 

facilities managers in their daily tasks of performing maintenance operations and 

immediate-attention work requests, for example, investigating a water leak or a 

power failure in a particular room. The wearer of our mobile system can utilize 

218


three main panes: an audio recorder, a digital still and video camera, and a 

work order and communication pane. The active pane is selected by performing a 

location-independent task-switch gesture for a short period of time, which cycles 

through the application panes and a “blank screen” pane, one by one. 

Voice recorder: A small microphone clipped to the goggles allows auditory 

recordings, activated by gesture commands that start, pause, resume, and stop a 

sound recording. This interface utilizes the pointer-based manipulation technique 

in combination with a location-independent “select” posture. Buttons 4 are hori- 

zontally aligned and the pointer is restricted to moving along this dimension. A 

red border gives visual feedback whenever the hand pointer is in the area of a 

button (hovering above it). 

Image and video capture: The image capture pane has three modes of 

operation which are selected via buttons. The interaction technique with the im- 

age/video capture menu is very similar to that of the voice recorder, only that 

the buttons are arranged in a vertical fashion and the pointer movement is con- 

strained to that dimension. The first mode allows a user to take a picture of the 

entire visible area. A count-down timer is overlaid after activating this mode. 

A picture is taken and stored at the end of the count-down. The second mode 

4 Thanks to James Chainey for drawing the icons and interaction elements for this application 

219


Figure 8.3: Image of pointer-based interaction: 

shown is the image and video capture pane. The pointer movement is constrained 

to the vertical dimension. Note that the interface images shown in the various 

figures were taken in different environments, illustrating the ability of our system 

to adjust to varying backgrounds and lighting conditions. 

records a video stream instead, stopping as soon as the hand is detected within 

the interaction initiation area. 

The third mode allows taking snapshots of selective areas. HandVu searches 

for the left hand as the nearest skin-colored blob to the lower-left of the right hand 

(see Section 6.6). The rectangular area enclosed by both hands is highlighted in 

the display, shown in Figure 8.4. When the positions of both hands stabilized 

with respect to the camera, the snapshot is taken. Implementations of the same 

functionality that use only one pointer are conceivable but less convenient to 

220


use. This is the only task where hand suspension was the selection method of 

choice because the user will most likely have assumed a stationary body position 

and performing a selection-by-action gesture would interfere with the pointing 

precision. 

Figure 8.4: Image of two-handed interaction: 

the user has selected an area, and the snapshot will be taken when the hands have 

settled for five seconds. 

Work order scheduler: With the aid of this pane, the person in the field can 

retrieve, view, and reply to work requests. Up to three work orders with title and 

status (open, closed, follow-up) are shown concurrently, automatic scrolling brings 

hidden orders into view (see Figure 8.5). Three dedicated, static hand gestures 

allow for selection and manipulation of work requests: one gesture selects the work 

order above the current one, another gesture selects the one below the current one. 

221


We choose the discrete posture technique over pointer-based manipulation because 

scrolling with a pointer and “scrollbars” is an unnatural, awkward operation, 

especially for mobile user interfaces. The third gesture facilitates activation of the 

currently selected work order. “Attachments” to a report can be selected from 

the previously recorded media clips (voice recording, still picture, or video) with 

“registered” hand movements. This was decided based upon the possibly large 

number of clips and the convenience of random access compared to access in a 

sequential fashion. The selection gesture picks the currently highlighted number. 

Figure 8.5: Image of location-independent interaction: 

location-independent postures (up, down) change the highlighted work order. The 

gesture being performed in the picture selects the highlighted item. 

222


Figure 8.6: Selecting from many items with registered manipulation. 

8.6.2 Benefits of HandVu for mobility 

Over-exposure of the hand area was a huge problem of outdoor operation at 

first. The reason is that the hand is often brighter than the background, yet it 

occupies only a small area in the video frame. Most digital video cameras can only 

optimize the exposure for the entire frame, not for a selective and dynamically 

changing area. This lack in functionality prompted conception and implementa- 

tion of the automatic exposure control in our software environment that is ex- 

plained in Section 4.2.2. HandVu now handles difficult illuminations both during 

the initial hand detection as well as during hand tracking and posture recognition. 

This in turn extends the range of conditions in which HandVu can be deployed as 

223


a user interface, specifically to the dynamic conditions experienced with mobile 

applications. 

Two-handed interaction offers a particularly attractive way to control certain 

tools, such as snapping a picture of a selective area. The registered manipulation 

is no doubt the best interaction technique for this task: even in real life do pro- 

fessional photographers and videographers use their hands to form an artificial 

picture frame around the scene they intend to immortalize. The correlation (with 

respect to the hands and the FOV of the world) between the content on the input 

plane and the manipulation/output plane is a strong indicator to the benefits of 

this technique. 

Most of the interaction techniques that we demonstrated with the Maintenance 

Application were mouse-based. Again, this generally does not make the best use 

of hands as an interface modality [55]. However, it realizes generic interface ca- 

pabilities for mobile deployment of almost any WIMP-style application 5 . Beyond 

mouse functionality, HandVu recognizes key postures to offer discrete “keys,” 

another important input capability that is difficult to provide in the wearable 

computer context. We employed this to switch between the different application 

panes with a dedicated posture. Furthermore, we showed techniques (restricted 

5 WIMP is “Short for Windows, Icons, Menus and Pointing device, the type of user interface 

made famous by the Macintosh computer and later imitated by the Windows operating systems.” 

(Webopedia.com) 

224


pointer movements, scaling, and snapping, see Section 8.3) that improve usability, 

mostly by decreasing the required extent and precision of a user’s input actions. 

8.7 A multimodal augmented reality interface 

In this section we will describe the third application for which we demonstrated 

the usefulness of vision interfaces. It uses hand gesture input in combination with 

voice commands and a handheld trackball to control virtual objects, overlaid over 

and registered with actual outdoor building structures and indoor room environ- 

ments. The main contribution with regard to hand gestures is the demonstration 

of the vision interface in concert with two other modalities and the gained input 

expressiveness. 

The mobile augmented reality system, developed by Ryan Bane, visualizes 

otherwise “invisible” information encountered in urban environments. A versatile 

filtering tool allows for interactive display of occluded infrastructure and of dense 

data distributions such as room temperature or wireless network strength, with 

applications for building maintenance, emergency response, and reconnaissance 

missions. Bane and Höllerer recently extended this interface to a very compre- 

hensive visualization toolkit [5]. Operation of the complex application functional- 

225


ity demands more context-specific interaction techniques than traditional desktop 

paradigms can offer. 

The motivation behind the system is to show how multi-modal interfaces can 

stretch the boundaries of what tasks can be performed on a wearable platform. 

While mobile computers are constrained in terms of their interaction possibilities 

by their form factors, they do allow for interaction in and with the real world, in 

places and situations in which desktop-style computer support would be hard to 

come by. Through a blend of multimodal integration styles we are able to achieve 

good overall usability, avoiding uses of any one input modality for purposes it is 

not suited for. 

8.7.1 System description 

The wearable platform is simulated by a bulky prototype backpack system 

based on two medium-performance laptops. Figure 8.7 shows a diagram of the 

structure of our system implementation, overlaid over pictures of the actual de- 

vices. We need two laptops for performance reasons: the gesture recognizer’s 

performance drops dramatically if it has to compete with other compute- or bus 

communication-intensive applications on the same machine, a 1.1GHz Mobile Pen- 

tium 3 running Windows XP. The other laptop runs a custom-built OpenGL-based 

visualization engine which is described in [5]. In addition to the hardware setup 

226


voice 

recognizer 

gesture 

recognizer 

tracker manager 

input handler 

tool manager 

tools 

tunnel tool 

path tool 

picking tool 

renderer 

tracker 

trackball 

mike 

video 

combiner 

glasses 

camera 

Figure 8.7: An overview of the hardware components: 

together with the HandVu software they make up the multimodal AR application. 

as shown in Section 4.1, we also mounted an InterSense InertiaCube2 orientation 

tracker atop the glasses. We assumed position tracking to be provided by auxiliary 

means and manually set our location. A newscaster-style microphone is clipped 

to the side of the glasses. A tv-one CS-450 Eclipse scan converter overlays the 

luminance-keyed output from the rendering computer over the mostly unmodified 

video feed from the first computer. The combined signal provides the input to 

the head-worn display – for video see-through augmented reality – and to a DV 

camera that we used to record the images shown in this section. 

In a second configuration, both major software components – recognition and 

rendering – can be run on the same machine, communicating the image frames 

227


between the two processes through a shared memory segment. This significantly 

reduces setup time and the amount of required gear as no analog video overlay is 

necessary. However, the achievable frame rates and the interactivity suffer from 

sharing the processing resources on commodity hardware. 

HandVu gets its hands on the video frames first and performs its gesture 

recognition tasks. The camera image is also corrected for lens distortion. This is 

important for subsequent registration with the virtual world: graphics hardware 

can only display geometric projections and is not able to correct for nonlinear 

distortions. Thus, the image to align with must not be distorted either. 

The tracking result and the undistorted picture are passed to the AR rendering 

module. This module generates 3D graphics that are registered with the current 

camera view. It also generates a screen-stabilized user interface that provides 

feedback and conventional mouse interaction functionality to the user as a fallback 

solution and for development. 

8.7.2 The Tunnel Tool and other visualizations 

The Tunnel Tool is a sophisticated technique to visualize complex information 

while in the field and immersed in augmented reality, conceived and implemented 

by Ryan Bane in collaboration with Tobias Höllerer and myself. 6 In essence, the 

6 A related publication [5] describes the tool and an extension to more detail. 

228


Tunnel Tool selects a volume from the entire field of view and displays it in a 

different manner than its surroundings. The tool is apparent to the viewer as a 

bluish plane that occludes the real world and occupies a part of the screen. It can 

“see through walls” and visualize occluded objects in front of the bluish plane. It 

can also filter dense and overlapping data sets and present them in the form of 

visually more informative data slices. We fixed the cutout volume’s position with 

respect to the user’s viewpoint because initial experiments showed no gain from 

relocating it off the center of the view axis. The three dimensional layout of the 

Tunnel Tool that creates this impression is shown in Figure 8.8. 

Users can take 3D snapshots of the virtual environment and the current view 

of the real world. Virtual objects that can be placed in the virtual environment 

improve expressiveness by allowing for realistic visual annotations. This helps with 

explaining the location and relation of real and virtual objects to other people, for 

example, to someone who has to mount a piece of hardware to a wall in a certain 

way. 

8.7.3 Speech recognition 

We use a prototype automatic speech recognition library (ASRlib), provided 

to us by the Panasonic Speech Technology Laboratory. ASRlib is targeted to- 

wards computationally very efficient, speaker-independent recognition of simple 

229


Figure 8.8: Schematic view of the Tunnel Tool: 

only objects inside the Focus Region are rendered in full. Objects that fall within 

the Context Region are rendered as wireframes. Objects to either side of the 

apparent tunnel are rendered in their normal fashion. The drawing is courtesy of 

Ryan Bane. 

230


grammars. We use the English dataset, about 70 keywords, and a grammar that 

allows around 300 distinct phrases with these keywords. Command phrases must 

be preceded and followed by a brief pause in the speech, but the words can be 

concatenated naturally. It performed well in our tests, not producing any false 

positives despite listening in on all of our conversation, but sometimes requiring 

the repetition of a command. It consumed few enough resources to run aside the 

power-hungry computer vision application. It lives in its own process and sends 

recognized phrases as text strings to our rendering application where they are 

parsed again and interpreted as AR commands. 

8.7.4 Interacting with the visualized invisible 

Not every input interface is well-suited to every task. We will briefly describe 

the motivations behind choosing a particular interface for the variety of tasks in 

our multimodal application. 

Keyboards and mice are unsuited for immersion in the HMD-created mixed 

reality world. We felt that discrete, “binary” parameters - previously mapped to 

one key each - are best accessed by equally discrete and binary speech commands, 

some of which are shown in Table 8.1. 

Positioning, sizing, and orienting objects could be done with multiple sequen- 

tial 1-dimensional input steps, but at least for repositioning this is very awkward 

231


Figure 8.9: A hand-worn trackball: 

this was our favorite device to provide one-dimensional input to our system: a 

trackball that can be worn similar to a ring. We also provided an attachment to 

the user’s pants with a Velcro strip. 

and differs starkly from the direct interaction employed for positioning a physical 

object. We therefore combined the 2-dimensional input from hand tracking with 

the 1-dimensional input from a “ring trackball” (see the picture in Figure 8.9) to 

achieve concurrent input of 3-dimensional control data. The trackball is conve- 

nient because it is easy to retrieve and to stow away, especially after we provided 

for attachment to the user’s pants with a Velcro strip. In addition, it allows for 

less encumbered hand gesture input: if the user chooses to use the same hand 

for gesturing and trackball operation, she can leave the trackball dangling from 

the index finger during gesturing. Only one dimension of the input was needed 

for our system, but the device has the full functionality of a 3-button mouse and 

thus allows for UI extensibility. In addition, since trackballs and similar concepts 

232


deliver information about unbound relative movement, their output domain is in- 

finite. This is important if the interaction range is also unlimited – as in the case 

of moving the regions in our tunnel tool. 

Table 8.1: The mapping of input to effect: 

this table details the modalities and commands that controlled the various application 

parameters. 

control parameter input modality 

non-dimensional: voice commands: 

⊲ take/save/discard snapshot ⊲ “take snapshot,” “save,” “discard” 

⊲ tunnel mode (float, open, close) ⊲ “open/float/close tunnel” 

⊲ add/remove viz to/from env. ⊲ “add bundle networking to tunnel” 

⊲ etc. ⊲ etc. 

one-dimensional: ring trackball: 

⊲ adjust Focus Region distance ⊲ roll forwards or backwards 

two-dimensional: gesture with speech driven modes: 

⊲ pencil tool for finger ⊲ point + “save,” “discard” 

three-dimensional: gesture+trackball+speech driven mode: 

⊲ position, size, orient objects ⊲ point + roll + “finished” 

Virtual objects in our system can be manipulated with hand gestures, trackball 

input and voice commands. The user either selects an object with his finger with 

the voice command “select picking tool for finger,” or inserts a new virtual object 

into the world with a voice command. The system then enters the “relocate” 

mode: the object can be moved in all three dimensions with a combination of 

hand gesture (x, y) and trackball input (z). The gesture commands work as 

follows: the user first makes the closed hand gesture, which sets the system to 

track the motions of his hand and apply them to the object’s position. The user 

233


moves his hand left and right to move the object on the x axis, and up and down to 

move the object along the y axis. When he is satisfied with the object’s position, 

he again makes the closed hand gesture, which stops the system from applying his 

hand motions to the object. Alternatively, the voice command “finished” has the 

same effect. Next, the “resize” mode is automatically entered and the same input 

modalities allow 3-dimensional resizing of the object. Again, closed or “finished” 

exits this mode and enters the next in which the object can be rotated around each 

axis, again with the same input modalities. The voice commands “move object,” 

“resize object,” and “orient object” will enter the respective manipulation modes 

which are to be quit with the “finished” command. 

8.7.5 Multimodal integration 

The integration of four modalities is capable of controlling the entire AR sys- 

tem, including in particular the Tunnel Tool: hand gesture input, voice commands, 

unidirectional trackball motions, and head orientation. The most frequent inter- 

action in fact is almost unaware to the user: head motions are tracked and im- 

mediately reflected in the rendered virtual objects. Features are extracted and 

interpreted independently on every channel. That is, the modalities are combined 

with late integration, after grammatically correct sentences have been extracted 

and the location and posture of the hand has been determined. The style of high- 

234


level interpretation differs according to input commands and system state. Three 

styles of late integration are blended in a way to maximize the overall usability 

while choosing input from the best-suited modality for a task. 

Independent, concurrent interpretation: 

Input of this style is interpreted immediately and as atomic commands; think 

mouse movements over the same window while typing. In our system, most speech 

commands can be given at any time and have the same effect at any time. For 

example, the speech directive “add networking to surroundings” can occur simul- 

taneously with gesture or trackball commands and it is interpreted independent 

of their state. 

Singular interpretation of redundant commands: 

Redundant commands, that is, commands from one channel that can substi- 

tute for commands from a different input channel, are useful for giving the user a 

choice to pick the momentarily most convenient way to give an instruction. They 

are interpreted in exactly the same way, and in the case that multiple, mutually 

redundant commands are given they are to be treated as a single instruction. We 

currently have two cases of this style: “select picking tool for finger” achieves 

the same result as performing a dedicated hand posture while the hand is being 

235


tracked, and the ‘release’ gesture during object manipulation is equivalent to the 

“finished” speech command. For the case of concurrent commands, we chose two 

seconds as an appropriate interval in which the mutually redundant commands 

are to be considered as one. In the first case we avoid such an arbitrary thresh- 

old implicitly by entering the picking mode, in which the two commands are not 

associated with a meaning. 

Sequential, moded interpretation: 

This style does the opposite of redundant commands: it requires users to 

provide input first in one modality, then in another. This is a common style 

within the desktop metaphor – first a mouse click to give focus to a window, 

then keyboard interaction with that window – and, there, has the drawback of an 

associated switching time between mouse and keyboard. In our system, however, 

there is no such switching time since the two involved modalities do not both 

involve hands: the drawing and virtual object manipulation modes use gestures 

for spatial input and voice commands for mode selection. In fact, we chose this 

style because it makes the best use of each modality without creating a conflict. 

Overall, the modalities work together seamlessly, allowing for interaction that 

has almost conversational character. Voice commands allow the user to easily 

switch features or tools on and off, and enter non-spatial input such as adding 

236


items to visualization environments. Hand gestures make for a very natural input 

interface for spatial data and a few key hand postures enable their use to perform 

simple action sequences entirely with gestures. Finally, the trackball provides for 

exact, continuous 1-dimensional input for situations where hand gestures are less 

convenient or 3-dimensional input is desired. 

8.7.6 Benefits of HandVu for powerful interfaces 

The benefits of using HandVu and our multimodal interface to control the 

powerful and complex application functionality became most apparent during the 

early stages of outdoor system deployment. All application parameters were ini- 

tially controlled through a keyboard and mouse interface that required access to 

the two laptop computers. We could only achieve this by spreading out the hard- 

ware on a bench (aside from the head-worn components), prohibiting almost all 

user mobility. Furthermore, immersion in the head-worn display made the in- 

teraction virtually impossible and a second person had to operate keyboard and 

mouse on command of the HMD wearer. The ring trackball, see Figure 8.9, along 

with a keyboard held by the main user improved matters, but this was obviously 

still significantly flawed. It was not until we had most input functionality from 

the speech and gesture recognition available that the system became less cumber- 

some to operate. Finally, we were able to completely stow away all computational 

237


components in a backpack and regained mobility. The user now only interfaces 

to the application through the head-worn camera, the ring trackball, microphone, 

and the head-worn orientation tracker. 

HandVu’s gesture recognition also contributes significantly to the multimodal 

interface as such. For example, spatial input capabilities are much easier achieved 

with pointing actions than with spoken commands. By employing the respectively 

most favorable modality for each task we have avoided awkward input procedures 

such as pointer- and menu-based command selection. But we have also given the 

user the possibility to select from two or more modalities to perform the same 

task, allowing for situational flexibility and personal preferences. 

Both the availability of the one best-suited modality and the flexibility are 

steps towards more natural interaction, more and more approximating human- 

human communication. This is not meant as a value statement about human- 

human interaction and whether human-computer interfaces should attempt com- 

plete replication of those. However, speech recognition and the ability to interpret 

gestures are long-evolved human skills that should not be ignored when searching 

for alternative input capabilities for wearable computers. 

Wearable and mobile computers offer new applications to new fields of deploy- 

ment. HandVu together with the multimodal user interface increases the wear- 

238


able computer users’ expressiveness that the devices can leverage and achieves 

respectable usability even for demanding application interfaces. 


This chapter showed how vision-based hand gesture recognition, facilitated 

with the HandVu library and the WinTk Windows toolkit, can provide user in- 

put to applications with different characteristics and thereby achieve a number 

of objectives. Hand gestures as a replacement interface were shown in the first 

application, vision to provide input in the absence of handheld devices in a wear- 

able computer context was the contribution of the second demonstration, and the 

benefits of vision-based interfaces in cooperation with three other modalities were 

exemplified in the last application. 

Areas of logical extensions for the current interface are the following. First, 

hand gestures’ inherent 3D capabilities can supply additional input parameters, 

especially for manipulation of virtual 3D objects in small-scale virtual or aug- 

mented environments. Second, more robust two-handed interaction is equally 

promising, again due to its naturalness in manipulation, exploration, and commu- 

nication contexts. Third, recognition of dynamic gestures such as clapping is also 

an interesting system extension. 

239


We showed the potential and benefits of vision-based interfaces in various 

contexts, hoping to have stimulated interest in further exploration of VBIs as 

interaction modality. 

240

Chapter 9 

The Future in Your Hands 

9.1 Recapitulation 

We have developed HandVu 1 , a computer vision system for recognition of hand 

gestures in real-time. Novel and improved vision methods had to be devised to 

meet the strict demands of user interfaces. Tailoring system and applications for 

hand motions within a comfort zone that we have established improves user sat- 

isfaction and helps optimizing the vision methods. Multiple applications demon- 

strated HandVu in action and showed that it adds to the options of interaction 

with non-traditional computing environments. 

1 HandVu is pronounced “hand-view.” 

241

Chapter 9. The Future in Your Hands 

9.2 Limitations 

The definition of comfort and the extent of the comfort zone are valuable tools 

to assess the convenience of postures and motions. Comfort does not predicate 

risk-free postures – injury-prone biomechanics must be independently determined. 

HandVu performs well on the tasks that it is designed for. However, it is a 

research tool that can be fooled easily; that is, HandVu does currently not provide 

consumer-grade reliability. It recognizes syntactic gestures only; hand waving and 

other more semantically afflicted gestures have to be recognized in subsequent 

processing stages to which HandVu supplies its results. 

Keyboards and mice are probably going to be the preferred user input modali- 

ties for many applications. Gesture recognition should not be expected to replace 

traditional interfaces. Their limitations must be kept in mind when looking for 

applications of the technology. 

9.2.1 Limits of hand gesture interfaces 

Hand gesture interfaces are not the silver bullet that solves all human-computer 

interaction problems. Rather, it is one of many interaction technologies that 

need to coexist and cooperate to jointly enhance our communication abilities 

with computers. There are and will be situations when the hands are occupied 

242


with other tasks, when it is socially unacceptable to gesture, or when disabilities 

prevent their use. Speech recognition is often a well-suited complementary input 

means that can be conveniently employed in situations when hand gestures fail, 

and vice versa. 

Tactile feedback lacks when haptically interacting with virtual objects or per- 

forming free-hand gestures for commands. The actual inflicted disadvantage must 

still be determined, but it is foreseeable that proprioception and human vision 

cannot fully compensate for this lack of fidelity. 

Too large a gesture vocabulary could inhibit adoption of otherwise advanta- 

geous interfaces. The evolution of the computer mouse – from a single-button 

device to multiple buttons and scroll wheels – gave a generation enough time to 

slowly accept new additions to a by-then-familiar device. 

Social acceptance must not be overlooked and it is advisable to incremen- 

tally introduce unusual concepts such as gesturing wildly in the presence of other 

people. 

9.2.2 Limits of vision 

Computer vision processing still commands high computing power and high 

data rates. While smart cameras and vision chips promise to mitigate these fac- 

tors, commodity hardware will dominate the market for vision platforms at least 

243


for the next five years. However, only with smaller, lighter, and less power hungry 

devices will the most auspicious environment for vision interfaces be exploitable: 

the computing in mobile and wearable scenarios. 

At least monocular vision is theoretically incapable of recovering the full finger 

configuration of a hand due to occlusions and other circumstances that introduce 

ambiguity. 

Social factors pertaining to computer vision must also not be overlooked. The 

camera that must be worn on the body might raise privacy concerns for people 

in the vicinity of the user. The camera could be recording at any time, which 

might not be in the interest of the filmed party. It has also recently come under 

discussion whether filming in public places should be generally forbidden if there 

is reason to believe that the recordings could be used to plan or help execute 

malicious activities. Social norms develop slowly over time and a sudden novelty 

like a permanently worn camera might take considerable adjustment time. Steve 

Mann pioneered permanently worn computers and gained much more insights 

about being a “photographic cyborg” in over twenty years, reported, for example, 

in [114]. 

244


9.3 Next-generation computer interfaces 

Provided all or most of these difficulties can be overcome, the potentials for 

vision-based interfaces are great. Not only will we finally be able to communicate 

with a computer like a peer, but with someone who understands subtle notions 

and speech-accompanying gestures, we will also be one step closer to virtual worlds 

that afford all properties of the real world, with benefits for design, travel, meet- 

ings, and the health sector to name just a few. For example, mute people will be 

able to have their gesturing in a sign language (such as the American or the Chi- 

nese Sign Language) translated instantaneously into computer-generated speech. 

A more distant hope is to recreate some of the human retina’s functionality and 

to build artificial vision systems that would allow blind people to gain at least a 

rudimentary sense of vision. 

Closer to actual deployment are hand gesture interfaces for the surgeon. Anti- 

septic environments prohibit use of a conventional keyboard and mouse interface 

to a computer. Inefficient mediation through an operation assistant is currently 

the only way for the surgeon to access a computer – frequently an essential tool to 

modern medicine. Free-hand gesture recognition immediately alleviates problems 

of sepsis. 

245


Once a camera is worn on the body, tremendous opportunities present them- 

selves, especially in combination with a head-worn display unit: recognition of 

familiar faces can augment the human name memory, scene recognition can aid 

in navigation, personal video albums can be created with ease, and so on. More 

technically, registration for augmented reality has been shown to be very accurate 

with vision sensors – a head-worn camera is in the ideal place to facilitate this 

function. 

As Turk notes in [177], the technical challenges that presently constitute the 

main hurdles are robustness of the vision methods, their speed, automatic ini- 

tialization, interface usability, and contextual integration of the interface into the 

application. 

Very immediate challenges are, for example, robust real-time detection of 

hands in arbitrary postures and 3D hand posture estimation (the recovery of each 

finger’s joint angles). Other applications such as marker-less full body tracking 

would benefit from advances in this topic as well, with applications, for example, 

in motion capture in unprepared environments. 

A few developments will increase the speed with which these dreams can be 

realized. 

• Vision chips are image acquisition chips with integrated, programmable cir- 

cuitry. Highly parallel processing at the data source allows for very fast 

246


image processing, avoiding off-chip bandwidth limitations. However, no 

standard programming models are established for vision chips, only custom 

and often hardware-based solutions allow their programming. A low-level 

language similar to OpenGL is needed to bridge the gap between hardware 

developers and computer vision researchers. 

• Z-cameras, or depth cameras, report each pixel’s distance from the focal 

plane. Object segmentation, in particular of proximal objects such as the 

hands, is almost trivial given these data. The challenges are high-resolution 

chips and models that can incorporate the large amount of information to 

achieve superior precision and accuracy. 

• Standardization at various levels is mandatory for fast progress, yet it re- 

quires concerted efforts, often times of business competitors. Standards at 

the intersection between hardware and software promise equally accelerated 

progress as OpenGL spawned for graphics. Models of gestural interaction 

must be developed that are independent of specific implementation methods 

(such as computer vision or data gloves) and their momentary shortcomings. 

This is important to free application developers from the burden of master- 

ing the gesture recognition implementation technology. 

247


• Hardware miniaturization and performance improvements especially of the 

vision and display components would help to lower the threshold for user 

acceptance of the gear required to implement pioneering applications in non- 

traditional computing environments. 

• Virtual reality and augmented reality research into user interaction para- 

digms has in the past brought about the most compelling applications for 

hand gesture interfaces. Continuing this trend is important to bring these 

inextricable technologies forward in close synchronization. 

• Wearable and mobile computer systems are equally important because they 

allow computers to penetrate more areas of our lives and because of their 

particularly high demands on the user interface. The challenge is to offer 

versatile, highly efficient interaction methods without a large form factor 

and without impeding the wearer’s ability to interact normally with the 

environment. 

• The famed killer-application would tremendously accelerate progress, for 

example, a computer game for the embedded devices market that makes 

compelling use of computer vision. This would jump start the chicken-and- 

egg problem of the mutually required hardware capabilities and consumer 

market size, respectively. 

248



We have broken the ground for easy deployment of vision-based hand gesture 

interfaces in many application domains. Our integrated approach constitutes an 

example for how novel user interface technologies should be introduced: careful 

technology selection and evaluation at every level, from theory and human factors 

considerations to the practical issues of balancing latency with accuracy. 

With this dissertation, we have shown that computer vision is on the brink of 

becoming a viable user interface technology for consumer-grade applications. The 

research conducted produced contributions in reliable detection and fast tracking 

of hands in video images, and robust posture recognition; it made possible the 

definition of postural comfort so that it can be measured with entirely objective 

means; and lastly, it demonstrated the enhanced interaction capabilities that the 

newly available modality enables. 

Computers are immensely powerful and we have only begun to explore their 

far-reaching capabilities. By enabling new ways to interact with computers, and 

by building a toolkit of available interaction modalities, we open the door for new 

functionalities, new devices, and new ways to think about and to think with com- 

puters. Interaction with hand gestures is an important step in that direction since 

hand motions and actions assume such crucial and diverse roles in our daily lifes. 

249


Computer vision in particular offers unencumbered data acquisition capabilities, 

and our work has shown that it is ready to be taken out of the lab into real appli- 

cations. We are in anticipation of further progress on topics that bring together 

the fields of computer vision, human-computer interaction, and graphics. 

250

Bibliography 

[1] V. Athitsos and S. Sclaroff. Estimating 3D Hand Pose from a Cluttered 

Image. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 

volume 2, pages 432–439, 2003. 

[2] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and B. MacIntyre. 

Recent Advances in Augmented Reality. IEEE Computer Graphics and 

Applications, 21(6):34–47, Nov/Dec 2001. 

[3] R. Azuma, J. W. Lee, B. Jiang, J. Park, S. You, and U. Neumann. Tracking 

in Unprepared Environments for Augmented Reality Systems. ACM 

Computers & Graphics, 23(6):787–793, December 1999. 

[4] R. T. Azuma. A Survey of Augmented Reality. Presence: Teleoperators and 

Virtual Environments, 6(4):355 – 385, August 1997. 

[5] R. Bane and T. Höllerer. Interactive Tools for Virtual X-Ray Vision in 

Mobile Augmented Reality. In Proc. IEEE and ACM Intl. Symposium on 

Mixed and Augmented Reality, November 2004. 

[6] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of Optical 

Flow Techniques. Int. Journal of Computer Vision, 12(1):43–77, 1994. 

[7] H. S. J. Bell and F. Wu. Very Fast Template Matching. In European 

Conference on Computer Vision, pages 358–372, May 2002. 

[8] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object Recognition 

Using Shape Contexts. In IEEE Trans. Pattern Analysis and Machine 

Intelligence, volume 24, pages 509–522, April 2002. 

[9] V. Bhatmager, C. Drury, and S. Schiro. Posture, Postural Discomfort and 

Performance. Human Factors, 27:189–199, 1985. 

251

Bibliography 

[10] S. Birchfield. Elliptical head tracking using intensity gradients and color 

histograms. In Proceedings of the IEEE Conference on Computer Vision 

and Pattern Recognition, pages 232–237, June 1998. 

[11] R. A. Bolt. Put-That-There: Voice and Gesture in the Graphics Interface. 

Computer Graphics, ACM SIGGRAPH, 14(3):262–270, 1980. 

[12] G. Borg. Psychophysical bases of perceived exertion. Medicine and Science 

in Sports and Exercise, 14(5):377–381, 1982. 

[13] D. Bowman. Interactive Techniques for Common Tasks in Immersive Virtual 

Environments: Design, Evaluation, and Application. PhD thesis, Georgia 

Tech, 1999. 

[14] G. R. Bradski. Real-time face and object tracking as a component of a perceptual 

user interface. In Proc. IEEE Workshop on Applications of Computer 

Vision, pages 214–219, 1998. 

[15] A. Braffort, C. Collet, and D. Teil. Anthropomorphic model for hand gesture 

interface. In Proceedings of the CHI ’94 conference companion on Human 

factors in computing systems, April 1994. 

[16] M. Brand. Shadow Puppetry. In Proc. Intl. Conference on Computer Vision, 

1999. 

[17] M. Bray, E. Koller-Meier, and L. V. Gool. Smart Particle Filtering for 3D 

Hand Tracking. In Proc. IEEE Intl. Conference on Automatic Face and 

Gesture Recognition, 2004. 

[18] M. Bray, E. Koller-Meier, L. V. Gool, and N. M. Schraudolph. 3D Hand 

Tracking by Rapid Stochastic Gradient Descent Using a Skinning Model. In 

European Conference on Visual Media Production (CVMP), March 2004. 

[19] L. Bretzner, I. Laptev, and T. Lindeberg. Hand Gesture Recognition using 

Multi-Scale Colour Features, Hierarchical Models and Particle Filtering. In 

Proc. IEEE Intl. Conference on Automatic Face and Gesture Recognition, 

pages 423–428, Washington D.C., 2002. 

[20] W. Broll, L. Schäfer, T. Höllerer, and D. Bowman. Interface with Angels: 

The Future of VR and AR Interfaces. IEEE Computer Graphics and Applications, 

21(6):14–17, Nov./Dec. 2001. 

252

Bibliography 

[21] H. Bunke and T. Caelli, editors. Hidden Markov Models in Vision, volume 

15(1) of International Journal of Pattern Recognition and Artificial Intelligence. 

World Scientific Publishing Company, 2001. 

[22] W. Buxton, E. Fiume, R. Hill, A. Lee, and C. Woo. Continuous hand-gesture 

driven input. In Proceedings of Graphics Interface ’83, 9th Conference of 

the Canadian Man-Computer Communications Society, pages 191–195, May 

1983. 

[23] C. Cadoz. Les réalités virtuelles, 1994. 

[24] D. B. Chaffin. Localized Muscle Fatigue – Definition and Measurement. 

Journal of Occupational Medicine, 15(4):346–354, 1973. 

[25] D. B. Chaffin and G. B. J. Andersson. Occupational Biomechanics. Wiley- 

Interscience, 1984. 

[26] M. K. Chung, I. Lee, D. Kee, and S. H. Kim. A Postural Workload Evaluation 

System Based on a Macro-postural Classification. Human Factors and 

Ergonomics in Manufacturing, 12(3):267–277, 2002. 

[27] K. C. Clarke, A. Nuernberger, T. Pingel, and D. Qingyun. User Interface 

Design for a Wearable Field Computer. In Proc. of National Conference on 

Digital Government Research, 2002. 

[28] D. Comaniciu, V. Ramesh, and P. Meer. Real-Time Tracking of Non-Rigid 

Objects Using Mean Shift. In Proc. IEEE Conference on Computer Vision 

and Pattern Recognition, volume 2, pages 142–149, 2000. 

[29] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active Appearance Models. 

In Proc. European Conference on Computer Vision, pages 484–498, 1998. 

[30] T. F. Cootes and C. J. Taylor. Active Shape Models: Smart Snakes. In 

Proceedings of the British Machine Vision Conference, pages 9–18. Springer- 

Verlag, 1992. 

[31] E. N. Corlett and R. P. Bishop. A Technique for Assessing Postural Discomfort. 

Ergonomics, 19(1):175–182, 1976. 

[32] J. L. Crowley, F. Berard, and J. Coutaz. Finger Tracking as an Input Device 

for Augmented Reality. In Intl. Workshop on Automatic Face and Gesture 

Recognition, 1995. 

253

Bibliography 

[33] Y. Cui and J. Weng. A Learning-Based Prediction and Verification Segmentation 

Scheme for Hand Sign Image Sequence. IEEE Trans. Pattern 

Analysis and Machine Intelligence, pages 798–804, 1999. 

[34] R. Cutler and M. Turk. View-based Interpretation of Real-time Optical Flow 

for Gesture Recognition. In Proc. IEEE Intl. Conference on Automatic Face 

and Gesture Recognition, pages 416–421, April 1998. 

[35] R. Desimone, T. D. Albright, C. G. Gross, and C. Bruce. Stimulus-Selective 

Properties in Inferior Temporal Neurons in the Macaque. Journal of Neuroscience, 

4(8):2051–2062, August 1984. 

[36] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by 

Annealed Particle Filtering. In Proc. IEEE Conference on Computer Vision 

and Pattern Recognition, volume 2, pages 126–133, 2000. 

[37] M. Dias, J. Jorge, J. Carvalho, P. Santos, and J. Luzio. Usability Evaluation 

of Tangible User Interfaces for Augmented Reality. In IEEE Intl. Augmented 

Reality Toolkit Workshop, 2003. 

[38] D. E. DiFranco, T.-J. Cham, and J. M. Rehg. Reconstruction of 3-D Figure 

Motion from 2-D Correspondences. In Proc. IEEE Conference on Computer 

Vision and Pattern Recognition, November 2001. 

[39] S. M. Dominguez, T. Keaton, and A. H. Sayed. Robust Finger Tracking for 

Wearable Computer Interfacing. In ACM PUI 2001 Orlando, FL, 2001. 

[40] K. Dorfmüller-Ulhaas and D. Schmalstieg. Finger Tracking for Interaction 

in Augmented Environments. In Proc. ACM/IEEE Intl. Symposium on 

Augmented Reality, 2001. 

[41] C. G. Drury and B. G. Coury. A methodology for chair evaluation. Applied 

Ergonomics, 13(3):195–202, 1982. 

[42] S. Feiner, B. MacIntyre, T. Höllerer, and T. Webster. A Touring Machine: 

Prototyping 3D Mobile Augmented Reality Systems for Exploring the Urban 

Environment. In Proc. First Int. Symp. on Wearable Computers, October 

1997. 

[43] T. G. Fikes. System Architecture Analysis for Reaching and Grasping. PhD 

thesis, University of California at Santa Barbara, 1993. 

254

Bibliography 

[44] P. M. Fitts. The information capacity of the human motor system in controlling 

the amplitude of movement. Journal of Experimental Psychology, 

47:381–391, 1954. 

[45] E. Foxlin and M. Harrington. WearTrack: A Self-Referenced Head and Hand 

Tracker for Wearable Computers and Portable VR. In 4th Intl. Symp. on 

Wearable Computers, pages 155–162, October 2000. 

[46] E. Foxlin and L. Naimark. VIS-Tracker: A Wearable Vision-Inertial Self- 

Tracker. In Proc. of the IEEE Virtual Reality Conference, 2003. 

[47] W. T. Freeman, D. B. Anderson, P. A. Beardsley, C. N. Dodge, M. Roth, 

C. D. Weissman, and W. S. Yerazunis. Computer Vision for Interactive 

Computer Graphics. IEEE Computer Graphics and Applications, pages 42– 

53, May-June 1998. 

[48] Y. Freund and R. E. Schapire. A decision-theoretic generalization of online 

learning and an application to boosting. In In Computational Learning 

Theory: EuroCOLT, pages 23–37. Springer-Verlag, 1995. 

[49] M. Fukumoto, Y. Suenaga, and K. Mase. Finger-Pointer: Pointing Interface 

by Image Processing. Computers & Graphics, 18(5):633–642, 1994. 

[50] Y. Gdalyahu and D. Weinshall. Flexible Syntactic Matching of Curves 

and Its Application to Automatic Hierarchical Classification of Silhouettes. 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12), 

December 1999. 

[51] E. Grandjean. Fitting the Task to the Man - An Ergonomic Approach. Taylor 

& Francis Ltd, London, 1969. 

[52] S. Grange, E. Casanova, T. Fong, and C. Baur. Vision-based Sensor Fusion 

for Human-Computer Interaction. In Intl. Conference on Intelligent Robots 

and Systems, October 2002. 

[53] Y. Hamada, N. Shimada, and Y. Shirai. Hand Shape Estimation Using 

Sequence of Multi-Ocular Images Based on Transition Network. In VI 2002, 

2002. 

[54] C. Hand. A Survey of 3D Interaction Techniques. Computer Graphics 

Forum, 16(5):269–281, 1997. 

[55] A. G. Hauptmann. Speech and Gesture for Graphic Image Manipulation. 

In ACM CHI, pages 241–245, May 1989. 

255

Bibliography 

[56] T. Heap and D. Hogg. Towards 3D Hand Tracking Using a Deformable 

Model. In Proc. IEEE Intl. Conference on Automatic Face and Gesture 


[57] N. Hedley, M. Billinghurst, L. Postner, R. May, and H. Kato. Explorations 

in the Use of Augmented Reality for Geographic Visualization. Presence, 

11(2):119–133, 2002. 

[58] K. Hinckley, R. Pausch, J. C. Goble, and N. F. Kassell. A survey of design 

issues in spatial input. In Proceedings of the 7th annual ACM symposium on 

User interface software and technology, pages 213–222. ACM Press, 1994. 

[59] K. Hinckley, R. Pausch, D. Proffitt, and N. F. Kassell. Two-handed virtual 

manipulation. ACM Transactions on Computer-Human Interaction 

(TOCHI), 5(3):260–302, 1998. 

[60] E. Hjelm˚as and B. K. Low. Face Detection: A Survey. Computer Vision 

and Image Understanding, 83(3):236–274, September 2001. 

[61] T. Höllerer, S. Feiner, D. Hallaway, B. Bell, M. Lanzagorta, D. Brown, 

S. Julier, Y. Baillot, and L. Rosenblum. User Interface Management Techniques 

for Collaborative Mobile Augmented Reality. Computers and Graphics, 

25(5):799–810, October 2001. 

[62] T. Höllerer, S. Feiner, T. Terauchi, G. Rashid, and D. Hallaway. Exploring 

MARS: Developing Indoor and Outdoor User Interfaces to a Mobile Augmented 

Reality System. Computers and Graphics, 23(6):779–785, December 

1999. 

[63] P. Hong, M. Turk, and T. S. Huang. Gesture Modeling and Recognition 

Using Finite State Machines. In Proc. IEEE Intl. Conference on Automatic 

Face and Gesture Recognition, pages 410–415. IEEE Computer Society, 

March 2000. 

[64] X. Hou, S. Z. Li, H. Zhang, and Q. Cheng. Direct Appearance Models. In 

Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2001. 

[65] C. Hummels and P. J. Stappers. Meaningful Gestures for Human Computer 

Interaction: Beyond Hand Postures. In Proc. IEEE Intl. Conference on 

Automatic Face and Gesture Recognition, April 1998. 

[66] M. Isard and A. Blake. A mixed-state CONDENSATION tracker with automatic 

model-switching. In ICCV, pages 107–112, 1998. 

256

Bibliography 

[67] M. Isard and A. Blake. Condensation – Conditional Density Propagation 

for Visual Tracking. Int. Journal of Computer Vision, 1998. 

[68] J. Isdale. What Is Virtual Reality? A Web-Based Introduction, September 

1998. http://vr.isdale.com/WhatIsVR.html. 

[69] T. Jebara, B. Schiele, N. Oliver, and A. Pentland. DyPERS: Dynamic 

Personal Enhanced Reality System. In Image Understanding Workshop, 

November 1998. 

[70] N. Jojic, M. Turk, and T. Huang. Tracking Self-Occluding Articulated 

Object in Dense Disparity Maps. In Proc. Intl. Conference on Computer 

Vision, September 1999. 

[71] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986. 

[72] M. Jones and P. Viola. Fast Multi-view Face Detection. Technical Report 

TR2003-96, MERL, July 2003. 

[73] M. J. Jones and J. M. Rehg. Statistical Color Models with Application to 

Skin Detection. Int. Journal of Computer Vision, 46(1):81–96, Jan 2002. 

[74] S. J. Julier, J. K. Uhlmann, and H. F. Durrant-Whyte. A new approach for 

filtering nonlinear systems. In Proc. American Control Conference, pages 

1628–1632, June 1995. 

[75] R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems. 

Transactions of the ASME Journal of Basic Engineering, pages 34–45, 

1960. 

[76] Y. Kameda, M. Minoh, and K. Ikeda. Three dimensional pose estimation 

of an articulated object from its silhouette image. In Proceedings of Asian 

Conference on Computer Vision, pages 612–615, 1993. 

[77] K. Karhunen. Über Lineare Methoden in der Wahrscheinlichkeitsrechnung. 

Annales Academiae Scientiarum Fennicae, 37:3–79, 1946. 

[78] W. Karwowski, R. Eberts, G. Salvendy, and S. Noland. The effects of computer 

interface design on human postural dynamics. Ergonomics, 37(4):703– 

724, 1994. 

[79] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. 

In Proc. Intl. Conference on Computer Vision, pages 259–268, 1987. 

257

Bibliography 

[80] H. Kato and M. Billinghurst. Marker Tracking and HMD Calibration for a 

Video-Based Augmented Reality Conferencing System. In Proceedings of the 

2nd IEEE and ACM International Workshop on Augmented Reality, pages 

85–94, October 1999. 

[81] H. Kato, M. Billinghurst, I. Poupyrev, K. Imamoto, and K. Tachibana. 

Virtual Object Manipulation of a Table-Top AR Environment. In Proc. 

Intl. Symp. Augmented Reality, pages 111–119. IEEE CS Press, 2000. 

[82] D. Kee. A method for analytically generating three-dimensional isocomfort 

workspace based on perceived discomfort. Applied Ergonomics, 33(1):51–62, 

2002. 

[83] A. Kendon. How gestures can become like words. Cross-cultural perspectives 

in nonverbal communication, pages 131–141, 1988. 

[84] M. Kirby and L. Sirovich. Application of the Karhunen-Loève Procedure 

for the Characterization of Human Faces. IEEE Transactions on Pattern 

Analysis and Machine Intelligence, 12(1):103–108, January 1990. 

[85] R. Kjeldsen and J. Kender. Finding Skin in Color Images. In Proceedings of 

the International Conference on Automatic Face and Gesture Recognition, 

pages 312–317, October 1996. 

[86] N. Kohtake, J. Rekimoto, and Y. Anzai. InfoPoint: A Device that Provides 

a Uniform User Interface to Allow Appliances to Work Together over a 

Network. Personal and Ubiquitous Computing, 5(4):264–274, 2001. 

[87] D. Koller, P. Lindstrom, W. Ribarsky, L. F. Hodges, N. Faust, and 

G. Turner. Virtual GIS: A Real-Time 3D Geographic Information System. 

In Proceedings of Visualization’, pages 94–100, October 1995. 

[88] M. Kölsch, A. C. Beall, and M. Turk. An Objective Measure for Postural 

Comfort. In HFES Annual Meeting Notes, October 2003. 

[89] M. Kölsch, A. C. Beall, and M. Turk. The Postural Comfort Zone for 

Reaching Gestures. In HFES Annual Meeting Notes, October 2003. 

[90] M. Kölsch and M. Turk. Analysis of Rotational Robustness of Hand Detection 

with a Viola-Jones Detector. In IAPR International Conference on 

Pattern Recognition, 2004. 

258

Bibliography 

[91] M. Kölsch and M. Turk. Fast 2D Hand Tracking with Flocks of Features 

and Multi-Cue Integration. In IEEE Workshop on Real-Time Vision for 

Human-Computer Interaction (at CVPR), 2004. 

[92] M. Kölsch and M. Turk. Robust Hand Detection. In Proc. IEEE Intl. 

Conference on Automatic Face and Gesture Recognition, May 2004. 

[93] M. Kölsch, M. Turk, and T. Höllerer. Vision-Based Interfaces for Mobility. 

In Intl. Conference on Mobile and Ubiquitous Systems (MobiQuitous), 

August 2004. 

[94] T. Koskela and I. Vilpola. Usability of MobiVR Concept: Towards Large 

Virtual Touch Screen for Mobile Devices. In Proc. Intl. Conference on Mobile 

HCI, 2004. 

[95] T. Koskela, I. Vilpola, and I. Rakkolainen. User Requirements for Large 

Virtual Display and Finger Pointing Input for Mobile Devices. In Proc. 

Intl. Conference on Mobile and Ubiquitous Multimedia, December 2003. 

[96] M. Kourogi and T. Kurata. A method of personal positioning based on sensor 

data fusion of wearable camera and self-contained sensors. In Proc. IEEE 

Conference on Multisensor Fusion and Integration for Intelligent Systems, 

pages 287–292, 2003. 

[97] D. M. Krum, O. Omoteso, W. Ribarsky, T. Starner, and L. F. Hodges. 

Evaluation of a Multimodal Interface for 3D Terrain Visualization. In IEEE 

Visualization, pages 411–418, October 27-November 1 2002. 

[98] D. M. Krum, O. Omoteso, W. Ribarsky, T. Starner, and L. F. Hodges. 

Speech and Gesture Multimodal Control of a Whole Earth 3D Visualization 

Environment. In Proc. Joint Eurographs and IEEE TCVG Symposium on 

Visualization (VisSym), pages 195–200, May 2002. 

[99] T. Kurata, T. Kato, M. Kourogi, J. Keechul, and K. Endo. A Functionally- 

Distributed Hand Tracking Method for Wearable Visual Interfaces and Its 

Applications. In Proc. IAPR Workshop on Machine Vision Applications, 

pages 84–89, 2002. 

[100] T. Kurata, T. Okuma, M. Kourogi, and K. Sakaue. The Hand Mouse: 

GMM Hand-color Classification and Mean Shift Tracking. In Second Intl. 

Workshop on Recognition, Analysis and Tracking of Faces and Gestures in 

Real-time Systems, July 2001. 

259

Bibliography 

[101] I. Laptev and T. Lindeberg. Tracking of multi-state hand models using particle 

filtering and a hierarchy of multi-scale image features. Technical Report 

ISRN KTH/NA/P-00/12-SE, Department of Numerical Analysis and Computer 

Science, KTH (Royal Institute of Technology), September 2000. 

[102] J. Lee and T. L. Kunii. Model-Based analysis of hand posture. IEEE 

Computer Graphics and Applications, 15(5):77–86, 1995. 

[103] A. Leganchuk, S. Zhai, and W. Buxton. Manual and cognitive benefits of 

two-handed input: an experimental study. ACM Transactions on Computer- 

Human Interaction (TOCHI), 5(4):326–359, 1998. 

[104] M.-H. Liao and C. G. Drury. Posture, Discomfort and Performance in a 

VDT task. Ergonomics, 43(3):345–359, 2000. 

[105] R. Lienhart and J. Maydt. An Extended Set of Haar-like Features for Rapid 

Object Detection. In Proc. IEEE Intl. Conference on Image Processing, 

volume 1, pages 900–903, Sep 2002. 

[106] J. Lin, Y. Wu, and T. S. Huang. Modeling the Constraints of Human Hand 

Motion. In Proceedings of the 5th Annual Federated Laboratory Symposium, 

2001. 

[107] P. Lindstrom, D. Koller, W. Ribarsky, L. F. Hodges, A. O. den Bosch, 

and N. Faust. An Integrated Global GIS and Visual Simulation System. 

Technical Report GIT-GVU-97-07, Georgia Tech, 1997. 

[108] M. M. Loève. Probability Theory. Van Nostrand, 1955. 

[109] S. Lu, D. Metaxas, D. Samaras, and J. Oliensis. Using Multiple Cues for 

Hand Tracking and Model Refinement. In Proc. IEEE Conference on Computer 

Vision and Pattern Recognition, 2003. 

[110] B. D. Lucas and T. Kanade. An Iterative Image Registration Technique with 

an Application to Stereo Vision. In Proc. Imaging Understanding Workshop, 

pages 121–130, 1981. 

[111] J. MacCormick and M. Isard. Partitioned sampling, articulated objects, and 

interface-quality hand tracking. In Proc. European Conf. Computer Vision, 

2000. 

[112] I. S. MacKenzie. Input devices and interaction techniques for advanced computing. 

In W. Barfield and T. A. F. III, editors, Virtual environments and 

advanced interface design, pages 437–470. Oxford University Press, 1995. 

260

Bibliography 

[113] S. Mann. Smart clothing: The wearable computer and wearcam. Personal 

Technologies, 1(1), March 1997. 

[114] S. Mann. Wearable Computing: A First Step Toward Personal Imaging. 

IEEE Computer, 30(2), February 1997. 

[115] D. McNeill. Hand and Mind: What Gestures Reveal about Thoughts. University 

of Chicago Press, 1992. 

[116] D. McNeill, editor. Language and Gesture. Cambridge University Press, 

2000. 

[117] M. Mine, F. Brooks, and C. Sequin. Moving Objects in Space: Exploiting 

Proprioception in Virtual Environment Interaction. In Proc. ACM SIG- 

GRAPH, 1997. 

[118] D. D. Morris and J. M. Rehg. Singularity Analysis for Articulated Object 

Tracking. In Proc. IEEE Conference on Computer Vision and Pattern 


[119] T. A. Mysliwiec. FingerMouse: A Freehand Computer Pointing Interface. 

Technical Report VISLab-94-001, Vision Interfaces and Systems Lab, The 

University of Illinois at Chicago, October 1994. 

[120] C. Nolker and H. Ritter. GREFIT: Visual recognition of hand postures. In 

Gesture-Based Communication in HCI, pages 61–72, 1999. 

[121] S. Nusser, L. Miller, K. Clarke, and M. Goodchild. Future views of field 

data collection in statistical surveys. In Proc. of National Conference on 

Digital Government Research, 2001. 

[122] T. Oberg, L. Sandsjo, and R. Kadefors. Subjective and Objective Evaluation 

of Shoulder Muscle Fatigue. Ergonomics, 37(8):1323–1333, 1994. 

[123] T. Ohshima, K. Satoh, H. Yamamoto, and H. Tamura. RV-Border Guards: 

A Multi-Player Mixed Reality Entertainment. Trans. Virtual Reality Soc. 

Japan, 4(4):699–705, 1999. 

[124] E. J. Ong and R. Bowden. A Boosted Classifier Tree for Hand Shape Detection. 

In Proc. IEEE Intl. Conference on Automatic Face and Gesture 

Recognition, pages 889–894, 2004. 

261

Bibliography 

[125] V. Paelke, J. Stöcklein, C. Reimann, and W. Rosenbach. Supporting User 

Interface Evaluation of AR Presentation and Interaction Techniques with 

ARToolkit. In IEEE Intl. Augmented Reality Toolkit Workshop, 2003. 

[126] J. Park, B. Jiang, and U. Neumann. Vision-based pose computation: Robust 

and accurate augmented reality tracking. In Proceedings of the 2nd 

IEEE and ACM International Workshop on Augmented Reality, pages 3– 

12, October 1999. 

[127] R. Pausch and R. D. Williams. Tailor: creating custom user interfaces 

based on gesture. In Proceedings of the the third annual ACM SIGGRAPH 

symposium on User interface software and technology, 1990. 

[128] V. Pavlovic, R. Sharma, and T. S. Huang. Visual Interpretation of Hand 

Gestures for Human Computer Interaction: A Review. In IEEE Transactions 

on Pattern Analysis and Machine Intelligence, volume 19, pages 

677–695, July 1997. 

[129] V. I. Pavlovic and A. Garg. Boosted Detection of Objects and Attributes. 

In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 

2001. 

[130] W. Piekarski and B. Thomas. ARQuake: The Outdoors Augmented Reality 

Gaming System. Communications of the ACM, 45(1):36–38, January 2002. 

[131] W. Piekarski and B. Thomas. Tinmith-Hand: Unified User Interface Technology 

for Mobile Outdoor Augmented Reality and Indoor Virtual Reality. 

In IEEE VR, pages 287–288, March 2002. 

[132] W. Piekarski and B. Thomas. Using AR Toolkit for 3D Hand Position 

Tracking in Mobile Outdoor Environments. In The First IEEE Workshop 

on the Augmented Reality Toolkit, September 2002. 

[133] W. Piekarski and B. H. Thomas. Developing Interactive Augmented Reality 

Modelling Applications. In International Workshop on Software Technology 

for Augmented Reality Systems, 2003. 

[134] J. S. Pierce, B. Stearns, and R. Pausch. Two Handed Manipulation of 

Voodoo Dolls in Virtual Environments. In Symposium on Interactive 3D 

Graphics, pages 141–145, 1999. 

[135] M. Porta. Vision-based user interfaces: methods and applications. Int. 

Journal on Human-Computer Studies, 57:27–73, 2002. 

262

Bibliography 

[136] F. K. H. Quek. Eyes in the Interface. Image and Vision Computing, 13, 

August 1995. 

[137] F. K. H. Quek. Unencumbered Gestural Interaction. IEEE Multimedia, 

4(3):36–47, 1996. 

[138] F. K. H. Quek, T. Mysliwiec, and M. Zhao. FingerMouse: A Freehand 

Pointing Interface. In Proc. Int’l Workshop on Automatic Face and Gesture 

Recognition, pages 372–377, June 1995. 

[139] I. Rauschert, P. Agrawal, R. Sharma, S. Fuhrmann, I. Brewer, 

A. MacEachren, H. Wang, and G. Cai. Designing a Human-Centered, Multimodal 

GIS Interface to Support Emergency Management. In GIS, November 

2002. 

[140] S. Razzaque, Z. Kohn, and M. C. Whitton. Redirected Walking. In EURO- 

GRAPHICS, 2001. 

[141] J. M. Rehg and T. Kanade. Visual Tracking of High DOF Articulated 

Structures: an Application to Human Hand Tracking. In Third European 

Conf. on Computer Vision, pages 35–46, May 1994. 

[142] J. M. Rehg and T. Kanade. Model-Based Tracking of Self-Occluding Articulated 

Objects. In Proc. Intl. Conference on Computer Vision, pages 

612–617, June 1995. 

[143] J. Rekimoto. Matrix: A Realtime Object Identification and Registration 

Method for Augmented Reality. In Proc. Asia Pacific Computer Human 

Interaction (APCHI), 1998. 

[144] J. Rekimoto and K. Nagao. The World through the Computer: Computer 

Augmented Interaction with Real World Environments. In Proceedings 

of Eighth Annual Symposium on User Interface Software and Technology 

(UIST’95), pages 29– 36, 1995. 

[145] C. W. Reynolds. Flocks, Herds, and Schools: A Distributed Behavioral 

Model. Computer Graphics, 21(4):25–34, 1987. SIGGRAPH ’87 Conference 

Proceedings. 

[146] B. J. Rhodes. The wearable remembrance agent: a system for augmented 

memory. Personal Technologies Journal; Special Issue on Wearable Computing, 

pages 218–224, 1997. 

263

Bibliography 

[147] D. A. Rosenbaum, R. J. Meulenbroek, J. Vaughan, and C. Jansen. Posture- 

Based Motion Planning: Applications to Grasping. Psychological Review, 

4(108):709–734, 2001. 

[148] G. Salvendy, editor. Handbook of Human Factors and Ergonomics. John 

Wiley & Sons, Inc, 2nd edition, 1997. 

[149] Y. Sato, Y. Kobayashi, and H. Koike. Fast Tracking of Hands and Fingertips 

in Infrared Images for Augmented Desk Interface. In Proc. IEEE Intl. 

Conference on Automatic Face and Gesture Recognition, March 2000. 

[150] D. Saxe and R. Foulds. Toward robust skin identification in video images. In 


pages 379–384, Sept. 1996. 

[151] B. Schiele and A. Waibel. Gaze tracking based on face-color. In Proceedings 

of the International Workshop on Automatic Face- and Gesture-Recognition, 

pages 344–349, June 1995. 

[152] J. Segen and S. Kumar. GestureVR: Vision-Based 3D Hand Interface for 

Spatial Interaction. In The Sixth ACM Intl. Multimedia Conference, September 

1998. 

[153] J. Segen and S. Kumar. Shadow Gestures: 3D Hand Pose Estimation Using a 

Single Camera. In Proc. IEEE Conference on Computer Vision and Pattern 

Recognition, pages 1479–1486, 1999. 

[154] J. Segen and S. Kumar. Look Ma, No Mouse! Communications of the ACM, 

43(7):102–109, July 2000. 

[155] C. Shan, Y. Wei, T. Tan, and F. Ojardias. Real Time Hand Tracking by 

Combining Particle Filtering and Mean Shift. In Proc. IEEE Intl. Conference 

on Automatic Face and Gesture Recognition, 2004. 

[156] T. Sheridan and W. Ferrell. Remote Manipulative Control with Transmission 

Delay. IEEE Transactions on Human Factors in Electronics, 4:25–29, 

1963. 

[157] J. Shi and J. Malik. Motion segmentation and tracking using normalized 

cuts. In Proc. Intl. Conference on Computer Vision, pages 1154–1160, 1998. 

[158] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Conference 

on Computer Vision and Pattern Recognition, Seattle, June 1994. 

264

Bibliography 

[159] N. Shimada, Y. Shirai, Y. Kuno, and J. Miura. Hand Gesture Estimation 

and Model Refinement Using Monocular Camera - Ambiguity Limitation by 

Inequality Constraints. In Proc. IEEE Intl. Conference on Automatic Face 

and Gesture Recognition, pages 268–273, April 1998. 

[160] B. Shneiderman. Direct Manipulation and Virtual Environments. In Designing 

the User Interface: Strategies for Effective Human-Computer Interaction, 

chapter 6. Addison Wesley, 3rd edition, March 1998. 

[161] T. Starner, J. Auxier, D. Ashbrook, and M. Gandy. The Gesture Pendant: 

A Self-illuminating, Wearable, Infrared Computer Vision System for Home 

Automation Control and Medical Monitoring. In International Symposium 

on Wearable Computers, 2000. 

[162] T. Starner, S. Mann, B. Rhodes, J. Healey, K. B. Russell, J. Levine, and 

A. Pentland. Wearable Computing and Augmented Reality. Technical report, 

MIT Media Lab, Vision and Modeling Group, November 1995. 

[163] T. E. Starner, J. Weaver, and A. Pentland. Real-Time American Sign Language 

Recognition Using Desk and Wearable Computer Based Video. IEEE 

Transactions on Pattern Recognition and Machine Intelligence, 20(12):1371– 

1375, December 1998. 

[164] A. State, G. Hirota, D. Chen, W. Garrett, and M. Livingston. Superior augmented 

reality registration by integrating landmark tracking and magnetic 

tracking. In Proceedings of SIGGRAPH, pages 439–446, August 1996. 

[165] B. Stenger, P. R. S. Mendonça, and R. Cipolla. Model-Based 3D Tracking 

of an Articulated Hand. In Proc. IEEE Conference on Computer Vision 

and Pattern Recognition, volume 2, pages 310–315, December 2001. 

[166] J. Ström, T. Jebara, S. Basu, and A. Pentland. Real Time Tracking and 

Modeling of Faces: An EKF-based Analysis by Synthesis Approach. In 

ICCV, 1999. 

[167] D. J. Sturman and D. Zeltzer. A Design Method for ”Whole-Hand” 

Human-Computer Interaction. ACM Transactions on Information Systems, 

11(3):219–238, July 1993. 

[168] Z. Szalavári and M. Gervautz. The personal interaction panel – a twohanded 

interface for augmented reality. In Proc. 18th Eurographics, Eurographics 

Assoc, pages 335–346, 1997. 

265

Bibliography 

[169] R. M. Taylor II., T. C. Hudson, A. Seeger, H. Weber, J. Juliano, and A. T. 

Helser. VRPN: A Device-Independent, Network-Transparent VR Peripheral 

System. In VRST, 2001. 

[170] A. Thayananthan, B. Stenger, P. H. S. Torr, and R. Cipolla. Shape Context 

and Chamfer Matching in Cluttered Scenes. In Proc. IEEE Conference 

on Computer Vision and Pattern Recognition, volume I, pages 127–133, 

Madison, USA, June 2003. 

[171] B. Thomas, B. Close, J. Donoghue, J. Squires, P. De Bondi, M. Morris, and 

W. Piekarski. ARQuake: An Outdoor/Indoor Augmented Reality First- 

Person Application. In Proc. of the Fourth International Symposium on 

Wearable Computers, pages 139–146, October 2000. 

[172] B. H. Thomas and W. Piekarski. Glove Based User Interaction Techniques 

for Augmented Reality in an Outdoor Environment. Virtual Reality: Research, 

Development, and Applications, 6(3), 2002. 

[173] M. Toews and T. Arbel. Entropy-of-likelihood Feature Selection for Image 

Correspondence. In Proc. Intl. Conference on Computer Vision, October 

2003. 

[174] C. Tomasi, A. Rafii, and I. Torunoglu. Full-Size Projection Keyboard for 

Handheld Devices. Communications of the ACM, 46(7):70–75, July 2003. 

[175] H. L. V. Trees. Detection, Estimation, and Modulation Theory, volume 1. 

Wiley, 1968. 

[176] M. Turk. Gesture recognition. In K. Stanney, editor, Handbook of Virtual 

Environments: Design, Implementation and Applications. Lawrence 

Erlbaum Associates Inc., December 2001. 

[177] M. Turk. Computer Vision in the Interface. Communications of the ACM, 

47(1):60–67, 2004. 

[178] M. Turk and A. Pentland. Eigenfaces for Recognition. J. Cognitive Neuroscience, 

3(1):71–86, 1991. 

[179] P. Viola and M. Jones. Fast and Robust Classification using Asymmetric 

AdaBoost and Detector Cascade. Neural Information Processing Systems, 

December 2001. 

[180] P. Viola and M. Jones. Robust Real-time Object Detection. In Intl. Workshop 

on Statistical and Computational Theories of Vision, July 2001. 

266

Bibliography 

[181] C. von Hardenberg and F. Bérard. Bare-hand human-computer interaction. 

In Perceptual User Interfaces, 2001. 

[182] G. Welch, G. Bishop, L. Vicci, S. Brumback, K. Keller, and D. Colucci. 

The HiBall Tracker: High-Performance Wide-Area Tracking for Virtual and 

Augmented Environments. In Proceedngs of the ACM Symposium on Virtual 

Reality Software and Technology (VRST), December 1999. 

[183] S. F. Wiker, G. D. Langolf, and D. B. Chaffin. Arm Posture and Human 

Movement Capability. Human Factors, 31(4):421–441, 1989. 

[184] A. Wilson and S. Shafer. XWand: UI for Intelligent Spaces. In ACM CHI, 

2003. 

[185] W. E. Woodson, B. Tillman, and P. Tillman. Human Factors Design Handbook. 

McGraw-Hill Professional, 2 edition, 1992. 

[186] C. R. Wren and A. P. Pentland. Dynamic Models of Human Motion. In 


pages 22–27. IEEE Computer Society, April 1998. 

[187] Y. Wu and T. S. Huang. Vision-based gesture recognition: A review. 

In A. Braffort, R. Gherbi, S. Gibet, J. Richardson, and D. Teil, editors, 

Gesture-Based Communication in Human-Computer Interaction, volume 

1739 of Lecture Notes in Artificial Intelligence. Springer Verlag, Berlin Heidelberg, 

1999. 

[188] Y. Wu and T. S. Huang. View-independent Recognition of Hand Postures. 

In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 

volume 2, pages 84–94, 2000. 

[189] Y. Wu and T. S. Huang. Hand Modeling, Analysis, and Recognition. IEEE 

Signal Processing Magazine, May 2001. 

[190] M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting Faces in Images: A 

Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 

24(1):34–58, 1 2002. 

[191] S. You, U. Neumann, and R. Azuma. Orientation Tracking for Outdoor 

Augmented Reality Registration. IEEE Computer Graphics and Applications, 

19(6):36–42, November/December 1999. 

[192] S. J. Young. HTK: Hidden Markov Model Toolkit V1.5, December 1993. 

Entropic Research Laboratories Inc. 

267

Bibliography 

[193] B. D. Zarit, B. J. Super, and F. K. H. Quek. Comparison of Five Color 

Models in Skin Pixel Classification. In Workshop on Recognition, Analysis, 

and Tracking of Faces and Gestures in Real-Time Systems, pages 58–63, 

Sept. 1999. 

[194] Z. Zhang, M. Li, S. Li, and H. Zhang. Multi-View Face Detection with 

FloatBoost. In Proc. IEEE Workshop on Applications of Computer Vision, 

2002. 

[195] Q. Zhu, K.-T. Cheng, C.-T. Wu, and Y.-L. Wu. Adaptive Learning of an 

Accurate Skin-Color Model. In Proc. IEEE Intl. Conference on Automatic 

Face and Gesture Recognition, 2004. 

[196] X. Zhu, J. Yang, and A. Waibel. Segmenting Hands of Arbitrary Color. In 


2000. 

[197] Y. Zhu, H. Ren, G. Xu, and X. Lin. Toward Real-Time Human-Computer 

Interaction with Continuous Dynamic Hand Gestures. In Proceedings of 

the Conference on Automatic Face and Gesture Recognition, pages 544–549, 

2000. 

268

Vision Based Hand Gesture Interfaces for Wearable Computing and ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?