Human-Computer Collaboration in Video-Augmented ... - Index of

Human-Computer Collaboration 

in Video-Augmented Environment 

for 3D Input 

Lijiang Li 

Doctor of Philosophy Dissertation 

University of York 

Department of Electronics 

May 2008

Declaration 

Except where otherwise stated in the text, this dissertation is the result 

of my own independent work and investigation, and is not the outcome 

of work done in collaboration. 

Other sources are acknowledged by footnotes given explicit references. 

A bibliography is appended at the end of this thesis. 

This dissertation is not substantially the same as any I have submitted 

for a degree or diploma or any other qualification at any other university. 

No part of this dissertation has already been, or is being currently sub- 

mitted for any such degree, diploma or other qualification. 

2

Abstract 

The role of the computer has gradually changed from merely a tool to an assis- 

tant to the human. Equipping computers with I/O devices and sensors makes 

them understanding of the surrounding world, and capable of interacting with 

humans. Video cameras and data projectors are ideally suited as these sensor de- 

vices, especially with the dramatic drops in their manufacturing costs it makes 

them more and more popular. A new type of user interface emerged where the 

video signals are used as an augmentation to enhance the physical world, and 

here comes the name Video-Augmented Environment. 

This thesis presents a design of human-computer interactions in a VAE for 

3D input. It begins with introducing an automated and efficient full calibration 

method for calibrating the projector-camera system. Shape acquisition techniques 

are discussed and then one particular technique based on structured light sys- 

tems is adapted for capturing depth information. A user guided approach for 

registering depth information scanned from different part of the target object is 

introduced. Finally a practical realisation of a Video-Augmented Environment is 

presented combining the techniques discussed earlier. 

Overall, the VAE designed in this thesis has the feasibility of completing com- 

puter vision tasks in a human-computer collaborative environment, and shows 

the potential and viability of being deployed not only in laboratory but also in 

office and home environment. 

3

Acknowledgements 

Completing a PhD is a marathon event, and I would not have been 

able to complete this journey without the support and encouragement of 

countless people over the last four years. 

First and foremost, I would like to express my deep and sincere grati- 

tude to my supervisor, Professor John Robinson, Head of the Department 

of Electronics, University of York. His wide knowledge and expertise have 

been invaluable to me, while his personal guidance and constructive criti- 

cism have provided a good basis for my research and this thesis work. 

Many thanks to Justen Hyde and Daniel Parnham for providing the 

OpenIllusionist framework where the frame grabber is originated from, 

and the help of other variety of implementation issues. I wish to express 

my thanks to my lab partner Matthew Day and Eddie Munday for lots 

of inspiring talks and their participation in user experiments. My warm 

thanks are due to Owen Francis and other CSG group members for their 

assistance. 

During my placement at the FCG team, British Telecommunications at 

4

Ipswich, I have collaborated with many colleagues, and I wish to extend 

my warmest thanks to Dr.Li-Qun Xu, Ian Kegel and all those who have 

helped me with my work. Their insights and comments were of great 

value during my placement, and I look forward to a continuing collabora- 

tion with the FCG team in the near future. 

Finally, I owe my most loving thanks to my mum, for single-handedly 

raising my up over the last twenty years. I would not have been where I 

am without her support, constant instilling my confidence, but most im- 

portantly her love. 

5 

Lijiang Li 

York, UK, May 2008

Contents 

Abstract 3 

Acknowledgement 4 

1 Introduction 19 

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.2 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.2.1 Augmented Reality and Virtual Reality . . . . . . . . 21 

1.2.2 Video-Augmented Environments . . . . . . . . . . . . 21 

1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

1.4 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . 24 

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2 Background and Prior Art 28 

2.1 Image based 3D capture methods for depth estimation . . . 29 

2.1.1 Feature Based Methods . . . . . . . . . . . . . . . . . 30 

2.1.2 Optical Flow Based Methods . . . . . . . . . . . . . . 31 

2.2 Active Shape Acquisition Methods . . . . . . . . . . . . . . . 33 

2.2.1 The Use of Structured Light System . . . . . . . . . . 35 

6

2.3 Video Augmented Environments (Video-Augmented Environment 

(VAE)s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

2.3.1 Related example VAEs in the past . . . . . . . . . . . 36 

2.3.2 Previous work at York . . . . . . . . . . . . . . . . . . 41 

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

3 Calibration 51 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.3 Calibration Parameters . . . . . . . . . . . . . . . . . . . . . . 57 

3.3.1 Intrinsic Parameters . . . . . . . . . . . . . . . . . . . 57 

3.3.2 The Reduced Camera Model . . . . . . . . . . . . . . 61 

3.3.3 Extrinsic Parameters . . . . . . . . . . . . . . . . . . . 62 

3.3.4 Full Model . . . . . . . . . . . . . . . . . . . . . . . . . 64 

3.4 Calibrate Camera-Projector Pair . . . . . . . . . . . . . . . . . 65 

3.4.1 World Coordinate System . . . . . . . . . . . . . . . . 65 

3.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . 66 

3.4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . 67 

3.4.4 Choice of colour . . . . . . . . . . . . . . . . . . . . . 70 

3.4.5 Camera Calibration . . . . . . . . . . . . . . . . . . . . 73 

3.4.6 Projector Calibration . . . . . . . . . . . . . . . . . . . 74 

3.5 Plane to Plane Calibration . . . . . . . . . . . . . . . . . . . . 78 

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

3.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 83 

4 Shape Acquisition 87 

7

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

4.3 Gray Codification . . . . . . . . . . . . . . . . . . . . . . . . . 93 

4.3.1 Gray Code Patterns . . . . . . . . . . . . . . . . . . . . 93 

4.3.2 Pattern Generation . . . . . . . . . . . . . . . . . . . . 95 

4.3.3 Codification Mechanism . . . . . . . . . . . . . . . . . 98 

4.4 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

4.4.1 Image Levels . . . . . . . . . . . . . . . . . . . . . . . 101 

4.4.2 Limited Camera Resolution . . . . . . . . . . . . . . . 102 

4.4.3 Inverse subtraction . . . . . . . . . . . . . . . . . . . . 104 

4.4.4 Adaptive thresholding . . . . . . . . . . . . . . . . . . 107 

4.5 Depth from Triangulation . . . . . . . . . . . . . . . . . . . . 109 

4.5.1 Final Captured Data . . . . . . . . . . . . . . . . . . . 112 

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

4.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 118 

5 Registration of Point Sets 121 

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

5.2.1 Rotations and Translations in 3D . . . . . . . . . . . . 125 

5.2.2 A Singular Value Decomposition (SVD) Based Least 

Square Fitting Method . . . . . . . . . . . . . . . . . . 126 

5.3 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 127 

5.3.1 Corner Detector . . . . . . . . . . . . . . . . . . . . . . 127 

5.3.2 Normalised Cross Correlation . . . . . . . . . . . . . 129 

5.3.3 Outlier Removals . . . . . . . . . . . . . . . . . . . . . 131 

8

5.4 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 

5.4.1 Data structure of a point set . . . . . . . . . . . . . . . 136 

5.4.2 Point set fusion with voxel quantisation . . . . . . . . 137 

5.4.3 User Assisted Tuning . . . . . . . . . . . . . . . . . . . 141 

5.5 Rendering A Rotating Object . . . . . . . . . . . . . . . . . . 143 

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

5.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 146 

6 System Design 148 

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 

6.2 Widgets Provided for Interaction . . . . . . . . . . . . . . . . 151 

6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 151 

6.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 154 

6.2.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . 156 

6.2.4 Implementation of Pushbutton . . . . . . . . . . . . . 157 

6.2.5 Implementation of Touchpad . . . . . . . . . . . . . . 164 

6.3 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 

6.4 Main Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 

6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 170 

6.4.2 Mode 1: Inspect . . . . . . . . . . . . . . . . . . . . . . 172 

6.4.3 Mode 2: Touchup . . . . . . . . . . . . . . . . . . . . . 175 

6.4.4 Mode 3: Correspondence . . . . . . . . . . . . . . . . 179 

6.4.5 Mode 4: Visualisation . . . . . . . . . . . . . . . . . . 186 

6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 

6.5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 194 

9

7 System Evaluation 195 

7.1 Test Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 

7.1.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . 196 

7.1.2 Object Descriptions . . . . . . . . . . . . . . . . . . . . 196 

7.2 Shape Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 198 

7.2.1 The Owl Experiment . . . . . . . . . . . . . . . . . . . 200 

7.2.2 The Football and Stand Experiment . . . . . . . . . . . 200 

7.2.3 The Cushion and Human Body Experiment . . . . . . . 206 

7.3 Correspondences Finding . . . . . . . . . . . . . . . . . . . . 208 

7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 

8 Conclusions 213 

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 

8.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 

8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 

A Declarations for class CButton 230 

B Declarations for class CPointSet 236 

C Declarations for class CView 240 

10

List of Figures 

1.1 Mixed Reality. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

2.1 Optical flow of approaching objects. . . . . . . . . . . . . . . 31 

2.2 The DigitalDesk. (image courtesy of the Computer Labora- 

tory, University of Cambridge) . . . . . . . . . . . . . . . . . 37 

2.3 An image of the BrightBoard. (image courtesy of the Com- 

puter Laboratory, University of Cambridge) . . . . . . . . . . 39 

2.4 User interacts with the ALIVE system. (image courtesy of 

the MIT Media Lab) . . . . . . . . . . . . . . . . . . . . . . . . 41 

2.5 The LivePaper system in use. (image courtesy of the Visual 

Systems Lab, University of York) . . . . . . . . . . . . . . . . 42 

2.6 The LivePaper applications. (image courtesy7 of the Visual 


2.7 Snapshots of Penpets in action. (image courtesy of the Visual 


2.8 Audio d-touch interface (the augmented musical stave). (im- 

age courtesy of the Computer Laboratory, University of Cam- 

bridge) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

11

2.9 Snapshots of Robot Ships in action. (image courtesy of the 

Visual Systems Lab, University of York) . . . . . . . . . . . . 49 

3.1 Calibration objects. (image courtesy of [109]) . . . . . . . . . 53 

3.2 Principal points. Bottom right subimage is the imaging plane. 58 

3.3 The distortion effects. . . . . . . . . . . . . . . . . . . . . . . . 60 

3.4 Transformation from world to camera coordinate system. . . 62 

3.5 Flow chart of the camera-projector pair calibration. (dia- 

gram of image processing after the projections and captures 

are done) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

3.6 Extraction of the projected pattern from the mixed one. . . . 71 

3.7 Extraction of the projected pattern from the mixed one (a 

closer look). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

3.8 Extraction of the projected pattern from the mixed one (a 

closer look). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

3.9 Pixel values of an image captured from a plain desktop. 

(bottom two showing the red channel only) . . . . . . . . . . 85 

4.1 A 9-level Gray-coded image. (only a slice from each im- 

age is shown here, to illustrate the change between adjacent 

codewords) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

4.2 Comparison: minimum level of Gray-coded and binary- 

coded images needed to encode 16 columns. . . . . . . . . . 95 

4.3 Point-line triangulation. . . . . . . . . . . . . . . . . . . . . . 98 

4.4 Binary encoded pattern divides the surface into many sub- 

regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

12

4.5 Stripes being projected onto a fluffy doll.(10 level Gray coded 

stripes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

4.6 The alias effect causing errors in depth map. . . . . . . . . . 103 

4.7 3D plots of figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . 105 

4.8 Inverse subtraction of original image and its flipped version. 106 

4.9 The inverse subtraction: the football experiment. . . . . . . . 108 

4.10 Depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

4.11 Colour texture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

4.12 Scattered point set in 3D. (re-sampled at every 2 millimetre) 115 

4.13 Scattered point set in 3D, attached with colour information. 

(re-sampled at every 2 millimetre) . . . . . . . . . . . . . . . 116 

4.14 Illustration of camera limited resolution. . . . . . . . . . . . . 118 

5.1 A routine of point set registration. . . . . . . . . . . . . . . . 125 

5.2 Corner detection. . . . . . . . . . . . . . . . . . . . . . . . . . 130 

5.3 NCC results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 

5.4 NCC results (periodic pattern). . . . . . . . . . . . . . . . . . 133 

5.5 Robust estimation. (inliers shown by red connecting lines) . 135 

5.6 Robust estimation. (inliers shown by index numbers) . . . . 136 

5.7 Data structure of a point set. . . . . . . . . . . . . . . . . . . . 137 

5.8 Voxel quantisation of the large data set. . . . . . . . . . . . . 138 

5.9 Different quantisation level by choosing different voxel size. 139 

5.10 The captured objects of figure 5.12. . . . . . . . . . . . . . . . 140 

5.11 The captured objects of figure 5.12. . . . . . . . . . . . . . . . 141 

5.12 The quantisation effect of choosing different voxel size on 

the total point set size. . . . . . . . . . . . . . . . . . . . . . . 142 

13

5.13 Manual tuning of point sets registration. . . . . . . . . . . . . 143 

5.14 Different rendered views. (top:rendered range images; bot- 

tom:rendered object attached with colour texture) . . . . . . 144 

6.1 A snapshot with touchpad and buttons. . . . . . . . . . . . . 153 

6.2 A captured image showing an object is being scanned. . . . 154 

6.3 Finger detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

6.4 Button calibration. . . . . . . . . . . . . . . . . . . . . . . . . 159 

6.5 The True Positive Rate (TPR) and False Positive Rate (FPR) 

of button push detection. . . . . . . . . . . . . . . . . . . . . . 160 

6.6 The projected buttons and their observations in camera im- 

age. (The red blocks only indicate the area to be monitored). 163 

6.7 Fingertip detection using background segmentation algo- 

rithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 

6.8 A screen shot of the working environment. . . . . . . . . . . 167 

6.9 Screen shot of the system start-up state. . . . . . . . . . . . . 171 

6.10 Owl experiment, 3 views captured, current on view 1. . . . . 174 

6.11 Owl experiment, 3 views captured, current on view 0, model 

rotated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 

6.12 The row index picture of the first view (the brighter pixel 

values correspond to higher rows in the projection image.) . 177 

6.13 The touchup result of 6.10. . . . . . . . . . . . . . . . . . . . . 178 

6.14 Correspondence Mode: two images are selected as ’from’ and 

’to’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 

6.15 Correspondence Mode: Region of Interest (ROI)s are selected. . 181 

6.16 Correspondence Mode: extracted corners. . . . . . . . . . . . . 183 

14

6.17 Correspondence Mode: correlated and improved point corre- 

spondences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 

6.18 Correspondence Mode: visualised point sets tuning, with con- 

trollable rotation and translation. . . . . . . . . . . . . . . . . 185 

6.19 Correspondence Mode: two point sets are fused. . . . . . . . . 186 

6.20 The Visualisation Mode. . . . . . . . . . . . . . . . . . . . . . . 187 

6.21 View 2 and 3 fused together. View completes the left wing 

of the owl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 

6.22 View 2 and 3 fused together. . . . . . . . . . . . . . . . . . . . 192 

6.23 Fusion of view 2, 3, and 4. . . . . . . . . . . . . . . . . . . . . 192 

7.1 Shape acquisition test: Owl. Top two: before touchup; bot- 

tom two: after touchup. . . . . . . . . . . . . . . . . . . . . . 201 

7.2 The projector-camera pair setup. The shaded part is the 

’dead’ area that can not be illuminated by the projector but 

in the viewing range of the camera. . . . . . . . . . . . . . . . 202 

7.3 Shape acquisition test: Stand. Left column: depth maps; 

right column: the corresponding textures. . . . . . . . . . . . 204 

7.4 Shape acquisition test: Football. Left column: depth maps; 


7.5 Shape acquisition test: Cushion. Left column: depth maps; 


7.6 Shape acquisition test: Human Body. Left column: depth 

maps; right column: the corresponding textures. . . . . . . . 208 

7.7 Number of extracted corner points and matched correspon- 

dence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 

15

List of Tables 

4.1 10 level Gray code look-up table. . . . . . . . . . . . . . . . . 97 

6.1 Grouping status of the point sets at different stages. . . . . . 190 

7.1 An overview of the objects used for the tests. . . . . . . . . . 197 

7.2 Evaluation: depth capture error, and their corrections. . . . . 199 

7.3 Evaluation: building correspondences. . . . . . . . . . . . . . 210 

16

List of Acronyms 

AD Absolute Intensity Differences 

ALIVE Artificial Life interactive Video Environment 

AR Augmented Reality 

DOF Degree of Freedom 

FOE Focus of Expansion 

FOV Field of View 

FPR False Positive Rate 

FRF Fast Rejection Filter 

GUI Graphical User Interface 

HCI Human-Computer Interface 

HMD Head Mounted Display 

MAD Mean Absolute Difference 

MR Mixed Reality 

17

MSE Mean Squared Error 

NCC Normalised Cross-Correlation 

PCA Principle Component Analysis 

PTZ Pan-Tilt-Zoom 

RANSAC RANdom SAmple Consensus 

ROI Region of Interest 

SD Squared Intensity Differences 

TPR True Positive Rate 

TUI Tangible User Interface 

SVD Singular Value Decomposition 

UI User Interface 

VAE Video-Augmented Environment 

VR Virtual Reality 

WCS World Coordinate System 

WTA Winner-Takes-All 

WWW World Wide Web 

XML Extensible Markup Language 

18

Chapter 1 

Introduction 

1.1 Problem Statement 

The goal of computer vision is to make useful decisions about physical ob- 

jects and scenes based on sensed images [89]. Therefore, it is almost always 

necessary to describe or model these objects in some way from images. It 

is safe to say there is no better way than reconstructing the 3D models from 

2D images, because 3D vision is natural to humans and can therefore pro- 

vide structural information in probably the most obvious way perceived 

by humans. 

19

Over recent years, researchers and scientists have been fascinated by 

the possibility of building intelligent machines or vision systems which 

are capable of understanding the physical world and representing it in 3D 

space. On the other hand, they are also keen in bringing these vision sys- 

tems into people’s day to day life, and use them to bridge the gap between 

the physical world where humans live and the virtual world the computer 

generates. 

This research is inspired under this context. We aim to develop a vi- 

sion system for efficient 3D shape input. From data input to finally build 

the complete 3D model for the target object, it might take several captures 

of the object being positioned in different orientations, and the tasks such 

as error removal and data fusion are carried out in a human-computer 

collaborative way in an environment mixed with real world objects and 

augmented video signals. 

It is also important that all hardware used in the system is day-to-day 

equipment that is not hard to get in an office environment. Inexpensive 

peripherals and easy to use software are used so that the system can be ap- 

plied in various environments, specially targeted to museum exhibitions 

and home gamers. 

In conclusion, the system should not only accomplish the 3D shape in- 

put task but also efficiently and collectively utilise the skills of human and 

20

the power of computer, in a visual environment subject to illumination 

changes where physical objects and virtual elements co-exist. 

1.2 Terminologies 

1.2.1 Augmented Reality and Virtual Reality 

Virtual Reality (VR) is a synthetic world where we interact with virtual 

objects generated by computers or other equipment, instead of those real 

objects surrounding us in the real world. Augmented Reality (AR), some- 

times known as Mixed Reality (MR), mixes the real physical world and 

the world of VR, by enhancing the real world with augmented virtual in- 

formation. 

1.2.2 Video-Augmented Environments 

In this thesis, A VAE is a kind of projector-camera system in which a user’s 

interaction with objects and projections are interpreted by a vision system, 

leading to changes in the augmented signals. It is a specific type of AR, 

while the augmentation could be anything from overlaying instructions, 

to generating a virtual object that appears to exist in the physical world 

and responds to the environment according to human’s instructions. In 

21

Figure 1.1: Mixed Reality. 

this way, objects can appear to be augmented [52, 73], or the user can ma- 

nipulate graphical data by gesture [63, 13, 30]. A significant property of 

such systems is that 3D objects and projected images are combined in a 

single mixed environment. 

1.3 Goals 

We contend that in many non-interactive vision problems, a valid and 

sometimes superior solution can be attained through a human user or 

users collaborating with automated analysis. Previous work at York has 

reported applications in fast panorama construction [80], AudioPhotoDesk 

[41], d-touch [29], movie footage logging [70] and this thesis considers 3D 

22

object acquisition – in a human-computer collaborative way. The impor- 

tance of user in the VAE presented in this thesis is highlighted, as it enables 

3D modelling without expensive presentation systems (e.g. servos etc.). 

Any such system must combine vision and interaction techniques with 

the design goal of higher efficiency than a purely automated system that 

requires passive human operator time. The one presented in this thesis 

uses the projector-camera pair of a VAE to acquire range images, then 

user interaction in the augmented environment to identify correspond- 

ing points for building a full 3D model. The automation can then take 

over again to suggest prototype registrations for further adjustment by 

the user. There are also simple facilities for touching up range images. 

The result is an efficient 3D acquisition system that can be deployed with- 

out conventional input devices such as keyboard, mouse, or laser pointers. 

In short, this work aims: 

• To analyse and extend the use of video as an input device. 

• To devise and implement different image and video processing com- 

ponents that make an augmented reality 3D input device possible. 

• To design a human-computer collaborative system for inputting 3D 

shapes, and evaluate its performance. 

23

1.4 Thesis Organisation 

The rest of the thesis is organised as follows. 

A literature review with a history of major image based 3D capture 

methods and prior art of VAEs is given in chapter 2. 

Chapter 3 introduces the calibration of the projector-camera pair as sys- 

tem configuration stage. The calibration serves for two purposes. It gives 

the internal and external geometry of the projector camera system, and 

also provides the bi-directional transform between the projection signals 

and their observations in the camera image. 

Chapter 4 is concerned with the technique used for the shape acquisi- 

tion as the first stage of data input from real objects. 

In chapter 5 we introduce the method for extracting 3D information 

from the scanned range images and how to fuse the 2.5D data into a com- 

plete 3D model. 

Chapter 6 gives the work flow with the aforementioned key compo- 

nents. The interactive user interface is also presented in this chapter. 

Experimental results and performance evaluation from the user test are 

given in chapter 7. 

24

Chapter 8 draws the conclusions and possible future work. 

1.5 Contributions 

The work described in this thesis is the development of a tabletop based 

VAE for fast 3D input via collaborative work between the human and the 

computer. 

In chapter 3, a fully automatic method for calibration of the camera- 

projector pair is proposed and implemented. The work is inspired by a 

widely used Matlab based camera calibration toolbox which is extended 

and converted to C++ to make it capable of calibrating the projector cam- 

era system in a fully automatic manner. The Matlab toolbox is also used 

off-line to manually evaluate and validate the calibration results from our 

own automatic method. Initial testing of the calibration data suggests it is 

not only suitable for table-top based monitoring, but also capable of full 

3D applications. 

In chapter 4, a Gray coded structured light projection is implemented 

for the acquisition of the two and half dimensional depth maps. Despite 

the method itself being well-established and widely used, efforts are made 

to incorporate it into the interactive VAE system and conquer the issues 

25

aised in practice such as the ever-changing lighting conditions and vari- 

ous surface reflections caused by different object materials. The problems 

such as the large distance between the camera-projector pair and the pro- 

jection surface, and the aliasing effect caused by limited camera capture 

resolution are tackled as well. 

A framework for 3D point set registration is developed in chapter 5. It 

begins with the conventional image registration method for planar objects 

then extends it to work for arbitrary objects in the VAE, with no a priori 

ground truth information available. 

This thesis proposes a new system design for inputting 3D using an 

interactive VAE system. This is a major contribution and it is detailed 

in chapter 6. The proposed system is cheap to maintain with off-the- 

shelf hardware, and easy to deploy with minimum configuration of the 

projector-camera pair. It is also not just restricted to controlled laborato- 

rial environments. 

The designed system is also possible for multi-user collaboration, and 

the user is able to walk up to the VAE using their bare hands without the 

need of Head Mounted Display (HMD), gloves or markers, which most of 

the current VAE systems rely on. The top-down projection mechanism is 

also user-friendly, as it dramatically reduces the chance of the user’s eyes 

being hurt by the bright projection lights. 

26

Although the system proposed here contains techniques that are al- 

ready widely used in the field, it brings together these techniques in a new, 

practical and efficient way. There are also very few systems as such that 

can be deployed not only in restricted laboratory environments, but also at 

a very low cost by avoiding expensive hardware such as touch screens and 

HMD. Initial test results shows it provides a solid foundation for future 

research in this field, and opens up the possibilities of a lot of promising 

future work. 

27

Chapter 2 

Background and Prior Art 

This research combines 3D shape acquisition with video augmented real- 

ity. The shape acquisition is a tool for capturing 2.5D depth information, 

and it can be used repeatedly to built a complete 3D model from a set of 

different views of the object being measured. The VAE is an augmented 

reality where the projected visual signals are used to augment the real 

world. The background to both areas is reviewed here. 

28

2.1 Image based 3D capture methods for depth 

estimation 

Humans visually perceive depth using both of their eyes. There is a sim- 

ple experiment that if one tries to point the tips of two pens towards each 

other with one eye closed, it is almost impossible to succeed. The same 

thing happens if one approaches his finger towards the wall with one eye 

closed, then it is very hard to visually measure the distance between the 

finger tip and the wall. The reason behind this is not hard to explain, be- 

cause humans rely on the visual ability for depth perception using binoc- 

ular stereopsis. 

The root of the word stereopsis, stereo, comes from the Greek word 

stereos and it means firm or solid [100]. With stereo vision a solid object 

is perceived in three spatial dimensions width, height and depth which 

are geometrically represented as X, Y, and Z axes. During the perception 

process, each human eye captures its own view and the two separate im- 

ages are sent on to the brain for processing. When the two images arrive 

simultaneously at the back of the brain, they are united into one 2.5D rep- 

resentation based on their similarities and give the human an observation 

in three dimensions. In the field of computer vision, the human visual abil- 

ity for depth perception using binocular stereopsis has been modelled by 

two displaced cameras to obtain 3D information of the investigated scene. 

29

2.1.1 Feature Based Methods 

A feature based stereo matching algorithm produces a depth map that 

best describes the shape of the surfaces in the scene via a set of matching 

features as correspondences. The correspondences are always found from 

points, lines, corners, contours or other distinguishing features extracted 

from both of the observed images [37]. 

During the matching, the most commonly used matching criteria func- 

tions are pixel-based Squared Intensity Differences (SD) [1, 55] and Ab- 

solute Intensity Differences (AD) [55], which are sometimes averaged to 

the Mean Squared Error (MSE) and Mean Absolute Difference (MAD). 

Other widely used traditional matching costs include Normalised Cross- 

Correlation (NCC) which is similar to the MSE, and binary matching costs 

such as edges [20, 43] or the sign of Laplacian [71]. More recently, various 

robust measures [10, 11, 86] have been proposed to limit the influence of 

mismatches. Once the matching costs are computed, local and window- 

based methods are used to aggregate the cost by summing or averaging 

over a support region. In local methods, the final disparities are chosen at 

each pixel where the disparity is associated with the minimum cost value – 

this is often known as Winner-Takes-All (WTA) approach. This results in a 

limitation that only one of the two images has the unique matches, because 

pixels in the second image might be pointed to by multiple points from the 

first image, or vice versa. Efficient global methods such as max-flow [82] 

and graph-cut [91, 17] have been proposed to solve this optimisation prob- 

lem and have produced promising results. 

30

Comprehensive reviews of the aforementioned applicable technologies 

are provided in [57, 87, 47]. 

2.1.2 Optical Flow Based Methods 

Optical flow based methods recover structure information from the opti- 

cal flow observed from two images of a moving rigid object, or from two 

images taken from different point of views of a stationary object. 

Optical flow is the distribution of velocities of movement of brightness 

patterns in an image, where brightness patterns can be objects but nor- 

mally refer to pixels for further processing [49]. It can arise from relative 

motion of objects and the observer, which means it can be either a moving 

camera imaging a static scene or objects moving in front of the camera. In 

either setting, more than one images are taken and optical flow is com- 

puted to estimate 3D locations of the interested features. 

(a) first image (b) second image (c) observed flow 

Figure 2.1: Optical flow of approaching objects. 

31

As seen in figure 2.1, it is simulated that the ground and an object are 

approaching the observed in relative different speed and directions. Some 

of the points on the ground all have same instantaneous velocities, but 

when they are perceived by human eyes, their images will cross the retina 

with different velocities and direction. All the velocities are represented 

in rays which have the same vanishing point and it is called the Focus of 

Expansion (FOE). The FOE of the ground is in the image and easy to find, 

but for the moving object, its FOE is located outside of the image. 

In recent approaches, optical flow [49, 65, 7, 1, 40, 10, 11] is widely used 

to estimate the dense correspondence derived from consequent frames. In 

[49] a gradient-based method is presented to compute the optical flow, 

while there are feature-based [35, 12, 92] and correlation-based methods 

[8, 90, 61]. Once the correspondence is established, the 3D location of 

these corresponding features can then be computed if information about 

the camera is known. 

One of the most comprehensive discussion and evaluation of the exist- 

ing optical flow computation is given in [5]. 

32

2.2 Active Shape Acquisition Methods 

Techniques discussed in section 2.1 cover most of the scenarios in com- 

puter vision for depth estimation, but under certain circumstances there 

are still things can be done in a proactive way as an aid to help enhance 

the performance. For example, when measuring a white object with no 

texture at all, it is often hard to extract the distinguished features or com- 

pute the optical flow. In this case it would be helpful to manually put some 

marks on the object such as squares or triangles to help locate the interest 

points. Similarly in an augmented environment, controlled lighting can 

replace those squares and triangles as an aid to help identify more inter- 

esting features. 

Imagine waving a pen over the inspected object under a constant light 

to cast shadows across the scene. The shadow is expected to be a thin 

line, but deformed due to the shape of the surface underneath. Can the 

structural information be retrieved from the deformed shadows? Another 

example is by turning on different lights in the room and using a camera 

to monitor an object under different lighting conditions, again, is there any 

structural information induced in the observed images? 

One of the active methods is photometric stereo [106], which has the 

ability to estimate the object surface orientation by using several images 

taken from the same viewpoint but under distinct illumination from dif- 

ferent directions. Under most circumstances, the surfaces being measured 

are thought to obey Lambert’s cosine law, which states that the irradiance 

33

(i.e. light emitted or perceived) is proportional to the cosine of the angle 

between surface normal and light source direction, and this relationship 

can be represented by the reflectance map [107, 108]. A big advantage 

of the photometric stereo method is it can be used as a texture classifier. 

For instance, suppose a surface with lots of protuberant horizontal curved 

stripes is imaged under a single constant illumination. If the surface is ro- 

tated by 90 degrees while the lighting remains the same (i.e. strength, ori- 

entation), it causes failure of conventional texture based correspondence 

matching because of the big change in the appearance of the surface. For 

these types of applications, photometric stereo is the answer provided the 

rotation is known. 

Another active technique for shape acquisition is structured light [81, 

45, 6, 15, 84]. In structured light systems, a projector is used to completely 

replace one of the cameras in a stereo vision system. With the projector 

projecting light patterns such as dots, lines, grids or stripes onto the object 

surface, all the illumination sources of these projected signals are known 

in the projector space. At the same time, a camera is used to capture the 

illuminated scene as the observer. By projecting one or a set of known 

image patterns, it is possible to uniquely label each pixel in the image ob- 

served by the camera. 

Unlike stereo vision methods introduced in section 2.1 which rely on 

the accuracy of matching algorithms, structured light automatically estab- 

lishes the geometric relationship by direct mapping from the codewords 

34

assigned to each pixel to their corresponding coordinates in the source pat- 

tern. 

A detailed discussion of structured light techniques is presented in 

chapter 4. 

2.2.1 The Use of Structured Light System 

This research presents a projector-camera based VAE system. Although 

there is controlled lighting available such as turning on and off the lights 

or adjusting the blinds, it is not to the extent that is sufficient for photo- 

metric stereo. With the projector-camera pair available, structured light 

fits well in terms of hardware requirements, and is used for initial capture 

of depth information in this research. Following this, stereo-matching cor- 

respondence methods are applied to fuse the depth data captured from 

different views of the object being measured. 

2.3 Video Augmented Environments (VAEs) 

A VAE is a visual environment where physical objects from the real world 

and virtual elements co-exist coherently. Data projectors are normally 

35

used in VAEs to augment the real objects by projecting video signals onto 

the scene. The visual environment is also monitored by the camera, so 

that the VAE system can detect the changes and make response by chang- 

ing the projections. 

Over the last decade various VAE systems are developed for different 

purposes. We list some related example VAEs to show their range and di- 

versity. 

2.3.1 Related example VAEs in the past 

DigitalDesk 

In the early 90s, one of the earliest projects in the history of VAE emerged 

as Wellner’s DigitalDesk [102, 104, 103] (figure 2.2(a)). A major feature of 

the project is the blurring of the boundary between physical paper and 

electronic documents. DigitalDesk also tackles the problem of calibrating 

the multiple inputs (cameras) and output devices (projector) so to enable 

the planar mapping between their individual coordinate systems. 

36

(a) Hardware setup. (b) The DigitalDesk 

Figure 2.2: The DigitalDesk. (image courtesy of the Computer Laboratory, 

University of Cambridge) 

In DigitalDesk, a projector and one or more cameras are mounted above 

the desk sharing the common view area. On the desk, a user can place 

normal day to day objects such as papers, books, and mugs. The desk has 

other characteristics of a workstation, that the projector and camera(s) are 

connected to a PC and the system can (1) read the documents placed on 

the desktop; (2) monitor a user’s activity at the desk; (3) project video sig- 

nals such as images and annotations down onto the desk surface. 

Inspired by DigitalDesk, a number of prototype applications have been 

built. For example, the PaperPaint application [104] allows copying and 

37

pasting of images and text from paper documents laid on the desk into 

electronic versions. The DigitalDesk Calculator [102] (figure 2.2(b)) enables 

mathematical operations on numeric data contained in paper documents, 

providing the user a virtual calculator by projecting a set of buttons along- 

side the paper documents. Another application is Marcel [72], where user 

can point their finger at the words in a French document and the pointed 

words are translated into English, which is subsequently projected along- 

side the original French word. 

BrightBoard 

BrightBoard [93] explores the use of a ordinary whiteboard as a com- 

puter interface. A vision system is developed to monitor what is happen- 

ing on the board. A major difference of BrightBoard from other VAE sys- 

tems is it is not designed to continuously respond to the captured images. 

Instead, events are only triggered whenever the system detects a signifi- 

cant change such as the user obstructing the whiteboard. 

A few commands are provided (previously written by marker pens) 

on the board, with a square check box alongside each command (figure 

2.3). For each check box, the system monitors the square as the active area 

and detects when the zone becomes significantly darker or lighter, which 

38

corresponds to the conclusion that a mark is made on the board or erased, 

respectively. 

Figure 2.3: An image of the BrightBoard. (image courtesy of the Computer 

Laboratory, University of Cambridge) 

After initial success of the prototype, instead of expanding the system 

into a monolithic application with more and more features, the developers 

decide to simplify the BrightBoard into a whiteboard based control panel, 

from which the scripts and external programs can be activated, such as 

printing and saving what is written on the whiteboard, emailing the im- 

ages of board or passing them onto other programs for further processing. 

However, one of the limitations of the system is that calibration is not 

involved at any stage. The system relies on the fact that the active areas in 

the camera image are crudely fixed. Once the camera or the board itself is 

moved the system needs to be reconfigured. 

39

Artificial Life interactive Video Environment (ALIVE) 

The ALIVE [67, 68] system was developed at the MIT Media Lab, in- 

spired by the ideas behind Myron Krueger’s VideoPlace [59, 60]. A large 

projection screen roughly the same height as a human is placed vertically 

on the ground. A camera is fixed on the top edge of the screen and moni- 

tors the user standing right in front of the screen and free to move about. 

In the observed image, the background is cut off so that only the fore- 

ground image of the user is conserved. It is then incorporated into another 

different scene (e.g. a different room) mixed with some animated creatures 

(figure 2.4). User can interact with the computer generated creatures by ei- 

ther their movement or instructions expressed by their gestures. 

To enable this type of interaction, the user’s 3D position in the physical 

world has to be known. With a couple of assumptions, this can be achieved 

even by a single camera. First, the relative positions and orientation of the 

camera to the floor need to be known. Also, the users is assumed to be 

standing on the floor all the time so that simply locating the user’s lowest 

point in the observed image can approximately estimate his/her position 

in the room. 

40

Figure 2.4: User interacts with the ALIVE system. (image courtesy of the 

MIT Media Lab) 

2.3.2 Previous work at York 

At the Visual System Group, University of York, the main research inter- 

est of the VAEs concerns image input and analysis technologies that are 

resilient to lighting changes and shadowing. Sufficiently fast VAE imple- 

mentations are aimed to support richly interactive applications. 

A number of practical VAE applications have been designed and im- 

plemented within the group [52, 73, 29, 74]. Prior to this research, one of 

the most recent is Robot Ships, developed for the National Museum of Scot- 

land’s Connect gallery. 

41

LivePaper 

A recent system [52] by Robinson and Robertson provides a VAE in 

which individual sheets of pages, cards and books are placed on an instru- 

mented tabletop to activate their enhancement. It appears to the user as if 

the paper has additional properties with new visual and auditory features. 

Figure 2.5: The LivePaper system in use. (image courtesy of the Visual 

Systems Lab, University of York) 

A sheet of paper is detected through boundary extraction in an ob- 

served image, and then the projector displays the associated augmenta- 

42

tions according to the contents on the current page recognised by the sys- 

tem. The augmented video signals remain projected onto the page. An 

interactive menu is provided beside the page to provide finger-triggered 

functionality. 

A number of sample applications have been developed to illustrate the 

feasibility of the LivePaper system. These applications include an archi- 

tectural visualisation tool (figure 2.6(a)) which projects a 3D hidden-line 

rendering of walls onto a page, page sharing, remote collaboration (figure 

2.6(b)), and World Wide Web (WWW) page viewing. From the user’s per- 

spective, all of these applications are attributes of the particular page, not 

features of the tabletop. 

(a) The architectural visualisation appli- 

cation. 

(b) The collaborative drawing applica- 

tion. 

Figure 2.6: The LivePaper applications. (image courtesy7 of the Visual Sys- 

tems Lab, University of York) 

Another application of LivePaper is an audio player. When a page such 

43

as a business card is laid on the desk, the player begins playing an audio 

clip, of which the playback can be controlled by the user, by pressing the 

projected buttons. 

PenPets 

The PenPets application developed by O’Mahony and Robinson [73] 

is an application running on a VAE called SketchTop which supports rich 

interaction through sketching, augmented physical objects and mobile vir- 

tual objects. 

SketchTop is a whiteboard mounted horizontally at desk height together 

with other physical objects that can be augmented. Problems are encoun- 

tered in some of the other whiteboard based VAE systems. First, the white- 

board is horizontally placed because vertically mounted whiteboard can- 

not support augmented objects other than the video signal itself. Second, 

once the markings are static once written on the whiteboard, so the liter- 

ality of interaction that comes through registering augmented signals to 

moving objects is lost. SketchTop was designed to solve both these prob- 

lems and thereby provide rich interactions via static-but-erasable writings. 

The focus of SketchTop demonstration is Penpets, an artificial life appli- 

44

(a) A maze-solving agent tries to find its 

way out while user modifies the struc- 

ture of the maze. 

(b) Moving an agent with a fishnet-like 

tool. 

Figure 2.7: Snapshots of Penpets in action. (image courtesy of the Visual 


cation in which virtual animals roam the augmented surface, running into 

objects and triggering events subject to their various behavioural models. 

Figure 2.7 shows two snapshots of Penpets in action. The demonstrated 

agent in figure 2.7(a) has hazard detection and maze solving ability. The 

tunnels and walls on the whiteboard are drawn by users, therefore users 

can easily hinder the agents by opening up new exits, closing old ones, or 

tapering the current lane in which the agent is travelling. Figure 2.7 shows 

an agent is being carried to another part of the environment by a fishnet- 

like tool. 

Based on different behaviour models, developments of SketchTop appli- 

cations such as circuit simulator, traffic simulator, and sketchable pinball 

(using agents as balls) are implemented. Another interesting implemen- 

45

tation is to simulate the agents culinary interests by providing a mean of 

recognising different objects such as apple, cheese, or teapot. 

Audio d-touch 

Audio d-touch [29, 28] uses a consumer-grade web camera and cus- 

tomisable block objects with markers attached to provide an interactive 

Tangible User Interface (TUI) for a variety of time based musical tasks such 

as sequencing, drum editing and collaborative composition. Three musi- 

cal applications have been reported by previous research in the group: the 

augmented musical stave (figure 2.8), the tangible drum machine, and the 

physical sequencer. Although there is no presence of a data projector in 

this system, the Audio d-touch is very similar to other standard VAEs, the 

only difference being the video signals projected by the projector are re- 

placed by the audio signals from the speakers. 

TUIs are a recent research field in Human-Computer Interface (HCI)s. 

Compared to a Graphical User Interface (GUI) where users interact with 

virtual objects represented on a screen through mouse and keyboard to 

control and represent digital information, in a TUI physical objects are 

used in the real space to achieve the same goals. Grasping a physical 

object is equivalent to grasping a piece of digital information, and nor- 

46

mally different objects represents different pieces of information of the 

virtual model. As feedback, the computer output is usually represented 

in the same physical environment to sustain the perceptual link between 

the physical and virtual objects. 

In Audio d-touch the user can create patterns and beats. This is realised 

by mapping physical quantities to musical parameters such as timbre and 

frequency. The visual part of the system tracks the position of the control 

objects with a web-cam by means of a robust image fiducial recognition 

algorithm. Technical details of the fiducial algorithms can be found in 

[27, 75]. 

Figure 2.8: Audio d-touch interface (the augmented musical stave). (image 

courtesy of the Computer Laboratory, University of Cambridge) 

Figure 2.8 shows one of the Audio d-touch application: the augmented 

musical stave. Only the interactive surface is shown in the figure, while 

47

the web camera is vertically mounted above the surface and a pair of 

speakers are placed on the side – all connected to a PC. In the augmented 

stave, physical representations of musical notes can be placed on a stave 

drawn on an A4 sheet of paper for either teaching score notations or com- 

position of melodies. The interactive objects are rectangular blocks, each of 

which is labelled with a fiducial symbol correlated to a variety of musical 

notes. Once the notes are placed on the stave, the corresponding sounds 

are played by the computer. Various musical parameters such as the pitch, 

the duration (quavers, crotchets, minims, etc...), the playing sequence are 

decided by the position of the object on the musical stave. 

Prototypes of the designed instruments have been tested by a group 

of people with different musical backgrounds, ranging from music aca- 

demics to amateurs with little experience in music composition. Each en- 

joyed interacting with the instruments and managed to make interesting 

compositions. 

Robot Ships 

Robot Ships is a commercial application developed as a featured exhibi- 

tion for the Connect Gallery [101] at the National Museums of Scotland in 

Edinburgh. 

48

Designed with the technology of VAE, Robot Ships turns a tabletop into 

a stretch of ocean, upon which robotic boats work together to clean up oil 

spills. An audience walks up to the tabletop, reaches onto it, and becomes 

part of the interactive environment to create various events (figure 2.9(a)). 

(a) A picture showing the user sinking 

an oil tanker for the workers to start the 

clean-up work. 

(b) A screen shot showing the workers 

start cleaning up the toxic spill that has 

been located by a scout. 

Figure 2.9: Snapshots of Robot Ships in action. (image courtesy of the Visual 


On the biological scale, the idea behind the Robot Ships is inspired by 

combining user assistance and the work force to solve environmental tasks. 

In this case, the scout ship is first sent out to search for toxic spills, and 

upon finding one it will return to the central control rig. On its way back, 

it navigates the obstructions and leaves a series of trail points. Cleanup 

49

worker ships are then dispatched. Without knowing the location of the 

spill, the workers rely only on the trail points left by the scout. Due to 

the fact that the workers don’t know where they are heading, but instead 

only using their limited viewing cone, they are more manipulatable by 

the audience. As the entire interface is on a round table which is used by 

reaching over it, it is open to all ages and to multi-user collaborations. 

Robot Ships is a VAE that runs on top of the OpenIllusionist framework 

[50] independently developed by previous members of the Visual Systems 

Lab, Justen Hyde and Dan Parnham. More details of Robot Ships and Ope- 

nIllusionist is given in [74]. 

2.4 Conclusions 

There are many other good VAEs apart from those aforementioned. Here 

we only introduce some of the pioneering and well-known VAEs, and the 

related previous work carried out in our Visual System Lab. 

At present many of the research groups continue their work on VAEs 

and some of the related individual contributions will be reviewed in more 

detail at appropriate stage later in this thesis. 

50

Chapter 3 

Calibration 

3.1 Introduction 

In a camera-projector based VAE system, different components have their 

own coordinate systems: the camera coordinate system, the projector co- 

ordinate system, and the World Coordinate System (WCS) which the real 

objects are placed within. For accurately measuring the objects’ place on 

the tabletop using structured light scan method, it is vital to have a reli- 

able calibration process so that the internal and external geometry of the 

camera and the projector are known. When a user interacts with the aug- 

51

mented signals projected onto the desktop, there is a need to sustain the 

coherent spatial relationship between the physical objects and the virtual 

elements in a continuously changing visual environment. 

For example, if a light dot is projected onto the centre of the desktop, 

it won’t necessarily appear at the centre of the observed image. There- 

fore the original location of the light dot in the projector image and its 

observed position in the captured image need to be correlated, so that the 

system knows where to look for it in the captured image. Furthermore, 

if the light dot is projected onto an object, the 3D position of the illumi- 

nated point on the object in the real world might need to be measured. In 

this case, the internal geometry of the camera and the projector need to 

be known to solve things like the mapping between pixels and real world 

measurements and to what extent the image is distorted due to the lens 

imperfection. The recovery of all the necessary information is called the 

calibration process. 

This chapter addresses this calibration problem. 

Calibration task 

The objective of the camera calibration process is to find the internal pa- 

rameters (a series of parameters that a camera has inherently) and the ex- 

ternal parameters (position of the camera and its orientation relatively to 

the World Coordinate System (WCS)). 

52

Calibration principle 

To calibrate the camera requires measurements of a set of 3D points and 

their image correspondences [37]. The most common way to do this is 

to have the camera observe a 2D planar pattern consisting of multiple 

collinear points and the pattern is shown to the camera in different views. 

Alternatively a 3D rig marked with ground truth points can also be used 

as the calibration object. The same principle applies to the projector cali- 

bration, although it is implemented in a slightly different way. 

Camera calibration 

In practice, a black and white checkerboard plane is usually chosen as the 

calibration object because it offers a set of known points as ground truth 

points straightaway, although there are other types of calibration objects 

that can be used [109]. In this research a 20 × 20 checkerboard is used as 

the calibration object. 

(a) 3D rig. (b) 2D planar object. (c) 1D object with marked 

points. 

Figure 3.1: Calibration objects. (image courtesy of [109]) 

Projector calibration 

53

When calibrating the projector we aim for the same set of parameters, the 

internal and external parameters for the projector. Unlike the camera, the 

projector already has a set of 2D points as ground truth since the pattern 

to be projected is a known image, but their 3D correspondences are un- 

known (the 3D positions of their projections). Finding these 3D locations 

is essential so that there are two sets of points available to complete the 

projector calibration. Therefore it is a prerequisite that the camera needs 

to be calibrated first to provide the transform of these unknown 3D points, 

from the camera image space to the real world coordinate system. 

2D plane to plane calibration 

The user interface of the collaborative system designed in this research 

for inputting 3D is based on a plane (i.e. the table top). Therefore a pre- 

cise registration between the image space of the camera and the rendered 

space of the projector is desired so that the spatial relationship between 

the projected signals and their observed images is sustained. To work out 

this plane to plane geometry it is not necessary that the internal parame- 

ters of the camera and the projector are known. The method of this plane 

to plane calibration is introduced in later part of this chapter. 

The rest of this chapter is structured as follows. In section 3.2 we re- 

view other related works. In section 3.3 we explain the calibration param- 

eters and the formalised full calibration model is given. In section 3.4 we 

introduce the implementation of calibrating the camera and the projector, 

respectively. A method of 2D plane to plane calibration is presented in 

54

section 3.5. Conclusions are given in section 3.6. 

3.2 Background 

During the past decade camera calibration has received a lot of attention 

because it is strongly related to many computer vision applications such 

as stereo vision, motion detection, structure from motion, and robotics 

[99, 37, 39, 48, 111]. 

One of the most used methods is Tsai’s camera calibration method [99] 

that is also suitable for a wide range of applications. It is because his 

method deals with planar and non-planar calibration objects which makes 

it possible to calibrate internal and external parameters separately. This is 

important because in some cases the internal parameters are known (pro- 

vided by the manufacturer) so that one can fix the internal parameters of 

the camera, and carry out iterative non-linear optimisation only on the ex- 

ternal parameters. 

The conventional calibration process can be consuming in terms of time 

and effort, and calibration objects might not be always available. This in- 

spires self-calibrating methods which use the horizon line and vanishing 

points that are estimated from structural information such as landscape or 

55

uildings [26, 79]. These methods are often used in computer vision tasks 

based on single view geometry or video surveillance applications [31]. Lv 

et al. [66] approaches the camera self-calibration problem using extracted 

positions from a single walking man via PCA analysis, to estimate the 

vanishing points indirectly. No rigid calibration target is needed for the 

aforementioned approaches, however they are more online-oriented and 

not very practical for our table-top VAE applications. 

In this research, we first carry out the camera calibration process us- 

ing the Matlab toolbox developed by Bouguet [14]. This Matlab toolbox 

is developed by Jean-Yves Bouguet at California Institute of Technology 

and its C implementation is also available in the Open Source Computer 

Vision Library [51]. The toolbox is then extended and converted to C++ to 

make it capable of calibrating the projector-camera system. In the off-line 

process using the Matlab toolbox, the projections and captures are done 

in the first stage and the captured images are processed on a local PC in a 

separate second stage. An online calibration program was then developed 

in C++, which takes about two minutes to calibrate the camera-projector 

pair in a fully automatic manner using 20 different poses of the calibration 

board. 

56

3.3 Calibration Parameters 

3.3.1 Intrinsic Parameters 

The internal camera model is described by a set of parameters known as 

intrinsic parameters. These parameters represent the internal geometry of 

the camera. 

A matrix formed by camera intrinsic parameters is known as a camera 

matrix, or the K matrix that relates a 3D scene point (X, Y, Z) T and its pro- 

jection (x, y, 1) T in the 2D image plane. 

where the camera matrix K is 

w ′ 

⎛ ⎞ ⎛ ⎞ 

x X 

⎜ ⎟ ⎜ ⎟ 

⎜ ⎟ ⎜ ⎟ 

⎜y⎟ 

≈ K ⎜Y 

⎟ 

⎝ ⎠ ⎝ ⎠ 

1 Z 

⎡ 

⎤ 

fc1 

⎢ 

K = ⎢ 0 

⎣ 

α × fc1 

fc2 

c1 

⎥ 

c2⎥ 

⎦ 

0 0 1 

All related parameters that compose K are explained as follows. 

(3.1) 

(3.2) 

fc is the focal length represented as a 2 ×1 vector. It is in units of hor- 

izontal and vertical pixels. Both components are normally equal to each 

other. However when the camera CCD array is not square, fc1 is slightly 

different from fc2. Therefore, the camera model handles non-square pix- 

57

els, and fc1/fc2 is called the aspect ratio. 

cc is the principal point represented as a 2 × 1 vector (c1, c2), and it 

means how the projection centre is positioned in the image. As shown 

from figure 3.2, a 3D point (X, Y, Z, 1) T is being projected onto the imaging 

plane, its projection being (x, y, 1) T . When this is being represented in UV 

space (the 2D image coordinate), the following relationship holds: 

⎧ 

⎪⎨ u = x + c1 

⎪⎩ v = y + c2 

(3.3) 

Figure 3.2: Principal points. Bottom right subimage is the imaging plane. 

Generally the principal point cc is always considered to be at the centre 

of projection, but not precisely so because there is always a slight decen- 

tring effect in camera design. This defect could be taken care of by accurate 

camera calibration. 

58

α is the skew coefficient, a scalar which encodes the angle between the 

X and Y axes in the imaging plane. It equals to zero when X and Y axes are 

perpendicular, but like the aspect ratio fc1/fc2 handling non-square pixels, 

the skew coefficient α handles non-rectangular pixels. 

kc is a 5×1 distortion vector. Although kc is not directly included in the 

intrinsic matrix for perspectively transforming the point between different 

coordinate systems, it still plays a part in the camera internal geometry. 

The lens distortion model was first introduced by Brown in 1966 [18] and 

called the ”Plumb Bob” model. There are three types of lens distortions: 

radial, tangential and decentring distortion, with the radial distortion be- 

ing the most commonly known and most distinguished. The full distor- 

tion is modelled as follows. 

For an image point (x, y), 

where 

and 

⎛ 

⎝ xd 

⎠ = (1 + kc1r 2 + kc2r 4 + kc5r 6 ) 

yd 

⎞ 

dx = 

⎛ 

r 2 = x 2 + y 2 

⎝ 2kc3xy + kc4(r2 + 2x2 ) 

kc3(r2 + 2y2 ) + 2kc4xy 

⎛ 

⎝ x 

⎞ 

⎠ + dx (3.4) 

y 

⎞ 

(3.5) 

⎠ (3.6) 

The term dx is the tangential distortion. It is due to the imperfect cen- 

tring of lens components and other manufacturing defects. Therefore tan- 

gential distortion is also known as decentring distortion. Also, the radial 

59

distortion is more visible, being affected by three entries of the distortion 

vector, kc1, kc2 and kc5. Because of the concavity of the lens, pixels further 

away from the image centre suffer more severe distortion, and the amount 

of distortion is monotonically increasing with the factor x 2 +y 2 . This effect 

is illustrated in figure 3.3. 

(a) Distorted image (b) Distorted image 

(c) Original image (d) Original image 

Figure 3.3: The distortion effects. 

60

3.3.2 The Reduced Camera Model 

The above optical model is not always required in current manufactured 

cameras. In practice, the 6th order radial + tangential distortion model is 

often not considered completely. A few reductions are possible. 

• Nowadays most cameras on the market have pretty good optical 

systems, and it is hard to find lenses with imperfection in centring. 

Therefore tangential distortion can be discarded. The skew coeffi- 

cient α is often assumed to be zero for the same reason. 

• For cameras with good optical systems or standard Field of View 

(FOV) lenses (non wide-angle lenses), it is not necessary to push the 

lens distortion model to high orders. Commonly a second order ra- 

dial distortion is used. 

• In some instances such as the calibration data is not sufficient (e.g. 

using only two or three images for calibration), it is an option to set 

the principal point cc at the centre of the image ( nx−1 

2 , ny−1 

2 ) and reject 

the aspect ratio fc1/fc2 (set it to 1). However when sufficient images 

are used for calibration, this reduction is not necessary. 

Therefore, the reduced camera model can be defined as: 

⎛ 

fc1 0 c1 

⎞ 

⎜ 

K = ⎜ 0 

⎝ 

fc2 

⎟ 

c2⎟ 

⎠ 

0 0 1 

61 

(3.7)

with distortion modelled as: 

⎛ ⎞ 

where r 2 = x 2 + y 2 . 

⎝ xd 

⎠ = (1 + kc1r 2 ) 

yd 

3.3.3 Extrinsic Parameters 

⎛ 

⎝ x 

⎞ 

⎠ (3.8) 

y 

Figure 3.4: Transformation from world to camera coordinate system. 

Figure 3.4 is an example of how a triangle in the world coordinate space 

is imaged. Let (Xw, Yw, Zw) T be an object point (the blue point in the pic- 

ture) and its 3D position in the camera coordinate system is (Xc, Yc, Zc) T . 

Let point (x, y, f) T be its projection (the red point in the picture) on the 

imaging plane and f is the focal length. 

The rotation matrix R and the translation vector T characterise the 3D 

transformation for a scene point from the world coordinate to camera co- 

62

ordinate, 

⎛ 

⎜ 

⎝ 

Xc 

Yc 

Zc 

⎞ 

⎛ 

⎟ ⎜ 

⎟ ⎜ 

⎟ = R ⎜ 

⎠ ⎝ 

Xw 

Yw 

Zw 

⎞ 

⎟ + T (3.9) 

⎠ 

where R is a 3 × 3 rotation matrix and T is a 3 × 1 translation vector be- 

tween the two system origins in 3D space. 

After the scene point is transferred from world into camera coordi- 

nates, its 2D image point can be known as 

⎧ 

Rotation matrix 

⎪⎨ 

x = f Xc 

Zc 

⎪⎩ y = f Yc 

Zc 

(3.10) 

Three main rotation parameter Rx, Ry, Rz, also known as pan, tilt, yaw 

angles, are the Euler angles of the rotation from world to camera coordi- 

nate system around three major axes, are represented by a 3 × rotation 

matrix R, 

⎛ 

⎜ 

R = ⎜ 

⎝ 

r11 r12 r13 

r21 r22 r23 

r31 r32 r33 

63 

⎞ 

⎟ 

⎠ 

(3.11)

where 

Translation vector 

3.3.4 Full Model 

r11 = cos(Ry) sin(Rz) (3.12) 

r12 = cos(Rz) sin(Rx) sin(Ry) − cos(Rx) sin(Rz) (3.13) 

r13 = sin(Rx) sin(Rz) + cos(Rx) cos(Rz) sin(Ry) (3.14) 

r21 = cos(Ry) sin(Rz) (3.15) 

r22 = sin(Rx) sin(Ry) sin(Rz) + cos(Rx) cos(Rz) (3.16) 

r23 = cos(Rx) sin(Ry) sin(Rz) − cos(Rz) sin(Rx) (3.17) 

r31 = − sin(Ry) (3.18) 

r32 = cos(Ry) sin(Rx) (3.19) 

r33 = cos(Rx) cos(Ry) (3.20) 

⎛ 

⎜ 

T = ⎜ 

⎝ 

Tx 

Ty 

Tz 

⎞ 

⎟ 

⎠ 

(3.21) 

Combining the camera intrinsic and extrinsic parameters, it gives the full 

projection model, which performs the transform of a scene point (Xw, Yw, Zw) T 

from the World Coordinate System (WCS) to the camera coordinate system 

(Xc, Yc, Zc) T , then to the 2D imaging space (x, y) T , as shown in equation 

3.22. By representing all the points in their homogeneous form, the above 

transform relationships can be formalised as 

64

⎛ ⎞ ⎛ 

x 

⎜ ⎟ ⎜ 

⎜ ⎟ ⎜ 

⎜y⎟ 

≈ K ⎜ 

⎝ ⎠ ⎝ 

1 

Xc 

Yc 

Zc 

where K(R|T ) is a 3 × 4 projection matrix. 

⎛ ⎞ 

⎞ 

Xw ⎜ ⎟ 

⎜ ⎟ 

⎟ ⎜ 

⎟ ⎜Yw 

⎟ 

⎟ = K(R|T ) ⎜ ⎟ 

⎠ ⎜ ⎟ 

⎜Zw 

⎟ 

⎝ ⎠ 

1 

(3.22) 

So to calibrate the camera, it is necessary to estimate both intrinsic 

and extrinsic parameters, and the distortion model. This can be done by 

matching a set of ground truth points from the calibration object, and their 

correspondences in the observed image. 

3.4 Calibrate Camera-Projector Pair 

3.4.1 World Coordinate System 

The camera extrinsic parameters are not inherent parameters of the cam- 

era. The rotation and translation only represent the current camera pose in 

reference to the world coordinate system chosen by user. Without a world 

coordinate system or a reference coordinate system, the extrinsic param- 

eters are meaningless. Therefore, a world coordinate system needs to be 

chosen first as a reference to describe the relative camera position. In our 

65

system the white board is chosen to be the world reference frame. 

Being more specific, by laying a checkerboard flat on the table plane, 

the checkerboard plane is chosen as the XOY plane of the world coordi- 

nate system with its bottom and leftmost edge taken as X and Y axis. The 

surface normal vector pointed from the bottom-left corner of the checker- 

board is chosen as the Z axis. Thus the origin of the WCS is arbitrary in X 

and Y, depending on how and where the checkerboard was laid. 

3.4.2 Methodology 

Before we can calibrate the camera and projector pair, a set of calibration 

images is needed. In this research we use 20 images for camera calibration 

and 20 images for projector calibration, each pair being captured from a 

different angle of the white board. 

The main methodology is to take an image of a known 3D pattern as 

ground truth. Then in the captured image one selects a set of points of that 

pattern as interest points, to use the 2D coordinate information of those 

interest points along with their 3D matching points as correspondence to 

calibrate the camera. Normally this process is iterated by orienting the cal- 

ibration pattern in different angles to increase accuracy. 

The projector is calibrated in a similar way. A pre-designed pattern 

with ground truth information is projected onto a surface (which is re- 

66

garded as in world coordinate space), and the projection is monitored from 

the calibrated camera. Since at this point the camera is already calibrated, 

with the captured image and full camera model we can recover the 3D 

information of the projected pattern. These 3D information together with 

prior knowledge of the pre-designed 2D pattern forms a correspondence, 

and hence the projector can be calibrated by these two sets of points in a 

”reversed camera” way. 

Figure 3.5 shows the flow chart of the whole calibration process. The 

diagram shows the whole process after the data collection stage is done, 

during which the black patterns are projected onto the cyan checkerboard 

and images are taken at the same time. 

3.4.3 Data Collection 

We use a printed checkerboard as the camera calibration target, and we 

let the projector project another checkerboard as the projector calibration 

target. As mentioned in section 3.3, the camera calibration results - partic- 

ularly the camera extrinsic parameters - are needed to perform the trans- 

formation of the observed projected pattern from camera coordinate space 

to world coordinate space. Therefore when the printed pattern is being 

captured, we have to make sure a projected pattern is captured as well 

with the base plane staying exactly at the same pose – to maintain accu- 

racy. 

67

Figure 3.5: Flow chart of the camera-projector pair calibration. (diagram 

of image processing after the projections and captures are done) 

However, this is not easy, if the user has to slide in and out the printed 

checkerboard every time the checkerboard changes the orientation, and 

68

manually it is very hard to hold the base plane firmly stationary while 

performing these activities. It might require one tester to hold the board 

still while another one is handling the sheet. For this reason a mechanism 

that allows us to take a picture of two superimposed checkerboards and 

extract one from each other is desired, to prevent any slight movement of 

the base plane. This is possible by choosing appropriate colours for the 

checkerboards. 

We use a cyan-white checkerboard for the printed pattern, and a blue- 

black checkerboard for the projected pattern. Cyan and white have very 

similar blue components under white ambient light. Therefore, in a cap- 

tured image with both checkerboards there, by inspecting the blue chan- 

nel, the cyan checkerboard is barely seen and the blue checkerboard can 

be extracted. 

On the other hand, blue and black grids have near-zero red compo- 

nents. This means by super-imposing a blue-black checkerboard onto a 

cyan-white, in the red channel no components are added. This property 

allows us to extract the cyan-white checkerboard out of the superimposed 

version easily. In figure 3.6, the top image shows the captured image of 

superimposed checkerboards. The bottom two images are images of ex- 

tracted checkerboards. 

69

3.4.4 Choice of colour 

Getting the printed pattern from the mixed image is simple, because when 

it is captured the projected pattern is switched off. More effort is made in 

extracting the projected pattern from the mixed pattern, and the key is to 

find the difference between the blue projected area and black projected 

area under the interference of the printed pattern on the white board. 

Zhang [109] chooses red and blue for the printed and projected pattern 

respectively, because of their distinctively different RGB values. In prac- 

tice, other factors such as the surface reflection and room lighting con- 

dition need to be considered. After evaluating colour combinations we 

choose cyan as the colour for the printed pattern instead of red, and figure 

3.6 shows its performance. 

Figure 3.7 gives a closer look at the mixed area. In figure 3.7(a), area A 

and C are the non projected area (projection is zero) but A appears yellow- 

ish because the surface absorbs part of the ambient light, and C appears 

darker as it is affected by the blue grid on the printed sheet. D and B are 

the blue projection area, but B is affected by the printed pattern in the same 

way. The task is to differentiate area A and C from D and B by exploiting 

their colour channels. The instant find is that the printed cyan colour has 

very little effect in the blue channels in the captured image – A and C have 

very little blue component, and B and D have heavy blue channels despite 

that B and C are the areas where the surface is printed as cyan. The ex- 

traction result is shown in figure 3.7(b). The same can not be applied to 

70

(a) blue and cyan mixed pattern (b) extracted blue pattern 

(c) blue and red mixed pattern (d) extracted blue pattern 

Figure 3.6: Extraction of the projected pattern from the mixed one. 

the red-blue method (figure 3.7(c),(d)), where the printed red area appears 

full red in the observed image regardless whether it is mixed with blue 

projection or not. 

This method is also tested under different ambient illuminations. In 

general, experiments conducted when sufficient day light is available out- 

perform those conducted during the night, and it is mostly reflected in the 

failure of extracting the all the corners successfully because of less satis- 

factory results from cyan and blue colour filtering. This is because during 

71


(c) blue and red mixed pattern (d) extracted blue pattern 

Figure 3.7: Extraction of the projected pattern from the mixed one (a closer 

look). 

the night room lighting needs to be turned on to illuminate the physical 

checkerboard while the projection is off, and this contributes negatively 

to the colour filtering at later stage as the fluorescent lamps violates the 

colour channels more than the sunlight. When the points are extracted 

automatically, any captured images with not enough corner points will be 

rejected (e.g. a precise 81 inner corner points are expected from a 10 × 10 

checkerboard). Disqualifying more images leads to degradation the accu- 

racy of the calibration. 

72

3.4.5 Camera Calibration 

An automated process is implemented. All the user need to do is to hold 

the whiteboard which is attached with a physical checkerboard pattern at 

one pose for a short period (around 2 seconds), to let the camera take two 

pictures with the projections turned on and off, then re-position the board 

into a different orientation as long as the whole printed checkerboard pat- 

tern is within the common FOV between the camera and the projector. 

1. After the image capture stage, the colour filtered images as shown in 

the bottom left image of figure 3.6 captured from ten different orien- 

tations of the white board are used as the camera calibration images. 

2. For each image, the user manually clicks the four top corners of the 

checkerboard. The user is also prompted to input the physical grid 

size of the checkerboard to set up the units world coordinate system. 

Grid numbers and the inner cross points are located automatically 

after the four top corners are given. 

3. Normally the lens distortion can be tolerated at this stage as the dis- 

tortion model will be estimated later using the camera intrinsic pa- 

rameters. In case of severe lens distortion, the user is advised to give 

an initial guess for the first order distortion factor kc1. Then the sys- 

tem will take a guess and locate the corners more precisely, as shown 

in figure 3.8. 

4. After corner points are extracted for all input images, the user can 

deploy the camera calibration. By defining the checkerboard plane 

73

as the world coordinate XOY plane and the first point user clicked 

as bottom left corner as the world coordinate origin, 3D points of all 

corners are known. Calibration parameters are first initialised, and 

then optimised by redo the calibration using the improved repro- 

jected corners based on the estimated camera parameters. 


Figure 3.8: Extraction of the projected pattern from the mixed one (a closer 

look). 

3.4.6 Projector Calibration 

By the time the projector is calibrated, calibration for the camera is already 

done. Therefore, the calibration images used for projector calibration (in 

our case, 10 blue checkerboard images) will go through an ’Undistort’ 

stage before being used as input images for corner extraction. The two 

74

dimensional distortion vector is used to removed distortion from the im- 

ages. 

The first few steps of projector calibration are the same as camera: read 

images, extract corners. 

The extracted corners here cannot be used directly for calibration. They 

are the corner points in the captured image of the projected checkerboard. 

The information we need is the 3D coordinates of corners of the projected 

pattern. Now the camera model can be used to perform these transforma- 

tions. 

In theory it is impossible to recover a 3D scene point merely from its 

2D projection in the image plane. Because given the projection in the im- 

age, its original 3D point could be anywhere down the projection ray if 

the scene structure is unknown. However in our case, all the points we 

are trying to recover is on the checkerboard plane which is chosen as the 

XOY plane of the WCS, that means for all of them Z = 0. This relation- 

ship holds for all different poses of the checkerboard, as the instantaneous 

plane where the printed checkerboard lies in is assumed to be the XOY 

plane of the WCS. 

Technically, there is a different WCS for each tilt of the plane. It won’t 

affect the final calibration result because for N tilts there will be N sets 

of different rotation and translation vectors. Geometrically, each of them 

75

only represents the relative geometry towards temporary WCS, but there 

is only one set of rotation and translation vector will be used to estimate 

the final extrinsic parameters – the one from the view where the white- 

board is laid flat on tabletop, as that is where the VAE runs upon. 

Let x, y be the image point, we are trying to recover its 3D coordinate 

in world coordinate system, given camera calibration parameters and the 

constraint Z = 0. 

⎛ ⎞ 

⎛ ⎞ 

⎜ 

X 

⎟ 

x 

⎜ ⎟ 

⎜ ⎟ ⎜ 

⎜ ⎟ ⎜Y 

⎟ 

⎜y⎟ 

≈ K(R|T ) ⎜ ⎟ 

⎝ ⎠ ⎜ 0 

⎟ 

1 

⎝ ⎠ 

1 

(3.23) 

Here ≈ means equal up to a scale, so we replace it with a non-zero 

factor w 

⎛ ⎞ 

⎛ ⎞ 

⎜ 

X 

⎟ 

x 

⎜ ⎟ 

⎜ ⎟ ⎜ 

⎜ ⎟ ⎜Y 

⎟ 

w ⎜y⎟ 

= K(R|T ) ⎜ ⎟ 

⎝ ⎠ ⎜ 0 

⎟ 

1 

⎝ ⎠ 

1 

Replace K(R|T ) with the 3 × 4 projection matrix P 

⎛ 

⎞ 

⎜ 

P = K(R|T ) = ⎜ 

⎝ 

From Equ. 3.24 and 3.25, we have 

⎛ ⎞ 

x 

⎜ ⎟ 

⎜ ⎟ 

w ⎜y⎟ 

⎝ ⎠ 

1 

= 

⎛ 

⎜ 

⎝ 

p11 p12 p13 p14 

p21 p22 p23 p24 

p21 p32 p33 p34 

p11 p12 p13 p14 

p21 p22 p23 p24 

p21 p32 p33 p34 

76 

⎞ 

⎟ 

⎠ 

⎟ 

⎠ 

(3.24) 

(3.25) 

(3.26)

Cancel out the scale factor w by dividing the first and second row by 

the third row from Equ. 3.26 

x = p11X + p12Y + p14 

p31X + p32Y + p34 

y = p21X + p22Y + p24 

p31X + p32Y + p34 

From Equ. 3.27 and 3.28, X and Y in Equ. 3.23 can be solved 

X = (xp34 − p14)(p22 − yp32) − (yp34 − p24)(p12 − xp32) 

(p11 − xp31)(p22 − yp32) − (p21 − yp31)(p12 − xp32) 

Y = (xp34 − p14)(p21 − yp31) − (yp34 − p24)(p11 − xp31) 

(p12 − xp32)(p21 − yp31) − (p22 − yp32)(p11 − xp31) 

(3.27) 

(3.28) 

(3.29) 

(3.30) 

A programs was written by the author to implement all the calcula- 

tions above. So given an extracted point x, y from a corner point in the 

observed blue pattern, with the camera already calibrated, its position in 

the world coordinate space (X, Y ) is located from Equ. 3.29 and 3.30. 

Since the projection pattern (the blue checkerboard) is pre-designed, its 

corner points are all known. Along with the calculated 3D corners of the 

projected pattern, the projector can be calibrated in a similar way as cam- 

era calibration. The estimated distortion vector kc for the projector is very 

close to all zero, therefore the projector is assumed to have zero distortion. 

77

3.5 Plane to Plane Calibration 

The whole user interface of our collaborative system for inputting 3D is 

based on a plane (i.e. the table top). Therefore a precise estimation of the 

projective transform between the projector and the camera for this plane is 

desired, because we need constant and real-time monitoring of augmented 

signals in captured frames and response to them. Although the calibration 

data we previously worked out can be used, a more straightforward and 

accurate matching is preferred. 

A homography matrix is modelled to represent this matching. A ho- 

mography is a 3 × 3 non-singular matrix, which defines a homogeneous 

linear transformation from a plane to another plane. Although there is 

never a direct projective transform between the projector plane and the 

camera imaging plane, a homography still exists coherently between these 

two planes because they are induced by a reference plane, which is the 

white board in our case. Estimating the homography can be regarded as 

a 2D calibration process between the projector plane and camera plane. 

Normally a homography has 9 entries but only has 8 degrees of freedom, 

being constrained by ||H|| = 1 to only carry out an up to scale matching. 

Let the model plane (i.e. the white board) coincide with the XOY plane 

of the world coordinate system, then a 3D point on the model plane is Pw = 

(Xw, Yw, 0, 1) T , with its observed point in the camera plane Pc = (xc, yc, 1) T 

and its projection source point in the projector plane Pp = (xp, yp, 1) T . Sim- 

ilar to Equ. 3.23 and 3.24, we have 

78

⎛ ⎞ 

⎛ ⎞ 

Xw ⎜ ⎟ 

xc 

⎜ ⎟ 

⎜ ⎟ 

⎜ 

⎜ ⎟ 

⎜Yw 

⎟ 

⎜yc⎟ 

≈ Kc(Rc|Tc) ⎜ ⎟ 

⎝ ⎠ ⎜ 0 

⎟ 

1 

⎝ ⎠ 

1 

== Kc 

 

rc1 rc2 tc 

⎛ 

Xw 

⎞ 

⎜ ⎟ 

⎜ ⎟ 

⎜Yw 

⎟ 

⎝ ⎠ 

1 

(3.31) 

where rci denotes the i th column of the camera rotation matrix Rc and tc 

denotes the column vector of the translation matrix Tc. 

The homography Hwc from world plane to camera plane can be ex- 

pressed as 

Hwc ≈ Kc 

 

rc1 rc2 tc 

 

(3.32) 

Likewise, the homography Hwp form world plane to projector plane is 

Hwp ≈ Kp 

 

rp1 rp2 tp 

Substitution of Equ. 3.32 and 3.33 into 3.31 yields 

Pc ≈ HwcPw 

Pp ≈ HwpPw 

 

(3.33) 

(3.34) 

(3.35) 

From Equ. 3.34 and 3.35, it is not hard to find out that although the two 

points Pc and Pp are still related by a projective transform although being 

induced by a third plane 

Pc ≈ Hpc Pp 

where the homography of projector plane to camera plane is 

Hpc = HwcH −1 

wp 

79 

(3.36) 

(3.37)

However, it can also be seen from Equ. 3.37 that this homography Hpc 

only holds the current camera-projector relationship when and only when 

the reference plane not being changed. This is known as the plane to plane 

homography induced by a third plane. During our calibration, tilting the 

whiteboard 20 times yields 20 different homographies between the cam- 

era space and the projector space. Similar to the discussion in section 3.4.6, 

only the homography induced by the flat-placed whiteboard is the one we 

are interested in, because once the VAE is up and running the whiteboard 

is fixed onto the tabletop. 

To solve the homography, all participating frames will go through the 

distortion removal stage using the calibrated camera internal model and 

distortion parameters. Keeping the same notations from Equ. 3.36, and by 

introducing the scale factor w, Equ. 3.36 can be rewritten as 

⎛ 

⎞ 

⎛ 

wxc 

⎜ ⎟ 

⎜ ⎟ 

⎜wyc⎟ 

⎝ ⎠ 

w 

= 

⎜ 

⎝ 

h1 h2 h3 

h4 h5 h6 

h7 h8 h9 

⎞ ⎛ 

xp 

⎞ 

⎟ ⎜ ⎟ 

⎟ ⎜ ⎟ 

⎟ ⎜yp⎟ 

⎠ ⎝ ⎠ 

1 

(3.38) 

Using the similar method as described in section 3.4.6, Equ. 3.27 and 

3.28 to cancel out w, 

xc = h1xp + h2yp + h3 

h7xp + h8yp + h9 

yc = h4xp + h5yp + h6 

h7xp + h8yp + h9 

(3.39) 

(3.40) 

Each point gives two equations, thus to solve H which has 8 Degree of 

80

Freedom (DOF), a minimum of 4 points is needed. With (N ≥ 4) points, 

⎛ 

xp1 

⎜ 

0 

⎜ 

⎜xp2 

⎜ 0 

⎜ . 

⎜ 

⎜xpn 

⎝ 

yp1 

0 

yp2 

0 

. 

ypn 

1 

0 

1 

0 

. 

1 

0 

xp1 

0 

xp2 

. 

0 

0 

yp1 

0 

yp2 

. 

0 

0 

1 

0 

1 

. 

0 

−xp1xc1 

−xp1xc1 

−xp2xc2 

−xp2xc2 

. 

−xpnxcn 

−yp1xc1 

−yp1xc1 

−yp2xc2 

−yp2xc2 

. 

−ypnxcn 

−xc1 

⎟ 

−xc1⎟ 

⎛ ⎞ 

⎟ h1 

−xc2 

⎟ ⎜ ⎟ 

⎟ ⎜ ⎟ 

⎟ ⎜ 

⎟ ⎜h2 

⎟ 

−xc2⎟ 

⎜ ⎟ = 0 (3.41) 

⎟ ⎜ 

⎟ ⎜ . 

⎟ 

. ⎟ ⎝ ⎠ 

⎟ h9 

−xcn⎟ 

⎠ 

0 0 0 xpn ypn 1 −xpnxcn −ypnxcn −xcn 

Let the 2N × 9 matrix in Equ. 3.41 be A, this becomes a typical prob- 

lem of finding the least square solution in an over-determined situation, 

to minimise errors over |AH = 0. H can be solved by expanding the mea- 

surement matrix A to a square matrix and finding its inverse matrix. We 

used an alternative solution, which obtains H by finding the eigenvector 

which corresponds to the lease eigenvalue of A T A [4]. 

The solution to equation 3.41 is the homography between the camera 

and projector planes. It holds the transform in equation 3.38 from a point 

(xp, yp, 1) T in the projection image to its observation (xc, yc, 1) T in the cam- 

era image. Transform of the other way round from (xc, yc, 1) T to (xc, yc, 1) T 

is held by inverse matrix of this homography. By doing this, a two way 

transform is for any augmentations in the VAE is available at any time, 

between its projection source and camera observation. 

81 

⎞


This chapter begins with the introduction of the fundamental of conven- 

tional camera calibration technique, followed by a detailed implementa- 

tion of the camera calibration process using the Matlab toolbox designed 

by previous researchers. The method is then extended to calibrate the 

projector as a reverse camera. Finally, a fully automated method is im- 

plemented to calibrate the projector-camera system and used by the VAE 

framework in this research. 

The proposed method provides a means of estimating the internal and 

external parameters of the camera and the projector, in an automated way. 

It is fast, efficient, and requires little invasion to the scene from the tester. A 

colour filtering technique is also proposed so that the extraction of phys- 

ical printed pattern and the projected pattern from the mixed version is 

possible, while they are instantaneously sustained firmly with a same sur- 

face plane. This effectively exempts the user’s duty of manually manipu- 

lating the calibration objects, such as sliding in and out the physical pat- 

tern to avoid its super-imposition with the projected pattern. 

A method of plane-to-plane calibration is presented in section 3.5. The 

result of this calibration is used once the VAE is up and running, to sustain 

the spatial relationship of the virtual augmentations and their observation 

in the camera image. This ensures a quick and reliable mapping for the 

VAE to monitor the changes in the interactive environment, and respond 

to them by augmenting the scene with corresponding video signals. 

82

Although not having a comprehensive test, the proposed methods have 

been used reliably for the VAE system designed in this research in the 

past two years. Results from section 6.2.4 in chapter 6 suggests that accu- 

rate button locating is achieved, which is only estimated from the calibra- 

tion results of the projector-camera pair, without doing any local image 

processing in the observed image to detect the button positions. Hence 

the results were positive and warrant further research into the use of this 

method. 

3.6.1 Future Work 

This chapter is concerned with the calibration process which estimates the 

intrinsic and extrinsic parameters of the projector-camera pair, provides 

an accurate registration between the camera image space and the projec- 

tor rendering space, but only on a geometric scale. 

To deal with the lighting situation, photometric camera settings such as 

brightness, contrast, exposure, and white balance are manually tuned and 

evaluated before the calibration. The photometric parameters of the pro- 

jector are also pre-set. For example, to project a blue-black checkerboard 

pattern, the blue channel of the rendered image is set to full illumination 

(i.e. 255). One might wonder, is 255 the optimal value for the brightness 

in all scenarios? 

83

A similar problem is also encountered in chapter 4, where a plain white 

image is projected onto the interface to illuminate the object being mea- 

sured for the camera to take an image as the colour map. In day time 

where sufficient ambient light is available, the image can be taken without 

any illumination from the projector. However in the evening when lights 

off, projector illumination is essential while capturing an image of the ob- 

ject because it is the only light source. Furthermore, ambient light being 

too strong will affect the optimal projection affects the projection as well 

because it can over-illuminate the scene and weaken the projection signals. 

Therefore, choosing a universal brightness level of projector illumination 

for all the aforementioned scenarios can be problematic. 

84

(a) projector brightness = 0 (b) projector brightness = 128 

(c) red pixel values of (a) (d) red pixel values of (b) 

Figure 3.9: Pixel values of an image captured from a plain desktop. (bot- 

tom two showing the red channel only) 

Figure 3.9 shows an example of different projector illuminations. The 

top two images are the image captured when the projection brightness is 

0 and 128 respectively. The bottom two are the corresponding distribution 

of the pixel values across the planar surface (only red channels are shown, 

while the green and blue channels have similar distributions). The average 

pixel values in (d) is higher and (b) as expected. In both images a slope is 

noticed because the top of the desktop is closer to the window hence more 

85

ambient light is received on that part. When the projection brightness is 

set at 128 in (b), a reflection is caused and this is reflected as a spike in (d) 

in the bottom centre part of image (d). 

In this research, the photometric settings of the camera and the pro- 

jector are both manually tuned until the camera can see the projections 

reasonably well. Future development for the calibration framework could 

include automatic photomatric calibration which adjusts the camera and 

the projector lighting. Having a projector-camera pair is a big advantage 

of photomatric calibration, because it makes it feasible for self-adjusting 

the projector brightness by analysing the observed image, and the camera 

can be self-adjusted based on evaluating the image quality captured from 

different projector illuminations. 

Previous researchers at York [74] has proposed a means of photometric 

calibration, as an preliminary framework for future research to be built on. 

86

Chapter 4 

Shape Acquisition 


Shape acquisition is one of the key topics in computer vision. The hu- 

man visual ability to perceive depth using binocular stereopsis has been 

modelled by two displaced cameras to obtain the range information of the 

scene, as described earlier in chapter 2. The principle of this computer 

vision task is to establish correspondences, or in other words the match- 

ing points, between two or more images. In this thesis structured light 

is utilised as an active method to obtain range information with the help 

87

of a camera-projector pair. In VAE applications, it is always required that 

the structure information is extracted quickly and efficiently so that col- 

laborative work between user, PC and video sensors is feasible. This can 

be fulfilled by structured light because of its flexibility, rapidity, and effi- 

ciency. 

This chapter aims to provide an overview of structured light solutions, 

and then explain one particular method that is used in the later parts of 

this thesis. New contributions have been made to tackle the issues raised 

in practice, such as the aliasing effect caused by limited camera resolution, 

and dealing with challenging surface material from some of the objects. 

The chapter begins by considering different scenarios of the investigated 

method, then a specification is defined with the most practical subset of 

parameters regarding the current available hardware equipped in the lab. 

It is acknowledged that a full 3D description is not achieved by a single 

structured light projection, not with a single camera which can only see 

part of the object. By changing the pose or position of the target object 

it is possible to build the 3D model (see chapter 5), but each structured 

light projection only gives depth information which is often referred to as 

a 2.5D model. However this aspect is not in the scope of this chapter, and 

it will be introduced by later chapters. 

The rest of the chapter is organised as follows. A review of the existing 

methods and recent research of structured light systems is presented in 

section 4.2. Section 4.3 introduces the codification scheme chosen for our 

88

application and the generation of the projection image stack with the as- 

sociated look-up table. This is followed by section 4.3.3 where we discuss 

how the correspondence is established. Practical issues in the real world 

and hardware limitations are considered in section 4.4, where experimen- 

tal results are also present to validate the solutions proposed to tackle the 

problems. Section 4.5 explains depth calculation via triangulation. Then 

we address the conclusions in section 4.6. 


Structured light projection systems use a projector which can project a 

light pattern such as dots, lines, grids or stripes onto the object surface, 

and a camera which captures the illuminated scene. By projecting one or 

a set of image patterns, it is possible to uniquely label each pixel in the im- 

age observed by the camera. Unlike stereo vision methods which rely on 

the accuracy of matching algorithms, structured light automatically estab- 

lishes the geometric relationship by direct mapping from the codewords 

assigned to each pixel to their corresponding coordinates in the source pat- 

tern. Comprehensive literature review and taxonomy of structured light 

systems can be found in [81, 45, 6, 15, 84] 

The simplest way to label each pixel is to project a 2D grey ramp and 

89

a solid white pattern onto the measuring surface, tried by Carrihill et al. 

and Chazan et al. [21, 23]. By taking the ratios of the two observed im- 

ages, the brightness at each pixel determines the pixel’s corresponding 

coordinate in the original grey ramp image. However, this method is too 

sensitive to noise. Slight variation in surface reflection and lighting will 

cause brightness mismeasurement which results in substantial triangula- 

tion errors. Therefore, more sophisticated codification schemes need to be 

considered. 

One of the most commonly used strategies is temporal coding, where 

a set of images are successively projected onto the surface to be measured. 

In 1982, Posdamer and Altschuler [76] were the first to propose a projec- 

tion of n images to encode 2 n stripes with plain binary code. The resultant 

codewords are n bit binary codes formed by 0s and 1s, with more signif- 

icant bits associated with earlier pattern images and less significant bits 

associated with later ones. The symbol 0 corresponds to black intensity 

level for a pixel in the observed image and 1 corresponds to full illumina- 

tion. By doing this the number of stripes in every two consecutive pattern 

images is increasing by a factor of two. 

Sato et al. [84] used Gray codes instead of plain binary. The Gray code 

has the advantage of having successive codewords with unit Hamming 

distance which makes the codification more robust. Trobina [97] presented 

a binary threshold model to improve the scheme. In their method, a Gray 

code is used but the binary threshold between black and white in the ob- 

90

served image is fixed for every pixel independently. This is achieved by 

taking a pair of full white and full black images at the beginning, and the 

variant threshold is the mean between the grey level of the two observed 

images of full white and full black. In recent years, Rocchini [81] proposed 

a method to address the problem of localisation of the stripe transitions 

in Gray code images. They encode the stripes with blue and red instead 

of black and white, with a green slit of pixels between every two stripes 

to help finding the zero-crossing of the transitions between stripe bound- 

aries. 

The aforementioned schemes often employ binary codes and use a 

coarse-to-fine paradigm. This eases the segmentation of the image pat- 

terns, and the codewords can normally be generated by thresholding the 

observed image stack. However, a number of patterns need to be projected 

and problems are caused from top level patterns with very narrow stripes 

– too narrow for the camera to perceive. 

Using a combination of Gray code methods and phase shift methods 

answers this problem [9, 83, 105, 45, 98]. This is achieved by reducing the 

range resolution of the source patterns (i.e. using fewer levels of Gray 

code patterns to avoid narrow stripes), and compensating by exploiting 

the spatial neighbourhood information. This is done by periodically shift- 

ing the pattern in every projection to distinguish the codewords of those 

pixels falling into the same stripe. The limitation of these methods is by 

using patterns with shifted versions more images need to be projected and 

91

the total projection time increases considerably. 

In the direction of using fewer images to make it feasible to measure 

moving scenes, Boyer and Kak [16] employ colour patterns to try to en- 

code more information into the codewords. They propose a colour stripe 

pattern where a group of consecutive stripes has a unique colour intensity 

configuration. Caspi et al. [22] use a colour generalisation of Gray codes. 

Davies and Nixon [33] use a colour dot pattern but with a similar spatial 

window configuration to Boyer and Kak’s [16]. Chen et al. [24] and Zhang 

et al.[109, 110] propose a stereo vision based method that only requires one 

image. The underlying idea of their methods is to use more than one cam- 

era to solve the correspondences between stripe edges through dynamic 

programming. 

These colour based methods have the capability of measuring quasi- 

stationary or moving scenes since fewer images are used, however there 

are restraints as well. Some of them use more than one camera, which re- 

quire extra work to calibrate the camera pair with the projector. Others 

require the measuring surface to have uniform reflectance over all three 

channels of RGB to accurately extract the colour information, therefore 

they are more suitable for certain applications such as monitoring hand 

gestures. 

The Gray coded structured light codification scheme is considered in 

this thesis because of its simplicity and robustness. Colour or phase based 

92

methods have their own strengths, however, we aim to develop a VAE sys- 

tem which can be deployed in various environments such as offices, mu- 

seums, libraries, or other open environment. The system considered here 

is not just designed for laboratory purposes where the projector-camera 

system is normally setup close to the interactive surface. We consider a 

top-down setup in which the vision sensor is relatively far away (high 

up) from the projection surface, and low-end cameras such as ordinary 

web cameras will have difficulties picking up the colour details in such 

a distance. In this context, with a few adaptations made to enhance the 

performance of Gray coded structured light method, it yields reasonable 

results. 

4.3 Gray Codification 

4.3.1 Gray Code Patterns 

Images with Gray coded stripes are used in this work. All images are actu- 

ally stacked sequentially in time domain. In figure 4.1 one slice from each 

image level is taken out and aligned spatially from bottom up, and this is 

only to illustrate the codeword changes in adjacent image levels. 

93

Figure 4.1: A 9-level Gray-coded image. (only a slice from each image is 

shown here, to illustrate the change between adjacent codewords) 

Some of the advantages are already mentioned earlier in section 4.2, 

and here are a few other reasons to use this scheme. First, compared to 

dots and lines patterns, stripe patterns offer high resolution range infor- 

mation by labelling a dense and even distribution of 3D points over the 

scene. Second, the black and white coded pattern is more resilient to the 

variation in surface reflectance than to colour based methods, and it han- 

dles objects with challenging material with proper adaptations (which will 

94

e discussed later in section 4.4). Finally, Gray-coded images have more 

advantages than plain coded binary images, for being less sensitive to er- 

rors and using wider stripes in higher levels (see figure 4.2). This is a 

desirable property, as it causes less interference between the neighbouring 

stripes. 

(a) 4-bit plain binary code, top level 

stripes are 1 pixel wide. 

(b) 4-bit Gray code, top level stripes are 

two pixels wide. 

Figure 4.2: Comparison: minimum level of Gray-coded and binary-coded 

images needed to encode 16 columns. 

4.3.2 Pattern Generation 

The pattern generation stage is off-line and it serves two purposes: to gen- 

erate a Gray coded image stack and then create an look-up table for future 

codification use. This is only carried out once, and they are both held lo- 

cally. 

The stack of Gray-coded images are prepared in a temporal paradigm. 

95

All images are coded only in one-dimensional Gray-code as the point-line 

correspondences is sufficient to solve depth information. The reason for 

doing this will be explained later in 4.5. Because of the binarity of Gray 

code, the pattern generation is straightforward. It can be considered as 

recreating a square wave by doubling the frequency and halving the wave 

length at each image level along the time axis. For a data projector project- 

ing images with resolution of 1024 × 768, a 10-level Gray code is needed to 

make sure: 

1. All neighbouring rows or columns having different code words, 

2. All rows or columns having unique code words, 

Consider a 10 level horizontally Gray-coded image stack. During the 

look-up table generation, instead of assigning a 10 bit long code value to 

each row number, all possibilities of decimal code values are listed and 

then attached to the row numbers. By doing this, during the table look- 

up stage later on, for each incoming pixel with a 10 bit long code word, 

faster table look-up can be done to find its corresponding row number by 

using its decimal value. In horizontal coding (row-wise coding) for a 1024 

× 768 image, some code words do not exist after the whole image stack 

is coded and they are attached with -1. A section of look-up table for 10 

level Gray code will look like table 4.1. In vertical coding (column-wise), 

all 1024 columns will be assigned a valid positive decimal code value. 

96

Row Decimal (Binary) 

0 767 (1011111111) 

1 766 (1011111110) 

2 764 (1011111100) 

. 

. 

510 427 (0110101011) 

511 426 (0110101010) 

512 -1 - 

513 -1 - 

. 

. - 

1022 84 (0001010100) 

1023 85 (0001010101) 

Table 4.1: 10 level Gray code look-up table. 

97 

.

For implementation, only a one dimensional Gray code image set needs 

to be generated. As can be seen from figure 4.3, once the correspondence 

between the 2D point p in the camera plane and the stripe l in the projector 

plane is established via Gray code, the 3D position of the 3D object point 

P is the intersection of a ray and a plane. The mathematical justification of 

1D Gray code is presented in section 4.5. 

Figure 4.3: Point-line triangulation. 

4.3.3 Codification Mechanism 

The projection procedure consists of projecting a series of light patterns so 

that every encoded point from the observed image is identified with the 

sequence of intensities, which can be coded as a string of binary values 

98

Figure 4.4: Binary encoded pattern divides the surface into many sub- 

regions. 

(figure 4.4). 

The capture process starts with taking a snapshot of the scene with no 

projection. In severe lighting conditions such as a dark room, uniform 

lighting from the projector can be considered to help illuminate the scene. 

The level of projection brightness can vary depending on the current light- 

ing condition, ranging from zero brightness to a full white illumination. 

The first captured image serves as the colour texture map in the final rep- 

resentation of the current pose. 

After the first shot, the whole image stack is projected sequentially and 

99

images of the illuminated scene are captured in the same order (figure 4.5). 

Coding the binary image stack is similar to that of the Gray-coded pattern 

images. For a pixel with 2D image coordinate x, y in a 10 level image stack, 

a binary code word is formed by all the other pixels from the same posi- 

tion along the time axis, and its decimal representation is used to look up 

the table for the corresponding row number from the projector space (ta- 

ble 4.1). 

(a) level = 4 (b) level = 5 

(c) level = 6 (d) level = 7 

Figure 4.5: Stripes being projected onto a fluffy doll.(10 level Gray coded 

stripes) 

100

By iterating this approach across the whole observed image, each pixel 

is first labelled with a 10-bit long binary code word, and then attached with 

a row number – to represent its original position in the projector image as 

if the projection ray is reversed. A dense point-line correspondence map 

is established. Using appropriate triangulation method, the scene point 

(X, Y, Z, 1) T can be recovered as discussed in section 4.5. 

4.4 Practical Issues 

4.4.1 Image Levels 

To eliminate ambiguities in table look-up, it is always important not to 

have two or more rows (columns) sharing the same codeword, so that for 

every single pixel in the observed image there can only be one row (col- 

umn) in the projector image that is matching to that pixel. Therefore, to 

explicitly code the images being projected by a data projector with resolu- 

tion set at 1024 × 768, a log 2 1024 = 10 level Gray code is used to encode 

the pattern image to make sure each row or column is assigned with a 

unique codeword. By doing this, it is possible to do the table look-up for 

the observed image solely based on the binary output image stack. 

An alternative to this is to use fewer patterns so that thin stripes are 

avoided. However, there are a few drawbacks to this. First because not 

101

enough bits are used, there will be group of pixels sharing the same code- 

word. To either locate the stripe centres or the edges between neighbour- 

ing stripes, it involves finding zero-crossings to determine the flip posi- 

tion between black and white stripes, which is not easy because of the 

blooming effect of the white stripes when being observed in the camera. 

Secondly, stripe centres are not always perceivable depending on the con- 

vexity of the measuring surface and the presence of depth discontinuities. 

Furthermore, even if the stripe centres and edges are successfully located, 

interpolation needs to be done to estimate the other points in between. 

Otherwise the density of the range information will be compromised. 

Therefore, the maximum level Gray code is found essential. Due to 

the fact that thin stripes are inevitable, more adaptations are considered to 

maintain robustness. 

4.4.2 Limited Camera Resolution 

A good example of the problem caused by the camera is the alias effect. As 

illustrated in the experiment of measuring a brick shown in figure 4.6a, af- 

ter distortion recovery the stripe image level 5 is nice and clean. However, 

when it gets to image level 10, the thin stripe is almost invisible. Instead 

there are effects of curly waves in the observed image (figure 4.6b), and 

the resultant depth map is affected too (figure 4.6c). 

To alleviate this problem, we simply run a scan on the plain desktop 

102

(a) level = 5. (b) level = 10. (aliasing appears) 

(c) depth map without plane subtrac- 

tion. 

(d) depth map with plane subtraction. 

Figure 4.6: The alias effect causing errors in depth map. 

with no object being placed onto it. The depth map of the plain surface 

is used as a surface base, which is subtracted from all the depth map es- 

timated later on to compensate this defect (figure 4.6(d)). Although the 

resultant depth map for the object surface is violated to a slight degree, 

the background noise (mostly from the tabletop) are all removed. This is 

the simplest and quickest way to alleviate the alias problem without re- 

placing for more expensive capture device or changing the system setup. 

103

Figure 4.7 gives better visualisation by plotting the surface in 3D. The 

graph was generated using a sample of data every 20 pixels in both the x 

and y dimensions. It is clear that after base plane subtraction, the uneven 

background is flattened. 

4.4.3 Inverse subtraction 

For various reasons, the captured image stack cannot be used straight- 

away to determine the investigated pixels are on (illuminated) or off (not 

illuminated) at each level: there are different texture and reflectance prop- 

erties across the scene, the ambient light is inconsistent, and different pro- 

jection light adds variations to the lighting condition as well. 

For example, the theoretical threshold between white (255) and black 

(0) is 128, but in reality this is never the case. A pixel from a dark object 

can still appear close to 0 brightness even if it is illuminated by full white 

projection. However, subtract the image taken with full white projection 

by another image which is taken with black projection, all pixels will have 

positive value in the subtraction image regardless if it’s from a black object 

or white object, as long as it goes through full white projection first then 

full black projection. 

To address this issue in our system, for each level of projection, one 

original pattern and its inverted version (the black-white flipped image) 

are projected and the observed image is subtracted from its inverse im- 

104

(a) Before the subtraction. 

(b) After the subtraction. 

Figure 4.7: 3D plots of figure 4.6. 

age to yield an image with positive and negative values. It shows that the 

optimal black-white threshold value is likely to be brought close to zero 

105

Figure 4.8: Inverse subtraction of original image and its flipped version. 

(figure 4.8). As a result, thresholding is done on the subtracted images in- 

stead of the original versions. 

In figure 4.9, a football with black stripes is being scanned. As it can be 

seen from the picture, there are glares (white spots) caused by the projec- 

tor light and the reflective surface of the football itself. Figure 4.9(b) and 

(c) shows the image captured when the level 4 stripe image and its flipped 

version is being projected onto the surface, respectively. It is noticed that 

the threshold output (figure 4.9(d)) of (b) has obvious errors because the 

coherent black pattern on the football itself stays at black while illumi- 

nated by either white or black projection light. An optimal threshold also 

is very hard to choose, because it is object dependent and can be affected 

106

y the lighting conditions. Figure 4.9(e) is the subtraction of (b) and (c), 

with the white pixels standing for positive value of the subtraction image, 

the black pixels for negative value and the gray pixels for close zero val- 

ues. Figure 4.9(f) is the binary output of (e) with white for ones and black 

for zeros, which is a better version of (d). 

4.4.4 Adaptive thresholding 

Trobina [97] (see section 4.2) tries to improve thresholding accuracy by fix- 

ing different threshold values to each pixel based on the white to black 

reflectance ratio calculated from a solid white projection and a full black 

projection. There are a few concerns when this is carried out in practice. 

Unlike a laser scanner, for a certain point in the measuring surface, the 

observed brightness depends on the neighbouring projection rays around 

itself. Especially in the high frequency stripe image, for instance, where 

each black and white stripe occupies two rows or columns, it is never guar- 

anteed that a particular point that falls into a black stripe will appear the 

same as when the scene is projected by full black. 

To cope with this uncertainty, a three-level adaptive thresholding is 

used instead of binary thresholding. A dead zone around zero is intro- 

duced to deal with uncertainties. The size of the dead zone is set empir- 

ically. For any pixels with brightness out of the dead zone, the normal 

binary threshold is applied. Otherwise, the pixel is to be further inspected 

at the next image level. Pixels successively falling into the dead zone twice 

107

(a) texture map (b) stripe image (positive, level 4) 

(c) stripe image (negative, level 4) (d) threshold of (b), t=100 

(e) subtraction of (b) and (c) (f) binary image of (e) 

Figure 4.9: The inverse subtraction: the football experiment. 

are rejected as background points, and they will not be further processed 

in the remaining levels. 

108

This is inspired by one of the properties of the Gray coded images that 

no pixel is located at any stripe transitions at two consecutive levels (see 

figure 4.1, 4.2), which means any uncertain pixels encountered at one level 

can be verified by its appearance at the same position in the previous im- 

age, and it can be classified as background point or shadowed point if no 

clean-cut decision (either black or white) can be made in two consecutive 

image levels. 

4.5 Depth from Triangulation 

In some cases [6] where camera and projector have the same orientations 

(strictly facing the same direction), and their displacement is known (con- 

trolled displacement, for example both mounted onto a fixed rail), the co- 

ordinate of a 3D point can be estimated through simplified triangulation 

without the information of external geometry from projector and camera. 

However this is not considered in our application, since it requires highly 

customised hardware. 

A general purpose triangulation method for structured light systems 

is considered here, where the camera and the projector can be turned to 

any arbitrary angles, and both are properly calibrated in earlier stage. For 

109

details of calibration of a projector-camera system, please refer to chapter 

3. 

Let point (x, y) be the 2D point currently being investigated, to recover 

its 3D coordinate (X, Y, Z), we build full projection model (equation 3.25 

using homogeneous coordinates [42], 

⎛ ⎞ 

x 

⎜ ⎟ 

⎜ ⎟ 

w ⎜y⎟ 

⎝ ⎠ 

z 

= 

⎛ 

⎜ 

⎝ 

where C = Kc(Rc|T c) = 

matrix and w is a scale factor. 

third, 

⎛ 

⎜ 

⎝ 

c11 c12 c13 c14 

c21 c22 c23 c24 

c31 c32 c33 c34 

c11 c12 c13 c14 

c21 c22 c23 c24 

c31 c32 c33 c34 

⎛ ⎞ 

⎞ 

⎜ 

x 

⎟ 

⎜ ⎟ 

⎟ ⎜ 

⎟ ⎜y 

⎟ 

⎟ ⎜ ⎟ 

⎠ ⎜ 

⎜z 

⎟ 

⎝ ⎠ 

1 

⎞ 

(4.1) 

⎟ is the camera extrinsic 

⎠ 

To cancel out the scale factor w, in eq 4.1 divide the first row by the 

(c11 − xc21)X + (c12 − xc22)Y + (c13 − xc23)Z + (c14 − xc24) = 0 (4.2) 

By dividing the second row by the third in eq 4.1, 

(c21 − yc31)X + (c22 − yc32)Y + (c23 − yc33)Z + (c24 − yc34) = 0 (4.3) 

If point (x, y) corresponds to (m, n) in projector plane, similarly 

110

⎛ ⎞ 

m 

⎜ ⎟ 

⎜ ⎟ 

w ⎜ n ⎟ 

⎝ ⎠ 

1 

= 

⎛ 

⎜ 

⎝ 

p11 p12 p13 p14 

p21 p22 p23 p24 

p31 p32 p33 p34 

⎛ ⎞ 

⎞ 

⎜ 

x 

⎟ 

⎜ ⎟ 

⎟ ⎜ 

⎟ ⎜y 

⎟ 

⎟ ⎜ ⎟ 

⎠ ⎜ 

⎜z 

⎟ 

⎝ ⎠ 

1 

Use the same method to cancel out the scale factor w ′ , 

(4.4) 

(p11 − mp31)X + (p12 − mp32)Y + (p13 − mp33)Z + (p14 − mp34) = 0 (4.5) 

(p21 − np31)X + (p22 − np32)Y + (p23 − np33)Z + (p24 − np34) = 0 (4.6) 

From equations 4.2, 4.3 and 4.6, we have 

⎛ 

⎜ 

⎝ 

c11 − xc31 c12 − xc32 c13 − xc33 c14 − xc34 

c21 − yc31 c22 − yc32 c23 − yc33 c24 − xc34 

p21 − np31 p22 − np32 p23 − np33 p24 − xp34 

⎛ ⎞ 

⎞ 

⎜ 

X 

⎟ 

⎜ ⎟ 

⎟ ⎜ 

⎟ ⎜Y 

⎟ 

⎟ ⎜ ⎟ = 0 (4.7) 

⎠ ⎜ 

⎜Z 

⎟ 

⎝ ⎠ 

1 

This now becomes a problem of solving a set of linear equations. The 

first matrix in eq 4.7 is often referred as the measurement matrix A. The 

vector (X, Y, Z, 1) T is solved by finding the eigenvector with the least eigen- 

value of matrix A T A [4]. 

Equivalently, equation 4.7 can also be constructed from equations 4.2, 

4.3 and 4.5. It is obvious that choosing any one of the two equations 4.5 

and 4.6 yields the same results, which proves that the structured light 

projection only need to be done in one dimension, either horizontally or 

111

vertically. Using both of them is not recommended as understandably it 

doubles the capture time while only providing an over-determined linear 

equation system. The final system therefore only uses horizontal stripes. 

4.5.1 Final Captured Data 

After each successful structured light scan, the following data are captured 

and saved into the memory for further processing. Figure 4.10 to 4.13 

shows the rendered data in the form of images. The scattered 3D point 

sets can be rendered at any arbitrary pose, and figure 4.12 and 4.13 are 

showing it rendered at one pose. 

112

Figure 4.10: Depth map. 

113

Figure 4.11: Colour texture. 

114

Figure 4.12: Scattered point set in 3D. (re-sampled at every 2 millimetre) 

115

Figure 4.13: Scattered point set in 3D, attached with colour information. 

(re-sampled at every 2 millimetre) 


This chapter introduces a method for acquiring depth information using 

structure light system. After studying the existing codification schemes, 

Gray coded structured light is used in this research for its simplicity and 

robustness. A variety of problems are encountered during implementa- 

tion, and solutions are provided to tackle these problems. Preliminary ex- 

116

perimental results suggest these proposed techniques positively enhance 

the system performance. 

First, we justify it is essential to use maximum level of Gray coded im- 

ages both theoretically and experimentally. This is at the risk of making 

the stripes too thin to detect by the camera, which has limited resolution 

and mounted high above the desktop surface. 

Secondly, because of the large distance between the ceiling-mounted 

camera and the tabletop, and the limited capture resolution of the camera, 

the stripes going too thin causes aliasing effect in the observed images (fig- 

ure 4.6). When a single-pixel-wide line is projected, it could be observed 

in the camera image as a combination of neighbouring three or four lines. 

When multiple lines that are close to each other are projected, the observed 

lines are likely to mix with each other (figure 4.14). This will not only vi- 

sually causes aliasing effect, but also assign multiple lines to an 2D image 

pixel. A base plane subtraction method is proposed to deal with this chal- 

lenge caused by the aliasing effect. 

Thirdly, the inverse subtraction and adaptive thresholding are com- 

bined to perform robust codeword generation. This is a big boost to the 

codification, as we are no longer concerned with the object surface colour 

while these techniques maintains the optimal threshold for 0s and 1s close 

to zero. 

117

(a) The projector image. Lines are 

single-pixel-wide and three-pixel apart 

from each other. 

(b) The observed image. One thicker 

line is observed instead of three clean 

cut lines. 

Figure 4.14: Illustration of camera limited resolution. 

Finally, it is geometrically and mathematically justified (figure 4.3 and 

section 4.5), that the structured light projection is only required to run in 

one dimension, either horizontal or vertical, provided the proper triangu- 

lation method is used. 


With the current projector-camera setup, the system performance is mostly 

hindered by the limited capture resolution of the camera, and the distance 

between the ceiling-mounted camera and the tabletop. However, once the 

VAE is setup and running, it is not possible to change these factors. There- 

fore, efforts need to be made in other areas to compensate for this negative 

118

contribution. 

In section 4.4.2 a method of base plane subtraction is proposed to com- 

pensate for the aliasing effect caused by the aforementioned defects. This 

method is still preliminary and has its own limitations. The most signifi- 

cant one is that the subtraction is only restricted to the planar surface (in 

this case, the tabletop). It is incapable of modelling the artifact caused by 

the aliasing on arbitrary object surface. Future research could further in- 

vestigate this area to properly model this distortion. 

It is claimed in this chapter (section 4.4.1) that the maximum possible 

level of Gray coded images should be used to uniquely label every row/- 

column in the rendered projection image, and to avoid codeword sharing. 

It is based on the fact that dense depth map is required in this shape ac- 

quisition process. In certain applications where only sparse depth map 

is needed, it is possible to use fewer level of stripe images. Sparse depth 

information can be recovered at the stripe transitions or located stripe cen- 

tres, and a big plus point is that the camera won’t be forced to capture the 

scene illuminated by thin stripes that is beyond its resolution. 

Future work on photometric calibration discussed in the previous chap- 

ter (section 3.6) also relates to the development in structured light systems. 

A successful calibration of the photometric properties for the camera could 

lead to the use of colour-based structured light systems. Since the colour- 

based methods normally use fewer images (sometimes just one), it opens 

119

up the possibility of turning the shape acquisition into a real-time process. 

This would be an attractive feature for the VAE and with real-time depth 

scan capability lots of application can be built within the VAE framework. 

120

Chapter 5 

Registration of Point Sets 

Creating 3D model for a real object is a multi-stage process, because cam- 

eras only deliver data from one view of the target object at a time. To ob- 

tain a complete model, it requires either the scanner to shoot from different 

views to cover the whole object, or equivalently move the object relative 

to a stationary scanner. Whichever scenario is chosen, registration of the 

scanned data from different views is required. This chapter is focused on 

this subject. 

After each structured light scan, a cloud of point samples from the sur- 

121

face of an object is obtained. By placing the object in different positions 

on the tabletop or placing it in different orientations towards the camera 

yields a few point sets, which is expected to cover the whole surface of the 

object to be measured. The objective of registration is to fuse these clouds 

together by estimating the transformations between them, and trying to 

place all the data into the same reference frame to visualise or for further 

processing. 

The process of point sets fusion begins with 2D image registration on 

colour textures of two participating views, where the interesting points are 

first extracted by corner detectors and then correlated. Once the 2D corre- 

spondences are established, the 3D coordinates of the matched points are 

used as control points to estimate the rotation and translation in 3D space 

between these two sets of points. The estimated rotation and translation 

vectors are used as an initial guess to perform a trial merge, by wrapping 

one point set to another in 3D space based on the estimated transform. The 

user has the final decision of whether to accept this trial given by the com- 

puter, or manually improve the fusion of point sets by tuning the them 

into different poses in a virtual environment using the augmented tools. 

The whole process combines automated image processing and human 

interaction. For example, tasks such as 2D image registration or exhausted 

searching for transformation vectors are executed by automated process 

while the final tuning and merging is handed over by human interaction. 

This is not only because the humans are chosen to be the decision maker, 

122

ut also that this is what humans are good at – spot where things go wrong 

and respond to it in an effective way. In the rest of this chapter, this is ex- 

plained in details. 


Assume there exist two point sets {mi} and {di}, i = 1, 2, ..., N, and the cor- 

respondences between them are already established, either from ground 

truth or by matching the point sets in 3D space. We name {mi} the model 

points and {di} the data points. If they are both from the same model, 

the objective is to find the relative rotation and translation from the data 

points to the model points, so that in 3D space they are related by 

di = Rmi + T + ei 

(5.1) 

where R is the 3 × 1 rotation matrix, T is the 3 × 1 translation vector 

and ei is a noise vector. Solving for the optimal solutions of ˆ R and ˆ T that 

maps the two point sets is a least square minimisation problem: 

N 

di − ˆ Rmi − ˆ T 2 

i=1 

(5.2) 

Because the correspondences between the point sets are unknown a- 

priori, the most straightforward method to register two point sets is ex- 

haustive search in 3D space. However, this method faces the challenges 

from processing time, convergence speed, and falling into local minima. It 

is not complex to implement but consumes a lot of the processing power, 

123

and is therefore not suitable for VAE systems. 

Using calibrated motion is another routine to solve the registration, but 

this brings new problems too. To control either the movement of the scan- 

ners or the object to be measured, additional hardware equipment such as 

rails and turntables are inevitable. The scanner may require extra calibra- 

tion as well. More importantly, in the context of VAE, it is desired that the 

restriction to controlled motion must be lifted, and the object to be mea- 

sured can be freely moved into any different poses in 3D space, under the 

guidance of the user. 

In this research, the routine we choose to fuse two point sets incor- 

porates three stages: 2D planar image registration (section 5.3), point set 

registration using corresponded features with a voxel based quantisation 

process (section 5.4), and rendering (section 5.5). They are discussed sepa- 

rately in the rest of this chapter. When more than two views are presented, 

the problem is reduced to a chain of pairwise registrations. 

124


Figure 5.1: A routine of point set registration. 

5.2.1 Rotations and Translations in 3D 

There are several common ways to build a rotation matrix. The most fre- 

quently documented representation is to rotate a point around one of the 

three coordinate axes. The advantage of using this representation is the 

generated 3 × 3 rotation matrix can be applied to 3D points for matrix ma- 

nipulations straightaway. To rotate a point around X, Y, and Z axes, we 

have: 

Rx = 

Rx = 

⎡ 

⎤ 

1 

⎢ 

⎢0 

⎣ 

0 

cos(θ) 

0 

⎥ 

− sin(θ) ⎥ 

⎦ 

0 sin(θ) cos(θ) 

⎡ 

⎤ 

cos(φ) 

⎢ 0 

⎣ 

0 

1 

sin(φ) 

⎥ 

0 ⎥ 

⎦ 

− sin(φ) 0 cos(φ) 

125 

(5.3) 

(5.4)

Rx = 

⎡ 

⎤ 

cos(ψ) 

⎢ 

⎢sin(ψ) 

⎣ 

− sin(ψ) 

cos(ψ) 

0 

⎥ 

0⎥ 

⎦ 

0 0 1 

(5.5) 

where θ, φ, and ψ are the rotations around X, Y, and Z axes respectively. 

More detailed discussions of rotation in 3D are given in [44], [47]. 

5.2.2 A SVD Based Least Square Fitting Method 

SVD is one of the most significant topic in linear algebra and it has con- 

siderable theoretical and practical values [54, 62, 95]. A very important 

feature of SVD is that it can be performed on any real matrix. The result 

of this decomposition is to factor matrix A into three matrices U, S, V such 

that A = USV T , where U and V are orthogonal matrices and S is a diago- 

nal matrix. SVD is also a common tool used to solve least square solutions 

(section 3.5, section 4.5). 

Arun, Huang and Bolstein [3] proposed a method of computing the 

3D rotation matrix and translation vector by doing the Singular Value 

Decomposition (SVD) of the 3 ×correlation matrix, which is built as fol- 

lows, 

H = 

N 

i=1 

mc,i d T c,i 

(5.6) 

where mc,i and dc,i are obtained by translating the original data sets mi 

126

and di (equation 5.2) to the origin. 

The SVD of the correlation matrix is H = USV T , and the optimal rota- 

tion matrix is first computed from 

ˆR = V U T 

(5.7) 

The computation of ˆ R is also known as the Orthogonal Procrustes Prob- 

lem [88]. 

The optimal translation matrix is the vector that aligns the centroid of 

the point set di and mi, which is 

ˆT = ¯ d − ¯m (5.8) 

Of course ˆ T = ¯ d − ˆ R ¯m exists too because rotating a point set about the 

origin doesn’t change the centroid of the point set itself. 

5.3 Image Registration 

5.3.1 Corner Detector 

To build up a dense correspondence map given a pair of input images is 

not practical considering the amount of computation involved. Therefore 

127

the first step is to choose a set of distinguished points as interest points, 

from both input images. To find these interest points, a Harris corner de- 

tector [46] algorithm is used on the textures. The corner detector uses the 

following structure matrix to evaluate whether the given pixel is a corner 

or not. 

⎡ 

G = ⎣ 

 

w f 2 x 

 

w fxfy 

 

w fxfy 

 

w f 2 y 

⎤ ⎡ 

⎦ = Q ⎣ λ1 

⎤ 

0 

⎦ Q T 

0 λ2 

(5.9) 

The second part of the equation 5.9 is the decomposition to its left, fx 

and fy are the first derivatives of horizontal and vertical directions respec- 

tively, and w is the window size of aggregation. For the two output eigen- 

values, λ1 ≥ λ2, and the Harris corner detector defines when λ2 ≫ 0 the 

pixel can be interpreted as a corner within a certain region. Even though 

the sign of the eigenvalues contains information of the local gradient, we 

are not interested in them here as our purpose is to find the point of inter- 

est. 

To implement the corner detection algorithm, fx and fy are first com- 

puted from the convolution of the original image with two derivative ker- 

nels 

Dx = 

⎡ ⎤ 

−1 

⎢ 

⎢−1 

⎣ 

0 

0 

1 

⎥ 

1⎥ 

⎦ 

−1 0 1 

128 

(5.10)

Dy = 

⎡ 

⎤ 

−1 

⎢ 0 

⎣ 

−1 

0 

−1 

⎥ 

0 ⎥ 

⎦ 

1 1 1 

(5.11) 

The G matrix is constructed for each pixel from the derivatives, and it is 

aggregated by its neighbouring pixels. Then the two eigenvalues of G ma- 

trix is computed – the smaller of the two are stored. The pixel is considered 

as a corner if it has the biggest stored eigenvalue in the given area, and the 

value is greater than a threshold. This step is repeated for all pixels from 

both of the input images. 

The whole process is shown in figure 5.2. A pair of images of a corridor 

are used to give better illustration of the extraction of corner points. w1 

is the window of aggregation for the structure matrix G (equ.5.9). w2 is 

the local evaluation window within which the pixel with the biggest λ2 

is considered as a corner candidate. λ is the threshold for λ2: the pixel is 

considered as a corner if λ2 > λ. 

5.3.2 Normalised Cross Correlation 

After the interest points are detected in the input image pair, correspon- 

dences are found using Normalised Cross Correlation (NCC) [77]. For 

each interest point in the left image, we look for its maximum correlation 

in the right image using the NCC cost function below, 

NCC = 

 

(x,y)∈W (f1(x, y) − ¯ f1)(f2(x, y) − ¯ f2) 

(x,y)∈W (f1(x, y) − ¯ 

f1) 2 

(x,y)∈W (f2(x, y) − ¯ f2) 2 

129 

(5.12)

(a) Original image (b) Gaussian smoothed image 

(c) First x derivatives (d) First y derivatives 

(e) Eigenvalue image (f) Detected corners 

Figure 5.2: Corner detection. 

where fk(x, y) is the k − th image block, and ¯ fk is the average value of 

the block. W is the size of the search window. 

130

To implement the algorithm, we first take a pixel from the left image 

and construct a N × N block centred at that pixel. Then calculate the NCC 

between the current block and all the corner points encountered in the 

right image, within the search range W . The corner point with the maxi- 

mum NCC value is assigned the corresponding pixel. This process is re- 

peated for all the pixels in the left image. 

Results are shown in figure 5.3 and 5.4. Choosing different sizes of the 

search window yields different results. Especially when periodic patterns 

are involved, the checker board for example, the result is far less accurate 

if an inappropriate search window size is chosen. Furthermore, at this 

stage the correspondences are not one-to-one. For a corner point in the 

right image, it is likely to happen that more than one point from the left 

image has found it as the best match. The details of mismatch removals 

are discussed in section 5.3.3. 

5.3.3 Outlier Removals 

With given correspondences, we feed them into the correlation matrix 

(equ.5.6) so the rotation matrix and translation vector can be estimated. 

To do this reliably, outliers need to be removed. The RANdom SAmple 

Consensus (RANSAC) algorithm [38] is a widely used algorithm for ro- 

bust fitting of models in the presence of data outliers. The algorithm keeps 

randomly selecting data items and uses them to estimate the data model 

131

(a) Searchwindow : W = 64 

(b) Searchwindow : W = 256 

Figure 5.3: NCC results. 

until a good fit is found or the maximum iterations is reached. Only the 

data that qualify the certain criteria are considered as meaningful data. 

The choice of the criteria here depends on the data to be measured, for ex- 

ample it can be the Euclidean distance of a point to the centroid of a cloud 

of points, the disparity in brightness of a group of windowed pixels, or 

other cost functions. 

In this work, since the transform between two observed images can be 

132

(a) W = 64 

(b) W = 256 

Figure 5.4: NCC results (periodic pattern). 

encapsulated in a 3 × 3 homography matrix, the RANSAC algorithm is 

implemented with the adaptations as follows: 

1. Start with putative correspondences computed from NCC (section 

5.3.2). 

2. Repeat step 3-7 for N times, with N being updated using algorithm 

4.5 from [47]. 

3. Select a random sample of 4 correspondences and check the data 

133

colinearity. If the data is bad, reselect a sample. 

4. Compute the homography H using the method presented in section 

3.5. 

5. Calculate the distance for each of the putative correspondences d = 

d(mi, ˆmi) 2 + d(di, ˆ di) 2 , where ˆmi and ˆ di are the transformed points 

based on the estimated homography H. 

6. Compute the number of putative correspondences consistent with 

the current H, by the criterion that the distance calculated in step 5 

is no greater than a empirical threshold. The qualifying correspon- 

dences are inliers. 

7. If the number of inliers for the current H is maximum, update H and 

the set of inliers consistent with H. 

8. When reach here, choose the group of inliers associated with the best 

H so far. 

9. Re-calculate the H using all the inliers left. 

In general cases, because the homography H is estimated by 4 ran- 

domly selected correspondences in each loop – even if they are estimated 

from the best set of 4 pairs, the final homography still needs to be refined 

by calculating it once again with all the qualified inliers from putative cor- 

respondences. However in this work we are only focused on choosing the 

reliable correspondences instead of looking for the 2D projective trans- 

134

form between them, as a result step 9 is not necessary and can be omitted. 

(a) T = 100, 235 putative correspondences after NCC, 142 inliers. 

(b) Rectified image pair. 

Figure 5.5: Robust estimation. (inliers shown by red connecting lines) 

135

(a) T = 50, 80 putative correspondences after NCC, 80 inliers. 

(b) Rectified image pair. 

Figure 5.6: Robust estimation. (inliers shown by index numbers) 

5.4 Fusion 

5.4.1 Data structure of a point set 

The data structure of a point set is depicted in figure 5.7. For each point, 

the following information is stored: its index in the data array, 3D world 

coordinates (X, Y, Z), 2D image coordinates (x, y), and its colour informa- 

tion in RGB channels. 

136

Figure 5.7: Data structure of a point set. 

5.4.2 Point set fusion with voxel quantisation 

For each single view, a point set is given from estimating the 3D positions 

of the foreground pixels in the captured image. All background parts like 

the table top and non projected area have non-positive depth and they are 

rejected. Therefore the size of the point set is the total number of pixels 

that have positive depth in the corresponding depth image. 

The data size can be huge sometimes. Figure 5.8(a) shows the point set 

of a fluffy doll which has the dimension of roughly 600mm in height width 

and depth. The resultant point set size is 34056, where a lot of points are 

actually very close to its neighbouring points in 3D space. This not only 

137

causes redundancy and increases the burden of rendering the point set or 

transforming it in 3D. A voxel quantisation method is presented here to 

deal with this problem. 

(a) the point set (b) voxel quantisation 

Figure 5.8: Voxel quantisation of the large data set. 

For each point set, we keep two copies in the memory. One copy is 

the original data set, where all the points are saved as backup so that no 

information is lost. Another copy is the slimmed version for display or 

other front end purposes. It begins with estimation of what kind of size 

the point set is occupying in 3D space, by computing the centroid of the 

point set and the furthest points along the X,Y,Z axes. Then a cube is con- 

structed with size of the estimated required size to contain the whole point 

set, and it is divided into voxels which are smaller cubes (figure 5.8(b)). All 

3D points falling into the same voxel are averaged into one point, and the 

voxels with no points falling into them are not considered. 

138

Bigger and fewer voxels gives coarser quantisation and less details (fig- 

ure 5.9). A point set with original size of 34056 is slimmed using voxel size 

s = 1mm, 10mm respectively. As the voxel size increases, the point set 

gets more and more sparse. 

(a) s = 1mm, 32097 points (b) s = 10mm, 4373 points 

Figure 5.9: Different quantisation level by choosing different voxel size. 

Figure 5.12 shows the choice of the voxel size can be object irrelevant. 

The football, fluffy owl, and the vase which is placed in different orien- 

tations all have different object size and surface structure (figure 5.10 and 

5.11). In figure 5.12(a), the total points curve for the owl starts very high 

but has a dramatic drop. This is because the physical size of the object is 

much bigger than the other three object tested. By comparing figure 5.12(a) 

and (b), it is not hard to find out that the total point size of the point set has 

very little impact on the amount of data being lost by voxel quantisation. 

In the graph with the percentage curves, it can be seen that all four objects 

drops in a similar manner as the voxel size increases. 

139

(a) football (b) point set of (a) 

(c) owl (d) point set of (c) 

Figure 5.10: The captured objects of figure 5.12. 

Looking from 5.12(b), an universal voxel size of 2mm can be chosen 

to conserve over 80% of the original data, while choosing a voxel size of 

5mm throws half of the information away. This is particularly useful be- 

cause the voxel size can be decided by how much the data from different 

views are overlapping. The redundancy can be reduced to a minimum if 

the appropriate voxel size is chosen. 

140

(a) vase (horizontal shot) (b) point set of (a) 

(c) vase (vertical shot) (d) point set of (c) 

Figure 5.11: The captured objects of figure 5.12. 

5.4.3 User Assisted Tuning 

As discussed earlier, the transform between the two point sets in 3D space 

can be estimated using the SVD based fitting algorithm (section 5.2.2), 

from a set of matching points computed from section 5.3. Before mak- 

ing the commitment in saving the estimate transform, the user is given the 

chance of manually tuning the point sets. This process is also visualised 

and the tuning results is instantly reflected on the desktop, as shown in 

figure 5.13. 

Further discussion of this interactive tuning and the scenario of mul- 

141

(a) total points 

(b) percentage of the original data size 

Figure 5.12: The quantisation effect of choosing different voxel size on the 

total point set size. 

tiple point set registration are both presented in a more detailed scale in 

sections 6.4.4 and 6.4.5. 

142

Figure 5.13: Manual tuning of point sets registration. 

5.5 Rendering A Rotating Object 

Rotating an object about the WCS origin has the risk of moving the object 

out of the camera’s field of view, so the common way to visualise the 3D 

data is to rotate it about the centroid as if the object is placed on a turn 

table. For each object point, the instantaneously changing world coordi- 

nates X ′ , Y ′ , Z ′ is projected onto the 2D camera space with the camera pose 

calibrated, 

143

⎛ 

⎜ 

X 

⎜ 

K(R|T ) ⎜ 

⎝ 

′ 

Y ′ 

Z ′ 

⎞ 

⎛ 

⎟ x 

⎟ ⎜ 

⎟ ⎜ 

⎟ ≈ ⎜ 

⎟ ⎝ 

⎟ 

⎠ 

1 

′ 

y ′ 

⎞ 

⎟ 

⎠ 

1 

(5.13) 

x ′ , y ′ is the moving 2D coordinate in the camera space. We then attach 

the colour information associated with the current point (figure 5.7). 

(a) (b) (c) 

(d) (e) (f) 

Figure 5.14: Different rendered views. (top:rendered range images; bot- 

tom:rendered object attached with colour texture) 

144


In this chapter a framework is presented for the fusion between two 3D 

point sets, in other words, the registration between two views. This is a 

combination of conventional automated 2D image registration, a 3D point 

set registration, and a user-guided human-computer collaborative work in 

a VAE. The proposed framework correlates two sets of 3D data captured 

from different views of the same object, with ideally an overlapping part 

shared between the two views. The registration framework can be iterated 

to perform the fusion of multiple views. 

The process begins with 2D image registration on colour textures of 

two participating views, where the interesting points are first extracted by 

corner detectors and then correlated using Normalised Cross-Correlation 

(NCC). Once the 2D correspondences are built, the 3D coordinates of the 

matched points are used to estimate the transform in 3D space between 

these two sets of points using Singular Value Decomposition (SVD) and 

Orthogonal Procrustes [88]. The estimated rotation and translation vec- 

tors are used as an initial guess to perform a trial merge, by wrapping 

one point set to another in 3D space based on the estimated rotation and 

translation. The user has the final decision of whether to accept this trial 

given by the computer, or manually improve the fusion of point sets by 

tuning the them into different poses in a virtual environment using the 

augmented tools. 

In addition to the registration itself, a voxel quantisation mechanism is 

145

proposed and implemented to reduce the data redundancy and speed up 

the rendering. This quantisation is in particular desired in multiple point 

sets fusion scenario, where the data redundancy is relative larger because 

the overlapping areas between a number of point sets. Preliminary results 

also show that the optimal quantisation level is only affected by the choice 

of voxel size, and it is object independent. 


Although reasonable results can be achieved using an automated regis- 

tration followed by user’s manual tuning, the participating two views are 

preferred to have a fair amount of overlapping area, otherwise the regis- 

tration results can become very poor. This is the main reason causing the 

extra data storage, and the performance can be affected when measuring 

objects with clean-cut surfaces such as a rectangular box. A feature based 

image registration also means it is hard to work on objects with very little 

texture. 

Future work include possible improvements in several areas: 

• First, during the process of image registration, it is on purpose that 

we aim to hide as much technical details as possible from users, 

while still providing them a means of working towards optimal re- 

146

sults by adjusting the parameter settings randomly within a closed 

interval. However, the interface can still be elaborated to give the 

user more targeted initiative on the parameter settings. For example, 

providing the user a choice of ’less corner points’ or ’more tolerant 

cross-correlation’ could be more presentable way than a simple ran- 

domised repetition. 

• Second, the visualisation in tuning can be improved (figure 5.13). 

The user can be provided with a means of inspecting the current 

point sets being merged from a variety of angles, to help with the 

merge. This is particularly helpful for fusing two pieces which share 

little overlapping area, for example, two halves of a sphere. 

• Last but not the least, there is a possibility of depth information being 

used for establishing corresponding points, when there is a lack of 

texture across the surface. This is can be regarded as using the depth 

map as an alternative feature to the texture. Although the prospect 

of using depth information for image registration faces the challenge 

from depth inaccuracies (e.g. caused by depth discontinuities), it is 

expected that an appropriately combined use of the depth informa- 

tion and the texture information would yield positive results. 

147

Chapter 6 

System Design 


In chapters 4 and 5, we discussed the shape acquisition stage and the post- 

processing of the scanned data. They are both separately performed com- 

puter vision tasks. In this chapter we address the design of a system that 

incorporates these two components into a complete and interactive sys- 

tem. The system provides the following: 

1. An automatically generated and maintained platform on which the 

148

data are visualised. 

2. A planar surface with real objects and video augmented signals. 

3. Widget tools for enabling user-computer interactions, without the 

need of traditional input devices such as mouse, keyboard or laser 

pointer. 

4. Accurate automated facilities, with ease of use and correctability, and 

the user decides when, where and how to utilise them. 

The most important feature of system presented is that the user plays 

an active role in the interactions. They make the final call of what is to be 

done next, by giving various instructions using the tools provided. Typical 

functionality includes range map touch-up, rejection of a scan, capturing 

a snap shot and lots more. Apart from triggering various computer vision 

tasks, the user also decides what part of the collected data to be displayed. 

The central display area is limited and not all the scanned data will be 

used. More detailed discussions on the user interface are presented in sec- 

tion 6.4. 

On the other hand, the computer itself offers user help information ei- 

ther in a visualised way or in form of text messages. The help information 

can be a brief summary of the current data, offering the user different op- 

tions about what might be the next move, or how to trigger these events. 

But this is a user guided, user centralised system, so users still have the 

final call under all circumstances. 

149

The calibration stage introduced in chapter 3, however, has to be a 

stand-alone step and can not be carried out in this augmented environ- 

ment, because (a): it is normally performed prior to everything else if the 

camera-projector system is uncalibrated; (b): the interpretations of human 

gestures requires accurate mapping between the augmented projections 

and the observed images; (c): once the calibration is done, there is no need 

to perform the calibration again unless the positioning of the projector- 

camera system or the table setup has been changed. 

The rest of this chapter is organised as follows. In section 6.2, two wid- 

gets are introduced. They are implemented to simulate two of two most 

frequently used gestures in the user-machine interactions, the button push 

and the touchpad slide. The background and some other practical issues 

during implementation are discussed as well. In section 6.3 the main user 

interface of the system is introduced. Some of the main utilities and func- 

tionality are presented in section 6.4. Section 6.5 is the conclusions. 

150

6.2 Widgets Provided for Interaction 

6.2.1 Introduction 

Where a vision system is used as the interactive device in a man-machine 

collaboration, it is desirable to have an efficient solution for the user to give 

orders without having to turn to the traditional input devices. In this re- 

search tabletop interaction is normally concerned with hands rather than 

other part of the human body or other pointing devices. Therefore hand 

gesture is the most frequently used behaviour for user to give instructions. 

The most common gesture is the button push – to trigger an event. In a 

vision system, a button push does not necessarily require physical contact 

with the desktop surface. Without the presence of a touch screen or other 

contact sensors, it is hard to visually detect whether the user’s hand has 

touched the interface or not. The method discussed here is to monitor the 

interested area over consequent frames to analyse whether the button has 

been pushed, kept pressed, or released. 

Pointing is also realised as another widget in this system, equivalent to 

a touchpad on a laptop. When the pointing device is engaged, a rectangle 

in the control area is assigned to a touchpad, while a cursor is rendered in 

the data area. The user can slide their finger across the touchpad as if they 

are working on a laptop. The finger tip movement in the observed images 

is analysed and the system responds to it by changing the display location 

of the augmented cursor. 

151

Figure 6.1(a) shows an image to be projected. The green rectangle in 

the middle bottom section of the interface is the touchpad. The bottom 

image shows the user using the touchpad with left hand and pointing a 

button using the right hand. 

152

(a) A projected image. 

(b) The observed image. 

Figure 6.1: A snapshot with touchpad and buttons. 

Figure 6.2 shows an object is being scanned to get the 2.5D depth map. 

While the scan is being performed, the projection image space (shown in 

153

figure 6.1(a)) is replaced with a 1024 × 768 Gray coded stripe image. After 

the scan is finished, the menus and control buttons will reappear in the 

interactive interface. 

Figure 6.2: A captured image showing an object is being scanned. 

6.2.2 Background 

Most of the current finger detection techniques can be classified into three 

main categories. 

The majority of these techniques rely on background differencing [69, 

63, 72] for the initial stage of image processing. In [69], Malik and Laszlo 

develop a vision-based input device which allows for hand interactions 

with desktop PCs. They use a pair of cameras to provide the 3D positions 

154

of a user’s fingertips, and locate the fingertip and its orientation by seg- 

menting the foreground hand regions from the background. Parnham [74] 

proposes a technique involving a combination of plane calibration shadow 

removal via the analysis of the invariance image. Letessier and Bérard [63] 

present a technique that combines a method for image differencing and a 

fingertip detection algorithm named Fast Rejection Filter (FRF). FRF is a 

set of rules to classify hand pixels and non-hand pixels, however it is only 

concerned with detecting fingertips but not hand shape. Therefore it is 

unable to detect fingers that are pressed together. 

Some others make use of skin colour detection. In [2], a colour de- 

tection method is presented using a Bayesian classifier [36] plus a small 

set of training data. Then a curvature analysis algorithm is applied on 

the detected contours to determine peaks which could correspond to fin- 

gertips. Quek et al. [78] develop a system named FingerMouse which al- 

lows finger pointing to replace the mouse to control a desktop PC. Their 

method involves segmentation via a training-required probabilistic colour 

table look-up, and a Principle Component Analysis (PCA) based fingertip 

detection algorithm. 

Using a mask to perform template matching is another way to detect 

fingertips. There are techniques where researchers use markers [34, 32] 

and gloves [96, 19]. Some researchers use fiducials [56] as pointing device, 

which also falls into this category. 

155

Apart from the aforementioned main categories, an alternative is to use 

more expensive hardware such as thermoscopic camera or infra-red cam- 

era to provide a clean binary image for further processing [58, 85]. 

6.2.3 Practical Issues 

A few practical issues have to be addressed before background differenc- 

ing based finger detection techniques are used in this system. Finger de- 

tection for use in an VAE application is different from those used in a con- 

ventional vision system. First, it must be resilient to the effect of various 

lighting conditions, especially to the projections. Second, it has to be ef- 

ficient so as to be responsive but not adversely affect the performance of 

the rest of the system. Third, a user should be able to walk up to the table- 

top and begin interacting without the need of extra equipment such as 

markers or gloves. Last, it provides interactions without conventional in- 

put devices such as mouse and keyboard, nor the need of more expensive 

tabletop touch-screens, which means the move and click behaviour that are 

usually available by the mouse, need to be addressed. 

With those factors stated above, template matching based methods 

which might require extra training are not suitable for this application. 

Moreover, although both move and click can be detected in a single paradigm 

of fingertip detection by responding to the instantaneous fingertip loca- 

tion, processing the whole image for each frame is not efficient. Robust 

background segmentation techniques usually involve analysing the pixel 

156

classifications by modelling them as Mixture of Gaussians [25, 94, 53] in 

an adjacent few frames. This will inevitably causes processing overhead 

and affect the overall system performance. 

In this research, click is the dominant interactive gesture therefore we 

model it as a button-push action, with a number of virtual buttons pro- 

vided within the interface (figure 6.1). move is realised by designating an 

area as a touch-pad and switching it on and off depending on whether the 

locating device is required or not for the current function, and the desig- 

nated touch-pad area is processed instead of the whole frame. 

6.2.4 Implementation of Pushbutton 

Figure 6.3: Finger detection. 

Our approach to realise the pushbutton widget is to divide the button into 

two areas (Figure 6.3). Area A is the inner area where fingers are most 

likely placed, and it is roughly the same size as human finger tip. Area B 

157

is the outer area. 

Let At0 be the average luminance over area A at time t0, and At1 for 

time t1, then the average luminance change between this time period is 

Similarly for area B, we have 

∆A = At0 − At1 

∆B = Bt0 − Bt1 

(6.1) 

(6.2) 

We can define a button being touched if |∆A| > w1 and |∆B| < w2, 

where w1 and w2 are both positive thresholds. 

To detect button press and release events, the sign of ∆A needs to be 

considered. Due to the fact that human skin absorbs a bigger proportion 

of the incident light than the desktop surface (in this case a more reflective 

white board), the finger appears significantly less bright than the back- 

ground in the image observed from camera. By taking into account the 

sign of ∆A instead of its absolute value, we can detect the button press 

and button release events. The advantage of this appearance-based finger 

detection is that it is immune to changes in lighting conditions and acci- 

dental occlusions. 

In an early version [64] of our finger detection system the button area 

is taken as a whole when being monitored. The dual region approach is 

more reliable. We have tested the new approach for a continuous time 

period of more than 24 hours, during which it survived extreme changes 

158

in lighting conditions such as sunrise, sunset, pulling up and down the 

blinds, switching lights on and off. The buttons are never mistriggered. 

Button calibration 

The two thresholds w1 and w2 introduced above are set in different ways. 

w1which controls the outer region is set empirically to a small value so 

that the outer region of the button is intolerant to noise, which makes the 

button less likely to be triggered accidentally. The inner region is where 

the finger is normally pressed. 

(a) The projected button. (b) The observed button 

(no finger). 

Figure 6.4: Button calibration. 

(c) The observed button 

(finger pressed). 

To decide the threshold w2 for the inner region, a quick calibration pro- 

cess is provided at the beginning. First, a button is projected onto the 

surface (figure 6.4). The system takes an image of the projected button by 

itself and works out the average pixel value for inner region, say v1. In 

practice, v1 can be an average value over a small time period ∆t. Then a 

help message is displayed to advise the user to press the button. Similarly 

159

let v2 be the average pixel value of the inner region over a small time pe- 

riod. Then w2 = v1 − v2. 

Although w1 and w2 are both averaged values from a period of time, it 

is still regarded as a short period considering that the system will be up 

and running for a much longer time. Therefore in practice, a tolerance fac- 

tor t are applied. w ′ 1 = w1t and w ′ 2 = w2t are used as the final threshold 

values. 

Figure 6.5: The TPR and FPR of button push detection. 

160

In figure 6.5 shows the effect of choosing different tolerance factor t 

on the button detection performance. We also study the improvement of 

the dual-region method over the previous implementation where average 

pixel values across the whole button region is used. 

The test framework is designed as follows. For each method, we first 

evaluate its TPR by keep pressing the button and record the rate of suc- 

cessful detection. Then a hand is randomly waved over the button, using 

various types of gestures and the rate of mis-triggering as FPR. For both 

experiments, 100 times of the repeated same actions is used. 

The top graph shows that using the old method, although increasing 

the tolerance factor decreases the FPR, it is at the sacrifice of TPR. As we 

increase the tolerance factor he TPR is lowered down to near 60%, its FPR 

is still way too high at 40%. The new method shows promising results, 

thanks to its dual region design (figure 6.3 on page 157) that effectively re- 

duces the chance of the button being accidentally hit. The FPR of the new 

method is controlled below 10% in the bottom graph, while the TPR stays 

above 80% with the tolerance factor set below 0.6. 

All curves in both top and bottom graph has a similar trend of de- 

crease as the tolerance factor increases. This is expected because with us- 

ing smaller tolerance factor decreases threshold values for both inner and 

outer region detection, which ultimately leads to both button positive de- 

tection and mis-detection being more likely to happen. 

161

Button observation 

For each single button, its position and size is fixed in the projection im- 

age. Once a button is defined, it is assigned a constant 2D position and 

size (length and width). The position and size of the button in the ob- 

served image depends on the camera and projector setup. Since there is 

a plane-to-plane projective transform between the camera space and the 

projector space while induced by the desktop as a third plane (section 3.5), 

once a button is attached onto the source image, its appearance (position 

and size) in the observed image is known. Here is an illustration of a pro- 

jection image and its observed camera image (figure 6.6). 

162

(a) The projected buttons. 

(b) The observed buttons. 

Figure 6.6: The projected buttons and their observations in camera image. 

(The red blocks only indicate the area to be monitored). 

163

6.2.5 Implementation of Touchpad 

Real-time segmentation of moving regions in image sequences is done by 

background subtraction. The simplest way to do it is thresholding the 

error between the image taken earlier without any moving objects and 

the current image. However, to deal with the various lighting conditions 

which change from time to time involves more complicated processing. 

As discussed earlier in section 6.2.3, a separate rectangular area is as- 

signed and a constant pattern is projected onto it as the touchpad. This 

area is monitored and we apply the background subtraction algorithm 

only in that area in the observed frames. 

The mixture of Gaussian based adaptive background modelling method 

[25] is used to generate a foreground mask for each frame. In this appli- 

cation the detected foreground regions are fingers or sometimes with part 

of the palm also included. Unlike most of the vision systems, we do not 

explicitly segment the foreground blobs because the information needed 

from the foreground region is the finger tips, and it is assumed that finger 

is always pointing up. 

Figure 6.7 shows the result of background segmentation algorithm on 

four different occasions. From left to right column-wise, the image are cap- 

tured when 1. only one finger is at present; 2. two fingers are at present; 

3. part of the palm is included; 4. the whole upper hand is included. The 

fingertip is finally located at the top middle position of the most dominant 

164

lob in the resultant foreground region. 

(a) Original image. 

(b) Background region. 

(c) Foreground region. 

(d) Detected fingertip. 

Figure 6.7: Fingertip detection using background segmentation algorithm. 

165

6.3 User interface 

Since the whole system is based on interactions, it is important to have an 

well designed interface via which the user can give instructions and re- 

ceive feedback from the computer. Therefore, it must be understandable, 

streamlined and easy to use. Two principles are tightly followed during 

the design of the user interface. First, the data area is maximised to be 

able to present all relevant information and data. Second, various controls 

are efficiently grouped into different sections while taking as little space as 

possible. Besides, we are also aware that not all the control units need to 

be revealed at the same time for the purpose of saving the limited desktop 

space. 

The user interface itself is a 1024 × 768 image being projected onto the 

desktop surface. Figure 6.8 shows a screen shot of the working environ- 

ment. It is divided into 5 areas. 

Left column 

The left column is the preview area where all the thumbnails are listed. 

Only the thumbnails of the views that have already been scanned will be 

displayed here. The user can switch between different views by press- 

ing the corresponding thumbnails. The current investigated view is high- 

lighted and red framed. 

Right column 

166

Figure 6.8: A screen shot of the working environment. 

The right column is the area for system controls. These are the most im- 

portant system-wide controls so they will stay on display throughout the 

whole process. From the bottom up, they are Lock, Snapshot, Scan, Re-Scan, 

and Exit. The user might want to Lock the current desktop when the target 

object needs to be re-positioned manually by user or the tabletop is going 

to be unattended for some time so that the buttons will not be accidentally 

triggered. When the desktop is in lock, all buttons except the Lock button 

are not responsive until it is unlocked by user. By pressing the Scan button, 

a new structured light projection takes over the system. When it is done, 

all relevant information such as the texture map and depth map are dis- 

played in the central area and the system goes back to idle. A thumbnail 

of this scan is displayed in the left column too. Re-Scan is similar to Scan 

button, the only difference being pressing Re-Scan will erase data from the 

167

previous shape input. This is useful sometimes when a structured light 

process is disturbed which could result in unexpected large errors in the 

scanned data, so they are deleted prior to the next scan to save the mem- 

ory. On the top of this column is an Exit button to quit the whole system. 

Bottom left panel 

The bottom left area contains four mode buttons: Inspect, Touchup, Corre- 

spondence, and Visualise. Once a mode button is pressed, it will stay high- 

lighted and the system engages the appropriate mode. Relevant guide 

messages will appear above the control panel to briefly introduce what 

can be done in this mode or sometimes advise the user what the next pos- 

sible steps are. The user can hit the same mode button again to quit the 

current mode, or simply press another mode button to switch to a differ- 

ent mode directly. Detailed discussion of the individual modes is given in 

section 6.4. 

Bottom right panel 

The content displayed in the bottom right section depends on the mode 

currently engaged. 

Central display area 

The central area holds the main display. Normally, all data displayed in 

the central area is from the same view. This area is composed of four sub 

pictures: depth map, texture, colour texture, and a rendered model with 

texture map attached onto the depth map. 

168

6.4 Main Utilities 

In this section the main utilities of the system are introduced. They not 

only function individually but also work collectively as a whole unit to 

perform the 3D input task under the user’s instructions. Although some 

of the utilities requires certain steps to be done first, there is no specific 

order of which of them comes first or which last. The user can switch be- 

tween these modes anytime based on what to be done next. If an illegal 

operation is evoked a warning message will appear to advise the user of 

the correct options. 

We now briefly describe how the system works as an overview, then 

discuss the individual utilities via a scenario example to illustrate how 

they perform the individual tasks. 

169

6.4.1 Overview 

Figure 6.9 shows a screen shot of system start-up projection. On the left 

hand side, a few place holders are attached and each of them represents 

one view. This is where the thumbnails of the captured views are going to 

be placed after the user runs the structured light scan. On the right hand 

side are the attached system buttons which can be hit any time during the 

process. The Lock button is placed at the bottom for the user’s convenience 

to lock up the screen so it is temporarily not responsive to the user’s in- 

structions. Four mode buttons are also shown at the bottom left, however 

at this stage they will not evoke any applications because there is currently 

no captured data to be processed. 

At the bottom centre area, a button with a small red area is attached 

and flashed. A help message is displayed above the button to inform the 

user of the button calibration with five seconds count down. After the 

count down, the user is expected to put his finger in the designated area 

to perform the button calibration, and the system will choose an optimal 

value for the button push detection threshold based on the current room 

lighting, the projection illumination level, and this specific person’s skin 

colour. Detailed discussion of this calibration process is in section 6.2.4. 

A quick structured light scan is done right after the button calibration, 

as a plane calibration step (section 4.4.2). The scan button (third button 

from the bottom up in the right column, the one with the black and white 

stripes) is flashed to remind the user to capture data first before any pro- 

cessing can be carried out. 

170

Figure 6.9: Screen shot of the system start-up state. 

Once a scanned view is captured, some contents of the screen will be 

updated. A thumbnail of the current view is attached to the appropri- 

ate place in the left column. It serves as an identification of the view it 

represents. The user can switch between different views to perform the 

processing task by pressing the corresponding thumbnails. The captured 

data is visualised in the central display area in different forms: the depth 

map, a rendered 3D partial model, the texture map, and the colour map. 

Various tasks can be performed right after a view is captured. In gen- 

eral, there are four main modes the user can switch into: 

• The Inspect Mode for checking the captured data without changing 

171

the data itself. The user can inspect the data not only on the depth 

map itself but also through a manipulatable rendered 3D model. 

• The Touchup Mode for touching up the depth map if an obvious error 

is believed to have occurred. 

• The Correspondence Mode for finding matching points, estimating the 

transform between two views, and fusing the two views together. At 

least two captured views are required for this mode. 

• The Visualisation Mode for visualising the built 3D model. The user 

can visualise the final 3D model that has been built, check which 

view contributes to a certain part of the object, and how well the 

views are fused together by switching any of the views on and off. 

From section 6.4.2 to 6.4.5, an owl object experiment is used in an ex- 

ample scenario to show the usage of these utilities, both individually and 

collectively. 

6.4.2 Mode 1: Inspect 

172

In Inspect Mode, the user adjusts the orientation of the selected rendered 

model, for viewing or checking purposes. The first four arrow buttons 

are provided for rotating the rendered model in 3D space (pan and tilt), 

while the two rightmost buttons adjust the magnitude gain of the ren- 

dered model to further inspect the surface. 

Normally the very first move after a scan is to switch to this mode, to 

examine the accuracy of the estimated depth map and see if there is any 

outstanding errors which can be caused by surface discontinuities, shad- 

ows, reflectance artifacts or other disturbances occurring during the scan. 

The Inspect Mode does not involve any processing of the collected data, but 

works closely with the other modes. One can switch to this mode anytime 

for inspection purposes. It is sometime helpful to switch to a different 

view, if available, to double check the identified error and gain more con- 

fidence. 

173

Figure 6.10: Owl experiment, 3 views captured, current on view 1. 

Figure 6.11: Owl experiment, 3 views captured, current on view 0, model 

rotated. 

174

Figure 6.10 shows the projected display after three views are captured, 

and view 1 is currently selected. In the depth map, two white spots are 

observed and initially identified to be an obvious error. The error is more 

obvious in the top right picture where it is rendered in 3D and attached 

with the colour map. The two spikes seen in that picture correspond to 

the two bright spots found in the depth map, and this can be further con- 

firmed by rotating the rendered model into a more suitable angle (figure 

6.11), where it can be clearly seen that the two spikes come from the side 

of the owl’s left foot. These two sparks come from two tiny spots on the 

owl’s right leg (the one underneath), where the projector fails to illuminate 

those that little area, but it is within the view of the camera. 

Once the error is identified and confirmed, the user can move on to 

Touchup Mode to correct the error, after which they can witch back to in- 

spect the results again, but this is totally the user’s choice. 

6.4.3 Mode 2: Touchup 

Touchup Mode gives the user opportunity to manually touch up on the 

175

depth map and improve the view, without having to adjust the system 

parameters or run the shape acquisition stage once more. Although this 

mode doesn’t provide a sophisticated and detailed correction mechanism 

for the depth map, it does offer a tool for the user to alleviate or erase 

the most obvious errors based on their own judgement. Once the capture 

error is presentably visualised in the Inspect Mode, this correction tool is 

simple to use, fast, and effective. 

In this mode, different functional buttons are provided - a touch pad 

for locating the cursor and a push button to commit the change. A speed 

control button is also provided to adjust the cursor speed. The cursor can 

be positioned quickly towards the error point by faster cursor movement, 

but once it is located slower cursor movement may be used to pinpoint 

the error spot. The cursor is restricted within the depth map sub-window. 

The same owl object is used as an example to illustrate the touchup 

process. First an error point in the depth map is identified in the Inspection 

Mode, as shown in figure 6.10 and 6.11. The error actually occurs in the 

codification stage where the codewords of a group of pixels are wrongly 

built hence the table look-up result for those pixels are incorrect. Figure 

6.12 shows a row index image, which is the result of codification table 

look-up. In the row index image, the pixel value corresponds to which 

row of the projection image it is illuminated by, and the brighter pixels 

correspond to higher rows. This image is an off-line inspection during de- 

bug and will not be shown to the user. 

176

Figure 6.12: The row index picture of the first view (the brighter pixel 

values correspond to higher rows in the projection image.) 

The touch-up process executes a median filter on the area located by 

the cursor once the commit button is hit. The median filter is very effec- 

tive for the type of salt and pepper noise in this example. The result of 

the touchup is not only shown on the depth map, it is also reflected on 

the rendered model in the image to its right instantly (figure 6.13), as the 

two are synchronised throughout the process. It is clearly seen that the 

spikes in the rendered image caused by depth error are no longer present, 

compared to figure 6.11. (Note, the big increase in brightness level of the 

depth maps between figure 6.13 and 6.11 is caused by scaling, because all 

displayed depth maps are re-scaled to 0-255 otherwise all pixels exceeding 

255 will appear as full white.) 

177

Once the touchup is done, the user is also advised to switch into the 

Inspect Mode to tune the 3D model into a better pose to double check the 

questioned part, and see if there is any other part of the object needs to be 

corrected. 

The changes made by the median filter to the depth map are also up- 

dated in the corresponding 3D point set of the current view. Upon exit 

of the touchup process, the user has a final Yes-or-No choice of whether 

to accept this change permanently. If No is selected, the modified part is 

recovered by the backup data. Otherwise, the updated data will replace 

the old version to participate further processing. 

Figure 6.13: The touchup result of 6.10. 

178

6.4.4 Mode 3: Correspondence 

In Correspondence Mode, this mode follows the work flow introduced 

in section 5.3 and section 5.4. It is named Correspondence Mode because 

it starts with finding the matching points between the image pair, and the 

correspondences hold the key to the initial guess of the transform between 

the two views. This initial guess provides the user with a trial fuse, which 

can be further adjusted. A minimum of two views is required to perform 

this task. 

While the all the back-end image processing tasks are discussed ear- 

lier in chapter 5, here we are concerned with the interface part and how 

to incorporate the back-end process into a collaborative environment. The 

main principle that is sustained here is to perform the whole point set fu- 

sion process where the user works as a decision maker and the computer 

merely as a work force and a source of guidance. 

During the process of image registration and point set fusion, a set of 

parameters are used for each single step. Although there is a set of trial 

parameters provided to work with most of the scenarios, different objects 

have different properties (e.g. size, texture, surface reflection) and it is dif- 

ficult to find the best set of parameters for individual objects. For example, 

to register a pair of images of a periodic pattern such as a checkerboard 

179

(figure 5.4), choosing a too big search window confounds the NCC with 

mismatches. On the other hand, if the search window is not big enough, 

the right correspondence might not be found in a largely displaced image 

pair. Therefore we provide a randomised mechanism to let the user find 

those optimal parameters while not being exposed to too many much tech- 

nical details. The underlying idea is to keep it simple, and keep it visualised. 

The process begins with listing the views that have been scanned. The 

user is advised to choose two views as ’from’ image and ’to’ image for 

image registration in order to transfer the ’from’ point set towards the ’to’ 

point set (figure 6.14). If any other two views have already been registered 

previously, there will be a red connection line underneath indicating so. 

The colour texture maps of the selected two views will participate in the 

registration. 

Instead of taking the whole part of the two selected images, the system 

crops the images with a ROI (figure 6.15) based on the expected position 

and size, both estimated by the object size estimated from the point set in 

3D space and the camera imaging geometry (these are all available because 

the camera-projector pair is calibrated, and the centroid of the object, the 

minimum and maximum ends of the object along the X, Y, Z axes can all 

be worked out from the 3D point set). Giving the user an option of choos- 

ing the ROI has another purpose, as non rigid objects can be partially de- 

formed while being positioned to different poses, so these deformed parts 

are ideally excluded from participating in the correspondence matching. 

180

Figure 6.14: Correspondence Mode: two images are selected as ’from’ and 

’to’. 

Figure 6.15: Correspondence Mode: ROIs are selected. 

181

After the image pair with ROIs is chosen, they are enlarged and dis- 

played at the centre of the desktop to show better details. Three image 

processing tasks, corner detection, cross-correlation and outlier exclusion, 

are performed. The implementation details are discussed earlier in sec- 

tion 5.3.1 - 5.3.3. While these image processing tasks are performed (figure 

6.16 - 6.17), all system parameters are hidden from the user but the user 

still has the privilege to adjust the parameters and re-do the current step 

again with a new set of parameters. At each step of the aforementioned 

image processing task, a set of default parameters which is pre-set with 

empirical values is loaded with the instant result reflected on the desktop. 

All parameters also come with a floating range, from which they can be 

randomly selected. If the user is satisfied with the result yielded by the 

current parameter set, he/she can hit the Proceed button (the one with a 

tick) and move on to the next step. Otherwise, the user can use the Adjust 

button (the one with two gears) to select a new combination of the param- 

eters which are randomly selected from the allowed closed interval. This 

process is repeated until a satisfying result is shown on the desktop before 

moving onto the next step. 

Reasonable outcomes are often achieved at the first attempt. The user is 

advised to repeat the process a few times using different settings to com- 

pare the results, or sometimes working towards the possibility of even 

better results. However, all the parameters are restricted to be randomised 

only and not controllable, to comply with our principle of keeping it sim- 

ple by leaving all the technical details hidden. 

182

Figure 6.16: Correspondence Mode: extracted corners. 

Figure 6.17: Correspondence Mode: correlated and improved point corre- 

spondences. 

183

The established correspondences may still not be good enough. This 

is expected when the two participating images are highlighted. There are 

parts in the left image that will appear perspectively deformed in the other 

image, or sometimes don’t even exist because of the view point change. 

Other challenges include lack of texture of the measured object, sur- 

face reflections caused by the bright projection light, and deformed parts 

of objects such as stuffed animals. Further discussions of how to tackle 

these problems are given later in chapter 7. 

In figure 6.17 where the correspondences are shown, pressing the pro- 

ceed button will make the commitment of using the current point corre- 

spondences as control points to deploy the estimation of the rotation and 

translation vectors. The estimation is a quick process which takes less than 

a second, then the second point set is transformed towards the other based 

on the rotation and translation vectors estimated. This is a trial registra- 

tion of the two point sets suggested by the system as an initial guess. 

The user can accept this registration by pressing the proceed button 

again, or to further adjust their positions manually. By switching between 

the R and T buttons, both of which are attached with a set of six buttons 

for rotating a point set about its centroid around the X, Y, Z axes or trans- 

lating along them, the engaged point set can be manipulated rotation-wise 

and translation-wise respectively (figure 6.18). 

184

During the course of tuning, the first point set (on the left) is used as 

a reference while the second one is transformed towards the first one. A 

final solution is thought to be reached (figure 6.19) after the overlapping 

area of the two points coincide on each other. 

Figure 6.18: Correspondence Mode: visualised point sets tuning, with con- 

trollable rotation and translation. 

185

Figure 6.19: Correspondence Mode: two point sets are fused. 

6.4.5 Mode 4: Visualisation 

Although the captured data can be visualised by different means in 

any of the three modes introduced earlier, Visualisation Mode offers the fa- 

cility to visualise the complete 3D model that is built through the previous 

work, in 360 degrees. In this mode, the controls are not as sophisticated as 

other modes – all the scanned views are listed in the bottom centre control 

186

panel area represented by the resisted mini version of their colour textures 

(figure 6.20). The rendered object will be displayed at the centre of the dis- 

play area, slowly rotating about its centroid as if it is placed on a turn table. 

To be noticed, in Correspondence Mode, two point sets are only regis- 

tered (i.e. to work out the rotation and translation vectors between them), 

but no point set data is changed. In this mode, all point sets selected will 

be merged together (i.e transform one point set towards the other so that 

they are in the same coordinate space and sharing a same centroid). 

Figure 6.20: The Visualisation Mode. 

Apart from the viewing, the other only operation the user can do in 

the Visualisation Mode is turning on or off different views to inspect the 3D 

model of the measured object by pressing the corresponding buttons. All 

187

the views being turned on are fused first using the estimated transform 

between them which is previously worked out in the Correspondence Mode. 

There can be more than one view being turned on at the same time, or even 

all of the views (if all the necessary transform information is available) – 

and this is possible only in this mode. If no view is selected, nothing is 

displayed. 

However, not all of the views can be selected randomly and then fused 

together. There are a few ground rules that need to be applied for choosing 

different views to be fused: 

• If two views are to be selected, they have to be either registered in 

the Correspondence Mode (i.e. the transform vectors between them are 

available), or they are both registered with a same third view. 

• Registration relay is also allowed (e.g. if view 1 and 2, 2 and 3, 3 and 

4 are all registered, then view 1 and 4 are registered too). 

• All inter-registered views are categorised into a same group, and 

only the views from the same group can be visualised at the same 

time. 

The reason behind the above rules is that any two registered views can 

be regarded to have a path between them – the rotation and the transla- 

tion vectors. Suppose the rotation vector from view A to view B is RAB = 

(θ, φ, ψ) and let its translation vector is TAB = (Tx, Ty, Tz), then the rotation 

188

and translation vectors from view B to view A are RBA = (−θ, −φ, −ψ) and 

TBA = (−Tx, −Ty, −Tz). This relationship propagates to multiple views be- 

cause as long as there is not a stand-alone view that is not registered to any 

of the others, there is always a path of transform for this view to be trans- 

formed to any of the other’s orientation and position. 

Table 6.1 gives an example of the propagation of this relationship. We 

still consider the scenario example used earlier in this chapter in which 

five different views of the owl are scanned while the sixth view is not cap- 

tured yet. It starts from stage 0 where all five views are related to each 

other and the views are not grouped. At stage 1, view 1 and view 2 are 

registered so they are labelled as group 1. A red connection line is drawn 

between them to indicated this relationship. At stage 1, view 3 and view 

4 are registered as a new group, group 2, and it is indicated by a green 

connection line underneath. So up to this point, there are two separate 

groups between those five scanned views both of which are indicated by 

different colours to advise to the user that a view from the red group and a 

view from the green group or the stand-alone view 5 can not be displayed 

together, because there is no way to fuse them. After stage 3, a new regis- 

tration is completed between view 1 and 5 so the same grouping process 

is carried out. The situation is totally different after stage 4, after which 

view 2 from view 3 are registered. This registration changes everything as 

it brings the two groups into one. In other words, a registration between 

any other two views will result in the same grouping as long as they are 

from two different groups, one each. 

189

Stage View 

(from) 

View 

(to) 

0 n/a n/a 0 

1 1 2 1 

2 3 4 2 

3 1 5 2 

4 2 3 2 

Number 

of 

groups 

Relationship lines 

Table 6.1: Grouping status of the point sets at different stages. 

Figure 6.21, 6.22, and 6.23 shows process of a model of the owl being 

built from three central views. By fusing the view 2 and 3 together and 

visualised the fused model, it can be seen from figure 6.21 that the right- 

facing object in view 2 completes the left wing part that is partially not vis- 

ible in view 3 where the object is facing straight up. However, as the same 

model being rotated around its centroid until its right part is exposed, it is 

clear that the right wing of the current model is missing data. 

We notice the object in view 4 is facing left and its right part is visible 

while still sharing a fair amount of the overlapping area between view 3 

190

and itself. By fusing view 4 into the model previously built from view 2 

and 3, another part of the object is fulfilled as shown in figure 6.23. 

Figure 6.21: View 2 and 3 fused together. View completes the left wing of 

the owl. 

191

Figure 6.22: View 2 and 3 fused together. 

Figure 6.23: Fusion of view 2, 3, and 4. 

192


In this chapter we present a working and user friendly interface for VAE 

system designed in this research. This interactive interface is a mixed en- 

vironment with real objects and projected signals, where users’ interac- 

tion with these objects and projections are captured and interpreted by 

adjusted projections. Techniques introduced in chapter 4 and 5 are both 

integral parts of the designed system, while efficient monitoring of the in- 

teractive surface and accurate response to it rely on the explicit calibration 

presented in chapter 3. 

Two widgets are introduced and then implemented to simulate two of 

two of the most frequently used gestures in the human-computer inter- 

actions, the button push for triggering events and the touchpad slide for 

positioning. 

Four major facilities are provided to accomplish the task of 3D input, 

with which the user are allowed to inspect the captured data from differ- 

ent view angles, point out and correct errors, manipulate the projection 

signals, and finally build and visualise the complete 3D model. Other 

tools such as a desktop lock-down and snap-shot tool are also provided 

for practical uses during the process. 

193


In an interactive user interface, an easy-to-use and efficient interactive tool 

is always desired. Future work for implementation of finger tip detection 

can be beneficial to the system. Provided robust finger detection is imple- 

mented across the whole projection area, the touch up can be much easier 

as the user can point his finger directly at the questionable area. 

Drag and drop of the virtual elements on the desktop can be another 

possible extension to the finger detection. Previous work at York [74] 

yields promising results and lays the foundation of the future work in this 

area. 

As a final inspection on the built 3D model, the visualisation mode (sec- 

tion 6.4.5) can be further elaborated. Possible implementation of touch-up 

in 3D space is a big plus, as this is the stage where errors are likely to 

be rediscovered. Efficient and quick responses need to be made to cor- 

rect those errors on the rendered model straightaway, in a visualised way, 

rather than repeatedly going back to the 2D models. 

194

Chapter 7 

System Evaluation 

Most of the techniques used in this research have already been evaluated 

and justified at appropriate stages earlier in the thesis. In this chapter, we 

present informal user tests to evaluate the system performance. In partic- 

ular, the system performance with different test objects are evaluated, to 

provide an insightful suggestion of what is the possible way of achieving 

the best results with the presence of technical challenges and practical is- 

sues. 

195

7.1 Test Objects 

7.1.1 An Overview 

An overview of the objects used for experiments is listed in table 7.1. Each 

object is represented with a thumbnail, object name, and a brief descrip- 

tion. 

7.1.2 Object Descriptions 

The objects chosen to participate in the user tests covers a variety of differ- 

ent sizes, colours, and surface materials, as a diversity. For example, the 

owl appeared in previous chapters as example object because it presents 

various challenges to the techniques presented in early part of this the- 

sis. It has convexity and concaveness across it surface, and this will easily 

cause shadows while being illuminated from certain angles. The owl itself 

doesn’t lack texture, but its fluffy surface complicates the texture mapping 

because the same texture could appear totally different due to the inter- 

reflections caused by the uneven surface. Furthermore, the back side of 

the owl completely lacks texture. 

Other test objects present different technical challenges. The football 

is an example of high specular reflectance. Despite the system not being 

designed for human body measurement because the top-down projector- 

camera setup, we still did a test to evaluate how well the system performs 

196

Thumbnail Object Description 

Cushion A small soft cushion with bright colour texture. 

A small turtle attached onto the right side, but 

the tropical fish is just a 2D pattern. 

Football A small spherical object. Slightly deflated for it 

to stand on the table by itself. Surface has high 

specular reflection. 

Stand A mid-sized object made with cardboard and 

wrapped with brown packing paper, hardly re- 

flecting any lights. 

Owl A fairly big stuffed animal. It has soft and fluffy 

Human 

Body 

surface, and part of its body will be deformed 

while changing pose. 

A user lying on desktop. The rigidity is not guar- 

anteed, as the relative position between the head 

and the upper-body can be changed from one 

pose to another. 

Table 7.1: An overview of the objects used for the tests. 

on such an object and see where can be improved. During the human body 

test, the table top is lowered. This is not a computer vision driven move – 

purely to comply with the health and safety regulations. 

197

In the rest of this chapter, test frameworks are designed to test the in- 

dividual main techniques proposed and evaluate the performance with 

various types of objects. Then the system is evaluated as a whole. 

7.2 Shape Acquisition 

In section 7.2, the performance of the shape acquisition using structured 

light on different objects is evaluated. Most of the techniques involved in 

structured light scan are either discussed or experimentally tested in chap- 

ter 4, but it is still unclear how these separate pieces of techniques work as 

a whole. This section is aimed to address this issue. 

198

Object No. of 

views 

Initial 

error 

(per 

view) 

Error 

after 

touchup 

(per 

view) 

Initial diagnosis 

Cushion 2 5 (2.5) 0 (0) black part of the object surface 

Football 5 16 (3.2) 0 (0) common field of view problem (re- 

gions that can only be seen from the 

camera) 

Stand 5 36 (7.2) 5 (1) surface reflection 

Owl 5 6 (1.2) 0 (0) concaveness on the surface fails to be 


Body 

illuminated by the projector because 

of occlusion 

3 4 (1.3) 0 (0) distance from the object to the 

projector-camera pair 

Table 7.2: Evaluation: depth capture error, and their corrections. 

Table 7.2 lists the performance of the shape acquisition process using 

objects of different size, shape, and surface. It also shows the amount of 

effort required to touch-up the most obvious error until all captured depth 

information are reasonably accurate upon visual inspection. The numbers 

shown in the table are the number of parts (e.g. spikes, jumps, holes, and 

etc.) that are believed to be error (numbers in the brackets are the averaged 

number of errors per view). The third column in the table is the initial er- 

199

or in the captured depth maps, and the fourth column is the number of 

unerasable errors remained after the user touch-up. Initial diagnoses of 

the possible reason of the error are listed in the last column, to be further 

justified. 

7.2.1 The Owl Experiment 

General speaking, best result of the depth capture comes from the Owl ex- 

periment. Despite the owl being the second biggest object among those 

five being tested, it has a more continuous surface. The camera and the 

projector share a close common viewing area of the surface (i.e. where the 

projector can reach is where the camera can see, and vice versa). The only 

obvious inaccurate measurement at the concaved part at the bottom of the 

owl’s feet. The error part, seen as a bright dot in figure 7.1(a), is tiny and 

can be easily erased by single touch-up. 

7.2.2 The Football and Stand Experiment 

In this section, two objects are tested together to have a comparison be- 

tween them. Between the two objects in Football and Stand, there are a few 

dissimilarities. The capture result of three views are listed for each of these 

two experiments, in figure 7.4 and 7.3. 

200

(a) Depth map (b) Rendered model 

(c) Depth map (d) Rendered model 

Figure 7.1: Shape acquisition test: Owl. Top two: before touchup; bottom 

two: after touchup. 

• The difference in specular reflectance. The football is a rigid spheri- 

cal object with high gloss surface. For testing purpose, it is slightly 

deflated to be firmly placed on the desktop without using a stand. 

The brown stand is an object made of cardboard, but wrapped up 

with reflective brown packing paper. Reflectance in Football experi- 

ment is more severe than the Stand. However, due to the spherical 

surface of the football, the high reflectance are focused onto one sin- 

201

Figure 7.2: The projector-camera pair setup. The shaded part is the ’dead’ 

area that can not be illuminated by the projector but in the viewing range 

of the camera. 

gle point. It is noticed that in figure 7.4, the error caused by the high 

gloss surface is already filtered out by applying a smooth filter after 

on the scanned data, and the user touchup can be spared. 

• The difference in the projection light required. As mentioned above 

the high gloss surface of the football causes an overall increase in 

pixel values across the image, adjustment on the projection bright- 

ness is required to avoid the white balance in the captured image 

202

eing too high that the texture details are lost. The opposite need 

to be done for the Stand experiment. In implementation, projection 

brightness of 100 is used for Football, and 200 for the Stand. 

• Both cases suffer from shadows and occlusions, but in a slight dif- 

ferent way. In the Football experiment, it can be seen from figure 7.4 

that the only obvious depth inaccuracy occurs near the lower bottom 

rim of the sphere. This is because a small part of the desktop always 

stay out of the illumination but is in the viewing range of the camera 

(see figure 7.2). For Stand, it is a different case where the object is 

wrapped with material that hardly reflect any light. It is illustrated 

in figure 7.3 that all those planes nearly parallel to the projection rays 

are severely affected, because there is not enough projection light get 

reflected to the camera plane via the surface. There are also two small 

areas of depth inaccuracies caused by shadows, but can be easily cor- 

rected using the touchup tool provided. 

203

(a) Depth map (b) Colour map 

(c) Depth map (d) Colour map 

(e) Depth map (f) Colour map 

Figure 7.3: Shape acquisition test: Stand. Left column: depth maps; right 

column: the corresponding textures. 

204



(e) Depth map (f) Colour map 

Figure 7.4: Shape acquisition test: Football. Left column: depth maps; right 


205

It is noticed in table 7.2 that Stand is the only test with unerasable cap- 

ture errors. This is because the error parts are too big for the median filter 

to handle 

7.2.3 The Cushion and Human Body Experiment 

We compare the results of Cushion and Human Body together because the 

similarities shared between them. In both experiments, fewer views are 

used. For the cushion, front and back are the only two views captured, as 

it is hard to place the cushion into other orientations. When human body 

is being measured, we lower the table first (reason stated in section 7.1.2) 

then the tester lie on the table. Three views are captured: one facing left, 

one facing right and the third one facing up. 

In these two tests, the object surface are continuous and convex hence 

the problem we had in figure 7.4 and 7.3 does not occur here. However, 

’holes’ in depth image are found at the eyes and tail of the fish, and part 

of the human’s hair, which are all black area. After studying the captured 

Gray coded stripe image, it is found that all those area appear black (pre- 

cisely, with 0 pixel values) in the observed image whether they are illu- 

minated by the white or black projection. As a result, they stay 0 in the 

subtraction image of positive and negative images, and will be labelled as 

background pixels. 

206

(a) Depth map (front view, before 

touchup) 

(b) Colour map (front view) 

(c) Depth map (back view) (d) Colour map (back view) 

Figure 7.5: Shape acquisition test: Cushion. Left column: depth maps; right 


207



Figure 7.6: Shape acquisition test: Human Body. Left column: depth maps; 

right column: the corresponding textures. 

7.3 Correspondences Finding 

The test framework for evaluate correspondences finding is set as follows. 

For each object test, we pick two adjacent views, and run the correspon- 

dence program on the image pair. Depth and point set data are touched up 

208

if there is any obvious errors, before we start finding the correspondence. 

As introduced earlier, when doing the corner detection the user is pro- 

vided with a facility to randomise a parameter set, run the program, and 

the instant results are projected onto the desktop for inspection. The exact 

value of the parameters such as the search range, the eigenvalue thresh- 

old or the window size for local aggregation are all hidden from the user. 

While repeating the process by randomising the parameter set, it is not 

necessary that the parameter set which yields the most corners is chosen 

as the optimal value. The user is advised to use his own judgement by 

looking at the result reflected on the desktop. This is similar to debugging 

a C program on a local PC, the only difference being that in this application 

the user doesn’t have to know anything about the technical details which 

is hidden. Therefore, we apply the same rule of ’how to choose the opti- 

mal parameter set’ in the test framework, to simulate the user’s behaviour. 

209

Object Corners 

(left im- 

age) 

Corners 

(right 

image) 

No. of 

correspon- 

dences 

User ad- 

justment 

required? 

Cushion n/a n/a n/a Y 1 

Football 42 51 18 Y 1 

Stand 87 105 29 N 2.5 

Owl 206 197 30 Y 4 


Body 

102 113 39 N 2 

Table 7.3: Evaluation: building correspondences. 

Table 7.3 shows the test result. 

Total time 

spent 

(minutes) 

It is noticed that the first test, Cushion, doesn’t have the results or num- 

ber of corners detected or number of correspondence built. This is because 

there are only two views captured for the cushion, one is the top view and 

one is the bottom view. Although these two captured views complete the 

object model, they share no overlapping part. Therefore it is meaningless 

to run the correspondence search between the two images. In the test, we 

skip the corner extraction and correlation step, go straight into the tuning. 

The tuning task is straightforward too, as all the user has to do is to turn 

the second view over (rotate by 180 ◦ ). 

For the rest of the objects, it is usually more time is spent on bigger 

210

objects. The correspondence search in Stand and Human Body tests works 

very well, hence the initial trial rotation and translation vectors given by 

the computer are accepted without further user adjustment. The Owl ex- 

periment take longer time, where lots of corner points are detected but 

only a small portion of them are found to be matching. It can be seen from 

diagram 7.7 that it has only about half the percentage of building corre- 

spondences from detected corners, compared to other objects. 

(a) 

(b) 

Figure 7.7: Number of extracted corner points and matched correspon- 

dence. 

211


In this chapter, five objects are used as the test objects to evaluate the sys- 

tem performance within controlled test framework. Although a lot more 

objects have been tested in this research, those five listed here are the most 

representative ones illustrating the impact of different objects on the re- 

sults. This includes the surface reflectance of the objects, their texture, con- 

vexity and concaveness, rigidity, and the level of depth continuity across 

the surface. 

Two key components of the system, shape acquisition via structured 

light scan and point set registration from point correspondences, are tested. 

Statistics and experimental results give the diagnose and possible solution 

to the problem caused by those aforementioned challenges, and provide 

the foundation on which future research can be built. 

212

Chapter 8 

Conclusions 

8.1 Summary 

All of the chapters presented in this thesis contain their own introductions 

and conclusions. Apart from Introduction, Background and this Conclu- 

sion chapter itself, the rest of the thesis is summarised as follows: 

• Chapter 3 Calibration 

Methods for complete calibration of the VAE system are presented. 

This includes a full calibration of the projector-camera system for 

213

their own intrinsic and extrinsic parameters, and the calibration for 

a plane-to-plane homography between the rendered projector plane 

and the captured image plane induced by a third plane. 

• Chapter 4 Shape Acquisition 

A Gray coded structured light scan is implemented for acquiring 

depth information. It is then extended and adapted to tackle the 

practical issues raised, before being incorporated into the whole VAE 

framework. 

• Chapter 5 Registration of Point Sets 

A framework for 3D point set registration is presented in this chap- 

ter. The conventional image registration technique is used to find 

corresponding points between a pair of 2D image, and the estab- 

lished correspondences are propagated from 2D to register the point 

sets in 3D space. This framework is justified to work not only on 

planar surface, but also arbitrary objects with the user’s assistance in 

a VAE system, while there is no ground truth information known a 

priori. 

• Chapter 6 System Design 

This is the core of this research. A new system design is presented 

in this chapter, for inputting 3D by working collaboratively with the 

PC in a VAE. The proposed system is cheap to maintain with off-the- 

shelf hardware, and easy to be deployed with requiring minimum 

214

configuration of the projector-camera pair. The use of the system 

presented is aimed not being restricted only in research laboratory 

environment. 

• Chapter 7 User Experiments 

Major components of the system are evaluated in this chapter, with 

controlled test frameworks. 

8.2 Discussions 

System-wise, one of the most important design goals is to allow users to 

bring their objects to be input, walk up to the VAE and start the mission 

without worrying about the technical details of computer vision or how to 

produce the code to do that. We aim to recreate an environment where the 

computer and its attached vision equipment work as an assistant to the 

user, while the user always make the final call on key decisions based on 

the feedback from this interactive collaboration. Higher cost equipment 

such as HMD, touch screen, or other customised tools such as markers 

and gloves are all avoided, as the system presented here is not only de- 

signed for laboratory purpose, but also for home and office environment 

or other open public space such as schools and meseums. 

215

8.3 Future Work 

Techniques employed in this research are evaluated in separate chapters. 

Although the framework itself contains techniques that are already widely 

used in the field, it brings together these techniques in a new, practical 

and efficient way. But as mentioned before, many of the system elements 

would benefit from further improvement and optimisation. 

There are planned improvements for the techniques used in the sys- 

tem. In calibration, manual adjustment of the photometric settings of the 

camera and the projector is not only inconvenient but also inefficient. Fur- 

ther development of the calibration framework would include automatic 

photomatric calibration. 

Once the automated photometric calibration is feasible, it might be sen- 

sible to exploit the use of colour-based structured light techniques which 

allow real-time scan of depth information. There are also other planned 

improvement for the shape acquisition framework, as described in section 

4.6.1. 

Assigning user more power and initiative in the point set registration 

stage would be another big step forward, because if appropriately de- 

signed and implemented, it would gauge the registration process more 

quickly and efficiently towards the optimal results, while the user’s lead- 

ing role is still maintained. 

216

As mentioned in section 6.5.1, robust fingertip detection and touchup 

in 3D space are regarded as two major improvements in future work. Suc- 

cessful fingertip detection would not only simplify the user interface by 

reducing the number of interactive buttons required, but also offer a new 

dimension of user interaction as locating a point would be much easier 

either on a physical object or on a virtual element. 3D touchup could ef- 

fectively be a consequence of the deployment of fingertip detection, and 

it would be a big boost if the user is allowed to manipulate the rendered 

object model using his bare hands as if he is touching the real object. 

Regarding the user test carried out in chapter 7, they are still mainly at 

a descriptive stage. The next task of system performance measure would 

be aimed to get testers from a variety of backgrounds – from computer 

vision academics to someone who has little experience with the field – to 

characterise the system both behaviourally and experimentally. 

217

Bibliography 

[1] P. Anandan. A computational framework and an algorithm for the mea- 

surement of visual motion. International Journal of Computer Vision, 2(3):283– 

310, 1989. 

[2] A. Argyros and M.I.A. Lourakis. Vision-based interpretation of hand ges- 

tures for remote control of a computer mouse. In European Conference on 

Computer Vision, Workshop on Human Computer Interactions, pages 40–51, 

2006. 

[3] K.S. Arun, T.S. Huang, and S.D. Blostein. Least-squares fitting of two 3-d 

point sets. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(5):698– 

700, 1987. 

[4] K.E. Atkinson. An Introduction to Numerical Analysis. John Wiley and Sons, 

2nd edition, 1989. 

[5] J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of optical flow 

techniques. Int. J. Comput. Vision, 12(1):43–77, 1994. 

[6] J. Batlle, E. Mouaddib, and J. Salvi. Recent progress in coded structured 

light as a technique to solve the correspondence problem: A survey. Pattern 

Recognition, 31(7):963–982, July 1998. 

218

[7] S.S. Beauchemin and J.L. Barron. The computation of optical flow. ACM 

Comput. Surv., 27(3):433–466, 1995. 

[8] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. Computing two motions 

from three frames. ICCV, 90:27–32, 1990. 

[9] D. Bergmann. New approach for automatic surface reconstruction with 

coded light. Remote Sensing and Reconstruction for Three-Dimensional Objects 

and Scenes, 2572(1):2–9, 1995. 

[10] M.J. Black and P. Anandan. A framework for the robust estimation of opti- 

cal flow. In ICCV93, pages 231–236, 1993. 

[11] M. Blackm and A. Rangarajan. the unification of line processes, outlier 

rejection, and robust statistics with applications to early vision. 1996. 

[12] A. Blake and R. Cipolla. Robust estimation of surface curvature from defor- 

mation of apparent contours. In Proceedings of the First European Conference 

on Computer Vision, pages 465–474, London, UK, 1990. Springer-Verlag. 

[13] S. Borkowski, J. Letessier, and J.L. Crowley. Spatial control of interactive 

surfaces in an augmented environment. In EHCI/DS-VIS, pages 228–244, 

2004. 

[14] J.Y. Bouguet. Camera calibration toolbox for matlab, 2006. (Last retrieved 

30 November 2006). 

[15] J.Y. Bouguet and P. Perona. 3d photography on your desk. In ICCV ’98, 

pages 43–50, 1998. 

[16] K.L. Boyer and A.C. Kak. Color-encoded structured light for rapid active 

ranging. IEEE Trans. Pattern Anal. Mach. Intell., 9(1):14–28, 1987. 

219

[17] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization 

via graph cuts. In ICCV (1), pages 377–384, 1999. 

[18] D.C. Brown. Decentering distortion of lenses. Photometric Engineering, 

32(3):444–462, 1966. 

[19] V. Buchmann, S. Violich, M. Billinghurst, and A. Cockburn. Fingartips: 

gesture based direct manipulation in augmented reality. In GRAPHITE ’04: 

Proceedings of the 2nd international conference on Computer graphics and inter- 

active techniques in Australasia and South East Asia, pages 212–221, New York, 

NY, USA, 2004. ACM. 

[20] J.F. Canny. A Computational Approach to Edge Detection. 8(6):679–698, 

1986. 

[21] B. Carrihill and R.A. Hummel. Experiments with the intensity ratio data 

sensor. 32(3):337–358, December 1985. 

[22] D. Caspi, N. Kiryati, and J. Shamir. Range imaging with adaptive color 

structured light. IEEE Transactions on Pattern Analysis and Machine Intelli- 

gence, 20(5):470–480, 1998. 

[23] G. Chazan and N. Kiryati. Pyramidal intensity ratio depth sensor. Tech- 

nical Report, Center for Communication and Information Technologies, Dept. of 

Electrical Eng. Haifa, Israel, Oct 1995. 

[24] C.S. Chen, Y.P. Hung, C.C. Chiang, and J.L. Wu. Range data-acquisition 

using color structured lighting and stereo vision. 15(6):445–456, June 1997. 

[25] C. Chris Stauffer, W. Eric, and L. Grimson. Adaptive background mixture 

models for real-time tracking. In CVPR, pages 2246–2252, 1999. 

220

[26] R. Cipolla, T. Drummond, and D. Robertson. Camera calibration from van- 

ishing points in images of architectural scenes. BMVC, 1999. 

[27] E. Costanza and J.A. Robinson. A region adjacency tree approach to the 

detection and design of fiducials. In Proc. Vision, Video and Graphics, Bath, 

UK, July 2003. 

[28] E. Costanza, S. B. Shelley, and J.A. Robinson. d-touch: a consumer-grade 

tangible interface module and musical applications. In Proceedings of De- 

signing for Society HCI2003, Bath, UK, September 2003. 

[29] E. Costanza, S.B. Shelley, and J.A. Robinson. Introducing audio d-touch: 

A tangible user interface for music composition and performance. Digital 

Audio Effects (DAFx) 2003, September 2003. 

[30] J. Coutaz, S. Borkowski, and N. Barralon. Coupling interaction resources: 

an analytical model. In sOc-EUSAI ’05: Proceedings of the 2005 joint conference 

on Smart objects and ambient intelligence, pages 183–188, New York, NY, USA, 

2005. ACM. 

[31] A. Criminisi, I.D. Reid, and A. Zisserman. Single view metrology. IJCV, 

40(2):123–148, November 2000. 

[32] J. Crowley, F. Berard, and J. Coutaz. Finger tracking as an input device for 

augmented reality, 1995. 

[33] C.J. Davies and M.S. Nixon. A hough transform for detecting the location 

and orientation of 3-dimensional surfaces via color encoded spots. SMC-B, 

28(1):90–95, February 1998. 

[34] J. Davis and M. Shah. Visual gesture recognition, 1994. 

221

[35] R. Deriche and O.D. Faugeras. Tracking line segments. In ECCV 90: Proceed- 

ings of the first european conference on Computer vision, pages 259–268, New 

York, NY, USA, 1990. Springer-Verlag New York, Inc. 

[36] P. Domingos and M. Pazzani. On the optimality of the simple bayesian 

classifier under zero-one loss. Machine Learning, 29(2):103–130, November 

1997. 

[37] O. Faugeras. Three-Dimensional Computer Vision. MIT Press, 1993. 

[38] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for 

model fitting with applications to image analysis and automated cartogra- 

phy. Commun. ACM, 24(6):381–395, June 1981. 

[39] A.W. Fitzgibbon and A Zisserman. Automatic 3d model acquisition and 

generation of new images from video sequences. In European Signal Pro- 

cessing Conference (EUSIPCO98), pages 1261–1269, Rhodes, Greece, 1998. 

[40] d.j. Fleet and A.D. Jepson. Computation of component image velocity from 

local phase information. Int. J. Comput. Vision, 5(1):77–104, 1990. 

[41] D.M. Frohlich, T. Clancy, J.A. Robinson, and E. Costanza. The audiophoto 

desk. 2ad. The Second International Conference on Appliance Design, May 2004. 

[42] W.C. Graustein. Homogeneous Cartesian Coordinates. Linear Dependence of 

Points and Lines . New York: Macmillan, pp. 29-49, 1930. 

[43] W.E.L. Grimson. Computational experiments with a feature based stereo 

algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 

7:17–34, 1985. 

[44] D. Gruber. The mathematics of the 3d rotation matrix, 2000. (Last retrieved 

March 2007). 

222

[45] J. Guehring. Dense 3d surface acquisition by structured light using off-the- 

shelf components. In SPIE, Videometrics and Optical Methods for 3D Shape 

Measurement, volume 4309 of Presented at the Society of Photo-Optical Instru- 

mentation Engineers (SPIE) Conference, pages 220–231, December 2000. 

[46] C. Harris and M. Stephens. A combined corner and edge detection. pages 

147–151, 1988. 

[47] R.I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. 

Cambridge University Press, ISBN: 0521540518, second edition, 2004. 

[48] J. Heikkila and O. Silven. A four-step camera calibration procedure with 

implicit image correction. In IEEE Computer Vision and Pattern Recognition, 

pages 1106–1112, 1997. 

[49] B.K.P. Horn and B.G. Schunck. Determining optical flow. Artificial Intelli- 

gence, 17:185–203, 1981. 

[50] J. Hyde and D. Parnham. the openillusionist project, 2008. (Last retrieved 

May 2008). 

[51] Intel. Open source computer vision library. (Last retrieved 30 Nov 2006). 

[52] J.A. JRobinson and C. Robertson. The livepaper system: augmenting paper 

on an enhanced tabletop. Computers & Graphics, 25(5):731–743, 2001. 

[53] P. KaewTraKulPong and R. Bowden. An improved adaptive background 

mixture model for real-time tracking with shadow detection, 2001. 

[54] D. Kalman. A singularly valuable decomposition: The svd of a matrix. The 

College Mathematics Journal, 27(1):2–23, 1996. 

[55] T. Kanade. Development of a Video-Rate Stereo Machine. In 1994 ARPA 

Image Understanding Workshop, November 1994. 

223

[56] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for a 

video-based augmented reality conferencing system. In IWAR ’99: Proceed- 

ings of the 2nd IEEE and ACM International Workshop on Augmented Reality, 

page 85, Washington, DC, USA, 1999. IEEE Computer Society. 

[57] R. Klette, K. Schluns, and A. Koschan. Computer Vision: Three-Dimensional 

Data from Images. Springer-Verlag Singapore Pte. Limited, 1998. 

[58] H. Koike, Y. Sato, Y. Kobayashi, H. Tobita, and M. Kobayashi. Interactive 

textbook and interactive venn diagram: natural and intuitive interfaces on 

augmented desk system. In CHI ’00: Proceedings of the SIGCHI conference 

on Human factors in computing systems, pages 121–128, New York, NY, USA, 

2000. ACM. 

[59] M.W. Krueger. Artificial Reality. Addison-Wesley, Reading, MA, 1983. 

[60] M.W. Krueger. Environmental technology: making the real world virtual. 

Commun. ACM, 36(7):36–37, 1993. 

[61] D.T. Lawton and W.F. Gardner. Translational decomposition of flow fields. 

pages 697–705, 1993. 

[62] D.C. Lay. Linear Algebra and its Applications. Addison Wesley Longman Inc., 

1997. 

[63] J. Letessier and F. Bérard. Visual tracking of bare fingers for interactive 

surfaces. In UIST ’04: Proceedings of the 17th annual ACM symposium on User 

interface software and technology, pages 119–122, New York, NY, USA, 2004. 

ACM. 

[64] L. Li and J.A Robinson. A semi-automatic human-computer collaborative 

system for 3d shapes inputting. IET Visual Information Engineering, July 

2007. 

224

[65] B.D. Lucas and T. Kanade. An iterative image registration technique with 

an application to stereo vision. In In Proceedings of International Joint Confer- 

ence on Artificial Intelligence, pages 674–679, 1981. 

[66] F. Lv, T. Zhao, and R. Nevatia. Camera calibration from video of a walk- 

ing human. IEEE Transactions on Pattern Analysis and Machine Intelligence, 

28(9):1513–1518, 2006. 

[67] P. Maes. Artificial life meets entertainment: lifelike autonomous agents. 

Commun. ACM, 38(11):108–114, 1995. 

[68] P. Maes, T. Darrell, B. Blumberg, and A. Pentland. The alive system: Wire- 

less, full-body interaction with autonomous agents. 1996. 

[69] S. Malik and J. Laszlo. Visual touchpad: a two-handed gestural input de- 

vice. In ICMI, pages 289–296, 2004. 

[70] J.W. Mateer and J.A. Robinson. A vision-based postproduction tool for 

footage logging, analysis, and annotation. Graph. Models, 67(6):565–583, 

2005. 

[71] H.K. Nishihara. Prism: A practical real-time imaging stereo matcher. Tech- 

nical report, Cambridge, MA, USA, 1984. 

[72] C. Nölker and H. Ritter. Detection of fingertips in human hand movement 

sequences. In I. Wachsmuth and M. Fröhlich, editors, Gesture and Sign Lan- 

guage in Human-Computer Interaction, Proceedings of the International Gesture 

Workshop 1997, pages 209–218. Springer, 1998. 

[73] S. O’Mahony and J.A. Robinson. Penpets: a physical environment for vir- 

tual animals. In CHI ’03: CHI ’03 extended abstracts on Human factors in 

computing systems, pages 622–623, New York, NY, USA, 2003. ACM. 

225

[74] D. Parnham. An Infrastructure for Video-Augmented Environments. PhD the- 

sis, University of York, February 2007. 

[75] D. Parnham, J.A. Robinson, and Y. Zhao. A compact fiducial for affine 

augmented reality. Second International Conference on Visual Information En- 

gineering (VIE), pages 347–352, April 2005. 

[76] J.L. Posdamer and M.D. Altschuler. Surface measurement by space- 

encoded projected beam system. CGIP, 18(1):1–17, January 1982. 

[77] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical 

Recipes in C: The Art of Scientific Computing. Cambridge University Press, 

2nd edition, January 1993. 

[78] F. Quek, T. Mysliwiec, and M. Zhao. Figermouse: a freehand pointing in- 

terface. In International Workshop on Automatic Face and Gesture Recognition, 

pages 372–377, Zurich, Switzerland, June 1995. 

[79] J. Renno, J. Orwell, and G. Jones. Learning surveillance tracking models for 

the self-calibrated ground plane, 2002. 

[80] J.A. Robinson. Collaborative vision and interactive mosaicing. Vision, Video 

and Graphics (VVG), July 2003. 

[81] C. Rocchini, P. Cignoni, C. Montani, P. Pingi, and R. Scopigno. A low cost 3d 

scanner based on structured light. EUROGRAPHICS, 20(3):299–308, 2001. 

[82] S. Roy and I.J. Cox. A maximum-flow formulation of the n-camera stereo 

correspondence problem. In ICCV, pages 492–502, 1998. 

[83] G. Sansoni, S. Lazzari, S. Peli, and F. Docchio. 3-d imager for dimensional 

gauging of industrial workpieces: State-of-the-art of the development of a 

robust and versatile system. 3dim, 0:19, 1997. 

226

[84] K. Sato and S. Inokuchi. Three-dimensional surface measurement by space 

encoding range imaging. J.Robotic Systems, 2(1):27–39, 1985. 

[85] Y. Sato, Y. Kobayashi, and H. Koike. Fast tracking of hands and fingertips 

in infrared images for augmented desk interface. In FG ’00: Proceedings of 

the Fourth IEEE International Conference on Automatic Face and Gesture Recog- 

nition 2000, page 462, Washington, DC, USA, 2000. IEEE Computer Society. 

[86] D. Scharstein and R. Szeliski. Stereo matching with non-linear diffusion. 

Technical Report TR96-1575, 18, 1996. 

[87] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of 

dense two-frame stereo correspondence algorithms. In Proceedings of the 

IEEE Workshop on Stereo and Multi-Baseline Vision, Kauai, HI, Dec. 2001., 2001. 

[88] P. Schnemann. A generalized solution of the orthogonal procrustes prob- 

lem. Psychometrika, 31(1):1–10, March 1966. 

[89] L. Shapiro and G. Stockman. Computer Vision. Prentice Hall, 2001. 

[90] L.S. Shapiro, H. Wang, and J.M. Brady. A matching and tracking strategy 

for independently moving objects. Proc. 3rd British Machine Vision Confer- 

ence, pages 306–315, September 1992. 

[91] H. Shikawa and D. Geiger. Occlusions, discontinuities, and epipolar lines 

in stereo. Lecture Notes in Computer Science, 1406:232–248, 1998. 

[92] D. Sinclair, A. Blake, S. Smith, and S. Rothwel. Planar region detection and 

motion recovery. In 3rd British Machine Vision Conference, 1992. 

[93] Q. Stafford-Fraser and P. Robinson. Brightboard: A video-augmented en- 

vironment. In CHI, pages 134–141, 1996. 

227

[94] C. Stauffer, W. Eric, and L. Grimson. Learning patterns of activity using 

real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):747–757, 

2000. 

[95] G. Strang. Introduction to Linear Algebra, Third Edition. Wellesley Cambridge 

Pr, March 2003. 

[96] B. Thomas and W. Piekarski. Glove based user interaction techniques for 

augmented reality in an outdoor environment, 2002. 

[97] M. Trobina. Error model of a coded-light range sensor, 1995. 

[98] E. Trucco, R.B. Fisher, A.W. Fitzgibbon, and D.K. Naidu. Calibration, 

data consistency and model acquisition with a 3-d laser striper. RobCIM, 

11(4):292–310, 1998. 

[99] R.Y. Tsai. An efficient and accurate camera calibration technique for 3d 

machine vision. In Proceedings of IEEE Conference on Computer Vision and 

Pattern Recognition, pages 364–374, Miami Beach, FL, 1986. 

[100] Vision3dWeb. (Last retrieved, April 2008). 

[101] website. the national museum scotland, 2008. (Last retrieved May 2008). 

[102] P. Wellner. The digitaldesk calculator: tangible manipulation on a desk top 

display, 1991. 

[103] P. Wellner. Adaptive thresholding on the digitaldesk. EuroPARC Technical 

Report EPC-93-110, 1993, 1993. 

[104] P. Wellner. Interacting with paper on the DigitalDesk. Communications of 

the ACM, 36(7):86–97, 1993. 

228

[105] G. Wiora. High-resolution measurement of phase-shift amplitude and nu- 

meric object phase calculation. Vision Geometry IX, 4117(1):289–299, 2000. 

[106] R.J. Woodham. Photometric method for determining surface orientation 

from multiple images. Optical Engineerin, 19:139 –144, 1980. 

[107] R.J Woodham. Determining surface curvature with photometric stereo. 

IEEE Conf Robotics & Automation, pages 36–42, 1989. 

[108] R.J. Woodham. Gradient and curvature from photometric stereo includ- 

ing local confidence estimation. Journal of the Optical Society of America, 

11(11):3050–3068, 1994. 

[109] L. Zhang, B. Curless, and S. M. Seitz. Rapid shape acquisition using color 

structured light and multi-pass dynamic programming. The 1st IEEE In- 

ternational Symposium on 3D Data Processing, Visualization, and Transmission, 

pages 24–36, June 2002. 

[110] L. Zhang, B. Curless, and S. Seitz. Spacetime stereo: Shape recovery for dy- 

namic scenes. International Conference on Computer Vision and Pattern Recog- 

nition, , Madison, WI., pages 367–374, June, 2003. 

[111] Z.Y. Zhang. A flexible new technique for camera calibration. IEEE Transac- 

tion Pattern Analysis and Machine Intelligence, 22(11):1300–1334, 2000. 

229

Appendix A 

Declarations for class CButton 

1 #pragma once 

2 

3 

4 // number of buttons 

5 #define NUM_BUTTONS 59 

6 

7 // default threshold used for inner region, if button 

calibration is skipped 

8 #define BUTTON_TH_INNER 10.00 

9 

10 // default threshold used for outer region, if button 

calibration is skipped 

11 #define BUTTON_TH_OUTER 5.0 

12 

13 // the time period a button stays highlighted for, in 

milliseconds 

14 #define BUTTON_INT 500 

15 

16 

17 // Top-left corners of all buttons 

18 int button_pos[NUM_BUTTONS*2] = 

19 { 

20 954, 618, // 0: lock // 60x40 

21 954, 538, // 1: save 

22 954, 458, // 2: SL 

23 954, 378, // 3: SL_repeat 

24 954, 298, // 4: exit 

25 

26 10, 588, // 5: thumbnail (80 x 60) 

27 10, 508, // 6: thumbnail (80 x 60) 

230

28 10, 428, // 7: thumbnail (80 x 60) 

29 10, 348, // 8: thumbnail (80 x 60) 

30 10, 268, // 9: thumbnail (80 x 60) 

31 10, 188, // 10: thumbnail (80 x 60) 

32 

33 80, 698, // 11: INSPECT MODE 

34 150, 698, // 12: TOUCHUP MODE 

35 220, 698, // 13: CORRESPONDENCE MODE 

36 290, 698, // 14: visualization mode 

37 

38 380, 698, // 15: up 

39 460, 698, // 16: down 

40 540, 698, // 17: left 

41 620, 698, // 18: right 

42 700, 698, // 19: in 

43 780, 698, // 20: out 

44 

45 370, 698, // 21: v_expand 

46 440, 698, // 22: v_shrink 

47 510, 698, // 23: h_expand 

48 580, 698, // 24: h_shrink 

49 660, 698, // 25: roi_up 

50 730, 698, // 26: roi_down 

51 800, 698, // 27: roi_left 

52 870, 698, // 28: roi_right 

53 

54 370, 673, // 29: touchpad (150 x 90) 

55 530, 698, // 30: double cursor speed 

56 600, 698, // 31: push button 

57 

58 884, 698, // 32: manual search 

59 954, 698, // 33: mouse assisted 

60 

61 880, 698, // 34: param 

62 954, 698, // 35: proceed 

63 

64 370, 673, // 36: R 

65 370, 723, // 37: T 

66 

67 450, 698, // 38: R_x+ 

68 520, 698, // 39: R_x- 

69 590, 698, // 40: R_y+ 

70 660, 698, // 41: R_y- 

71 730, 698, // 42: R_z- 

72 800, 698, // 43: R_z- 

73 

74 450, 698, // 44: T_x+ 

75 520, 698, // 45: T_x- 

231

76 590, 698, // 46: T_y+ 

77 660, 698, // 47: T_y- 

78 730, 698, // 48: T_z+ 

79 800, 698, // 49: T_z- 

80 

81 870, 698, // 50: x1, x2, x4, x8 

82 

83 450, 698, // 51: pointset0 






89 

90 880, 698, // 57: no 

91 870, 618 // 58: tuning pose 

92 }; 

93 

94 

95 // Button IDs 

96 enum BUTTON_ID 

97 { 

98 SYS_LOCK, 

99 SYS_SAVE, 

100 SYS_SL, 

101 SYS_SL2, 

102 SYS_EXT, 

103 

104 THUMB_0, 

105 THUMB_1, 

106 THUMB_2, 

107 THUMB_3, 

108 THUMB_4, 

109 THUMB_5, 

110 

111 MODE_INSPECT, 

112 MODE_TOUCHUP, 

113 MODE_CORRESP, 

114 MODE_VISUAL, 

115 

116 CTRL_UP, 

117 CTRL_DOWN, 

118 CTRL_LEFT, 

119 CTRL_RIGHT, 

120 CTRL_IN, 

121 CTRL_OUT, 

122 

123 CTRL_ROI_VEXPAND, 

232

124 CTRL_ROI_VSHRINK, 

125 CTRL_ROI_HEXPAND, 

126 CTRL_ROI_HSHRINK, 

127 CTRL_ROI_UP, 

128 CTRL_ROI_DOWN, 

129 CTRL_ROI_LEFT, 

130 CTRL_ROI_RIGHT, 

131 

132 CTRL_TOUCHPAD, 

133 CTRL_DOUBLE_SPEED, 

134 CTRL_PUSHBUTTON, 

135 

136 CTRL_MANUAL, 

137 CTRL_MOUSE, 

138 

139 CTRL_PARAM, 

140 CTRL_PROCEED, 

141 

142 CTRL_R, 

143 CTRL_T, 

144 

145 CTRL_R_XP, 

146 CTRL_R_XM, 

147 CTRL_R_YP, 

148 CTRL_R_YM, 

149 CTRL_R_ZP, 

150 CTRL_R_ZM, 

151 

152 CTRL_T_XP, 

153 CTRL_T_XM, 

154 CTRL_T_YP, 

155 CTRL_T_YM, 

156 CTRL_T_ZP, 

157 CTRL_T_ZM, 

158 

159 CTRL_CHANGE_SPEED, 

160 

161 CTRL_SELECT_0, 






167 

168 CTRL_NO, 

169 CTRL_TUNING_POSE, 

170 }; 

171 

233

172 

173 

174 //-------------------------------------------- 

175 // CButton class declaration 

176 //-------------------------------------------- 

177 

178 class CButton 

179 { 

180 private: 

181 CvRect mProRect; // button position/size in projector 

image 

182 CvRect mCamRect; // button position/size in camera image 

183 CvRect mCamInnerRect; // inner region for button push 

detection 

184 

185 char *mpImageName; // name of the image to be loaded for 

the button 

186 char *mpHelpText1; // help text 1st line 

187 char *mpHelpText2; // help text 2nd line 

188 

189 bool mFlagActive; // a flag that indicates the current 

button is engaged or not 

190 bool mFlagHighlighted; // a flag that indicates the 

current button is highlighted or not 

191 

192 // Constructor 

193 CButton(); 

194 

195 // Destructor 

196 ˜CButton(); 

197 

198 public: 

199 

200 // Based on the size in the projector image, calculate the 

buttons’ expected positions in the camera image 

201 void SetSize(int px, int py, int pxsize, int pysize); 

202 

203 // Initialise nth button 

204 void Initialise(int n); 

205 

206 // on given image, get inner region avg 

207 double GetInnerAvg(picture_of_int *inpic); 

208 

209 // on given image, get outer region avg 

210 double GetOuterAvg(picture_of_int *inpic); 

211 

212 bool Pressed(); 

213 bool Released(); 

234

214 void Flash(); 

215 void Highlight(); 

216 void Dehighlight(); 

217 

218 void Attach(); 

219 void Detach(); 

220 void AttachText(); 

221 void DetachText(); 

222 void AttachNewText(char *inText1, char *inText2); 

223 

224 // Black text with white background, as opposed to normal 

text 

225 void AttachInverseText(); 

226 

227 void DrawButtonBoundary(colour_picture &inpic); 

228 }; 

Listing A.1: Header: Button.h 

235

Appendix B 

Declarations for class CPointSet 


2 

3 #include 

4 #include "XMLParser.h" 

5 

6 using namespace std; 

7 

8 typedef std::vector CvMat_vector; 

9 typedef std::vector CvScalar_vector; 

10 

11 

12 

13 //-------------------------------------------- 

14 // CPointSet class declaration 

15 //-------------------------------------------- 

16 class CPointSet 

17 { 

18 private: 

19 

20 //------------------------ 

21 // Main data 

22 //------------------------ 

23 int mLength; // total number of points 

24 CvMat *mpObjectPoints; // 3D coordinates 

25 CvMat *mpImagePoints; // 2D positions 

26 CvScalar *mpColour; // colour information 

27 

28 int mLength_bk; 

29 CvMat *mpObjectPoints_bk; 

30 CvMat *mpImagePoints_bk; 

236

31 CvScalar *mpColour_bk; 

32 

33 //------------------------ 

34 // Matrices 

35 //------------------------ 

36 CvMat *mpCentroid; 

37 CvMat *mpRvec; // 3x1 instant rotation vector 

38 CvMat *mpTvec; // 3x1 instant translation vector 

39 CvMat *mpRvecInter[NUM_VIEWS]; // 3x1 inter-pointset 

rotation vectors 

40 CvMat *mpTvecInter[NUM_VIEWS]; // Save as above, but 

vectors for translation 

41 int mMergedGroup; // which group this point set is 

merged to, -1 for non-merge, 0 for group0, 1 for group1 

, and so on... 

42 

43 //------------------------ 

44 // Rendered images 

45 //------------------------ 

46 picture_of_int *mpImageBwPic; // black and white model 

47 colour_picture *mpImageColorPic;// model attached with 

colour information 

48 

49 

50 //------------------------------------------------- 

51 // Constructor, Decontructor 

52 //------------------------------------------------- 

53 CPointSet (); 

54 ˜CPointSet (); 

55 

56 // Overload operator, for point set replication 

57 CPointSet& CPointSet::operator=(CPointSet& param); 

58 

59 public: 

60 //------------------------------------------------- 

61 // Primary functions 

62 //------------------------------------------------- 

63 

64 // Load point set from XML 

65 void LoadXML(char *fileName); 

66 

67 // Save point set to XML 

68 void SaveXML(char *fileName); 

69 

70 // Reallocate both front data and backup data 

71 void ReallocateAllMemory(int len, int len_bk); 

72 

73 // Reallocate memory for front data with size of len 

237

74 void ReallocateFrontMemory(int len); 

75 

76 // Reallocate memory for backup data with size of len 

77 void ReallocateBackMemory(int len); 

78 

79 // Replace front data with backup 

80 void ResetFromBackup(); 

81 

82 // Save front data into backup 

83 void SaveToBackup(); 

84 

85 // Default -1 means list all data; otherwise the list nth 

element 

86 void List(int index=-1); 

87 

88 // Given 2D image coordinate, find the point in point set, 

and return its index 

89 int GetIndex(int xin, int yin); 

90 

91 // Cut off out-of-boundary points and zero-depth points 

92 void RestrictSize(int size); 

93 

94 // Slim with voxel quantisation 

95 void Slim(int objSize, int voxSize); 

96 

97 

98 //------------------------------------------------- 

99 // Point set transform in 3D 

100 //------------------------------------------------- 

101 void UpdateCentroid(); 

102 void Rotate(); 

103 void Translate(); 

104 

105 // Rotate + Translate + UpdateCentroid 

106 void FullTransform(); 

107 

108 // Rotate about the WCS origin 

109 void RotateAboutOrigin(); 

110 

111 // Theta rotation about unit vector (x, y, z) 

112 void RotateThetaAboutVector(); 

113 

114 // Manually fine tune rotation or translation. flag: -1, 

do nothing; flag: 1˜6 for rotation; flag: 7˜12 for 

translation 

115 void StepAdjustRorT(int flag=-1); 

116 

117 

238

118 //------------------------------------------------- 

119 // Plotting and display 

120 //------------------------------------------------- 

121 

122 // Draw rendered point set into an image for display 

123 void DrawBw(int flagTopHalf=0, int flagInterp=0, int 

interpStep=1); 

124 

125 // Draw rendered point set into an image for display (with 

colour info attached) 

126 void DrawColor(int flagTopHalf=0, int flagInterp=0, int 

interpStep=1); 

127 }; 

Listing B.1: Header: PointSet.h 

239

Appendix C 

Declarations for class CView 


2 

3 

4 #include "PointSet.h" 

5 #include "Cursor.h" 

6 

7 // number of views (maximum allowed) 

8 #define NUM_VIEWS 6 

9 

10 // number of views to be tested, debug mode 

11 #define TESTING_VIEWS 5 

12 

13 

14 

15 //-------------------------------------------- 

16 // CView class declaration 

17 //-------------------------------------------- 

18 

19 class CView 

20 { 

21 private: 

22 int mViewIndex; // index of the current view 

23 

24 // Four sub images for display 

25 picture_of_int *mpDepthPic; // depth map 

26 picture_of_int *mpTextPic; // texth map 

27 colour_picture *mpModelPic; // rendered model 

28 colour_picture *mpColourPic;// colour map 

29 

30 // 4 sub rect, each is half size of 640x480 

240

31 CvRect mDepthRect, mTextRect, mModelRect, mColourRect; 

32 

33 // Thumbnail position 

34 CvRect mThumbRect; 

35 

36 // buffer image for fast push and pop central display area 

37 colour_picture *mpCentralDisplayPic; 

38 

39 // Cursor member, for cursor rendering and positioning 

40 CCursor mCursor; 

41 

42 // Point set member 

43 CPointSet *mPointset; 

44 // flag indicating current tuning mode (rotation or 

translation) 

45 bool mFlagFineTuneRorT; 

46 // if pointset of current view is merged away to other 

views, set it true. 

47 bool flag_PointSetMergedAway; 

48 

49 // ROI for image registration (left image) 

50 CvRect mCorrespROIRect1; 

51 // ROI for image registration (right image) 

52 CvRect mCorrespROIRect2; 

53 

54 // Constructor 

55 CView(int); 

56 // Destructor 

57 ˜CView(); 

58 

59 public: 

60 

61 //-------------------------------------------- 

62 // primary functions 

63 //-------------------------------------------- 

64 

65 // Initialise current view, allocate memory, assign 

positions 

66 void Initialise(); 

67 

68 // Get ROI based on object dimension and pointset centroid 

, then work out the estimated area the object is going 

to appear in the observed image, crop it. 

69 void PrepareThumbnail(char* fname); 

70 

71 // Attach four sub images 

72 void AttachDisplay(); 

73 

241

74 void PushCentralDisplay(); 

75 void PopCentralDisplay(); 

76 void ClearCentralDisplay(); 

77 void FadeCentralDisplay(); 

78 void UnfadeCentralDisplay(); 

79 

80 // Attach small box on thumbnail and big box on central 

display, draw all connections lines 

81 void AttachBox(int Rval, int Gval, int Bval); 

82 void DetachBox(); 

83 

84 

85 //-------------------------------------------- 

86 // Touchup mode 

87 //-------------------------------------------- 

88 

89 // Adjust rendered model picture, based on incoming flag n 

=0˜5: up, down, left, right, in, out 

90 void AdjustFijipic(int n); 

91 

92 // Do TouchUp on depth image, based on current cursor 

location. This will change contents of depth data, 

point set data, and colour map, all with backup. Once 

done, set flagTouchUpModified = true 

93 void TouchUp(); 

94 

95 bool flagTouchUpModified; 

96 

97 

98 //-------------------------------------------- 

99 // Correspondence mode 

100 //-------------------------------------------- 

101 

102 // Called when user selects ’from’ and ’to’ image for 

registration 

103 void UpdateCorrespSelectionDisplay(int selection); 

104 

105 // Save as above, just remove everything completely ( 

without any repairs) 

106 void RemoveCorrespThumbMainDisplayCompletely(); 

107 

108 // System gives trial ROI selections 

109 void AutoSelectRODisplay(); 

110 

111 // Select ROI of the chosen image, slide them into centre 

for better view 

112 void SlideImages(); 

113 

242

114 

115 //-------------------------------------------- 

116 // Visualize Mode 

117 //-------------------------------------------- 

118 

119 // Default is -1: do nothing; flag 0˜5: for rotations; 

flag 6˜11: for translations 

120 void UpdateVisualPanelArea(int flag = -1); 

121 

122 }; 

Listing C.1: Header: View.h 

243

Human-Computer Collaboration in Video-Augmented ... - Index of

Create successful ePaper yourself

Delete template?

Save as template?