17.06.2013 Views

Human-Computer Collaboration in Video-Augmented ... - Index of

Human-Computer Collaboration in Video-Augmented ... - Index of

Human-Computer Collaboration in Video-Augmented ... - Index of

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Human</strong>-<strong>Computer</strong> <strong>Collaboration</strong><br />

<strong>in</strong> <strong>Video</strong>-<strong>Augmented</strong> Environment<br />

for 3D Input<br />

Lijiang Li<br />

Doctor <strong>of</strong> Philosophy Dissertation<br />

University <strong>of</strong> York<br />

Department <strong>of</strong> Electronics<br />

May 2008


Declaration<br />

Except where otherwise stated <strong>in</strong> the text, this dissertation is the result<br />

<strong>of</strong> my own <strong>in</strong>dependent work and <strong>in</strong>vestigation, and is not the outcome<br />

<strong>of</strong> work done <strong>in</strong> collaboration.<br />

Other sources are acknowledged by footnotes given explicit references.<br />

A bibliography is appended at the end <strong>of</strong> this thesis.<br />

This dissertation is not substantially the same as any I have submitted<br />

for a degree or diploma or any other qualification at any other university.<br />

No part <strong>of</strong> this dissertation has already been, or is be<strong>in</strong>g currently sub-<br />

mitted for any such degree, diploma or other qualification.<br />

2


Abstract<br />

The role <strong>of</strong> the computer has gradually changed from merely a tool to an assis-<br />

tant to the human. Equipp<strong>in</strong>g computers with I/O devices and sensors makes<br />

them understand<strong>in</strong>g <strong>of</strong> the surround<strong>in</strong>g world, and capable <strong>of</strong> <strong>in</strong>teract<strong>in</strong>g with<br />

humans. <strong>Video</strong> cameras and data projectors are ideally suited as these sensor de-<br />

vices, especially with the dramatic drops <strong>in</strong> their manufactur<strong>in</strong>g costs it makes<br />

them more and more popular. A new type <strong>of</strong> user <strong>in</strong>terface emerged where the<br />

video signals are used as an augmentation to enhance the physical world, and<br />

here comes the name <strong>Video</strong>-<strong>Augmented</strong> Environment.<br />

This thesis presents a design <strong>of</strong> human-computer <strong>in</strong>teractions <strong>in</strong> a VAE for<br />

3D <strong>in</strong>put. It beg<strong>in</strong>s with <strong>in</strong>troduc<strong>in</strong>g an automated and efficient full calibration<br />

method for calibrat<strong>in</strong>g the projector-camera system. Shape acquisition techniques<br />

are discussed and then one particular technique based on structured light sys-<br />

tems is adapted for captur<strong>in</strong>g depth <strong>in</strong>formation. A user guided approach for<br />

register<strong>in</strong>g depth <strong>in</strong>formation scanned from different part <strong>of</strong> the target object is<br />

<strong>in</strong>troduced. F<strong>in</strong>ally a practical realisation <strong>of</strong> a <strong>Video</strong>-<strong>Augmented</strong> Environment is<br />

presented comb<strong>in</strong><strong>in</strong>g the techniques discussed earlier.<br />

Overall, the VAE designed <strong>in</strong> this thesis has the feasibility <strong>of</strong> complet<strong>in</strong>g com-<br />

puter vision tasks <strong>in</strong> a human-computer collaborative environment, and shows<br />

the potential and viability <strong>of</strong> be<strong>in</strong>g deployed not only <strong>in</strong> laboratory but also <strong>in</strong><br />

<strong>of</strong>fice and home environment.<br />

3


Acknowledgements<br />

Complet<strong>in</strong>g a PhD is a marathon event, and I would not have been<br />

able to complete this journey without the support and encouragement <strong>of</strong><br />

countless people over the last four years.<br />

First and foremost, I would like to express my deep and s<strong>in</strong>cere grati-<br />

tude to my supervisor, Pr<strong>of</strong>essor John Rob<strong>in</strong>son, Head <strong>of</strong> the Department<br />

<strong>of</strong> Electronics, University <strong>of</strong> York. His wide knowledge and expertise have<br />

been <strong>in</strong>valuable to me, while his personal guidance and constructive criti-<br />

cism have provided a good basis for my research and this thesis work.<br />

Many thanks to Justen Hyde and Daniel Parnham for provid<strong>in</strong>g the<br />

OpenIllusionist framework where the frame grabber is orig<strong>in</strong>ated from,<br />

and the help <strong>of</strong> other variety <strong>of</strong> implementation issues. I wish to express<br />

my thanks to my lab partner Matthew Day and Eddie Munday for lots<br />

<strong>of</strong> <strong>in</strong>spir<strong>in</strong>g talks and their participation <strong>in</strong> user experiments. My warm<br />

thanks are due to Owen Francis and other CSG group members for their<br />

assistance.<br />

Dur<strong>in</strong>g my placement at the FCG team, British Telecommunications at<br />

4


Ipswich, I have collaborated with many colleagues, and I wish to extend<br />

my warmest thanks to Dr.Li-Qun Xu, Ian Kegel and all those who have<br />

helped me with my work. Their <strong>in</strong>sights and comments were <strong>of</strong> great<br />

value dur<strong>in</strong>g my placement, and I look forward to a cont<strong>in</strong>u<strong>in</strong>g collabora-<br />

tion with the FCG team <strong>in</strong> the near future.<br />

F<strong>in</strong>ally, I owe my most lov<strong>in</strong>g thanks to my mum, for s<strong>in</strong>gle-handedly<br />

rais<strong>in</strong>g my up over the last twenty years. I would not have been where I<br />

am without her support, constant <strong>in</strong>still<strong>in</strong>g my confidence, but most im-<br />

portantly her love.<br />

5<br />

Lijiang Li<br />

York, UK, May 2008


Contents<br />

Abstract 3<br />

Acknowledgement 4<br />

1 Introduction 19<br />

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.2 Term<strong>in</strong>ologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.2.1 <strong>Augmented</strong> Reality and Virtual Reality . . . . . . . . 21<br />

1.2.2 <strong>Video</strong>-<strong>Augmented</strong> Environments . . . . . . . . . . . . 21<br />

1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

1.4 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2 Background and Prior Art 28<br />

2.1 Image based 3D capture methods for depth estimation . . . 29<br />

2.1.1 Feature Based Methods . . . . . . . . . . . . . . . . . 30<br />

2.1.2 Optical Flow Based Methods . . . . . . . . . . . . . . 31<br />

2.2 Active Shape Acquisition Methods . . . . . . . . . . . . . . . 33<br />

2.2.1 The Use <strong>of</strong> Structured Light System . . . . . . . . . . 35<br />

6


2.3 <strong>Video</strong> <strong>Augmented</strong> Environments (<strong>Video</strong>-<strong>Augmented</strong> Environment<br />

(VAE)s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.3.1 Related example VAEs <strong>in</strong> the past . . . . . . . . . . . 36<br />

2.3.2 Previous work at York . . . . . . . . . . . . . . . . . . 41<br />

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

3 Calibration 51<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

3.3 Calibration Parameters . . . . . . . . . . . . . . . . . . . . . . 57<br />

3.3.1 Intr<strong>in</strong>sic Parameters . . . . . . . . . . . . . . . . . . . 57<br />

3.3.2 The Reduced Camera Model . . . . . . . . . . . . . . 61<br />

3.3.3 Extr<strong>in</strong>sic Parameters . . . . . . . . . . . . . . . . . . . 62<br />

3.3.4 Full Model . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

3.4 Calibrate Camera-Projector Pair . . . . . . . . . . . . . . . . . 65<br />

3.4.1 World Coord<strong>in</strong>ate System . . . . . . . . . . . . . . . . 65<br />

3.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . 66<br />

3.4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . 67<br />

3.4.4 Choice <strong>of</strong> colour . . . . . . . . . . . . . . . . . . . . . 70<br />

3.4.5 Camera Calibration . . . . . . . . . . . . . . . . . . . . 73<br />

3.4.6 Projector Calibration . . . . . . . . . . . . . . . . . . . 74<br />

3.5 Plane to Plane Calibration . . . . . . . . . . . . . . . . . . . . 78<br />

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

3.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

4 Shape Acquisition 87<br />

7


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

4.3 Gray Codification . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

4.3.1 Gray Code Patterns . . . . . . . . . . . . . . . . . . . . 93<br />

4.3.2 Pattern Generation . . . . . . . . . . . . . . . . . . . . 95<br />

4.3.3 Codification Mechanism . . . . . . . . . . . . . . . . . 98<br />

4.4 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

4.4.1 Image Levels . . . . . . . . . . . . . . . . . . . . . . . 101<br />

4.4.2 Limited Camera Resolution . . . . . . . . . . . . . . . 102<br />

4.4.3 Inverse subtraction . . . . . . . . . . . . . . . . . . . . 104<br />

4.4.4 Adaptive threshold<strong>in</strong>g . . . . . . . . . . . . . . . . . . 107<br />

4.5 Depth from Triangulation . . . . . . . . . . . . . . . . . . . . 109<br />

4.5.1 F<strong>in</strong>al Captured Data . . . . . . . . . . . . . . . . . . . 112<br />

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

4.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

5 Registration <strong>of</strong> Po<strong>in</strong>t Sets 121<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

5.2.1 Rotations and Translations <strong>in</strong> 3D . . . . . . . . . . . . 125<br />

5.2.2 A S<strong>in</strong>gular Value Decomposition (SVD) Based Least<br />

Square Fitt<strong>in</strong>g Method . . . . . . . . . . . . . . . . . . 126<br />

5.3 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

5.3.1 Corner Detector . . . . . . . . . . . . . . . . . . . . . . 127<br />

5.3.2 Normalised Cross Correlation . . . . . . . . . . . . . 129<br />

5.3.3 Outlier Removals . . . . . . . . . . . . . . . . . . . . . 131<br />

8


5.4 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />

5.4.1 Data structure <strong>of</strong> a po<strong>in</strong>t set . . . . . . . . . . . . . . . 136<br />

5.4.2 Po<strong>in</strong>t set fusion with voxel quantisation . . . . . . . . 137<br />

5.4.3 User Assisted Tun<strong>in</strong>g . . . . . . . . . . . . . . . . . . . 141<br />

5.5 Render<strong>in</strong>g A Rotat<strong>in</strong>g Object . . . . . . . . . . . . . . . . . . 143<br />

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

5.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 146<br />

6 System Design 148<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />

6.2 Widgets Provided for Interaction . . . . . . . . . . . . . . . . 151<br />

6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

6.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 154<br />

6.2.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . 156<br />

6.2.4 Implementation <strong>of</strong> Pushbutton . . . . . . . . . . . . . 157<br />

6.2.5 Implementation <strong>of</strong> Touchpad . . . . . . . . . . . . . . 164<br />

6.3 User <strong>in</strong>terface . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />

6.4 Ma<strong>in</strong> Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />

6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 170<br />

6.4.2 Mode 1: Inspect . . . . . . . . . . . . . . . . . . . . . . 172<br />

6.4.3 Mode 2: Touchup . . . . . . . . . . . . . . . . . . . . . 175<br />

6.4.4 Mode 3: Correspondence . . . . . . . . . . . . . . . . 179<br />

6.4.5 Mode 4: Visualisation . . . . . . . . . . . . . . . . . . 186<br />

6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193<br />

6.5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 194<br />

9


7 System Evaluation 195<br />

7.1 Test Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196<br />

7.1.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . 196<br />

7.1.2 Object Descriptions . . . . . . . . . . . . . . . . . . . . 196<br />

7.2 Shape Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 198<br />

7.2.1 The Owl Experiment . . . . . . . . . . . . . . . . . . . 200<br />

7.2.2 The Football and Stand Experiment . . . . . . . . . . . 200<br />

7.2.3 The Cushion and <strong>Human</strong> Body Experiment . . . . . . . 206<br />

7.3 Correspondences F<strong>in</strong>d<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . 208<br />

7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />

8 Conclusions 213<br />

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213<br />

8.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />

8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216<br />

A Declarations for class CButton 230<br />

B Declarations for class CPo<strong>in</strong>tSet 236<br />

C Declarations for class CView 240<br />

10


List <strong>of</strong> Figures<br />

1.1 Mixed Reality. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

2.1 Optical flow <strong>of</strong> approach<strong>in</strong>g objects. . . . . . . . . . . . . . . 31<br />

2.2 The DigitalDesk. (image courtesy <strong>of</strong> the <strong>Computer</strong> Labora-<br />

tory, University <strong>of</strong> Cambridge) . . . . . . . . . . . . . . . . . 37<br />

2.3 An image <strong>of</strong> the BrightBoard. (image courtesy <strong>of</strong> the Com-<br />

puter Laboratory, University <strong>of</strong> Cambridge) . . . . . . . . . . 39<br />

2.4 User <strong>in</strong>teracts with the ALIVE system. (image courtesy <strong>of</strong><br />

the MIT Media Lab) . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

2.5 The LivePaper system <strong>in</strong> use. (image courtesy <strong>of</strong> the Visual<br />

Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . . . . . 42<br />

2.6 The LivePaper applications. (image courtesy7 <strong>of</strong> the Visual<br />

Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . . . . . 43<br />

2.7 Snapshots <strong>of</strong> Penpets <strong>in</strong> action. (image courtesy <strong>of</strong> the Visual<br />

Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . . . . . 45<br />

2.8 Audio d-touch <strong>in</strong>terface (the augmented musical stave). (im-<br />

age courtesy <strong>of</strong> the <strong>Computer</strong> Laboratory, University <strong>of</strong> Cam-<br />

bridge) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

11


2.9 Snapshots <strong>of</strong> Robot Ships <strong>in</strong> action. (image courtesy <strong>of</strong> the<br />

Visual Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . 49<br />

3.1 Calibration objects. (image courtesy <strong>of</strong> [109]) . . . . . . . . . 53<br />

3.2 Pr<strong>in</strong>cipal po<strong>in</strong>ts. Bottom right subimage is the imag<strong>in</strong>g plane. 58<br />

3.3 The distortion effects. . . . . . . . . . . . . . . . . . . . . . . . 60<br />

3.4 Transformation from world to camera coord<strong>in</strong>ate system. . . 62<br />

3.5 Flow chart <strong>of</strong> the camera-projector pair calibration. (dia-<br />

gram <strong>of</strong> image process<strong>in</strong>g after the projections and captures<br />

are done) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

3.6 Extraction <strong>of</strong> the projected pattern from the mixed one. . . . 71<br />

3.7 Extraction <strong>of</strong> the projected pattern from the mixed one (a<br />

closer look). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

3.8 Extraction <strong>of</strong> the projected pattern from the mixed one (a<br />

closer look). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

3.9 Pixel values <strong>of</strong> an image captured from a pla<strong>in</strong> desktop.<br />

(bottom two show<strong>in</strong>g the red channel only) . . . . . . . . . . 85<br />

4.1 A 9-level Gray-coded image. (only a slice from each im-<br />

age is shown here, to illustrate the change between adjacent<br />

codewords) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

4.2 Comparison: m<strong>in</strong>imum level <strong>of</strong> Gray-coded and b<strong>in</strong>ary-<br />

coded images needed to encode 16 columns. . . . . . . . . . 95<br />

4.3 Po<strong>in</strong>t-l<strong>in</strong>e triangulation. . . . . . . . . . . . . . . . . . . . . . 98<br />

4.4 B<strong>in</strong>ary encoded pattern divides the surface <strong>in</strong>to many sub-<br />

regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

12


4.5 Stripes be<strong>in</strong>g projected onto a fluffy doll.(10 level Gray coded<br />

stripes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

4.6 The alias effect caus<strong>in</strong>g errors <strong>in</strong> depth map. . . . . . . . . . 103<br />

4.7 3D plots <strong>of</strong> figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . 105<br />

4.8 Inverse subtraction <strong>of</strong> orig<strong>in</strong>al image and its flipped version. 106<br />

4.9 The <strong>in</strong>verse subtraction: the football experiment. . . . . . . . 108<br />

4.10 Depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

4.11 Colour texture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

4.12 Scattered po<strong>in</strong>t set <strong>in</strong> 3D. (re-sampled at every 2 millimetre) 115<br />

4.13 Scattered po<strong>in</strong>t set <strong>in</strong> 3D, attached with colour <strong>in</strong>formation.<br />

(re-sampled at every 2 millimetre) . . . . . . . . . . . . . . . 116<br />

4.14 Illustration <strong>of</strong> camera limited resolution. . . . . . . . . . . . . 118<br />

5.1 A rout<strong>in</strong>e <strong>of</strong> po<strong>in</strong>t set registration. . . . . . . . . . . . . . . . 125<br />

5.2 Corner detection. . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

5.3 NCC results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

5.4 NCC results (periodic pattern). . . . . . . . . . . . . . . . . . 133<br />

5.5 Robust estimation. (<strong>in</strong>liers shown by red connect<strong>in</strong>g l<strong>in</strong>es) . 135<br />

5.6 Robust estimation. (<strong>in</strong>liers shown by <strong>in</strong>dex numbers) . . . . 136<br />

5.7 Data structure <strong>of</strong> a po<strong>in</strong>t set. . . . . . . . . . . . . . . . . . . . 137<br />

5.8 Voxel quantisation <strong>of</strong> the large data set. . . . . . . . . . . . . 138<br />

5.9 Different quantisation level by choos<strong>in</strong>g different voxel size. 139<br />

5.10 The captured objects <strong>of</strong> figure 5.12. . . . . . . . . . . . . . . . 140<br />

5.11 The captured objects <strong>of</strong> figure 5.12. . . . . . . . . . . . . . . . 141<br />

5.12 The quantisation effect <strong>of</strong> choos<strong>in</strong>g different voxel size on<br />

the total po<strong>in</strong>t set size. . . . . . . . . . . . . . . . . . . . . . . 142<br />

13


5.13 Manual tun<strong>in</strong>g <strong>of</strong> po<strong>in</strong>t sets registration. . . . . . . . . . . . . 143<br />

5.14 Different rendered views. (top:rendered range images; bot-<br />

tom:rendered object attached with colour texture) . . . . . . 144<br />

6.1 A snapshot with touchpad and buttons. . . . . . . . . . . . . 153<br />

6.2 A captured image show<strong>in</strong>g an object is be<strong>in</strong>g scanned. . . . 154<br />

6.3 F<strong>in</strong>ger detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

6.4 Button calibration. . . . . . . . . . . . . . . . . . . . . . . . . 159<br />

6.5 The True Positive Rate (TPR) and False Positive Rate (FPR)<br />

<strong>of</strong> button push detection. . . . . . . . . . . . . . . . . . . . . . 160<br />

6.6 The projected buttons and their observations <strong>in</strong> camera im-<br />

age. (The red blocks only <strong>in</strong>dicate the area to be monitored). 163<br />

6.7 F<strong>in</strong>gertip detection us<strong>in</strong>g background segmentation algo-<br />

rithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

6.8 A screen shot <strong>of</strong> the work<strong>in</strong>g environment. . . . . . . . . . . 167<br />

6.9 Screen shot <strong>of</strong> the system start-up state. . . . . . . . . . . . . 171<br />

6.10 Owl experiment, 3 views captured, current on view 1. . . . . 174<br />

6.11 Owl experiment, 3 views captured, current on view 0, model<br />

rotated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174<br />

6.12 The row <strong>in</strong>dex picture <strong>of</strong> the first view (the brighter pixel<br />

values correspond to higher rows <strong>in</strong> the projection image.) . 177<br />

6.13 The touchup result <strong>of</strong> 6.10. . . . . . . . . . . . . . . . . . . . . 178<br />

6.14 Correspondence Mode: two images are selected as ’from’ and<br />

’to’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181<br />

6.15 Correspondence Mode: Region <strong>of</strong> Interest (ROI)s are selected. . 181<br />

6.16 Correspondence Mode: extracted corners. . . . . . . . . . . . . 183<br />

14


6.17 Correspondence Mode: correlated and improved po<strong>in</strong>t corre-<br />

spondences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183<br />

6.18 Correspondence Mode: visualised po<strong>in</strong>t sets tun<strong>in</strong>g, with con-<br />

trollable rotation and translation. . . . . . . . . . . . . . . . . 185<br />

6.19 Correspondence Mode: two po<strong>in</strong>t sets are fused. . . . . . . . . 186<br />

6.20 The Visualisation Mode. . . . . . . . . . . . . . . . . . . . . . . 187<br />

6.21 View 2 and 3 fused together. View completes the left w<strong>in</strong>g<br />

<strong>of</strong> the owl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191<br />

6.22 View 2 and 3 fused together. . . . . . . . . . . . . . . . . . . . 192<br />

6.23 Fusion <strong>of</strong> view 2, 3, and 4. . . . . . . . . . . . . . . . . . . . . 192<br />

7.1 Shape acquisition test: Owl. Top two: before touchup; bot-<br />

tom two: after touchup. . . . . . . . . . . . . . . . . . . . . . 201<br />

7.2 The projector-camera pair setup. The shaded part is the<br />

’dead’ area that can not be illum<strong>in</strong>ated by the projector but<br />

<strong>in</strong> the view<strong>in</strong>g range <strong>of</strong> the camera. . . . . . . . . . . . . . . . 202<br />

7.3 Shape acquisition test: Stand. Left column: depth maps;<br />

right column: the correspond<strong>in</strong>g textures. . . . . . . . . . . . 204<br />

7.4 Shape acquisition test: Football. Left column: depth maps;<br />

right column: the correspond<strong>in</strong>g textures. . . . . . . . . . . . 205<br />

7.5 Shape acquisition test: Cushion. Left column: depth maps;<br />

right column: the correspond<strong>in</strong>g textures. . . . . . . . . . . . 207<br />

7.6 Shape acquisition test: <strong>Human</strong> Body. Left column: depth<br />

maps; right column: the correspond<strong>in</strong>g textures. . . . . . . . 208<br />

7.7 Number <strong>of</strong> extracted corner po<strong>in</strong>ts and matched correspon-<br />

dence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211<br />

15


List <strong>of</strong> Tables<br />

4.1 10 level Gray code look-up table. . . . . . . . . . . . . . . . . 97<br />

6.1 Group<strong>in</strong>g status <strong>of</strong> the po<strong>in</strong>t sets at different stages. . . . . . 190<br />

7.1 An overview <strong>of</strong> the objects used for the tests. . . . . . . . . . 197<br />

7.2 Evaluation: depth capture error, and their corrections. . . . . 199<br />

7.3 Evaluation: build<strong>in</strong>g correspondences. . . . . . . . . . . . . . 210<br />

16


List <strong>of</strong> Acronyms<br />

AD Absolute Intensity Differences<br />

ALIVE Artificial Life <strong>in</strong>teractive <strong>Video</strong> Environment<br />

AR <strong>Augmented</strong> Reality<br />

DOF Degree <strong>of</strong> Freedom<br />

FOE Focus <strong>of</strong> Expansion<br />

FOV Field <strong>of</strong> View<br />

FPR False Positive Rate<br />

FRF Fast Rejection Filter<br />

GUI Graphical User Interface<br />

HCI <strong>Human</strong>-<strong>Computer</strong> Interface<br />

HMD Head Mounted Display<br />

MAD Mean Absolute Difference<br />

MR Mixed Reality<br />

17


MSE Mean Squared Error<br />

NCC Normalised Cross-Correlation<br />

PCA Pr<strong>in</strong>ciple Component Analysis<br />

PTZ Pan-Tilt-Zoom<br />

RANSAC RANdom SAmple Consensus<br />

ROI Region <strong>of</strong> Interest<br />

SD Squared Intensity Differences<br />

TPR True Positive Rate<br />

TUI Tangible User Interface<br />

SVD S<strong>in</strong>gular Value Decomposition<br />

UI User Interface<br />

VAE <strong>Video</strong>-<strong>Augmented</strong> Environment<br />

VR Virtual Reality<br />

WCS World Coord<strong>in</strong>ate System<br />

WTA W<strong>in</strong>ner-Takes-All<br />

WWW World Wide Web<br />

XML Extensible Markup Language<br />

18


Chapter 1<br />

Introduction<br />

1.1 Problem Statement<br />

The goal <strong>of</strong> computer vision is to make useful decisions about physical ob-<br />

jects and scenes based on sensed images [89]. Therefore, it is almost always<br />

necessary to describe or model these objects <strong>in</strong> some way from images. It<br />

is safe to say there is no better way than reconstruct<strong>in</strong>g the 3D models from<br />

2D images, because 3D vision is natural to humans and can therefore pro-<br />

vide structural <strong>in</strong>formation <strong>in</strong> probably the most obvious way perceived<br />

by humans.<br />

19


Over recent years, researchers and scientists have been fasc<strong>in</strong>ated by<br />

the possibility <strong>of</strong> build<strong>in</strong>g <strong>in</strong>telligent mach<strong>in</strong>es or vision systems which<br />

are capable <strong>of</strong> understand<strong>in</strong>g the physical world and represent<strong>in</strong>g it <strong>in</strong> 3D<br />

space. On the other hand, they are also keen <strong>in</strong> br<strong>in</strong>g<strong>in</strong>g these vision sys-<br />

tems <strong>in</strong>to people’s day to day life, and use them to bridge the gap between<br />

the physical world where humans live and the virtual world the computer<br />

generates.<br />

This research is <strong>in</strong>spired under this context. We aim to develop a vi-<br />

sion system for efficient 3D shape <strong>in</strong>put. From data <strong>in</strong>put to f<strong>in</strong>ally build<br />

the complete 3D model for the target object, it might take several captures<br />

<strong>of</strong> the object be<strong>in</strong>g positioned <strong>in</strong> different orientations, and the tasks such<br />

as error removal and data fusion are carried out <strong>in</strong> a human-computer<br />

collaborative way <strong>in</strong> an environment mixed with real world objects and<br />

augmented video signals.<br />

It is also important that all hardware used <strong>in</strong> the system is day-to-day<br />

equipment that is not hard to get <strong>in</strong> an <strong>of</strong>fice environment. Inexpensive<br />

peripherals and easy to use s<strong>of</strong>tware are used so that the system can be ap-<br />

plied <strong>in</strong> various environments, specially targeted to museum exhibitions<br />

and home gamers.<br />

In conclusion, the system should not only accomplish the 3D shape <strong>in</strong>-<br />

put task but also efficiently and collectively utilise the skills <strong>of</strong> human and<br />

20


the power <strong>of</strong> computer, <strong>in</strong> a visual environment subject to illum<strong>in</strong>ation<br />

changes where physical objects and virtual elements co-exist.<br />

1.2 Term<strong>in</strong>ologies<br />

1.2.1 <strong>Augmented</strong> Reality and Virtual Reality<br />

Virtual Reality (VR) is a synthetic world where we <strong>in</strong>teract with virtual<br />

objects generated by computers or other equipment, <strong>in</strong>stead <strong>of</strong> those real<br />

objects surround<strong>in</strong>g us <strong>in</strong> the real world. <strong>Augmented</strong> Reality (AR), some-<br />

times known as Mixed Reality (MR), mixes the real physical world and<br />

the world <strong>of</strong> VR, by enhanc<strong>in</strong>g the real world with augmented virtual <strong>in</strong>-<br />

formation.<br />

1.2.2 <strong>Video</strong>-<strong>Augmented</strong> Environments<br />

In this thesis, A VAE is a k<strong>in</strong>d <strong>of</strong> projector-camera system <strong>in</strong> which a user’s<br />

<strong>in</strong>teraction with objects and projections are <strong>in</strong>terpreted by a vision system,<br />

lead<strong>in</strong>g to changes <strong>in</strong> the augmented signals. It is a specific type <strong>of</strong> AR,<br />

while the augmentation could be anyth<strong>in</strong>g from overlay<strong>in</strong>g <strong>in</strong>structions,<br />

to generat<strong>in</strong>g a virtual object that appears to exist <strong>in</strong> the physical world<br />

and responds to the environment accord<strong>in</strong>g to human’s <strong>in</strong>structions. In<br />

21


Figure 1.1: Mixed Reality.<br />

this way, objects can appear to be augmented [52, 73], or the user can ma-<br />

nipulate graphical data by gesture [63, 13, 30]. A significant property <strong>of</strong><br />

such systems is that 3D objects and projected images are comb<strong>in</strong>ed <strong>in</strong> a<br />

s<strong>in</strong>gle mixed environment.<br />

1.3 Goals<br />

We contend that <strong>in</strong> many non-<strong>in</strong>teractive vision problems, a valid and<br />

sometimes superior solution can be atta<strong>in</strong>ed through a human user or<br />

users collaborat<strong>in</strong>g with automated analysis. Previous work at York has<br />

reported applications <strong>in</strong> fast panorama construction [80], AudioPhotoDesk<br />

[41], d-touch [29], movie footage logg<strong>in</strong>g [70] and this thesis considers 3D<br />

22


object acquisition – <strong>in</strong> a human-computer collaborative way. The impor-<br />

tance <strong>of</strong> user <strong>in</strong> the VAE presented <strong>in</strong> this thesis is highlighted, as it enables<br />

3D modell<strong>in</strong>g without expensive presentation systems (e.g. servos etc.).<br />

Any such system must comb<strong>in</strong>e vision and <strong>in</strong>teraction techniques with<br />

the design goal <strong>of</strong> higher efficiency than a purely automated system that<br />

requires passive human operator time. The one presented <strong>in</strong> this thesis<br />

uses the projector-camera pair <strong>of</strong> a VAE to acquire range images, then<br />

user <strong>in</strong>teraction <strong>in</strong> the augmented environment to identify correspond-<br />

<strong>in</strong>g po<strong>in</strong>ts for build<strong>in</strong>g a full 3D model. The automation can then take<br />

over aga<strong>in</strong> to suggest prototype registrations for further adjustment by<br />

the user. There are also simple facilities for touch<strong>in</strong>g up range images.<br />

The result is an efficient 3D acquisition system that can be deployed with-<br />

out conventional <strong>in</strong>put devices such as keyboard, mouse, or laser po<strong>in</strong>ters.<br />

In short, this work aims:<br />

• To analyse and extend the use <strong>of</strong> video as an <strong>in</strong>put device.<br />

• To devise and implement different image and video process<strong>in</strong>g com-<br />

ponents that make an augmented reality 3D <strong>in</strong>put device possible.<br />

• To design a human-computer collaborative system for <strong>in</strong>putt<strong>in</strong>g 3D<br />

shapes, and evaluate its performance.<br />

23


1.4 Thesis Organisation<br />

The rest <strong>of</strong> the thesis is organised as follows.<br />

A literature review with a history <strong>of</strong> major image based 3D capture<br />

methods and prior art <strong>of</strong> VAEs is given <strong>in</strong> chapter 2.<br />

Chapter 3 <strong>in</strong>troduces the calibration <strong>of</strong> the projector-camera pair as sys-<br />

tem configuration stage. The calibration serves for two purposes. It gives<br />

the <strong>in</strong>ternal and external geometry <strong>of</strong> the projector camera system, and<br />

also provides the bi-directional transform between the projection signals<br />

and their observations <strong>in</strong> the camera image.<br />

Chapter 4 is concerned with the technique used for the shape acquisi-<br />

tion as the first stage <strong>of</strong> data <strong>in</strong>put from real objects.<br />

In chapter 5 we <strong>in</strong>troduce the method for extract<strong>in</strong>g 3D <strong>in</strong>formation<br />

from the scanned range images and how to fuse the 2.5D data <strong>in</strong>to a com-<br />

plete 3D model.<br />

Chapter 6 gives the work flow with the aforementioned key compo-<br />

nents. The <strong>in</strong>teractive user <strong>in</strong>terface is also presented <strong>in</strong> this chapter.<br />

Experimental results and performance evaluation from the user test are<br />

given <strong>in</strong> chapter 7.<br />

24


Chapter 8 draws the conclusions and possible future work.<br />

1.5 Contributions<br />

The work described <strong>in</strong> this thesis is the development <strong>of</strong> a tabletop based<br />

VAE for fast 3D <strong>in</strong>put via collaborative work between the human and the<br />

computer.<br />

In chapter 3, a fully automatic method for calibration <strong>of</strong> the camera-<br />

projector pair is proposed and implemented. The work is <strong>in</strong>spired by a<br />

widely used Matlab based camera calibration toolbox which is extended<br />

and converted to C++ to make it capable <strong>of</strong> calibrat<strong>in</strong>g the projector cam-<br />

era system <strong>in</strong> a fully automatic manner. The Matlab toolbox is also used<br />

<strong>of</strong>f-l<strong>in</strong>e to manually evaluate and validate the calibration results from our<br />

own automatic method. Initial test<strong>in</strong>g <strong>of</strong> the calibration data suggests it is<br />

not only suitable for table-top based monitor<strong>in</strong>g, but also capable <strong>of</strong> full<br />

3D applications.<br />

In chapter 4, a Gray coded structured light projection is implemented<br />

for the acquisition <strong>of</strong> the two and half dimensional depth maps. Despite<br />

the method itself be<strong>in</strong>g well-established and widely used, efforts are made<br />

to <strong>in</strong>corporate it <strong>in</strong>to the <strong>in</strong>teractive VAE system and conquer the issues<br />

25


aised <strong>in</strong> practice such as the ever-chang<strong>in</strong>g light<strong>in</strong>g conditions and vari-<br />

ous surface reflections caused by different object materials. The problems<br />

such as the large distance between the camera-projector pair and the pro-<br />

jection surface, and the alias<strong>in</strong>g effect caused by limited camera capture<br />

resolution are tackled as well.<br />

A framework for 3D po<strong>in</strong>t set registration is developed <strong>in</strong> chapter 5. It<br />

beg<strong>in</strong>s with the conventional image registration method for planar objects<br />

then extends it to work for arbitrary objects <strong>in</strong> the VAE, with no a priori<br />

ground truth <strong>in</strong>formation available.<br />

This thesis proposes a new system design for <strong>in</strong>putt<strong>in</strong>g 3D us<strong>in</strong>g an<br />

<strong>in</strong>teractive VAE system. This is a major contribution and it is detailed<br />

<strong>in</strong> chapter 6. The proposed system is cheap to ma<strong>in</strong>ta<strong>in</strong> with <strong>of</strong>f-the-<br />

shelf hardware, and easy to deploy with m<strong>in</strong>imum configuration <strong>of</strong> the<br />

projector-camera pair. It is also not just restricted to controlled laborato-<br />

rial environments.<br />

The designed system is also possible for multi-user collaboration, and<br />

the user is able to walk up to the VAE us<strong>in</strong>g their bare hands without the<br />

need <strong>of</strong> Head Mounted Display (HMD), gloves or markers, which most <strong>of</strong><br />

the current VAE systems rely on. The top-down projection mechanism is<br />

also user-friendly, as it dramatically reduces the chance <strong>of</strong> the user’s eyes<br />

be<strong>in</strong>g hurt by the bright projection lights.<br />

26


Although the system proposed here conta<strong>in</strong>s techniques that are al-<br />

ready widely used <strong>in</strong> the field, it br<strong>in</strong>gs together these techniques <strong>in</strong> a new,<br />

practical and efficient way. There are also very few systems as such that<br />

can be deployed not only <strong>in</strong> restricted laboratory environments, but also at<br />

a very low cost by avoid<strong>in</strong>g expensive hardware such as touch screens and<br />

HMD. Initial test results shows it provides a solid foundation for future<br />

research <strong>in</strong> this field, and opens up the possibilities <strong>of</strong> a lot <strong>of</strong> promis<strong>in</strong>g<br />

future work.<br />

27


Chapter 2<br />

Background and Prior Art<br />

This research comb<strong>in</strong>es 3D shape acquisition with video augmented real-<br />

ity. The shape acquisition is a tool for captur<strong>in</strong>g 2.5D depth <strong>in</strong>formation,<br />

and it can be used repeatedly to built a complete 3D model from a set <strong>of</strong><br />

different views <strong>of</strong> the object be<strong>in</strong>g measured. The VAE is an augmented<br />

reality where the projected visual signals are used to augment the real<br />

world. The background to both areas is reviewed here.<br />

28


2.1 Image based 3D capture methods for depth<br />

estimation<br />

<strong>Human</strong>s visually perceive depth us<strong>in</strong>g both <strong>of</strong> their eyes. There is a sim-<br />

ple experiment that if one tries to po<strong>in</strong>t the tips <strong>of</strong> two pens towards each<br />

other with one eye closed, it is almost impossible to succeed. The same<br />

th<strong>in</strong>g happens if one approaches his f<strong>in</strong>ger towards the wall with one eye<br />

closed, then it is very hard to visually measure the distance between the<br />

f<strong>in</strong>ger tip and the wall. The reason beh<strong>in</strong>d this is not hard to expla<strong>in</strong>, be-<br />

cause humans rely on the visual ability for depth perception us<strong>in</strong>g b<strong>in</strong>oc-<br />

ular stereopsis.<br />

The root <strong>of</strong> the word stereopsis, stereo, comes from the Greek word<br />

stereos and it means firm or solid [100]. With stereo vision a solid object<br />

is perceived <strong>in</strong> three spatial dimensions width, height and depth which<br />

are geometrically represented as X, Y, and Z axes. Dur<strong>in</strong>g the perception<br />

process, each human eye captures its own view and the two separate im-<br />

ages are sent on to the bra<strong>in</strong> for process<strong>in</strong>g. When the two images arrive<br />

simultaneously at the back <strong>of</strong> the bra<strong>in</strong>, they are united <strong>in</strong>to one 2.5D rep-<br />

resentation based on their similarities and give the human an observation<br />

<strong>in</strong> three dimensions. In the field <strong>of</strong> computer vision, the human visual abil-<br />

ity for depth perception us<strong>in</strong>g b<strong>in</strong>ocular stereopsis has been modelled by<br />

two displaced cameras to obta<strong>in</strong> 3D <strong>in</strong>formation <strong>of</strong> the <strong>in</strong>vestigated scene.<br />

29


2.1.1 Feature Based Methods<br />

A feature based stereo match<strong>in</strong>g algorithm produces a depth map that<br />

best describes the shape <strong>of</strong> the surfaces <strong>in</strong> the scene via a set <strong>of</strong> match<strong>in</strong>g<br />

features as correspondences. The correspondences are always found from<br />

po<strong>in</strong>ts, l<strong>in</strong>es, corners, contours or other dist<strong>in</strong>guish<strong>in</strong>g features extracted<br />

from both <strong>of</strong> the observed images [37].<br />

Dur<strong>in</strong>g the match<strong>in</strong>g, the most commonly used match<strong>in</strong>g criteria func-<br />

tions are pixel-based Squared Intensity Differences (SD) [1, 55] and Ab-<br />

solute Intensity Differences (AD) [55], which are sometimes averaged to<br />

the Mean Squared Error (MSE) and Mean Absolute Difference (MAD).<br />

Other widely used traditional match<strong>in</strong>g costs <strong>in</strong>clude Normalised Cross-<br />

Correlation (NCC) which is similar to the MSE, and b<strong>in</strong>ary match<strong>in</strong>g costs<br />

such as edges [20, 43] or the sign <strong>of</strong> Laplacian [71]. More recently, various<br />

robust measures [10, 11, 86] have been proposed to limit the <strong>in</strong>fluence <strong>of</strong><br />

mismatches. Once the match<strong>in</strong>g costs are computed, local and w<strong>in</strong>dow-<br />

based methods are used to aggregate the cost by summ<strong>in</strong>g or averag<strong>in</strong>g<br />

over a support region. In local methods, the f<strong>in</strong>al disparities are chosen at<br />

each pixel where the disparity is associated with the m<strong>in</strong>imum cost value –<br />

this is <strong>of</strong>ten known as W<strong>in</strong>ner-Takes-All (WTA) approach. This results <strong>in</strong> a<br />

limitation that only one <strong>of</strong> the two images has the unique matches, because<br />

pixels <strong>in</strong> the second image might be po<strong>in</strong>ted to by multiple po<strong>in</strong>ts from the<br />

first image, or vice versa. Efficient global methods such as max-flow [82]<br />

and graph-cut [91, 17] have been proposed to solve this optimisation prob-<br />

lem and have produced promis<strong>in</strong>g results.<br />

30


Comprehensive reviews <strong>of</strong> the aforementioned applicable technologies<br />

are provided <strong>in</strong> [57, 87, 47].<br />

2.1.2 Optical Flow Based Methods<br />

Optical flow based methods recover structure <strong>in</strong>formation from the opti-<br />

cal flow observed from two images <strong>of</strong> a mov<strong>in</strong>g rigid object, or from two<br />

images taken from different po<strong>in</strong>t <strong>of</strong> views <strong>of</strong> a stationary object.<br />

Optical flow is the distribution <strong>of</strong> velocities <strong>of</strong> movement <strong>of</strong> brightness<br />

patterns <strong>in</strong> an image, where brightness patterns can be objects but nor-<br />

mally refer to pixels for further process<strong>in</strong>g [49]. It can arise from relative<br />

motion <strong>of</strong> objects and the observer, which means it can be either a mov<strong>in</strong>g<br />

camera imag<strong>in</strong>g a static scene or objects mov<strong>in</strong>g <strong>in</strong> front <strong>of</strong> the camera. In<br />

either sett<strong>in</strong>g, more than one images are taken and optical flow is com-<br />

puted to estimate 3D locations <strong>of</strong> the <strong>in</strong>terested features.<br />

(a) first image (b) second image (c) observed flow<br />

Figure 2.1: Optical flow <strong>of</strong> approach<strong>in</strong>g objects.<br />

31


As seen <strong>in</strong> figure 2.1, it is simulated that the ground and an object are<br />

approach<strong>in</strong>g the observed <strong>in</strong> relative different speed and directions. Some<br />

<strong>of</strong> the po<strong>in</strong>ts on the ground all have same <strong>in</strong>stantaneous velocities, but<br />

when they are perceived by human eyes, their images will cross the ret<strong>in</strong>a<br />

with different velocities and direction. All the velocities are represented<br />

<strong>in</strong> rays which have the same vanish<strong>in</strong>g po<strong>in</strong>t and it is called the Focus <strong>of</strong><br />

Expansion (FOE). The FOE <strong>of</strong> the ground is <strong>in</strong> the image and easy to f<strong>in</strong>d,<br />

but for the mov<strong>in</strong>g object, its FOE is located outside <strong>of</strong> the image.<br />

In recent approaches, optical flow [49, 65, 7, 1, 40, 10, 11] is widely used<br />

to estimate the dense correspondence derived from consequent frames. In<br />

[49] a gradient-based method is presented to compute the optical flow,<br />

while there are feature-based [35, 12, 92] and correlation-based methods<br />

[8, 90, 61]. Once the correspondence is established, the 3D location <strong>of</strong><br />

these correspond<strong>in</strong>g features can then be computed if <strong>in</strong>formation about<br />

the camera is known.<br />

One <strong>of</strong> the most comprehensive discussion and evaluation <strong>of</strong> the exist-<br />

<strong>in</strong>g optical flow computation is given <strong>in</strong> [5].<br />

32


2.2 Active Shape Acquisition Methods<br />

Techniques discussed <strong>in</strong> section 2.1 cover most <strong>of</strong> the scenarios <strong>in</strong> com-<br />

puter vision for depth estimation, but under certa<strong>in</strong> circumstances there<br />

are still th<strong>in</strong>gs can be done <strong>in</strong> a proactive way as an aid to help enhance<br />

the performance. For example, when measur<strong>in</strong>g a white object with no<br />

texture at all, it is <strong>of</strong>ten hard to extract the dist<strong>in</strong>guished features or com-<br />

pute the optical flow. In this case it would be helpful to manually put some<br />

marks on the object such as squares or triangles to help locate the <strong>in</strong>terest<br />

po<strong>in</strong>ts. Similarly <strong>in</strong> an augmented environment, controlled light<strong>in</strong>g can<br />

replace those squares and triangles as an aid to help identify more <strong>in</strong>ter-<br />

est<strong>in</strong>g features.<br />

Imag<strong>in</strong>e wav<strong>in</strong>g a pen over the <strong>in</strong>spected object under a constant light<br />

to cast shadows across the scene. The shadow is expected to be a th<strong>in</strong><br />

l<strong>in</strong>e, but deformed due to the shape <strong>of</strong> the surface underneath. Can the<br />

structural <strong>in</strong>formation be retrieved from the deformed shadows? Another<br />

example is by turn<strong>in</strong>g on different lights <strong>in</strong> the room and us<strong>in</strong>g a camera<br />

to monitor an object under different light<strong>in</strong>g conditions, aga<strong>in</strong>, is there any<br />

structural <strong>in</strong>formation <strong>in</strong>duced <strong>in</strong> the observed images?<br />

One <strong>of</strong> the active methods is photometric stereo [106], which has the<br />

ability to estimate the object surface orientation by us<strong>in</strong>g several images<br />

taken from the same viewpo<strong>in</strong>t but under dist<strong>in</strong>ct illum<strong>in</strong>ation from dif-<br />

ferent directions. Under most circumstances, the surfaces be<strong>in</strong>g measured<br />

are thought to obey Lambert’s cos<strong>in</strong>e law, which states that the irradiance<br />

33


(i.e. light emitted or perceived) is proportional to the cos<strong>in</strong>e <strong>of</strong> the angle<br />

between surface normal and light source direction, and this relationship<br />

can be represented by the reflectance map [107, 108]. A big advantage<br />

<strong>of</strong> the photometric stereo method is it can be used as a texture classifier.<br />

For <strong>in</strong>stance, suppose a surface with lots <strong>of</strong> protuberant horizontal curved<br />

stripes is imaged under a s<strong>in</strong>gle constant illum<strong>in</strong>ation. If the surface is ro-<br />

tated by 90 degrees while the light<strong>in</strong>g rema<strong>in</strong>s the same (i.e. strength, ori-<br />

entation), it causes failure <strong>of</strong> conventional texture based correspondence<br />

match<strong>in</strong>g because <strong>of</strong> the big change <strong>in</strong> the appearance <strong>of</strong> the surface. For<br />

these types <strong>of</strong> applications, photometric stereo is the answer provided the<br />

rotation is known.<br />

Another active technique for shape acquisition is structured light [81,<br />

45, 6, 15, 84]. In structured light systems, a projector is used to completely<br />

replace one <strong>of</strong> the cameras <strong>in</strong> a stereo vision system. With the projector<br />

project<strong>in</strong>g light patterns such as dots, l<strong>in</strong>es, grids or stripes onto the object<br />

surface, all the illum<strong>in</strong>ation sources <strong>of</strong> these projected signals are known<br />

<strong>in</strong> the projector space. At the same time, a camera is used to capture the<br />

illum<strong>in</strong>ated scene as the observer. By project<strong>in</strong>g one or a set <strong>of</strong> known<br />

image patterns, it is possible to uniquely label each pixel <strong>in</strong> the image ob-<br />

served by the camera.<br />

Unlike stereo vision methods <strong>in</strong>troduced <strong>in</strong> section 2.1 which rely on<br />

the accuracy <strong>of</strong> match<strong>in</strong>g algorithms, structured light automatically estab-<br />

lishes the geometric relationship by direct mapp<strong>in</strong>g from the codewords<br />

34


assigned to each pixel to their correspond<strong>in</strong>g coord<strong>in</strong>ates <strong>in</strong> the source pat-<br />

tern.<br />

A detailed discussion <strong>of</strong> structured light techniques is presented <strong>in</strong><br />

chapter 4.<br />

2.2.1 The Use <strong>of</strong> Structured Light System<br />

This research presents a projector-camera based VAE system. Although<br />

there is controlled light<strong>in</strong>g available such as turn<strong>in</strong>g on and <strong>of</strong>f the lights<br />

or adjust<strong>in</strong>g the bl<strong>in</strong>ds, it is not to the extent that is sufficient for photo-<br />

metric stereo. With the projector-camera pair available, structured light<br />

fits well <strong>in</strong> terms <strong>of</strong> hardware requirements, and is used for <strong>in</strong>itial capture<br />

<strong>of</strong> depth <strong>in</strong>formation <strong>in</strong> this research. Follow<strong>in</strong>g this, stereo-match<strong>in</strong>g cor-<br />

respondence methods are applied to fuse the depth data captured from<br />

different views <strong>of</strong> the object be<strong>in</strong>g measured.<br />

2.3 <strong>Video</strong> <strong>Augmented</strong> Environments (VAEs)<br />

A VAE is a visual environment where physical objects from the real world<br />

and virtual elements co-exist coherently. Data projectors are normally<br />

35


used <strong>in</strong> VAEs to augment the real objects by project<strong>in</strong>g video signals onto<br />

the scene. The visual environment is also monitored by the camera, so<br />

that the VAE system can detect the changes and make response by chang-<br />

<strong>in</strong>g the projections.<br />

Over the last decade various VAE systems are developed for different<br />

purposes. We list some related example VAEs to show their range and di-<br />

versity.<br />

2.3.1 Related example VAEs <strong>in</strong> the past<br />

DigitalDesk<br />

In the early 90s, one <strong>of</strong> the earliest projects <strong>in</strong> the history <strong>of</strong> VAE emerged<br />

as Wellner’s DigitalDesk [102, 104, 103] (figure 2.2(a)). A major feature <strong>of</strong><br />

the project is the blurr<strong>in</strong>g <strong>of</strong> the boundary between physical paper and<br />

electronic documents. DigitalDesk also tackles the problem <strong>of</strong> calibrat<strong>in</strong>g<br />

the multiple <strong>in</strong>puts (cameras) and output devices (projector) so to enable<br />

the planar mapp<strong>in</strong>g between their <strong>in</strong>dividual coord<strong>in</strong>ate systems.<br />

36


(a) Hardware setup. (b) The DigitalDesk<br />

Figure 2.2: The DigitalDesk. (image courtesy <strong>of</strong> the <strong>Computer</strong> Laboratory,<br />

University <strong>of</strong> Cambridge)<br />

In DigitalDesk, a projector and one or more cameras are mounted above<br />

the desk shar<strong>in</strong>g the common view area. On the desk, a user can place<br />

normal day to day objects such as papers, books, and mugs. The desk has<br />

other characteristics <strong>of</strong> a workstation, that the projector and camera(s) are<br />

connected to a PC and the system can (1) read the documents placed on<br />

the desktop; (2) monitor a user’s activity at the desk; (3) project video sig-<br />

nals such as images and annotations down onto the desk surface.<br />

Inspired by DigitalDesk, a number <strong>of</strong> prototype applications have been<br />

built. For example, the PaperPa<strong>in</strong>t application [104] allows copy<strong>in</strong>g and<br />

37


past<strong>in</strong>g <strong>of</strong> images and text from paper documents laid on the desk <strong>in</strong>to<br />

electronic versions. The DigitalDesk Calculator [102] (figure 2.2(b)) enables<br />

mathematical operations on numeric data conta<strong>in</strong>ed <strong>in</strong> paper documents,<br />

provid<strong>in</strong>g the user a virtual calculator by project<strong>in</strong>g a set <strong>of</strong> buttons along-<br />

side the paper documents. Another application is Marcel [72], where user<br />

can po<strong>in</strong>t their f<strong>in</strong>ger at the words <strong>in</strong> a French document and the po<strong>in</strong>ted<br />

words are translated <strong>in</strong>to English, which is subsequently projected along-<br />

side the orig<strong>in</strong>al French word.<br />

BrightBoard<br />

BrightBoard [93] explores the use <strong>of</strong> a ord<strong>in</strong>ary whiteboard as a com-<br />

puter <strong>in</strong>terface. A vision system is developed to monitor what is happen-<br />

<strong>in</strong>g on the board. A major difference <strong>of</strong> BrightBoard from other VAE sys-<br />

tems is it is not designed to cont<strong>in</strong>uously respond to the captured images.<br />

Instead, events are only triggered whenever the system detects a signifi-<br />

cant change such as the user obstruct<strong>in</strong>g the whiteboard.<br />

A few commands are provided (previously written by marker pens)<br />

on the board, with a square check box alongside each command (figure<br />

2.3). For each check box, the system monitors the square as the active area<br />

and detects when the zone becomes significantly darker or lighter, which<br />

38


corresponds to the conclusion that a mark is made on the board or erased,<br />

respectively.<br />

Figure 2.3: An image <strong>of</strong> the BrightBoard. (image courtesy <strong>of</strong> the <strong>Computer</strong><br />

Laboratory, University <strong>of</strong> Cambridge)<br />

After <strong>in</strong>itial success <strong>of</strong> the prototype, <strong>in</strong>stead <strong>of</strong> expand<strong>in</strong>g the system<br />

<strong>in</strong>to a monolithic application with more and more features, the developers<br />

decide to simplify the BrightBoard <strong>in</strong>to a whiteboard based control panel,<br />

from which the scripts and external programs can be activated, such as<br />

pr<strong>in</strong>t<strong>in</strong>g and sav<strong>in</strong>g what is written on the whiteboard, email<strong>in</strong>g the im-<br />

ages <strong>of</strong> board or pass<strong>in</strong>g them onto other programs for further process<strong>in</strong>g.<br />

However, one <strong>of</strong> the limitations <strong>of</strong> the system is that calibration is not<br />

<strong>in</strong>volved at any stage. The system relies on the fact that the active areas <strong>in</strong><br />

the camera image are crudely fixed. Once the camera or the board itself is<br />

moved the system needs to be reconfigured.<br />

39


Artificial Life <strong>in</strong>teractive <strong>Video</strong> Environment (ALIVE)<br />

The ALIVE [67, 68] system was developed at the MIT Media Lab, <strong>in</strong>-<br />

spired by the ideas beh<strong>in</strong>d Myron Krueger’s <strong>Video</strong>Place [59, 60]. A large<br />

projection screen roughly the same height as a human is placed vertically<br />

on the ground. A camera is fixed on the top edge <strong>of</strong> the screen and moni-<br />

tors the user stand<strong>in</strong>g right <strong>in</strong> front <strong>of</strong> the screen and free to move about.<br />

In the observed image, the background is cut <strong>of</strong>f so that only the fore-<br />

ground image <strong>of</strong> the user is conserved. It is then <strong>in</strong>corporated <strong>in</strong>to another<br />

different scene (e.g. a different room) mixed with some animated creatures<br />

(figure 2.4). User can <strong>in</strong>teract with the computer generated creatures by ei-<br />

ther their movement or <strong>in</strong>structions expressed by their gestures.<br />

To enable this type <strong>of</strong> <strong>in</strong>teraction, the user’s 3D position <strong>in</strong> the physical<br />

world has to be known. With a couple <strong>of</strong> assumptions, this can be achieved<br />

even by a s<strong>in</strong>gle camera. First, the relative positions and orientation <strong>of</strong> the<br />

camera to the floor need to be known. Also, the users is assumed to be<br />

stand<strong>in</strong>g on the floor all the time so that simply locat<strong>in</strong>g the user’s lowest<br />

po<strong>in</strong>t <strong>in</strong> the observed image can approximately estimate his/her position<br />

<strong>in</strong> the room.<br />

40


Figure 2.4: User <strong>in</strong>teracts with the ALIVE system. (image courtesy <strong>of</strong> the<br />

MIT Media Lab)<br />

2.3.2 Previous work at York<br />

At the Visual System Group, University <strong>of</strong> York, the ma<strong>in</strong> research <strong>in</strong>ter-<br />

est <strong>of</strong> the VAEs concerns image <strong>in</strong>put and analysis technologies that are<br />

resilient to light<strong>in</strong>g changes and shadow<strong>in</strong>g. Sufficiently fast VAE imple-<br />

mentations are aimed to support richly <strong>in</strong>teractive applications.<br />

A number <strong>of</strong> practical VAE applications have been designed and im-<br />

plemented with<strong>in</strong> the group [52, 73, 29, 74]. Prior to this research, one <strong>of</strong><br />

the most recent is Robot Ships, developed for the National Museum <strong>of</strong> Scot-<br />

land’s Connect gallery.<br />

41


LivePaper<br />

A recent system [52] by Rob<strong>in</strong>son and Robertson provides a VAE <strong>in</strong><br />

which <strong>in</strong>dividual sheets <strong>of</strong> pages, cards and books are placed on an <strong>in</strong>stru-<br />

mented tabletop to activate their enhancement. It appears to the user as if<br />

the paper has additional properties with new visual and auditory features.<br />

Figure 2.5: The LivePaper system <strong>in</strong> use. (image courtesy <strong>of</strong> the Visual<br />

Systems Lab, University <strong>of</strong> York)<br />

A sheet <strong>of</strong> paper is detected through boundary extraction <strong>in</strong> an ob-<br />

served image, and then the projector displays the associated augmenta-<br />

42


tions accord<strong>in</strong>g to the contents on the current page recognised by the sys-<br />

tem. The augmented video signals rema<strong>in</strong> projected onto the page. An<br />

<strong>in</strong>teractive menu is provided beside the page to provide f<strong>in</strong>ger-triggered<br />

functionality.<br />

A number <strong>of</strong> sample applications have been developed to illustrate the<br />

feasibility <strong>of</strong> the LivePaper system. These applications <strong>in</strong>clude an archi-<br />

tectural visualisation tool (figure 2.6(a)) which projects a 3D hidden-l<strong>in</strong>e<br />

render<strong>in</strong>g <strong>of</strong> walls onto a page, page shar<strong>in</strong>g, remote collaboration (figure<br />

2.6(b)), and World Wide Web (WWW) page view<strong>in</strong>g. From the user’s per-<br />

spective, all <strong>of</strong> these applications are attributes <strong>of</strong> the particular page, not<br />

features <strong>of</strong> the tabletop.<br />

(a) The architectural visualisation appli-<br />

cation.<br />

(b) The collaborative draw<strong>in</strong>g applica-<br />

tion.<br />

Figure 2.6: The LivePaper applications. (image courtesy7 <strong>of</strong> the Visual Sys-<br />

tems Lab, University <strong>of</strong> York)<br />

Another application <strong>of</strong> LivePaper is an audio player. When a page such<br />

43


as a bus<strong>in</strong>ess card is laid on the desk, the player beg<strong>in</strong>s play<strong>in</strong>g an audio<br />

clip, <strong>of</strong> which the playback can be controlled by the user, by press<strong>in</strong>g the<br />

projected buttons.<br />

PenPets<br />

The PenPets application developed by O’Mahony and Rob<strong>in</strong>son [73]<br />

is an application runn<strong>in</strong>g on a VAE called SketchTop which supports rich<br />

<strong>in</strong>teraction through sketch<strong>in</strong>g, augmented physical objects and mobile vir-<br />

tual objects.<br />

SketchTop is a whiteboard mounted horizontally at desk height together<br />

with other physical objects that can be augmented. Problems are encoun-<br />

tered <strong>in</strong> some <strong>of</strong> the other whiteboard based VAE systems. First, the white-<br />

board is horizontally placed because vertically mounted whiteboard can-<br />

not support augmented objects other than the video signal itself. Second,<br />

once the mark<strong>in</strong>gs are static once written on the whiteboard, so the liter-<br />

ality <strong>of</strong> <strong>in</strong>teraction that comes through register<strong>in</strong>g augmented signals to<br />

mov<strong>in</strong>g objects is lost. SketchTop was designed to solve both these prob-<br />

lems and thereby provide rich <strong>in</strong>teractions via static-but-erasable writ<strong>in</strong>gs.<br />

The focus <strong>of</strong> SketchTop demonstration is Penpets, an artificial life appli-<br />

44


(a) A maze-solv<strong>in</strong>g agent tries to f<strong>in</strong>d its<br />

way out while user modifies the struc-<br />

ture <strong>of</strong> the maze.<br />

(b) Mov<strong>in</strong>g an agent with a fishnet-like<br />

tool.<br />

Figure 2.7: Snapshots <strong>of</strong> Penpets <strong>in</strong> action. (image courtesy <strong>of</strong> the Visual<br />

Systems Lab, University <strong>of</strong> York)<br />

cation <strong>in</strong> which virtual animals roam the augmented surface, runn<strong>in</strong>g <strong>in</strong>to<br />

objects and trigger<strong>in</strong>g events subject to their various behavioural models.<br />

Figure 2.7 shows two snapshots <strong>of</strong> Penpets <strong>in</strong> action. The demonstrated<br />

agent <strong>in</strong> figure 2.7(a) has hazard detection and maze solv<strong>in</strong>g ability. The<br />

tunnels and walls on the whiteboard are drawn by users, therefore users<br />

can easily h<strong>in</strong>der the agents by open<strong>in</strong>g up new exits, clos<strong>in</strong>g old ones, or<br />

taper<strong>in</strong>g the current lane <strong>in</strong> which the agent is travell<strong>in</strong>g. Figure 2.7 shows<br />

an agent is be<strong>in</strong>g carried to another part <strong>of</strong> the environment by a fishnet-<br />

like tool.<br />

Based on different behaviour models, developments <strong>of</strong> SketchTop appli-<br />

cations such as circuit simulator, traffic simulator, and sketchable p<strong>in</strong>ball<br />

(us<strong>in</strong>g agents as balls) are implemented. Another <strong>in</strong>terest<strong>in</strong>g implemen-<br />

45


tation is to simulate the agents cul<strong>in</strong>ary <strong>in</strong>terests by provid<strong>in</strong>g a mean <strong>of</strong><br />

recognis<strong>in</strong>g different objects such as apple, cheese, or teapot.<br />

Audio d-touch<br />

Audio d-touch [29, 28] uses a consumer-grade web camera and cus-<br />

tomisable block objects with markers attached to provide an <strong>in</strong>teractive<br />

Tangible User Interface (TUI) for a variety <strong>of</strong> time based musical tasks such<br />

as sequenc<strong>in</strong>g, drum edit<strong>in</strong>g and collaborative composition. Three musi-<br />

cal applications have been reported by previous research <strong>in</strong> the group: the<br />

augmented musical stave (figure 2.8), the tangible drum mach<strong>in</strong>e, and the<br />

physical sequencer. Although there is no presence <strong>of</strong> a data projector <strong>in</strong><br />

this system, the Audio d-touch is very similar to other standard VAEs, the<br />

only difference be<strong>in</strong>g the video signals projected by the projector are re-<br />

placed by the audio signals from the speakers.<br />

TUIs are a recent research field <strong>in</strong> <strong>Human</strong>-<strong>Computer</strong> Interface (HCI)s.<br />

Compared to a Graphical User Interface (GUI) where users <strong>in</strong>teract with<br />

virtual objects represented on a screen through mouse and keyboard to<br />

control and represent digital <strong>in</strong>formation, <strong>in</strong> a TUI physical objects are<br />

used <strong>in</strong> the real space to achieve the same goals. Grasp<strong>in</strong>g a physical<br />

object is equivalent to grasp<strong>in</strong>g a piece <strong>of</strong> digital <strong>in</strong>formation, and nor-<br />

46


mally different objects represents different pieces <strong>of</strong> <strong>in</strong>formation <strong>of</strong> the<br />

virtual model. As feedback, the computer output is usually represented<br />

<strong>in</strong> the same physical environment to susta<strong>in</strong> the perceptual l<strong>in</strong>k between<br />

the physical and virtual objects.<br />

In Audio d-touch the user can create patterns and beats. This is realised<br />

by mapp<strong>in</strong>g physical quantities to musical parameters such as timbre and<br />

frequency. The visual part <strong>of</strong> the system tracks the position <strong>of</strong> the control<br />

objects with a web-cam by means <strong>of</strong> a robust image fiducial recognition<br />

algorithm. Technical details <strong>of</strong> the fiducial algorithms can be found <strong>in</strong><br />

[27, 75].<br />

Figure 2.8: Audio d-touch <strong>in</strong>terface (the augmented musical stave). (image<br />

courtesy <strong>of</strong> the <strong>Computer</strong> Laboratory, University <strong>of</strong> Cambridge)<br />

Figure 2.8 shows one <strong>of</strong> the Audio d-touch application: the augmented<br />

musical stave. Only the <strong>in</strong>teractive surface is shown <strong>in</strong> the figure, while<br />

47


the web camera is vertically mounted above the surface and a pair <strong>of</strong><br />

speakers are placed on the side – all connected to a PC. In the augmented<br />

stave, physical representations <strong>of</strong> musical notes can be placed on a stave<br />

drawn on an A4 sheet <strong>of</strong> paper for either teach<strong>in</strong>g score notations or com-<br />

position <strong>of</strong> melodies. The <strong>in</strong>teractive objects are rectangular blocks, each <strong>of</strong><br />

which is labelled with a fiducial symbol correlated to a variety <strong>of</strong> musical<br />

notes. Once the notes are placed on the stave, the correspond<strong>in</strong>g sounds<br />

are played by the computer. Various musical parameters such as the pitch,<br />

the duration (quavers, crotchets, m<strong>in</strong>ims, etc...), the play<strong>in</strong>g sequence are<br />

decided by the position <strong>of</strong> the object on the musical stave.<br />

Prototypes <strong>of</strong> the designed <strong>in</strong>struments have been tested by a group<br />

<strong>of</strong> people with different musical backgrounds, rang<strong>in</strong>g from music aca-<br />

demics to amateurs with little experience <strong>in</strong> music composition. Each en-<br />

joyed <strong>in</strong>teract<strong>in</strong>g with the <strong>in</strong>struments and managed to make <strong>in</strong>terest<strong>in</strong>g<br />

compositions.<br />

Robot Ships<br />

Robot Ships is a commercial application developed as a featured exhibi-<br />

tion for the Connect Gallery [101] at the National Museums <strong>of</strong> Scotland <strong>in</strong><br />

Ed<strong>in</strong>burgh.<br />

48


Designed with the technology <strong>of</strong> VAE, Robot Ships turns a tabletop <strong>in</strong>to<br />

a stretch <strong>of</strong> ocean, upon which robotic boats work together to clean up oil<br />

spills. An audience walks up to the tabletop, reaches onto it, and becomes<br />

part <strong>of</strong> the <strong>in</strong>teractive environment to create various events (figure 2.9(a)).<br />

(a) A picture show<strong>in</strong>g the user s<strong>in</strong>k<strong>in</strong>g<br />

an oil tanker for the workers to start the<br />

clean-up work.<br />

(b) A screen shot show<strong>in</strong>g the workers<br />

start clean<strong>in</strong>g up the toxic spill that has<br />

been located by a scout.<br />

Figure 2.9: Snapshots <strong>of</strong> Robot Ships <strong>in</strong> action. (image courtesy <strong>of</strong> the Visual<br />

Systems Lab, University <strong>of</strong> York)<br />

On the biological scale, the idea beh<strong>in</strong>d the Robot Ships is <strong>in</strong>spired by<br />

comb<strong>in</strong><strong>in</strong>g user assistance and the work force to solve environmental tasks.<br />

In this case, the scout ship is first sent out to search for toxic spills, and<br />

upon f<strong>in</strong>d<strong>in</strong>g one it will return to the central control rig. On its way back,<br />

it navigates the obstructions and leaves a series <strong>of</strong> trail po<strong>in</strong>ts. Cleanup<br />

49


worker ships are then dispatched. Without know<strong>in</strong>g the location <strong>of</strong> the<br />

spill, the workers rely only on the trail po<strong>in</strong>ts left by the scout. Due to<br />

the fact that the workers don’t know where they are head<strong>in</strong>g, but <strong>in</strong>stead<br />

only us<strong>in</strong>g their limited view<strong>in</strong>g cone, they are more manipulatable by<br />

the audience. As the entire <strong>in</strong>terface is on a round table which is used by<br />

reach<strong>in</strong>g over it, it is open to all ages and to multi-user collaborations.<br />

Robot Ships is a VAE that runs on top <strong>of</strong> the OpenIllusionist framework<br />

[50] <strong>in</strong>dependently developed by previous members <strong>of</strong> the Visual Systems<br />

Lab, Justen Hyde and Dan Parnham. More details <strong>of</strong> Robot Ships and Ope-<br />

nIllusionist is given <strong>in</strong> [74].<br />

2.4 Conclusions<br />

There are many other good VAEs apart from those aforementioned. Here<br />

we only <strong>in</strong>troduce some <strong>of</strong> the pioneer<strong>in</strong>g and well-known VAEs, and the<br />

related previous work carried out <strong>in</strong> our Visual System Lab.<br />

At present many <strong>of</strong> the research groups cont<strong>in</strong>ue their work on VAEs<br />

and some <strong>of</strong> the related <strong>in</strong>dividual contributions will be reviewed <strong>in</strong> more<br />

detail at appropriate stage later <strong>in</strong> this thesis.<br />

50


Chapter 3<br />

Calibration<br />

3.1 Introduction<br />

In a camera-projector based VAE system, different components have their<br />

own coord<strong>in</strong>ate systems: the camera coord<strong>in</strong>ate system, the projector co-<br />

ord<strong>in</strong>ate system, and the World Coord<strong>in</strong>ate System (WCS) which the real<br />

objects are placed with<strong>in</strong>. For accurately measur<strong>in</strong>g the objects’ place on<br />

the tabletop us<strong>in</strong>g structured light scan method, it is vital to have a reli-<br />

able calibration process so that the <strong>in</strong>ternal and external geometry <strong>of</strong> the<br />

camera and the projector are known. When a user <strong>in</strong>teracts with the aug-<br />

51


mented signals projected onto the desktop, there is a need to susta<strong>in</strong> the<br />

coherent spatial relationship between the physical objects and the virtual<br />

elements <strong>in</strong> a cont<strong>in</strong>uously chang<strong>in</strong>g visual environment.<br />

For example, if a light dot is projected onto the centre <strong>of</strong> the desktop,<br />

it won’t necessarily appear at the centre <strong>of</strong> the observed image. There-<br />

fore the orig<strong>in</strong>al location <strong>of</strong> the light dot <strong>in</strong> the projector image and its<br />

observed position <strong>in</strong> the captured image need to be correlated, so that the<br />

system knows where to look for it <strong>in</strong> the captured image. Furthermore,<br />

if the light dot is projected onto an object, the 3D position <strong>of</strong> the illumi-<br />

nated po<strong>in</strong>t on the object <strong>in</strong> the real world might need to be measured. In<br />

this case, the <strong>in</strong>ternal geometry <strong>of</strong> the camera and the projector need to<br />

be known to solve th<strong>in</strong>gs like the mapp<strong>in</strong>g between pixels and real world<br />

measurements and to what extent the image is distorted due to the lens<br />

imperfection. The recovery <strong>of</strong> all the necessary <strong>in</strong>formation is called the<br />

calibration process.<br />

This chapter addresses this calibration problem.<br />

Calibration task<br />

The objective <strong>of</strong> the camera calibration process is to f<strong>in</strong>d the <strong>in</strong>ternal pa-<br />

rameters (a series <strong>of</strong> parameters that a camera has <strong>in</strong>herently) and the ex-<br />

ternal parameters (position <strong>of</strong> the camera and its orientation relatively to<br />

the World Coord<strong>in</strong>ate System (WCS)).<br />

52


Calibration pr<strong>in</strong>ciple<br />

To calibrate the camera requires measurements <strong>of</strong> a set <strong>of</strong> 3D po<strong>in</strong>ts and<br />

their image correspondences [37]. The most common way to do this is<br />

to have the camera observe a 2D planar pattern consist<strong>in</strong>g <strong>of</strong> multiple<br />

coll<strong>in</strong>ear po<strong>in</strong>ts and the pattern is shown to the camera <strong>in</strong> different views.<br />

Alternatively a 3D rig marked with ground truth po<strong>in</strong>ts can also be used<br />

as the calibration object. The same pr<strong>in</strong>ciple applies to the projector cali-<br />

bration, although it is implemented <strong>in</strong> a slightly different way.<br />

Camera calibration<br />

In practice, a black and white checkerboard plane is usually chosen as the<br />

calibration object because it <strong>of</strong>fers a set <strong>of</strong> known po<strong>in</strong>ts as ground truth<br />

po<strong>in</strong>ts straightaway, although there are other types <strong>of</strong> calibration objects<br />

that can be used [109]. In this research a 20 × 20 checkerboard is used as<br />

the calibration object.<br />

(a) 3D rig. (b) 2D planar object. (c) 1D object with marked<br />

po<strong>in</strong>ts.<br />

Figure 3.1: Calibration objects. (image courtesy <strong>of</strong> [109])<br />

Projector calibration<br />

53


When calibrat<strong>in</strong>g the projector we aim for the same set <strong>of</strong> parameters, the<br />

<strong>in</strong>ternal and external parameters for the projector. Unlike the camera, the<br />

projector already has a set <strong>of</strong> 2D po<strong>in</strong>ts as ground truth s<strong>in</strong>ce the pattern<br />

to be projected is a known image, but their 3D correspondences are un-<br />

known (the 3D positions <strong>of</strong> their projections). F<strong>in</strong>d<strong>in</strong>g these 3D locations<br />

is essential so that there are two sets <strong>of</strong> po<strong>in</strong>ts available to complete the<br />

projector calibration. Therefore it is a prerequisite that the camera needs<br />

to be calibrated first to provide the transform <strong>of</strong> these unknown 3D po<strong>in</strong>ts,<br />

from the camera image space to the real world coord<strong>in</strong>ate system.<br />

2D plane to plane calibration<br />

The user <strong>in</strong>terface <strong>of</strong> the collaborative system designed <strong>in</strong> this research<br />

for <strong>in</strong>putt<strong>in</strong>g 3D is based on a plane (i.e. the table top). Therefore a pre-<br />

cise registration between the image space <strong>of</strong> the camera and the rendered<br />

space <strong>of</strong> the projector is desired so that the spatial relationship between<br />

the projected signals and their observed images is susta<strong>in</strong>ed. To work out<br />

this plane to plane geometry it is not necessary that the <strong>in</strong>ternal parame-<br />

ters <strong>of</strong> the camera and the projector are known. The method <strong>of</strong> this plane<br />

to plane calibration is <strong>in</strong>troduced <strong>in</strong> later part <strong>of</strong> this chapter.<br />

The rest <strong>of</strong> this chapter is structured as follows. In section 3.2 we re-<br />

view other related works. In section 3.3 we expla<strong>in</strong> the calibration param-<br />

eters and the formalised full calibration model is given. In section 3.4 we<br />

<strong>in</strong>troduce the implementation <strong>of</strong> calibrat<strong>in</strong>g the camera and the projector,<br />

respectively. A method <strong>of</strong> 2D plane to plane calibration is presented <strong>in</strong><br />

54


section 3.5. Conclusions are given <strong>in</strong> section 3.6.<br />

3.2 Background<br />

Dur<strong>in</strong>g the past decade camera calibration has received a lot <strong>of</strong> attention<br />

because it is strongly related to many computer vision applications such<br />

as stereo vision, motion detection, structure from motion, and robotics<br />

[99, 37, 39, 48, 111].<br />

One <strong>of</strong> the most used methods is Tsai’s camera calibration method [99]<br />

that is also suitable for a wide range <strong>of</strong> applications. It is because his<br />

method deals with planar and non-planar calibration objects which makes<br />

it possible to calibrate <strong>in</strong>ternal and external parameters separately. This is<br />

important because <strong>in</strong> some cases the <strong>in</strong>ternal parameters are known (pro-<br />

vided by the manufacturer) so that one can fix the <strong>in</strong>ternal parameters <strong>of</strong><br />

the camera, and carry out iterative non-l<strong>in</strong>ear optimisation only on the ex-<br />

ternal parameters.<br />

The conventional calibration process can be consum<strong>in</strong>g <strong>in</strong> terms <strong>of</strong> time<br />

and effort, and calibration objects might not be always available. This <strong>in</strong>-<br />

spires self-calibrat<strong>in</strong>g methods which use the horizon l<strong>in</strong>e and vanish<strong>in</strong>g<br />

po<strong>in</strong>ts that are estimated from structural <strong>in</strong>formation such as landscape or<br />

55


uild<strong>in</strong>gs [26, 79]. These methods are <strong>of</strong>ten used <strong>in</strong> computer vision tasks<br />

based on s<strong>in</strong>gle view geometry or video surveillance applications [31]. Lv<br />

et al. [66] approaches the camera self-calibration problem us<strong>in</strong>g extracted<br />

positions from a s<strong>in</strong>gle walk<strong>in</strong>g man via PCA analysis, to estimate the<br />

vanish<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong>directly. No rigid calibration target is needed for the<br />

aforementioned approaches, however they are more onl<strong>in</strong>e-oriented and<br />

not very practical for our table-top VAE applications.<br />

In this research, we first carry out the camera calibration process us-<br />

<strong>in</strong>g the Matlab toolbox developed by Bouguet [14]. This Matlab toolbox<br />

is developed by Jean-Yves Bouguet at California Institute <strong>of</strong> Technology<br />

and its C implementation is also available <strong>in</strong> the Open Source <strong>Computer</strong><br />

Vision Library [51]. The toolbox is then extended and converted to C++ to<br />

make it capable <strong>of</strong> calibrat<strong>in</strong>g the projector-camera system. In the <strong>of</strong>f-l<strong>in</strong>e<br />

process us<strong>in</strong>g the Matlab toolbox, the projections and captures are done<br />

<strong>in</strong> the first stage and the captured images are processed on a local PC <strong>in</strong> a<br />

separate second stage. An onl<strong>in</strong>e calibration program was then developed<br />

<strong>in</strong> C++, which takes about two m<strong>in</strong>utes to calibrate the camera-projector<br />

pair <strong>in</strong> a fully automatic manner us<strong>in</strong>g 20 different poses <strong>of</strong> the calibration<br />

board.<br />

56


3.3 Calibration Parameters<br />

3.3.1 Intr<strong>in</strong>sic Parameters<br />

The <strong>in</strong>ternal camera model is described by a set <strong>of</strong> parameters known as<br />

<strong>in</strong>tr<strong>in</strong>sic parameters. These parameters represent the <strong>in</strong>ternal geometry <strong>of</strong><br />

the camera.<br />

A matrix formed by camera <strong>in</strong>tr<strong>in</strong>sic parameters is known as a camera<br />

matrix, or the K matrix that relates a 3D scene po<strong>in</strong>t (X, Y, Z) T and its pro-<br />

jection (x, y, 1) T <strong>in</strong> the 2D image plane.<br />

where the camera matrix K is<br />

w ′<br />

⎛ ⎞ ⎛ ⎞<br />

x X<br />

⎜ ⎟ ⎜ ⎟<br />

⎜ ⎟ ⎜ ⎟<br />

⎜y⎟<br />

≈ K ⎜Y<br />

⎟<br />

⎝ ⎠ ⎝ ⎠<br />

1 Z<br />

⎡<br />

⎤<br />

fc1<br />

⎢<br />

K = ⎢ 0<br />

⎣<br />

α × fc1<br />

fc2<br />

c1<br />

⎥<br />

c2⎥<br />

⎦<br />

0 0 1<br />

All related parameters that compose K are expla<strong>in</strong>ed as follows.<br />

(3.1)<br />

(3.2)<br />

fc is the focal length represented as a 2 ×1 vector. It is <strong>in</strong> units <strong>of</strong> hor-<br />

izontal and vertical pixels. Both components are normally equal to each<br />

other. However when the camera CCD array is not square, fc1 is slightly<br />

different from fc2. Therefore, the camera model handles non-square pix-<br />

57


els, and fc1/fc2 is called the aspect ratio.<br />

cc is the pr<strong>in</strong>cipal po<strong>in</strong>t represented as a 2 × 1 vector (c1, c2), and it<br />

means how the projection centre is positioned <strong>in</strong> the image. As shown<br />

from figure 3.2, a 3D po<strong>in</strong>t (X, Y, Z, 1) T is be<strong>in</strong>g projected onto the imag<strong>in</strong>g<br />

plane, its projection be<strong>in</strong>g (x, y, 1) T . When this is be<strong>in</strong>g represented <strong>in</strong> UV<br />

space (the 2D image coord<strong>in</strong>ate), the follow<strong>in</strong>g relationship holds:<br />

⎧<br />

⎪⎨ u = x + c1<br />

⎪⎩ v = y + c2<br />

(3.3)<br />

Figure 3.2: Pr<strong>in</strong>cipal po<strong>in</strong>ts. Bottom right subimage is the imag<strong>in</strong>g plane.<br />

Generally the pr<strong>in</strong>cipal po<strong>in</strong>t cc is always considered to be at the centre<br />

<strong>of</strong> projection, but not precisely so because there is always a slight decen-<br />

tr<strong>in</strong>g effect <strong>in</strong> camera design. This defect could be taken care <strong>of</strong> by accurate<br />

camera calibration.<br />

58


α is the skew coefficient, a scalar which encodes the angle between the<br />

X and Y axes <strong>in</strong> the imag<strong>in</strong>g plane. It equals to zero when X and Y axes are<br />

perpendicular, but like the aspect ratio fc1/fc2 handl<strong>in</strong>g non-square pixels,<br />

the skew coefficient α handles non-rectangular pixels.<br />

kc is a 5×1 distortion vector. Although kc is not directly <strong>in</strong>cluded <strong>in</strong> the<br />

<strong>in</strong>tr<strong>in</strong>sic matrix for perspectively transform<strong>in</strong>g the po<strong>in</strong>t between different<br />

coord<strong>in</strong>ate systems, it still plays a part <strong>in</strong> the camera <strong>in</strong>ternal geometry.<br />

The lens distortion model was first <strong>in</strong>troduced by Brown <strong>in</strong> 1966 [18] and<br />

called the ”Plumb Bob” model. There are three types <strong>of</strong> lens distortions:<br />

radial, tangential and decentr<strong>in</strong>g distortion, with the radial distortion be-<br />

<strong>in</strong>g the most commonly known and most dist<strong>in</strong>guished. The full distor-<br />

tion is modelled as follows.<br />

For an image po<strong>in</strong>t (x, y),<br />

where<br />

and<br />

⎛<br />

⎝ xd<br />

⎠ = (1 + kc1r 2 + kc2r 4 + kc5r 6 )<br />

yd<br />

⎞<br />

dx =<br />

⎛<br />

r 2 = x 2 + y 2<br />

⎝ 2kc3xy + kc4(r2 + 2x2 )<br />

kc3(r2 + 2y2 ) + 2kc4xy<br />

⎛<br />

⎝ x<br />

⎞<br />

⎠ + dx (3.4)<br />

y<br />

⎞<br />

(3.5)<br />

⎠ (3.6)<br />

The term dx is the tangential distortion. It is due to the imperfect cen-<br />

tr<strong>in</strong>g <strong>of</strong> lens components and other manufactur<strong>in</strong>g defects. Therefore tan-<br />

gential distortion is also known as decentr<strong>in</strong>g distortion. Also, the radial<br />

59


distortion is more visible, be<strong>in</strong>g affected by three entries <strong>of</strong> the distortion<br />

vector, kc1, kc2 and kc5. Because <strong>of</strong> the concavity <strong>of</strong> the lens, pixels further<br />

away from the image centre suffer more severe distortion, and the amount<br />

<strong>of</strong> distortion is monotonically <strong>in</strong>creas<strong>in</strong>g with the factor x 2 +y 2 . This effect<br />

is illustrated <strong>in</strong> figure 3.3.<br />

(a) Distorted image (b) Distorted image<br />

(c) Orig<strong>in</strong>al image (d) Orig<strong>in</strong>al image<br />

Figure 3.3: The distortion effects.<br />

60


3.3.2 The Reduced Camera Model<br />

The above optical model is not always required <strong>in</strong> current manufactured<br />

cameras. In practice, the 6th order radial + tangential distortion model is<br />

<strong>of</strong>ten not considered completely. A few reductions are possible.<br />

• Nowadays most cameras on the market have pretty good optical<br />

systems, and it is hard to f<strong>in</strong>d lenses with imperfection <strong>in</strong> centr<strong>in</strong>g.<br />

Therefore tangential distortion can be discarded. The skew coeffi-<br />

cient α is <strong>of</strong>ten assumed to be zero for the same reason.<br />

• For cameras with good optical systems or standard Field <strong>of</strong> View<br />

(FOV) lenses (non wide-angle lenses), it is not necessary to push the<br />

lens distortion model to high orders. Commonly a second order ra-<br />

dial distortion is used.<br />

• In some <strong>in</strong>stances such as the calibration data is not sufficient (e.g.<br />

us<strong>in</strong>g only two or three images for calibration), it is an option to set<br />

the pr<strong>in</strong>cipal po<strong>in</strong>t cc at the centre <strong>of</strong> the image ( nx−1<br />

2 , ny−1<br />

2 ) and reject<br />

the aspect ratio fc1/fc2 (set it to 1). However when sufficient images<br />

are used for calibration, this reduction is not necessary.<br />

Therefore, the reduced camera model can be def<strong>in</strong>ed as:<br />

⎛<br />

fc1 0 c1<br />

⎞<br />

⎜<br />

K = ⎜ 0<br />

⎝<br />

fc2<br />

⎟<br />

c2⎟<br />

⎠<br />

0 0 1<br />

61<br />

(3.7)


with distortion modelled as:<br />

⎛ ⎞<br />

where r 2 = x 2 + y 2 .<br />

⎝ xd<br />

⎠ = (1 + kc1r 2 )<br />

yd<br />

3.3.3 Extr<strong>in</strong>sic Parameters<br />

⎛<br />

⎝ x<br />

⎞<br />

⎠ (3.8)<br />

y<br />

Figure 3.4: Transformation from world to camera coord<strong>in</strong>ate system.<br />

Figure 3.4 is an example <strong>of</strong> how a triangle <strong>in</strong> the world coord<strong>in</strong>ate space<br />

is imaged. Let (Xw, Yw, Zw) T be an object po<strong>in</strong>t (the blue po<strong>in</strong>t <strong>in</strong> the pic-<br />

ture) and its 3D position <strong>in</strong> the camera coord<strong>in</strong>ate system is (Xc, Yc, Zc) T .<br />

Let po<strong>in</strong>t (x, y, f) T be its projection (the red po<strong>in</strong>t <strong>in</strong> the picture) on the<br />

imag<strong>in</strong>g plane and f is the focal length.<br />

The rotation matrix R and the translation vector T characterise the 3D<br />

transformation for a scene po<strong>in</strong>t from the world coord<strong>in</strong>ate to camera co-<br />

62


ord<strong>in</strong>ate,<br />

⎛<br />

⎜<br />

⎝<br />

Xc<br />

Yc<br />

Zc<br />

⎞<br />

⎛<br />

⎟ ⎜<br />

⎟ ⎜<br />

⎟ = R ⎜<br />

⎠ ⎝<br />

Xw<br />

Yw<br />

Zw<br />

⎞<br />

⎟ + T (3.9)<br />

⎠<br />

where R is a 3 × 3 rotation matrix and T is a 3 × 1 translation vector be-<br />

tween the two system orig<strong>in</strong>s <strong>in</strong> 3D space.<br />

After the scene po<strong>in</strong>t is transferred from world <strong>in</strong>to camera coordi-<br />

nates, its 2D image po<strong>in</strong>t can be known as<br />

⎧<br />

Rotation matrix<br />

⎪⎨<br />

x = f Xc<br />

Zc<br />

⎪⎩ y = f Yc<br />

Zc<br />

(3.10)<br />

Three ma<strong>in</strong> rotation parameter Rx, Ry, Rz, also known as pan, tilt, yaw<br />

angles, are the Euler angles <strong>of</strong> the rotation from world to camera coordi-<br />

nate system around three major axes, are represented by a 3 × rotation<br />

matrix R,<br />

⎛<br />

⎜<br />

R = ⎜<br />

⎝<br />

r11 r12 r13<br />

r21 r22 r23<br />

r31 r32 r33<br />

63<br />

⎞<br />

⎟<br />

⎠<br />

(3.11)


where<br />

Translation vector<br />

3.3.4 Full Model<br />

r11 = cos(Ry) s<strong>in</strong>(Rz) (3.12)<br />

r12 = cos(Rz) s<strong>in</strong>(Rx) s<strong>in</strong>(Ry) − cos(Rx) s<strong>in</strong>(Rz) (3.13)<br />

r13 = s<strong>in</strong>(Rx) s<strong>in</strong>(Rz) + cos(Rx) cos(Rz) s<strong>in</strong>(Ry) (3.14)<br />

r21 = cos(Ry) s<strong>in</strong>(Rz) (3.15)<br />

r22 = s<strong>in</strong>(Rx) s<strong>in</strong>(Ry) s<strong>in</strong>(Rz) + cos(Rx) cos(Rz) (3.16)<br />

r23 = cos(Rx) s<strong>in</strong>(Ry) s<strong>in</strong>(Rz) − cos(Rz) s<strong>in</strong>(Rx) (3.17)<br />

r31 = − s<strong>in</strong>(Ry) (3.18)<br />

r32 = cos(Ry) s<strong>in</strong>(Rx) (3.19)<br />

r33 = cos(Rx) cos(Ry) (3.20)<br />

⎛<br />

⎜<br />

T = ⎜<br />

⎝<br />

Tx<br />

Ty<br />

Tz<br />

⎞<br />

⎟<br />

⎠<br />

(3.21)<br />

Comb<strong>in</strong><strong>in</strong>g the camera <strong>in</strong>tr<strong>in</strong>sic and extr<strong>in</strong>sic parameters, it gives the full<br />

projection model, which performs the transform <strong>of</strong> a scene po<strong>in</strong>t (Xw, Yw, Zw) T<br />

from the World Coord<strong>in</strong>ate System (WCS) to the camera coord<strong>in</strong>ate system<br />

(Xc, Yc, Zc) T , then to the 2D imag<strong>in</strong>g space (x, y) T , as shown <strong>in</strong> equation<br />

3.22. By represent<strong>in</strong>g all the po<strong>in</strong>ts <strong>in</strong> their homogeneous form, the above<br />

transform relationships can be formalised as<br />

64


⎛ ⎞ ⎛<br />

x<br />

⎜ ⎟ ⎜<br />

⎜ ⎟ ⎜<br />

⎜y⎟<br />

≈ K ⎜<br />

⎝ ⎠ ⎝<br />

1<br />

Xc<br />

Yc<br />

Zc<br />

where K(R|T ) is a 3 × 4 projection matrix.<br />

⎛ ⎞<br />

⎞<br />

Xw ⎜ ⎟<br />

⎜ ⎟<br />

⎟ ⎜<br />

⎟ ⎜Yw<br />

⎟<br />

⎟ = K(R|T ) ⎜ ⎟<br />

⎠ ⎜ ⎟<br />

⎜Zw<br />

⎟<br />

⎝ ⎠<br />

1<br />

(3.22)<br />

So to calibrate the camera, it is necessary to estimate both <strong>in</strong>tr<strong>in</strong>sic<br />

and extr<strong>in</strong>sic parameters, and the distortion model. This can be done by<br />

match<strong>in</strong>g a set <strong>of</strong> ground truth po<strong>in</strong>ts from the calibration object, and their<br />

correspondences <strong>in</strong> the observed image.<br />

3.4 Calibrate Camera-Projector Pair<br />

3.4.1 World Coord<strong>in</strong>ate System<br />

The camera extr<strong>in</strong>sic parameters are not <strong>in</strong>herent parameters <strong>of</strong> the cam-<br />

era. The rotation and translation only represent the current camera pose <strong>in</strong><br />

reference to the world coord<strong>in</strong>ate system chosen by user. Without a world<br />

coord<strong>in</strong>ate system or a reference coord<strong>in</strong>ate system, the extr<strong>in</strong>sic param-<br />

eters are mean<strong>in</strong>gless. Therefore, a world coord<strong>in</strong>ate system needs to be<br />

chosen first as a reference to describe the relative camera position. In our<br />

65


system the white board is chosen to be the world reference frame.<br />

Be<strong>in</strong>g more specific, by lay<strong>in</strong>g a checkerboard flat on the table plane,<br />

the checkerboard plane is chosen as the XOY plane <strong>of</strong> the world coordi-<br />

nate system with its bottom and leftmost edge taken as X and Y axis. The<br />

surface normal vector po<strong>in</strong>ted from the bottom-left corner <strong>of</strong> the checker-<br />

board is chosen as the Z axis. Thus the orig<strong>in</strong> <strong>of</strong> the WCS is arbitrary <strong>in</strong> X<br />

and Y, depend<strong>in</strong>g on how and where the checkerboard was laid.<br />

3.4.2 Methodology<br />

Before we can calibrate the camera and projector pair, a set <strong>of</strong> calibration<br />

images is needed. In this research we use 20 images for camera calibration<br />

and 20 images for projector calibration, each pair be<strong>in</strong>g captured from a<br />

different angle <strong>of</strong> the white board.<br />

The ma<strong>in</strong> methodology is to take an image <strong>of</strong> a known 3D pattern as<br />

ground truth. Then <strong>in</strong> the captured image one selects a set <strong>of</strong> po<strong>in</strong>ts <strong>of</strong> that<br />

pattern as <strong>in</strong>terest po<strong>in</strong>ts, to use the 2D coord<strong>in</strong>ate <strong>in</strong>formation <strong>of</strong> those<br />

<strong>in</strong>terest po<strong>in</strong>ts along with their 3D match<strong>in</strong>g po<strong>in</strong>ts as correspondence to<br />

calibrate the camera. Normally this process is iterated by orient<strong>in</strong>g the cal-<br />

ibration pattern <strong>in</strong> different angles to <strong>in</strong>crease accuracy.<br />

The projector is calibrated <strong>in</strong> a similar way. A pre-designed pattern<br />

with ground truth <strong>in</strong>formation is projected onto a surface (which is re-<br />

66


garded as <strong>in</strong> world coord<strong>in</strong>ate space), and the projection is monitored from<br />

the calibrated camera. S<strong>in</strong>ce at this po<strong>in</strong>t the camera is already calibrated,<br />

with the captured image and full camera model we can recover the 3D<br />

<strong>in</strong>formation <strong>of</strong> the projected pattern. These 3D <strong>in</strong>formation together with<br />

prior knowledge <strong>of</strong> the pre-designed 2D pattern forms a correspondence,<br />

and hence the projector can be calibrated by these two sets <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> a<br />

”reversed camera” way.<br />

Figure 3.5 shows the flow chart <strong>of</strong> the whole calibration process. The<br />

diagram shows the whole process after the data collection stage is done,<br />

dur<strong>in</strong>g which the black patterns are projected onto the cyan checkerboard<br />

and images are taken at the same time.<br />

3.4.3 Data Collection<br />

We use a pr<strong>in</strong>ted checkerboard as the camera calibration target, and we<br />

let the projector project another checkerboard as the projector calibration<br />

target. As mentioned <strong>in</strong> section 3.3, the camera calibration results - partic-<br />

ularly the camera extr<strong>in</strong>sic parameters - are needed to perform the trans-<br />

formation <strong>of</strong> the observed projected pattern from camera coord<strong>in</strong>ate space<br />

to world coord<strong>in</strong>ate space. Therefore when the pr<strong>in</strong>ted pattern is be<strong>in</strong>g<br />

captured, we have to make sure a projected pattern is captured as well<br />

with the base plane stay<strong>in</strong>g exactly at the same pose – to ma<strong>in</strong>ta<strong>in</strong> accu-<br />

racy.<br />

67


Figure 3.5: Flow chart <strong>of</strong> the camera-projector pair calibration. (diagram<br />

<strong>of</strong> image process<strong>in</strong>g after the projections and captures are done)<br />

However, this is not easy, if the user has to slide <strong>in</strong> and out the pr<strong>in</strong>ted<br />

checkerboard every time the checkerboard changes the orientation, and<br />

68


manually it is very hard to hold the base plane firmly stationary while<br />

perform<strong>in</strong>g these activities. It might require one tester to hold the board<br />

still while another one is handl<strong>in</strong>g the sheet. For this reason a mechanism<br />

that allows us to take a picture <strong>of</strong> two superimposed checkerboards and<br />

extract one from each other is desired, to prevent any slight movement <strong>of</strong><br />

the base plane. This is possible by choos<strong>in</strong>g appropriate colours for the<br />

checkerboards.<br />

We use a cyan-white checkerboard for the pr<strong>in</strong>ted pattern, and a blue-<br />

black checkerboard for the projected pattern. Cyan and white have very<br />

similar blue components under white ambient light. Therefore, <strong>in</strong> a cap-<br />

tured image with both checkerboards there, by <strong>in</strong>spect<strong>in</strong>g the blue chan-<br />

nel, the cyan checkerboard is barely seen and the blue checkerboard can<br />

be extracted.<br />

On the other hand, blue and black grids have near-zero red compo-<br />

nents. This means by super-impos<strong>in</strong>g a blue-black checkerboard onto a<br />

cyan-white, <strong>in</strong> the red channel no components are added. This property<br />

allows us to extract the cyan-white checkerboard out <strong>of</strong> the superimposed<br />

version easily. In figure 3.6, the top image shows the captured image <strong>of</strong><br />

superimposed checkerboards. The bottom two images are images <strong>of</strong> ex-<br />

tracted checkerboards.<br />

69


3.4.4 Choice <strong>of</strong> colour<br />

Gett<strong>in</strong>g the pr<strong>in</strong>ted pattern from the mixed image is simple, because when<br />

it is captured the projected pattern is switched <strong>of</strong>f. More effort is made <strong>in</strong><br />

extract<strong>in</strong>g the projected pattern from the mixed pattern, and the key is to<br />

f<strong>in</strong>d the difference between the blue projected area and black projected<br />

area under the <strong>in</strong>terference <strong>of</strong> the pr<strong>in</strong>ted pattern on the white board.<br />

Zhang [109] chooses red and blue for the pr<strong>in</strong>ted and projected pattern<br />

respectively, because <strong>of</strong> their dist<strong>in</strong>ctively different RGB values. In prac-<br />

tice, other factors such as the surface reflection and room light<strong>in</strong>g con-<br />

dition need to be considered. After evaluat<strong>in</strong>g colour comb<strong>in</strong>ations we<br />

choose cyan as the colour for the pr<strong>in</strong>ted pattern <strong>in</strong>stead <strong>of</strong> red, and figure<br />

3.6 shows its performance.<br />

Figure 3.7 gives a closer look at the mixed area. In figure 3.7(a), area A<br />

and C are the non projected area (projection is zero) but A appears yellow-<br />

ish because the surface absorbs part <strong>of</strong> the ambient light, and C appears<br />

darker as it is affected by the blue grid on the pr<strong>in</strong>ted sheet. D and B are<br />

the blue projection area, but B is affected by the pr<strong>in</strong>ted pattern <strong>in</strong> the same<br />

way. The task is to differentiate area A and C from D and B by exploit<strong>in</strong>g<br />

their colour channels. The <strong>in</strong>stant f<strong>in</strong>d is that the pr<strong>in</strong>ted cyan colour has<br />

very little effect <strong>in</strong> the blue channels <strong>in</strong> the captured image – A and C have<br />

very little blue component, and B and D have heavy blue channels despite<br />

that B and C are the areas where the surface is pr<strong>in</strong>ted as cyan. The ex-<br />

traction result is shown <strong>in</strong> figure 3.7(b). The same can not be applied to<br />

70


(a) blue and cyan mixed pattern (b) extracted blue pattern<br />

(c) blue and red mixed pattern (d) extracted blue pattern<br />

Figure 3.6: Extraction <strong>of</strong> the projected pattern from the mixed one.<br />

the red-blue method (figure 3.7(c),(d)), where the pr<strong>in</strong>ted red area appears<br />

full red <strong>in</strong> the observed image regardless whether it is mixed with blue<br />

projection or not.<br />

This method is also tested under different ambient illum<strong>in</strong>ations. In<br />

general, experiments conducted when sufficient day light is available out-<br />

perform those conducted dur<strong>in</strong>g the night, and it is mostly reflected <strong>in</strong> the<br />

failure <strong>of</strong> extract<strong>in</strong>g the all the corners successfully because <strong>of</strong> less satis-<br />

factory results from cyan and blue colour filter<strong>in</strong>g. This is because dur<strong>in</strong>g<br />

71


(a) blue and cyan mixed pattern (b) extracted blue pattern<br />

(c) blue and red mixed pattern (d) extracted blue pattern<br />

Figure 3.7: Extraction <strong>of</strong> the projected pattern from the mixed one (a closer<br />

look).<br />

the night room light<strong>in</strong>g needs to be turned on to illum<strong>in</strong>ate the physical<br />

checkerboard while the projection is <strong>of</strong>f, and this contributes negatively<br />

to the colour filter<strong>in</strong>g at later stage as the fluorescent lamps violates the<br />

colour channels more than the sunlight. When the po<strong>in</strong>ts are extracted<br />

automatically, any captured images with not enough corner po<strong>in</strong>ts will be<br />

rejected (e.g. a precise 81 <strong>in</strong>ner corner po<strong>in</strong>ts are expected from a 10 × 10<br />

checkerboard). Disqualify<strong>in</strong>g more images leads to degradation the accu-<br />

racy <strong>of</strong> the calibration.<br />

72


3.4.5 Camera Calibration<br />

An automated process is implemented. All the user need to do is to hold<br />

the whiteboard which is attached with a physical checkerboard pattern at<br />

one pose for a short period (around 2 seconds), to let the camera take two<br />

pictures with the projections turned on and <strong>of</strong>f, then re-position the board<br />

<strong>in</strong>to a different orientation as long as the whole pr<strong>in</strong>ted checkerboard pat-<br />

tern is with<strong>in</strong> the common FOV between the camera and the projector.<br />

1. After the image capture stage, the colour filtered images as shown <strong>in</strong><br />

the bottom left image <strong>of</strong> figure 3.6 captured from ten different orien-<br />

tations <strong>of</strong> the white board are used as the camera calibration images.<br />

2. For each image, the user manually clicks the four top corners <strong>of</strong> the<br />

checkerboard. The user is also prompted to <strong>in</strong>put the physical grid<br />

size <strong>of</strong> the checkerboard to set up the units world coord<strong>in</strong>ate system.<br />

Grid numbers and the <strong>in</strong>ner cross po<strong>in</strong>ts are located automatically<br />

after the four top corners are given.<br />

3. Normally the lens distortion can be tolerated at this stage as the dis-<br />

tortion model will be estimated later us<strong>in</strong>g the camera <strong>in</strong>tr<strong>in</strong>sic pa-<br />

rameters. In case <strong>of</strong> severe lens distortion, the user is advised to give<br />

an <strong>in</strong>itial guess for the first order distortion factor kc1. Then the sys-<br />

tem will take a guess and locate the corners more precisely, as shown<br />

<strong>in</strong> figure 3.8.<br />

4. After corner po<strong>in</strong>ts are extracted for all <strong>in</strong>put images, the user can<br />

deploy the camera calibration. By def<strong>in</strong><strong>in</strong>g the checkerboard plane<br />

73


as the world coord<strong>in</strong>ate XOY plane and the first po<strong>in</strong>t user clicked<br />

as bottom left corner as the world coord<strong>in</strong>ate orig<strong>in</strong>, 3D po<strong>in</strong>ts <strong>of</strong> all<br />

corners are known. Calibration parameters are first <strong>in</strong>itialised, and<br />

then optimised by redo the calibration us<strong>in</strong>g the improved repro-<br />

jected corners based on the estimated camera parameters.<br />

(a) blue and cyan mixed pattern (b) extracted blue pattern<br />

Figure 3.8: Extraction <strong>of</strong> the projected pattern from the mixed one (a closer<br />

look).<br />

3.4.6 Projector Calibration<br />

By the time the projector is calibrated, calibration for the camera is already<br />

done. Therefore, the calibration images used for projector calibration (<strong>in</strong><br />

our case, 10 blue checkerboard images) will go through an ’Undistort’<br />

stage before be<strong>in</strong>g used as <strong>in</strong>put images for corner extraction. The two<br />

74


dimensional distortion vector is used to removed distortion from the im-<br />

ages.<br />

The first few steps <strong>of</strong> projector calibration are the same as camera: read<br />

images, extract corners.<br />

The extracted corners here cannot be used directly for calibration. They<br />

are the corner po<strong>in</strong>ts <strong>in</strong> the captured image <strong>of</strong> the projected checkerboard.<br />

The <strong>in</strong>formation we need is the 3D coord<strong>in</strong>ates <strong>of</strong> corners <strong>of</strong> the projected<br />

pattern. Now the camera model can be used to perform these transforma-<br />

tions.<br />

In theory it is impossible to recover a 3D scene po<strong>in</strong>t merely from its<br />

2D projection <strong>in</strong> the image plane. Because given the projection <strong>in</strong> the im-<br />

age, its orig<strong>in</strong>al 3D po<strong>in</strong>t could be anywhere down the projection ray if<br />

the scene structure is unknown. However <strong>in</strong> our case, all the po<strong>in</strong>ts we<br />

are try<strong>in</strong>g to recover is on the checkerboard plane which is chosen as the<br />

XOY plane <strong>of</strong> the WCS, that means for all <strong>of</strong> them Z = 0. This relation-<br />

ship holds for all different poses <strong>of</strong> the checkerboard, as the <strong>in</strong>stantaneous<br />

plane where the pr<strong>in</strong>ted checkerboard lies <strong>in</strong> is assumed to be the XOY<br />

plane <strong>of</strong> the WCS.<br />

Technically, there is a different WCS for each tilt <strong>of</strong> the plane. It won’t<br />

affect the f<strong>in</strong>al calibration result because for N tilts there will be N sets<br />

<strong>of</strong> different rotation and translation vectors. Geometrically, each <strong>of</strong> them<br />

75


only represents the relative geometry towards temporary WCS, but there<br />

is only one set <strong>of</strong> rotation and translation vector will be used to estimate<br />

the f<strong>in</strong>al extr<strong>in</strong>sic parameters – the one from the view where the white-<br />

board is laid flat on tabletop, as that is where the VAE runs upon.<br />

Let x, y be the image po<strong>in</strong>t, we are try<strong>in</strong>g to recover its 3D coord<strong>in</strong>ate<br />

<strong>in</strong> world coord<strong>in</strong>ate system, given camera calibration parameters and the<br />

constra<strong>in</strong>t Z = 0.<br />

⎛ ⎞<br />

⎛ ⎞<br />

⎜<br />

X<br />

⎟<br />

x<br />

⎜ ⎟<br />

⎜ ⎟ ⎜<br />

⎜ ⎟ ⎜Y<br />

⎟<br />

⎜y⎟<br />

≈ K(R|T ) ⎜ ⎟<br />

⎝ ⎠ ⎜ 0<br />

⎟<br />

1<br />

⎝ ⎠<br />

1<br />

(3.23)<br />

Here ≈ means equal up to a scale, so we replace it with a non-zero<br />

factor w<br />

⎛ ⎞<br />

⎛ ⎞<br />

⎜<br />

X<br />

⎟<br />

x<br />

⎜ ⎟<br />

⎜ ⎟ ⎜<br />

⎜ ⎟ ⎜Y<br />

⎟<br />

w ⎜y⎟<br />

= K(R|T ) ⎜ ⎟<br />

⎝ ⎠ ⎜ 0<br />

⎟<br />

1<br />

⎝ ⎠<br />

1<br />

Replace K(R|T ) with the 3 × 4 projection matrix P<br />

⎛<br />

⎞<br />

⎜<br />

P = K(R|T ) = ⎜<br />

⎝<br />

From Equ. 3.24 and 3.25, we have<br />

⎛ ⎞<br />

x<br />

⎜ ⎟<br />

⎜ ⎟<br />

w ⎜y⎟<br />

⎝ ⎠<br />

1<br />

=<br />

⎛<br />

⎜<br />

⎝<br />

p11 p12 p13 p14<br />

p21 p22 p23 p24<br />

p21 p32 p33 p34<br />

p11 p12 p13 p14<br />

p21 p22 p23 p24<br />

p21 p32 p33 p34<br />

76<br />

⎞<br />

⎟<br />

⎠<br />

⎟<br />

⎠<br />

(3.24)<br />

(3.25)<br />

(3.26)


Cancel out the scale factor w by divid<strong>in</strong>g the first and second row by<br />

the third row from Equ. 3.26<br />

x = p11X + p12Y + p14<br />

p31X + p32Y + p34<br />

y = p21X + p22Y + p24<br />

p31X + p32Y + p34<br />

From Equ. 3.27 and 3.28, X and Y <strong>in</strong> Equ. 3.23 can be solved<br />

X = (xp34 − p14)(p22 − yp32) − (yp34 − p24)(p12 − xp32)<br />

(p11 − xp31)(p22 − yp32) − (p21 − yp31)(p12 − xp32)<br />

Y = (xp34 − p14)(p21 − yp31) − (yp34 − p24)(p11 − xp31)<br />

(p12 − xp32)(p21 − yp31) − (p22 − yp32)(p11 − xp31)<br />

(3.27)<br />

(3.28)<br />

(3.29)<br />

(3.30)<br />

A programs was written by the author to implement all the calcula-<br />

tions above. So given an extracted po<strong>in</strong>t x, y from a corner po<strong>in</strong>t <strong>in</strong> the<br />

observed blue pattern, with the camera already calibrated, its position <strong>in</strong><br />

the world coord<strong>in</strong>ate space (X, Y ) is located from Equ. 3.29 and 3.30.<br />

S<strong>in</strong>ce the projection pattern (the blue checkerboard) is pre-designed, its<br />

corner po<strong>in</strong>ts are all known. Along with the calculated 3D corners <strong>of</strong> the<br />

projected pattern, the projector can be calibrated <strong>in</strong> a similar way as cam-<br />

era calibration. The estimated distortion vector kc for the projector is very<br />

close to all zero, therefore the projector is assumed to have zero distortion.<br />

77


3.5 Plane to Plane Calibration<br />

The whole user <strong>in</strong>terface <strong>of</strong> our collaborative system for <strong>in</strong>putt<strong>in</strong>g 3D is<br />

based on a plane (i.e. the table top). Therefore a precise estimation <strong>of</strong> the<br />

projective transform between the projector and the camera for this plane is<br />

desired, because we need constant and real-time monitor<strong>in</strong>g <strong>of</strong> augmented<br />

signals <strong>in</strong> captured frames and response to them. Although the calibration<br />

data we previously worked out can be used, a more straightforward and<br />

accurate match<strong>in</strong>g is preferred.<br />

A homography matrix is modelled to represent this match<strong>in</strong>g. A ho-<br />

mography is a 3 × 3 non-s<strong>in</strong>gular matrix, which def<strong>in</strong>es a homogeneous<br />

l<strong>in</strong>ear transformation from a plane to another plane. Although there is<br />

never a direct projective transform between the projector plane and the<br />

camera imag<strong>in</strong>g plane, a homography still exists coherently between these<br />

two planes because they are <strong>in</strong>duced by a reference plane, which is the<br />

white board <strong>in</strong> our case. Estimat<strong>in</strong>g the homography can be regarded as<br />

a 2D calibration process between the projector plane and camera plane.<br />

Normally a homography has 9 entries but only has 8 degrees <strong>of</strong> freedom,<br />

be<strong>in</strong>g constra<strong>in</strong>ed by ||H|| = 1 to only carry out an up to scale match<strong>in</strong>g.<br />

Let the model plane (i.e. the white board) co<strong>in</strong>cide with the XOY plane<br />

<strong>of</strong> the world coord<strong>in</strong>ate system, then a 3D po<strong>in</strong>t on the model plane is Pw =<br />

(Xw, Yw, 0, 1) T , with its observed po<strong>in</strong>t <strong>in</strong> the camera plane Pc = (xc, yc, 1) T<br />

and its projection source po<strong>in</strong>t <strong>in</strong> the projector plane Pp = (xp, yp, 1) T . Sim-<br />

ilar to Equ. 3.23 and 3.24, we have<br />

78


⎛ ⎞<br />

⎛ ⎞<br />

Xw ⎜ ⎟<br />

xc<br />

⎜ ⎟<br />

⎜ ⎟<br />

⎜<br />

⎜ ⎟<br />

⎜Yw<br />

⎟<br />

⎜yc⎟<br />

≈ Kc(Rc|Tc) ⎜ ⎟<br />

⎝ ⎠ ⎜ 0<br />

⎟<br />

1<br />

⎝ ⎠<br />

1<br />

== Kc<br />

<br />

rc1 rc2 tc<br />

⎛<br />

Xw<br />

⎞<br />

⎜ ⎟<br />

⎜ ⎟<br />

⎜Yw<br />

⎟<br />

⎝ ⎠<br />

1<br />

(3.31)<br />

where rci denotes the i th column <strong>of</strong> the camera rotation matrix Rc and tc<br />

denotes the column vector <strong>of</strong> the translation matrix Tc.<br />

The homography Hwc from world plane to camera plane can be ex-<br />

pressed as<br />

Hwc ≈ Kc<br />

<br />

rc1 rc2 tc<br />

<br />

(3.32)<br />

Likewise, the homography Hwp form world plane to projector plane is<br />

Hwp ≈ Kp<br />

<br />

rp1 rp2 tp<br />

Substitution <strong>of</strong> Equ. 3.32 and 3.33 <strong>in</strong>to 3.31 yields<br />

Pc ≈ HwcPw<br />

Pp ≈ HwpPw<br />

<br />

(3.33)<br />

(3.34)<br />

(3.35)<br />

From Equ. 3.34 and 3.35, it is not hard to f<strong>in</strong>d out that although the two<br />

po<strong>in</strong>ts Pc and Pp are still related by a projective transform although be<strong>in</strong>g<br />

<strong>in</strong>duced by a third plane<br />

Pc ≈ Hpc Pp<br />

where the homography <strong>of</strong> projector plane to camera plane is<br />

Hpc = HwcH −1<br />

wp<br />

79<br />

(3.36)<br />

(3.37)


However, it can also be seen from Equ. 3.37 that this homography Hpc<br />

only holds the current camera-projector relationship when and only when<br />

the reference plane not be<strong>in</strong>g changed. This is known as the plane to plane<br />

homography <strong>in</strong>duced by a third plane. Dur<strong>in</strong>g our calibration, tilt<strong>in</strong>g the<br />

whiteboard 20 times yields 20 different homographies between the cam-<br />

era space and the projector space. Similar to the discussion <strong>in</strong> section 3.4.6,<br />

only the homography <strong>in</strong>duced by the flat-placed whiteboard is the one we<br />

are <strong>in</strong>terested <strong>in</strong>, because once the VAE is up and runn<strong>in</strong>g the whiteboard<br />

is fixed onto the tabletop.<br />

To solve the homography, all participat<strong>in</strong>g frames will go through the<br />

distortion removal stage us<strong>in</strong>g the calibrated camera <strong>in</strong>ternal model and<br />

distortion parameters. Keep<strong>in</strong>g the same notations from Equ. 3.36, and by<br />

<strong>in</strong>troduc<strong>in</strong>g the scale factor w, Equ. 3.36 can be rewritten as<br />

⎛<br />

⎞<br />

⎛<br />

wxc<br />

⎜ ⎟<br />

⎜ ⎟<br />

⎜wyc⎟<br />

⎝ ⎠<br />

w<br />

=<br />

⎜<br />

⎝<br />

h1 h2 h3<br />

h4 h5 h6<br />

h7 h8 h9<br />

⎞ ⎛<br />

xp<br />

⎞<br />

⎟ ⎜ ⎟<br />

⎟ ⎜ ⎟<br />

⎟ ⎜yp⎟<br />

⎠ ⎝ ⎠<br />

1<br />

(3.38)<br />

Us<strong>in</strong>g the similar method as described <strong>in</strong> section 3.4.6, Equ. 3.27 and<br />

3.28 to cancel out w,<br />

xc = h1xp + h2yp + h3<br />

h7xp + h8yp + h9<br />

yc = h4xp + h5yp + h6<br />

h7xp + h8yp + h9<br />

(3.39)<br />

(3.40)<br />

Each po<strong>in</strong>t gives two equations, thus to solve H which has 8 Degree <strong>of</strong><br />

80


Freedom (DOF), a m<strong>in</strong>imum <strong>of</strong> 4 po<strong>in</strong>ts is needed. With (N ≥ 4) po<strong>in</strong>ts,<br />

⎛<br />

xp1<br />

⎜<br />

0<br />

⎜<br />

⎜xp2<br />

⎜ 0<br />

⎜ .<br />

⎜<br />

⎜xpn<br />

⎝<br />

yp1<br />

0<br />

yp2<br />

0<br />

.<br />

ypn<br />

1<br />

0<br />

1<br />

0<br />

.<br />

1<br />

0<br />

xp1<br />

0<br />

xp2<br />

.<br />

0<br />

0<br />

yp1<br />

0<br />

yp2<br />

.<br />

0<br />

0<br />

1<br />

0<br />

1<br />

.<br />

0<br />

−xp1xc1<br />

−xp1xc1<br />

−xp2xc2<br />

−xp2xc2<br />

.<br />

−xpnxcn<br />

−yp1xc1<br />

−yp1xc1<br />

−yp2xc2<br />

−yp2xc2<br />

.<br />

−ypnxcn<br />

−xc1<br />

⎟<br />

−xc1⎟<br />

⎛ ⎞<br />

⎟ h1<br />

−xc2<br />

⎟ ⎜ ⎟<br />

⎟ ⎜ ⎟<br />

⎟ ⎜<br />

⎟ ⎜h2<br />

⎟<br />

−xc2⎟<br />

⎜ ⎟ = 0 (3.41)<br />

⎟ ⎜<br />

⎟ ⎜ .<br />

⎟<br />

. ⎟ ⎝ ⎠<br />

⎟ h9<br />

−xcn⎟<br />

⎠<br />

0 0 0 xpn ypn 1 −xpnxcn −ypnxcn −xcn<br />

Let the 2N × 9 matrix <strong>in</strong> Equ. 3.41 be A, this becomes a typical prob-<br />

lem <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g the least square solution <strong>in</strong> an over-determ<strong>in</strong>ed situation,<br />

to m<strong>in</strong>imise errors over |AH = 0. H can be solved by expand<strong>in</strong>g the mea-<br />

surement matrix A to a square matrix and f<strong>in</strong>d<strong>in</strong>g its <strong>in</strong>verse matrix. We<br />

used an alternative solution, which obta<strong>in</strong>s H by f<strong>in</strong>d<strong>in</strong>g the eigenvector<br />

which corresponds to the lease eigenvalue <strong>of</strong> A T A [4].<br />

The solution to equation 3.41 is the homography between the camera<br />

and projector planes. It holds the transform <strong>in</strong> equation 3.38 from a po<strong>in</strong>t<br />

(xp, yp, 1) T <strong>in</strong> the projection image to its observation (xc, yc, 1) T <strong>in</strong> the cam-<br />

era image. Transform <strong>of</strong> the other way round from (xc, yc, 1) T to (xc, yc, 1) T<br />

is held by <strong>in</strong>verse matrix <strong>of</strong> this homography. By do<strong>in</strong>g this, a two way<br />

transform is for any augmentations <strong>in</strong> the VAE is available at any time,<br />

between its projection source and camera observation.<br />

81<br />


3.6 Conclusions<br />

This chapter beg<strong>in</strong>s with the <strong>in</strong>troduction <strong>of</strong> the fundamental <strong>of</strong> conven-<br />

tional camera calibration technique, followed by a detailed implementa-<br />

tion <strong>of</strong> the camera calibration process us<strong>in</strong>g the Matlab toolbox designed<br />

by previous researchers. The method is then extended to calibrate the<br />

projector as a reverse camera. F<strong>in</strong>ally, a fully automated method is im-<br />

plemented to calibrate the projector-camera system and used by the VAE<br />

framework <strong>in</strong> this research.<br />

The proposed method provides a means <strong>of</strong> estimat<strong>in</strong>g the <strong>in</strong>ternal and<br />

external parameters <strong>of</strong> the camera and the projector, <strong>in</strong> an automated way.<br />

It is fast, efficient, and requires little <strong>in</strong>vasion to the scene from the tester. A<br />

colour filter<strong>in</strong>g technique is also proposed so that the extraction <strong>of</strong> phys-<br />

ical pr<strong>in</strong>ted pattern and the projected pattern from the mixed version is<br />

possible, while they are <strong>in</strong>stantaneously susta<strong>in</strong>ed firmly with a same sur-<br />

face plane. This effectively exempts the user’s duty <strong>of</strong> manually manipu-<br />

lat<strong>in</strong>g the calibration objects, such as slid<strong>in</strong>g <strong>in</strong> and out the physical pat-<br />

tern to avoid its super-imposition with the projected pattern.<br />

A method <strong>of</strong> plane-to-plane calibration is presented <strong>in</strong> section 3.5. The<br />

result <strong>of</strong> this calibration is used once the VAE is up and runn<strong>in</strong>g, to susta<strong>in</strong><br />

the spatial relationship <strong>of</strong> the virtual augmentations and their observation<br />

<strong>in</strong> the camera image. This ensures a quick and reliable mapp<strong>in</strong>g for the<br />

VAE to monitor the changes <strong>in</strong> the <strong>in</strong>teractive environment, and respond<br />

to them by augment<strong>in</strong>g the scene with correspond<strong>in</strong>g video signals.<br />

82


Although not hav<strong>in</strong>g a comprehensive test, the proposed methods have<br />

been used reliably for the VAE system designed <strong>in</strong> this research <strong>in</strong> the<br />

past two years. Results from section 6.2.4 <strong>in</strong> chapter 6 suggests that accu-<br />

rate button locat<strong>in</strong>g is achieved, which is only estimated from the calibra-<br />

tion results <strong>of</strong> the projector-camera pair, without do<strong>in</strong>g any local image<br />

process<strong>in</strong>g <strong>in</strong> the observed image to detect the button positions. Hence<br />

the results were positive and warrant further research <strong>in</strong>to the use <strong>of</strong> this<br />

method.<br />

3.6.1 Future Work<br />

This chapter is concerned with the calibration process which estimates the<br />

<strong>in</strong>tr<strong>in</strong>sic and extr<strong>in</strong>sic parameters <strong>of</strong> the projector-camera pair, provides<br />

an accurate registration between the camera image space and the projec-<br />

tor render<strong>in</strong>g space, but only on a geometric scale.<br />

To deal with the light<strong>in</strong>g situation, photometric camera sett<strong>in</strong>gs such as<br />

brightness, contrast, exposure, and white balance are manually tuned and<br />

evaluated before the calibration. The photometric parameters <strong>of</strong> the pro-<br />

jector are also pre-set. For example, to project a blue-black checkerboard<br />

pattern, the blue channel <strong>of</strong> the rendered image is set to full illum<strong>in</strong>ation<br />

(i.e. 255). One might wonder, is 255 the optimal value for the brightness<br />

<strong>in</strong> all scenarios?<br />

83


A similar problem is also encountered <strong>in</strong> chapter 4, where a pla<strong>in</strong> white<br />

image is projected onto the <strong>in</strong>terface to illum<strong>in</strong>ate the object be<strong>in</strong>g mea-<br />

sured for the camera to take an image as the colour map. In day time<br />

where sufficient ambient light is available, the image can be taken without<br />

any illum<strong>in</strong>ation from the projector. However <strong>in</strong> the even<strong>in</strong>g when lights<br />

<strong>of</strong>f, projector illum<strong>in</strong>ation is essential while captur<strong>in</strong>g an image <strong>of</strong> the ob-<br />

ject because it is the only light source. Furthermore, ambient light be<strong>in</strong>g<br />

too strong will affect the optimal projection affects the projection as well<br />

because it can over-illum<strong>in</strong>ate the scene and weaken the projection signals.<br />

Therefore, choos<strong>in</strong>g a universal brightness level <strong>of</strong> projector illum<strong>in</strong>ation<br />

for all the aforementioned scenarios can be problematic.<br />

84


(a) projector brightness = 0 (b) projector brightness = 128<br />

(c) red pixel values <strong>of</strong> (a) (d) red pixel values <strong>of</strong> (b)<br />

Figure 3.9: Pixel values <strong>of</strong> an image captured from a pla<strong>in</strong> desktop. (bot-<br />

tom two show<strong>in</strong>g the red channel only)<br />

Figure 3.9 shows an example <strong>of</strong> different projector illum<strong>in</strong>ations. The<br />

top two images are the image captured when the projection brightness is<br />

0 and 128 respectively. The bottom two are the correspond<strong>in</strong>g distribution<br />

<strong>of</strong> the pixel values across the planar surface (only red channels are shown,<br />

while the green and blue channels have similar distributions). The average<br />

pixel values <strong>in</strong> (d) is higher and (b) as expected. In both images a slope is<br />

noticed because the top <strong>of</strong> the desktop is closer to the w<strong>in</strong>dow hence more<br />

85


ambient light is received on that part. When the projection brightness is<br />

set at 128 <strong>in</strong> (b), a reflection is caused and this is reflected as a spike <strong>in</strong> (d)<br />

<strong>in</strong> the bottom centre part <strong>of</strong> image (d).<br />

In this research, the photometric sett<strong>in</strong>gs <strong>of</strong> the camera and the pro-<br />

jector are both manually tuned until the camera can see the projections<br />

reasonably well. Future development for the calibration framework could<br />

<strong>in</strong>clude automatic photomatric calibration which adjusts the camera and<br />

the projector light<strong>in</strong>g. Hav<strong>in</strong>g a projector-camera pair is a big advantage<br />

<strong>of</strong> photomatric calibration, because it makes it feasible for self-adjust<strong>in</strong>g<br />

the projector brightness by analys<strong>in</strong>g the observed image, and the camera<br />

can be self-adjusted based on evaluat<strong>in</strong>g the image quality captured from<br />

different projector illum<strong>in</strong>ations.<br />

Previous researchers at York [74] has proposed a means <strong>of</strong> photometric<br />

calibration, as an prelim<strong>in</strong>ary framework for future research to be built on.<br />

86


Chapter 4<br />

Shape Acquisition<br />

4.1 Introduction<br />

Shape acquisition is one <strong>of</strong> the key topics <strong>in</strong> computer vision. The hu-<br />

man visual ability to perceive depth us<strong>in</strong>g b<strong>in</strong>ocular stereopsis has been<br />

modelled by two displaced cameras to obta<strong>in</strong> the range <strong>in</strong>formation <strong>of</strong> the<br />

scene, as described earlier <strong>in</strong> chapter 2. The pr<strong>in</strong>ciple <strong>of</strong> this computer<br />

vision task is to establish correspondences, or <strong>in</strong> other words the match-<br />

<strong>in</strong>g po<strong>in</strong>ts, between two or more images. In this thesis structured light<br />

is utilised as an active method to obta<strong>in</strong> range <strong>in</strong>formation with the help<br />

87


<strong>of</strong> a camera-projector pair. In VAE applications, it is always required that<br />

the structure <strong>in</strong>formation is extracted quickly and efficiently so that col-<br />

laborative work between user, PC and video sensors is feasible. This can<br />

be fulfilled by structured light because <strong>of</strong> its flexibility, rapidity, and effi-<br />

ciency.<br />

This chapter aims to provide an overview <strong>of</strong> structured light solutions,<br />

and then expla<strong>in</strong> one particular method that is used <strong>in</strong> the later parts <strong>of</strong><br />

this thesis. New contributions have been made to tackle the issues raised<br />

<strong>in</strong> practice, such as the alias<strong>in</strong>g effect caused by limited camera resolution,<br />

and deal<strong>in</strong>g with challeng<strong>in</strong>g surface material from some <strong>of</strong> the objects.<br />

The chapter beg<strong>in</strong>s by consider<strong>in</strong>g different scenarios <strong>of</strong> the <strong>in</strong>vestigated<br />

method, then a specification is def<strong>in</strong>ed with the most practical subset <strong>of</strong><br />

parameters regard<strong>in</strong>g the current available hardware equipped <strong>in</strong> the lab.<br />

It is acknowledged that a full 3D description is not achieved by a s<strong>in</strong>gle<br />

structured light projection, not with a s<strong>in</strong>gle camera which can only see<br />

part <strong>of</strong> the object. By chang<strong>in</strong>g the pose or position <strong>of</strong> the target object<br />

it is possible to build the 3D model (see chapter 5), but each structured<br />

light projection only gives depth <strong>in</strong>formation which is <strong>of</strong>ten referred to as<br />

a 2.5D model. However this aspect is not <strong>in</strong> the scope <strong>of</strong> this chapter, and<br />

it will be <strong>in</strong>troduced by later chapters.<br />

The rest <strong>of</strong> the chapter is organised as follows. A review <strong>of</strong> the exist<strong>in</strong>g<br />

methods and recent research <strong>of</strong> structured light systems is presented <strong>in</strong><br />

section 4.2. Section 4.3 <strong>in</strong>troduces the codification scheme chosen for our<br />

88


application and the generation <strong>of</strong> the projection image stack with the as-<br />

sociated look-up table. This is followed by section 4.3.3 where we discuss<br />

how the correspondence is established. Practical issues <strong>in</strong> the real world<br />

and hardware limitations are considered <strong>in</strong> section 4.4, where experimen-<br />

tal results are also present to validate the solutions proposed to tackle the<br />

problems. Section 4.5 expla<strong>in</strong>s depth calculation via triangulation. Then<br />

we address the conclusions <strong>in</strong> section 4.6.<br />

4.2 Background<br />

Structured light projection systems use a projector which can project a<br />

light pattern such as dots, l<strong>in</strong>es, grids or stripes onto the object surface,<br />

and a camera which captures the illum<strong>in</strong>ated scene. By project<strong>in</strong>g one or<br />

a set <strong>of</strong> image patterns, it is possible to uniquely label each pixel <strong>in</strong> the im-<br />

age observed by the camera. Unlike stereo vision methods which rely on<br />

the accuracy <strong>of</strong> match<strong>in</strong>g algorithms, structured light automatically estab-<br />

lishes the geometric relationship by direct mapp<strong>in</strong>g from the codewords<br />

assigned to each pixel to their correspond<strong>in</strong>g coord<strong>in</strong>ates <strong>in</strong> the source pat-<br />

tern. Comprehensive literature review and taxonomy <strong>of</strong> structured light<br />

systems can be found <strong>in</strong> [81, 45, 6, 15, 84]<br />

The simplest way to label each pixel is to project a 2D grey ramp and<br />

89


a solid white pattern onto the measur<strong>in</strong>g surface, tried by Carrihill et al.<br />

and Chazan et al. [21, 23]. By tak<strong>in</strong>g the ratios <strong>of</strong> the two observed im-<br />

ages, the brightness at each pixel determ<strong>in</strong>es the pixel’s correspond<strong>in</strong>g<br />

coord<strong>in</strong>ate <strong>in</strong> the orig<strong>in</strong>al grey ramp image. However, this method is too<br />

sensitive to noise. Slight variation <strong>in</strong> surface reflection and light<strong>in</strong>g will<br />

cause brightness mismeasurement which results <strong>in</strong> substantial triangula-<br />

tion errors. Therefore, more sophisticated codification schemes need to be<br />

considered.<br />

One <strong>of</strong> the most commonly used strategies is temporal cod<strong>in</strong>g, where<br />

a set <strong>of</strong> images are successively projected onto the surface to be measured.<br />

In 1982, Posdamer and Altschuler [76] were the first to propose a projec-<br />

tion <strong>of</strong> n images to encode 2 n stripes with pla<strong>in</strong> b<strong>in</strong>ary code. The resultant<br />

codewords are n bit b<strong>in</strong>ary codes formed by 0s and 1s, with more signif-<br />

icant bits associated with earlier pattern images and less significant bits<br />

associated with later ones. The symbol 0 corresponds to black <strong>in</strong>tensity<br />

level for a pixel <strong>in</strong> the observed image and 1 corresponds to full illum<strong>in</strong>a-<br />

tion. By do<strong>in</strong>g this the number <strong>of</strong> stripes <strong>in</strong> every two consecutive pattern<br />

images is <strong>in</strong>creas<strong>in</strong>g by a factor <strong>of</strong> two.<br />

Sato et al. [84] used Gray codes <strong>in</strong>stead <strong>of</strong> pla<strong>in</strong> b<strong>in</strong>ary. The Gray code<br />

has the advantage <strong>of</strong> hav<strong>in</strong>g successive codewords with unit Hamm<strong>in</strong>g<br />

distance which makes the codification more robust. Trob<strong>in</strong>a [97] presented<br />

a b<strong>in</strong>ary threshold model to improve the scheme. In their method, a Gray<br />

code is used but the b<strong>in</strong>ary threshold between black and white <strong>in</strong> the ob-<br />

90


served image is fixed for every pixel <strong>in</strong>dependently. This is achieved by<br />

tak<strong>in</strong>g a pair <strong>of</strong> full white and full black images at the beg<strong>in</strong>n<strong>in</strong>g, and the<br />

variant threshold is the mean between the grey level <strong>of</strong> the two observed<br />

images <strong>of</strong> full white and full black. In recent years, Rocch<strong>in</strong>i [81] proposed<br />

a method to address the problem <strong>of</strong> localisation <strong>of</strong> the stripe transitions<br />

<strong>in</strong> Gray code images. They encode the stripes with blue and red <strong>in</strong>stead<br />

<strong>of</strong> black and white, with a green slit <strong>of</strong> pixels between every two stripes<br />

to help f<strong>in</strong>d<strong>in</strong>g the zero-cross<strong>in</strong>g <strong>of</strong> the transitions between stripe bound-<br />

aries.<br />

The aforementioned schemes <strong>of</strong>ten employ b<strong>in</strong>ary codes and use a<br />

coarse-to-f<strong>in</strong>e paradigm. This eases the segmentation <strong>of</strong> the image pat-<br />

terns, and the codewords can normally be generated by threshold<strong>in</strong>g the<br />

observed image stack. However, a number <strong>of</strong> patterns need to be projected<br />

and problems are caused from top level patterns with very narrow stripes<br />

– too narrow for the camera to perceive.<br />

Us<strong>in</strong>g a comb<strong>in</strong>ation <strong>of</strong> Gray code methods and phase shift methods<br />

answers this problem [9, 83, 105, 45, 98]. This is achieved by reduc<strong>in</strong>g the<br />

range resolution <strong>of</strong> the source patterns (i.e. us<strong>in</strong>g fewer levels <strong>of</strong> Gray<br />

code patterns to avoid narrow stripes), and compensat<strong>in</strong>g by exploit<strong>in</strong>g<br />

the spatial neighbourhood <strong>in</strong>formation. This is done by periodically shift-<br />

<strong>in</strong>g the pattern <strong>in</strong> every projection to dist<strong>in</strong>guish the codewords <strong>of</strong> those<br />

pixels fall<strong>in</strong>g <strong>in</strong>to the same stripe. The limitation <strong>of</strong> these methods is by<br />

us<strong>in</strong>g patterns with shifted versions more images need to be projected and<br />

91


the total projection time <strong>in</strong>creases considerably.<br />

In the direction <strong>of</strong> us<strong>in</strong>g fewer images to make it feasible to measure<br />

mov<strong>in</strong>g scenes, Boyer and Kak [16] employ colour patterns to try to en-<br />

code more <strong>in</strong>formation <strong>in</strong>to the codewords. They propose a colour stripe<br />

pattern where a group <strong>of</strong> consecutive stripes has a unique colour <strong>in</strong>tensity<br />

configuration. Caspi et al. [22] use a colour generalisation <strong>of</strong> Gray codes.<br />

Davies and Nixon [33] use a colour dot pattern but with a similar spatial<br />

w<strong>in</strong>dow configuration to Boyer and Kak’s [16]. Chen et al. [24] and Zhang<br />

et al.[109, 110] propose a stereo vision based method that only requires one<br />

image. The underly<strong>in</strong>g idea <strong>of</strong> their methods is to use more than one cam-<br />

era to solve the correspondences between stripe edges through dynamic<br />

programm<strong>in</strong>g.<br />

These colour based methods have the capability <strong>of</strong> measur<strong>in</strong>g quasi-<br />

stationary or mov<strong>in</strong>g scenes s<strong>in</strong>ce fewer images are used, however there<br />

are restra<strong>in</strong>ts as well. Some <strong>of</strong> them use more than one camera, which re-<br />

quire extra work to calibrate the camera pair with the projector. Others<br />

require the measur<strong>in</strong>g surface to have uniform reflectance over all three<br />

channels <strong>of</strong> RGB to accurately extract the colour <strong>in</strong>formation, therefore<br />

they are more suitable for certa<strong>in</strong> applications such as monitor<strong>in</strong>g hand<br />

gestures.<br />

The Gray coded structured light codification scheme is considered <strong>in</strong><br />

this thesis because <strong>of</strong> its simplicity and robustness. Colour or phase based<br />

92


methods have their own strengths, however, we aim to develop a VAE sys-<br />

tem which can be deployed <strong>in</strong> various environments such as <strong>of</strong>fices, mu-<br />

seums, libraries, or other open environment. The system considered here<br />

is not just designed for laboratory purposes where the projector-camera<br />

system is normally setup close to the <strong>in</strong>teractive surface. We consider a<br />

top-down setup <strong>in</strong> which the vision sensor is relatively far away (high<br />

up) from the projection surface, and low-end cameras such as ord<strong>in</strong>ary<br />

web cameras will have difficulties pick<strong>in</strong>g up the colour details <strong>in</strong> such<br />

a distance. In this context, with a few adaptations made to enhance the<br />

performance <strong>of</strong> Gray coded structured light method, it yields reasonable<br />

results.<br />

4.3 Gray Codification<br />

4.3.1 Gray Code Patterns<br />

Images with Gray coded stripes are used <strong>in</strong> this work. All images are actu-<br />

ally stacked sequentially <strong>in</strong> time doma<strong>in</strong>. In figure 4.1 one slice from each<br />

image level is taken out and aligned spatially from bottom up, and this is<br />

only to illustrate the codeword changes <strong>in</strong> adjacent image levels.<br />

93


Figure 4.1: A 9-level Gray-coded image. (only a slice from each image is<br />

shown here, to illustrate the change between adjacent codewords)<br />

Some <strong>of</strong> the advantages are already mentioned earlier <strong>in</strong> section 4.2,<br />

and here are a few other reasons to use this scheme. First, compared to<br />

dots and l<strong>in</strong>es patterns, stripe patterns <strong>of</strong>fer high resolution range <strong>in</strong>for-<br />

mation by labell<strong>in</strong>g a dense and even distribution <strong>of</strong> 3D po<strong>in</strong>ts over the<br />

scene. Second, the black and white coded pattern is more resilient to the<br />

variation <strong>in</strong> surface reflectance than to colour based methods, and it han-<br />

dles objects with challeng<strong>in</strong>g material with proper adaptations (which will<br />

94


e discussed later <strong>in</strong> section 4.4). F<strong>in</strong>ally, Gray-coded images have more<br />

advantages than pla<strong>in</strong> coded b<strong>in</strong>ary images, for be<strong>in</strong>g less sensitive to er-<br />

rors and us<strong>in</strong>g wider stripes <strong>in</strong> higher levels (see figure 4.2). This is a<br />

desirable property, as it causes less <strong>in</strong>terference between the neighbour<strong>in</strong>g<br />

stripes.<br />

(a) 4-bit pla<strong>in</strong> b<strong>in</strong>ary code, top level<br />

stripes are 1 pixel wide.<br />

(b) 4-bit Gray code, top level stripes are<br />

two pixels wide.<br />

Figure 4.2: Comparison: m<strong>in</strong>imum level <strong>of</strong> Gray-coded and b<strong>in</strong>ary-coded<br />

images needed to encode 16 columns.<br />

4.3.2 Pattern Generation<br />

The pattern generation stage is <strong>of</strong>f-l<strong>in</strong>e and it serves two purposes: to gen-<br />

erate a Gray coded image stack and then create an look-up table for future<br />

codification use. This is only carried out once, and they are both held lo-<br />

cally.<br />

The stack <strong>of</strong> Gray-coded images are prepared <strong>in</strong> a temporal paradigm.<br />

95


All images are coded only <strong>in</strong> one-dimensional Gray-code as the po<strong>in</strong>t-l<strong>in</strong>e<br />

correspondences is sufficient to solve depth <strong>in</strong>formation. The reason for<br />

do<strong>in</strong>g this will be expla<strong>in</strong>ed later <strong>in</strong> 4.5. Because <strong>of</strong> the b<strong>in</strong>arity <strong>of</strong> Gray<br />

code, the pattern generation is straightforward. It can be considered as<br />

recreat<strong>in</strong>g a square wave by doubl<strong>in</strong>g the frequency and halv<strong>in</strong>g the wave<br />

length at each image level along the time axis. For a data projector project-<br />

<strong>in</strong>g images with resolution <strong>of</strong> 1024 × 768, a 10-level Gray code is needed to<br />

make sure:<br />

1. All neighbour<strong>in</strong>g rows or columns hav<strong>in</strong>g different code words,<br />

2. All rows or columns hav<strong>in</strong>g unique code words,<br />

Consider a 10 level horizontally Gray-coded image stack. Dur<strong>in</strong>g the<br />

look-up table generation, <strong>in</strong>stead <strong>of</strong> assign<strong>in</strong>g a 10 bit long code value to<br />

each row number, all possibilities <strong>of</strong> decimal code values are listed and<br />

then attached to the row numbers. By do<strong>in</strong>g this, dur<strong>in</strong>g the table look-<br />

up stage later on, for each <strong>in</strong>com<strong>in</strong>g pixel with a 10 bit long code word,<br />

faster table look-up can be done to f<strong>in</strong>d its correspond<strong>in</strong>g row number by<br />

us<strong>in</strong>g its decimal value. In horizontal cod<strong>in</strong>g (row-wise cod<strong>in</strong>g) for a 1024<br />

× 768 image, some code words do not exist after the whole image stack<br />

is coded and they are attached with -1. A section <strong>of</strong> look-up table for 10<br />

level Gray code will look like table 4.1. In vertical cod<strong>in</strong>g (column-wise),<br />

all 1024 columns will be assigned a valid positive decimal code value.<br />

96


Row Decimal (B<strong>in</strong>ary)<br />

0 767 (1011111111)<br />

1 766 (1011111110)<br />

2 764 (1011111100)<br />

.<br />

.<br />

510 427 (0110101011)<br />

511 426 (0110101010)<br />

512 -1 -<br />

513 -1 -<br />

.<br />

. -<br />

1022 84 (0001010100)<br />

1023 85 (0001010101)<br />

Table 4.1: 10 level Gray code look-up table.<br />

97<br />

.


For implementation, only a one dimensional Gray code image set needs<br />

to be generated. As can be seen from figure 4.3, once the correspondence<br />

between the 2D po<strong>in</strong>t p <strong>in</strong> the camera plane and the stripe l <strong>in</strong> the projector<br />

plane is established via Gray code, the 3D position <strong>of</strong> the 3D object po<strong>in</strong>t<br />

P is the <strong>in</strong>tersection <strong>of</strong> a ray and a plane. The mathematical justification <strong>of</strong><br />

1D Gray code is presented <strong>in</strong> section 4.5.<br />

Figure 4.3: Po<strong>in</strong>t-l<strong>in</strong>e triangulation.<br />

4.3.3 Codification Mechanism<br />

The projection procedure consists <strong>of</strong> project<strong>in</strong>g a series <strong>of</strong> light patterns so<br />

that every encoded po<strong>in</strong>t from the observed image is identified with the<br />

sequence <strong>of</strong> <strong>in</strong>tensities, which can be coded as a str<strong>in</strong>g <strong>of</strong> b<strong>in</strong>ary values<br />

98


Figure 4.4: B<strong>in</strong>ary encoded pattern divides the surface <strong>in</strong>to many sub-<br />

regions.<br />

(figure 4.4).<br />

The capture process starts with tak<strong>in</strong>g a snapshot <strong>of</strong> the scene with no<br />

projection. In severe light<strong>in</strong>g conditions such as a dark room, uniform<br />

light<strong>in</strong>g from the projector can be considered to help illum<strong>in</strong>ate the scene.<br />

The level <strong>of</strong> projection brightness can vary depend<strong>in</strong>g on the current light-<br />

<strong>in</strong>g condition, rang<strong>in</strong>g from zero brightness to a full white illum<strong>in</strong>ation.<br />

The first captured image serves as the colour texture map <strong>in</strong> the f<strong>in</strong>al rep-<br />

resentation <strong>of</strong> the current pose.<br />

After the first shot, the whole image stack is projected sequentially and<br />

99


images <strong>of</strong> the illum<strong>in</strong>ated scene are captured <strong>in</strong> the same order (figure 4.5).<br />

Cod<strong>in</strong>g the b<strong>in</strong>ary image stack is similar to that <strong>of</strong> the Gray-coded pattern<br />

images. For a pixel with 2D image coord<strong>in</strong>ate x, y <strong>in</strong> a 10 level image stack,<br />

a b<strong>in</strong>ary code word is formed by all the other pixels from the same posi-<br />

tion along the time axis, and its decimal representation is used to look up<br />

the table for the correspond<strong>in</strong>g row number from the projector space (ta-<br />

ble 4.1).<br />

(a) level = 4 (b) level = 5<br />

(c) level = 6 (d) level = 7<br />

Figure 4.5: Stripes be<strong>in</strong>g projected onto a fluffy doll.(10 level Gray coded<br />

stripes)<br />

100


By iterat<strong>in</strong>g this approach across the whole observed image, each pixel<br />

is first labelled with a 10-bit long b<strong>in</strong>ary code word, and then attached with<br />

a row number – to represent its orig<strong>in</strong>al position <strong>in</strong> the projector image as<br />

if the projection ray is reversed. A dense po<strong>in</strong>t-l<strong>in</strong>e correspondence map<br />

is established. Us<strong>in</strong>g appropriate triangulation method, the scene po<strong>in</strong>t<br />

(X, Y, Z, 1) T can be recovered as discussed <strong>in</strong> section 4.5.<br />

4.4 Practical Issues<br />

4.4.1 Image Levels<br />

To elim<strong>in</strong>ate ambiguities <strong>in</strong> table look-up, it is always important not to<br />

have two or more rows (columns) shar<strong>in</strong>g the same codeword, so that for<br />

every s<strong>in</strong>gle pixel <strong>in</strong> the observed image there can only be one row (col-<br />

umn) <strong>in</strong> the projector image that is match<strong>in</strong>g to that pixel. Therefore, to<br />

explicitly code the images be<strong>in</strong>g projected by a data projector with resolu-<br />

tion set at 1024 × 768, a log 2 1024 = 10 level Gray code is used to encode<br />

the pattern image to make sure each row or column is assigned with a<br />

unique codeword. By do<strong>in</strong>g this, it is possible to do the table look-up for<br />

the observed image solely based on the b<strong>in</strong>ary output image stack.<br />

An alternative to this is to use fewer patterns so that th<strong>in</strong> stripes are<br />

avoided. However, there are a few drawbacks to this. First because not<br />

101


enough bits are used, there will be group <strong>of</strong> pixels shar<strong>in</strong>g the same code-<br />

word. To either locate the stripe centres or the edges between neighbour-<br />

<strong>in</strong>g stripes, it <strong>in</strong>volves f<strong>in</strong>d<strong>in</strong>g zero-cross<strong>in</strong>gs to determ<strong>in</strong>e the flip posi-<br />

tion between black and white stripes, which is not easy because <strong>of</strong> the<br />

bloom<strong>in</strong>g effect <strong>of</strong> the white stripes when be<strong>in</strong>g observed <strong>in</strong> the camera.<br />

Secondly, stripe centres are not always perceivable depend<strong>in</strong>g on the con-<br />

vexity <strong>of</strong> the measur<strong>in</strong>g surface and the presence <strong>of</strong> depth discont<strong>in</strong>uities.<br />

Furthermore, even if the stripe centres and edges are successfully located,<br />

<strong>in</strong>terpolation needs to be done to estimate the other po<strong>in</strong>ts <strong>in</strong> between.<br />

Otherwise the density <strong>of</strong> the range <strong>in</strong>formation will be compromised.<br />

Therefore, the maximum level Gray code is found essential. Due to<br />

the fact that th<strong>in</strong> stripes are <strong>in</strong>evitable, more adaptations are considered to<br />

ma<strong>in</strong>ta<strong>in</strong> robustness.<br />

4.4.2 Limited Camera Resolution<br />

A good example <strong>of</strong> the problem caused by the camera is the alias effect. As<br />

illustrated <strong>in</strong> the experiment <strong>of</strong> measur<strong>in</strong>g a brick shown <strong>in</strong> figure 4.6a, af-<br />

ter distortion recovery the stripe image level 5 is nice and clean. However,<br />

when it gets to image level 10, the th<strong>in</strong> stripe is almost <strong>in</strong>visible. Instead<br />

there are effects <strong>of</strong> curly waves <strong>in</strong> the observed image (figure 4.6b), and<br />

the resultant depth map is affected too (figure 4.6c).<br />

To alleviate this problem, we simply run a scan on the pla<strong>in</strong> desktop<br />

102


(a) level = 5. (b) level = 10. (alias<strong>in</strong>g appears)<br />

(c) depth map without plane subtrac-<br />

tion.<br />

(d) depth map with plane subtraction.<br />

Figure 4.6: The alias effect caus<strong>in</strong>g errors <strong>in</strong> depth map.<br />

with no object be<strong>in</strong>g placed onto it. The depth map <strong>of</strong> the pla<strong>in</strong> surface<br />

is used as a surface base, which is subtracted from all the depth map es-<br />

timated later on to compensate this defect (figure 4.6(d)). Although the<br />

resultant depth map for the object surface is violated to a slight degree,<br />

the background noise (mostly from the tabletop) are all removed. This is<br />

the simplest and quickest way to alleviate the alias problem without re-<br />

plac<strong>in</strong>g for more expensive capture device or chang<strong>in</strong>g the system setup.<br />

103


Figure 4.7 gives better visualisation by plott<strong>in</strong>g the surface <strong>in</strong> 3D. The<br />

graph was generated us<strong>in</strong>g a sample <strong>of</strong> data every 20 pixels <strong>in</strong> both the x<br />

and y dimensions. It is clear that after base plane subtraction, the uneven<br />

background is flattened.<br />

4.4.3 Inverse subtraction<br />

For various reasons, the captured image stack cannot be used straight-<br />

away to determ<strong>in</strong>e the <strong>in</strong>vestigated pixels are on (illum<strong>in</strong>ated) or <strong>of</strong>f (not<br />

illum<strong>in</strong>ated) at each level: there are different texture and reflectance prop-<br />

erties across the scene, the ambient light is <strong>in</strong>consistent, and different pro-<br />

jection light adds variations to the light<strong>in</strong>g condition as well.<br />

For example, the theoretical threshold between white (255) and black<br />

(0) is 128, but <strong>in</strong> reality this is never the case. A pixel from a dark object<br />

can still appear close to 0 brightness even if it is illum<strong>in</strong>ated by full white<br />

projection. However, subtract the image taken with full white projection<br />

by another image which is taken with black projection, all pixels will have<br />

positive value <strong>in</strong> the subtraction image regardless if it’s from a black object<br />

or white object, as long as it goes through full white projection first then<br />

full black projection.<br />

To address this issue <strong>in</strong> our system, for each level <strong>of</strong> projection, one<br />

orig<strong>in</strong>al pattern and its <strong>in</strong>verted version (the black-white flipped image)<br />

are projected and the observed image is subtracted from its <strong>in</strong>verse im-<br />

104


(a) Before the subtraction.<br />

(b) After the subtraction.<br />

Figure 4.7: 3D plots <strong>of</strong> figure 4.6.<br />

age to yield an image with positive and negative values. It shows that the<br />

optimal black-white threshold value is likely to be brought close to zero<br />

105


Figure 4.8: Inverse subtraction <strong>of</strong> orig<strong>in</strong>al image and its flipped version.<br />

(figure 4.8). As a result, threshold<strong>in</strong>g is done on the subtracted images <strong>in</strong>-<br />

stead <strong>of</strong> the orig<strong>in</strong>al versions.<br />

In figure 4.9, a football with black stripes is be<strong>in</strong>g scanned. As it can be<br />

seen from the picture, there are glares (white spots) caused by the projec-<br />

tor light and the reflective surface <strong>of</strong> the football itself. Figure 4.9(b) and<br />

(c) shows the image captured when the level 4 stripe image and its flipped<br />

version is be<strong>in</strong>g projected onto the surface, respectively. It is noticed that<br />

the threshold output (figure 4.9(d)) <strong>of</strong> (b) has obvious errors because the<br />

coherent black pattern on the football itself stays at black while illumi-<br />

nated by either white or black projection light. An optimal threshold also<br />

is very hard to choose, because it is object dependent and can be affected<br />

106


y the light<strong>in</strong>g conditions. Figure 4.9(e) is the subtraction <strong>of</strong> (b) and (c),<br />

with the white pixels stand<strong>in</strong>g for positive value <strong>of</strong> the subtraction image,<br />

the black pixels for negative value and the gray pixels for close zero val-<br />

ues. Figure 4.9(f) is the b<strong>in</strong>ary output <strong>of</strong> (e) with white for ones and black<br />

for zeros, which is a better version <strong>of</strong> (d).<br />

4.4.4 Adaptive threshold<strong>in</strong>g<br />

Trob<strong>in</strong>a [97] (see section 4.2) tries to improve threshold<strong>in</strong>g accuracy by fix-<br />

<strong>in</strong>g different threshold values to each pixel based on the white to black<br />

reflectance ratio calculated from a solid white projection and a full black<br />

projection. There are a few concerns when this is carried out <strong>in</strong> practice.<br />

Unlike a laser scanner, for a certa<strong>in</strong> po<strong>in</strong>t <strong>in</strong> the measur<strong>in</strong>g surface, the<br />

observed brightness depends on the neighbour<strong>in</strong>g projection rays around<br />

itself. Especially <strong>in</strong> the high frequency stripe image, for <strong>in</strong>stance, where<br />

each black and white stripe occupies two rows or columns, it is never guar-<br />

anteed that a particular po<strong>in</strong>t that falls <strong>in</strong>to a black stripe will appear the<br />

same as when the scene is projected by full black.<br />

To cope with this uncerta<strong>in</strong>ty, a three-level adaptive threshold<strong>in</strong>g is<br />

used <strong>in</strong>stead <strong>of</strong> b<strong>in</strong>ary threshold<strong>in</strong>g. A dead zone around zero is <strong>in</strong>tro-<br />

duced to deal with uncerta<strong>in</strong>ties. The size <strong>of</strong> the dead zone is set empir-<br />

ically. For any pixels with brightness out <strong>of</strong> the dead zone, the normal<br />

b<strong>in</strong>ary threshold is applied. Otherwise, the pixel is to be further <strong>in</strong>spected<br />

at the next image level. Pixels successively fall<strong>in</strong>g <strong>in</strong>to the dead zone twice<br />

107


(a) texture map (b) stripe image (positive, level 4)<br />

(c) stripe image (negative, level 4) (d) threshold <strong>of</strong> (b), t=100<br />

(e) subtraction <strong>of</strong> (b) and (c) (f) b<strong>in</strong>ary image <strong>of</strong> (e)<br />

Figure 4.9: The <strong>in</strong>verse subtraction: the football experiment.<br />

are rejected as background po<strong>in</strong>ts, and they will not be further processed<br />

<strong>in</strong> the rema<strong>in</strong><strong>in</strong>g levels.<br />

108


This is <strong>in</strong>spired by one <strong>of</strong> the properties <strong>of</strong> the Gray coded images that<br />

no pixel is located at any stripe transitions at two consecutive levels (see<br />

figure 4.1, 4.2), which means any uncerta<strong>in</strong> pixels encountered at one level<br />

can be verified by its appearance at the same position <strong>in</strong> the previous im-<br />

age, and it can be classified as background po<strong>in</strong>t or shadowed po<strong>in</strong>t if no<br />

clean-cut decision (either black or white) can be made <strong>in</strong> two consecutive<br />

image levels.<br />

4.5 Depth from Triangulation<br />

In some cases [6] where camera and projector have the same orientations<br />

(strictly fac<strong>in</strong>g the same direction), and their displacement is known (con-<br />

trolled displacement, for example both mounted onto a fixed rail), the co-<br />

ord<strong>in</strong>ate <strong>of</strong> a 3D po<strong>in</strong>t can be estimated through simplified triangulation<br />

without the <strong>in</strong>formation <strong>of</strong> external geometry from projector and camera.<br />

However this is not considered <strong>in</strong> our application, s<strong>in</strong>ce it requires highly<br />

customised hardware.<br />

A general purpose triangulation method for structured light systems<br />

is considered here, where the camera and the projector can be turned to<br />

any arbitrary angles, and both are properly calibrated <strong>in</strong> earlier stage. For<br />

109


details <strong>of</strong> calibration <strong>of</strong> a projector-camera system, please refer to chapter<br />

3.<br />

Let po<strong>in</strong>t (x, y) be the 2D po<strong>in</strong>t currently be<strong>in</strong>g <strong>in</strong>vestigated, to recover<br />

its 3D coord<strong>in</strong>ate (X, Y, Z), we build full projection model (equation 3.25<br />

us<strong>in</strong>g homogeneous coord<strong>in</strong>ates [42],<br />

⎛ ⎞<br />

x<br />

⎜ ⎟<br />

⎜ ⎟<br />

w ⎜y⎟<br />

⎝ ⎠<br />

z<br />

=<br />

⎛<br />

⎜<br />

⎝<br />

where C = Kc(Rc|T c) =<br />

matrix and w is a scale factor.<br />

third,<br />

⎛<br />

⎜<br />

⎝<br />

c11 c12 c13 c14<br />

c21 c22 c23 c24<br />

c31 c32 c33 c34<br />

c11 c12 c13 c14<br />

c21 c22 c23 c24<br />

c31 c32 c33 c34<br />

⎛ ⎞<br />

⎞<br />

⎜<br />

x<br />

⎟<br />

⎜ ⎟<br />

⎟ ⎜<br />

⎟ ⎜y<br />

⎟<br />

⎟ ⎜ ⎟<br />

⎠ ⎜<br />

⎜z<br />

⎟<br />

⎝ ⎠<br />

1<br />

⎞<br />

(4.1)<br />

⎟ is the camera extr<strong>in</strong>sic<br />

⎠<br />

To cancel out the scale factor w, <strong>in</strong> eq 4.1 divide the first row by the<br />

(c11 − xc21)X + (c12 − xc22)Y + (c13 − xc23)Z + (c14 − xc24) = 0 (4.2)<br />

By divid<strong>in</strong>g the second row by the third <strong>in</strong> eq 4.1,<br />

(c21 − yc31)X + (c22 − yc32)Y + (c23 − yc33)Z + (c24 − yc34) = 0 (4.3)<br />

If po<strong>in</strong>t (x, y) corresponds to (m, n) <strong>in</strong> projector plane, similarly<br />

110


⎛ ⎞<br />

m<br />

⎜ ⎟<br />

⎜ ⎟<br />

w ⎜ n ⎟<br />

⎝ ⎠<br />

1<br />

=<br />

⎛<br />

⎜<br />

⎝<br />

p11 p12 p13 p14<br />

p21 p22 p23 p24<br />

p31 p32 p33 p34<br />

⎛ ⎞<br />

⎞<br />

⎜<br />

x<br />

⎟<br />

⎜ ⎟<br />

⎟ ⎜<br />

⎟ ⎜y<br />

⎟<br />

⎟ ⎜ ⎟<br />

⎠ ⎜<br />

⎜z<br />

⎟<br />

⎝ ⎠<br />

1<br />

Use the same method to cancel out the scale factor w ′ ,<br />

(4.4)<br />

(p11 − mp31)X + (p12 − mp32)Y + (p13 − mp33)Z + (p14 − mp34) = 0 (4.5)<br />

(p21 − np31)X + (p22 − np32)Y + (p23 − np33)Z + (p24 − np34) = 0 (4.6)<br />

From equations 4.2, 4.3 and 4.6, we have<br />

⎛<br />

⎜<br />

⎝<br />

c11 − xc31 c12 − xc32 c13 − xc33 c14 − xc34<br />

c21 − yc31 c22 − yc32 c23 − yc33 c24 − xc34<br />

p21 − np31 p22 − np32 p23 − np33 p24 − xp34<br />

⎛ ⎞<br />

⎞<br />

⎜<br />

X<br />

⎟<br />

⎜ ⎟<br />

⎟ ⎜<br />

⎟ ⎜Y<br />

⎟<br />

⎟ ⎜ ⎟ = 0 (4.7)<br />

⎠ ⎜<br />

⎜Z<br />

⎟<br />

⎝ ⎠<br />

1<br />

This now becomes a problem <strong>of</strong> solv<strong>in</strong>g a set <strong>of</strong> l<strong>in</strong>ear equations. The<br />

first matrix <strong>in</strong> eq 4.7 is <strong>of</strong>ten referred as the measurement matrix A. The<br />

vector (X, Y, Z, 1) T is solved by f<strong>in</strong>d<strong>in</strong>g the eigenvector with the least eigen-<br />

value <strong>of</strong> matrix A T A [4].<br />

Equivalently, equation 4.7 can also be constructed from equations 4.2,<br />

4.3 and 4.5. It is obvious that choos<strong>in</strong>g any one <strong>of</strong> the two equations 4.5<br />

and 4.6 yields the same results, which proves that the structured light<br />

projection only need to be done <strong>in</strong> one dimension, either horizontally or<br />

111


vertically. Us<strong>in</strong>g both <strong>of</strong> them is not recommended as understandably it<br />

doubles the capture time while only provid<strong>in</strong>g an over-determ<strong>in</strong>ed l<strong>in</strong>ear<br />

equation system. The f<strong>in</strong>al system therefore only uses horizontal stripes.<br />

4.5.1 F<strong>in</strong>al Captured Data<br />

After each successful structured light scan, the follow<strong>in</strong>g data are captured<br />

and saved <strong>in</strong>to the memory for further process<strong>in</strong>g. Figure 4.10 to 4.13<br />

shows the rendered data <strong>in</strong> the form <strong>of</strong> images. The scattered 3D po<strong>in</strong>t<br />

sets can be rendered at any arbitrary pose, and figure 4.12 and 4.13 are<br />

show<strong>in</strong>g it rendered at one pose.<br />

112


Figure 4.10: Depth map.<br />

113


Figure 4.11: Colour texture.<br />

114


Figure 4.12: Scattered po<strong>in</strong>t set <strong>in</strong> 3D. (re-sampled at every 2 millimetre)<br />

115


Figure 4.13: Scattered po<strong>in</strong>t set <strong>in</strong> 3D, attached with colour <strong>in</strong>formation.<br />

(re-sampled at every 2 millimetre)<br />

4.6 Conclusions<br />

This chapter <strong>in</strong>troduces a method for acquir<strong>in</strong>g depth <strong>in</strong>formation us<strong>in</strong>g<br />

structure light system. After study<strong>in</strong>g the exist<strong>in</strong>g codification schemes,<br />

Gray coded structured light is used <strong>in</strong> this research for its simplicity and<br />

robustness. A variety <strong>of</strong> problems are encountered dur<strong>in</strong>g implementa-<br />

tion, and solutions are provided to tackle these problems. Prelim<strong>in</strong>ary ex-<br />

116


perimental results suggest these proposed techniques positively enhance<br />

the system performance.<br />

First, we justify it is essential to use maximum level <strong>of</strong> Gray coded im-<br />

ages both theoretically and experimentally. This is at the risk <strong>of</strong> mak<strong>in</strong>g<br />

the stripes too th<strong>in</strong> to detect by the camera, which has limited resolution<br />

and mounted high above the desktop surface.<br />

Secondly, because <strong>of</strong> the large distance between the ceil<strong>in</strong>g-mounted<br />

camera and the tabletop, and the limited capture resolution <strong>of</strong> the camera,<br />

the stripes go<strong>in</strong>g too th<strong>in</strong> causes alias<strong>in</strong>g effect <strong>in</strong> the observed images (fig-<br />

ure 4.6). When a s<strong>in</strong>gle-pixel-wide l<strong>in</strong>e is projected, it could be observed<br />

<strong>in</strong> the camera image as a comb<strong>in</strong>ation <strong>of</strong> neighbour<strong>in</strong>g three or four l<strong>in</strong>es.<br />

When multiple l<strong>in</strong>es that are close to each other are projected, the observed<br />

l<strong>in</strong>es are likely to mix with each other (figure 4.14). This will not only vi-<br />

sually causes alias<strong>in</strong>g effect, but also assign multiple l<strong>in</strong>es to an 2D image<br />

pixel. A base plane subtraction method is proposed to deal with this chal-<br />

lenge caused by the alias<strong>in</strong>g effect.<br />

Thirdly, the <strong>in</strong>verse subtraction and adaptive threshold<strong>in</strong>g are com-<br />

b<strong>in</strong>ed to perform robust codeword generation. This is a big boost to the<br />

codification, as we are no longer concerned with the object surface colour<br />

while these techniques ma<strong>in</strong>ta<strong>in</strong>s the optimal threshold for 0s and 1s close<br />

to zero.<br />

117


(a) The projector image. L<strong>in</strong>es are<br />

s<strong>in</strong>gle-pixel-wide and three-pixel apart<br />

from each other.<br />

(b) The observed image. One thicker<br />

l<strong>in</strong>e is observed <strong>in</strong>stead <strong>of</strong> three clean<br />

cut l<strong>in</strong>es.<br />

Figure 4.14: Illustration <strong>of</strong> camera limited resolution.<br />

F<strong>in</strong>ally, it is geometrically and mathematically justified (figure 4.3 and<br />

section 4.5), that the structured light projection is only required to run <strong>in</strong><br />

one dimension, either horizontal or vertical, provided the proper triangu-<br />

lation method is used.<br />

4.6.1 Future Work<br />

With the current projector-camera setup, the system performance is mostly<br />

h<strong>in</strong>dered by the limited capture resolution <strong>of</strong> the camera, and the distance<br />

between the ceil<strong>in</strong>g-mounted camera and the tabletop. However, once the<br />

VAE is setup and runn<strong>in</strong>g, it is not possible to change these factors. There-<br />

fore, efforts need to be made <strong>in</strong> other areas to compensate for this negative<br />

118


contribution.<br />

In section 4.4.2 a method <strong>of</strong> base plane subtraction is proposed to com-<br />

pensate for the alias<strong>in</strong>g effect caused by the aforementioned defects. This<br />

method is still prelim<strong>in</strong>ary and has its own limitations. The most signifi-<br />

cant one is that the subtraction is only restricted to the planar surface (<strong>in</strong><br />

this case, the tabletop). It is <strong>in</strong>capable <strong>of</strong> modell<strong>in</strong>g the artifact caused by<br />

the alias<strong>in</strong>g on arbitrary object surface. Future research could further <strong>in</strong>-<br />

vestigate this area to properly model this distortion.<br />

It is claimed <strong>in</strong> this chapter (section 4.4.1) that the maximum possible<br />

level <strong>of</strong> Gray coded images should be used to uniquely label every row/-<br />

column <strong>in</strong> the rendered projection image, and to avoid codeword shar<strong>in</strong>g.<br />

It is based on the fact that dense depth map is required <strong>in</strong> this shape ac-<br />

quisition process. In certa<strong>in</strong> applications where only sparse depth map<br />

is needed, it is possible to use fewer level <strong>of</strong> stripe images. Sparse depth<br />

<strong>in</strong>formation can be recovered at the stripe transitions or located stripe cen-<br />

tres, and a big plus po<strong>in</strong>t is that the camera won’t be forced to capture the<br />

scene illum<strong>in</strong>ated by th<strong>in</strong> stripes that is beyond its resolution.<br />

Future work on photometric calibration discussed <strong>in</strong> the previous chap-<br />

ter (section 3.6) also relates to the development <strong>in</strong> structured light systems.<br />

A successful calibration <strong>of</strong> the photometric properties for the camera could<br />

lead to the use <strong>of</strong> colour-based structured light systems. S<strong>in</strong>ce the colour-<br />

based methods normally use fewer images (sometimes just one), it opens<br />

119


up the possibility <strong>of</strong> turn<strong>in</strong>g the shape acquisition <strong>in</strong>to a real-time process.<br />

This would be an attractive feature for the VAE and with real-time depth<br />

scan capability lots <strong>of</strong> application can be built with<strong>in</strong> the VAE framework.<br />

120


Chapter 5<br />

Registration <strong>of</strong> Po<strong>in</strong>t Sets<br />

Creat<strong>in</strong>g 3D model for a real object is a multi-stage process, because cam-<br />

eras only deliver data from one view <strong>of</strong> the target object at a time. To ob-<br />

ta<strong>in</strong> a complete model, it requires either the scanner to shoot from different<br />

views to cover the whole object, or equivalently move the object relative<br />

to a stationary scanner. Whichever scenario is chosen, registration <strong>of</strong> the<br />

scanned data from different views is required. This chapter is focused on<br />

this subject.<br />

After each structured light scan, a cloud <strong>of</strong> po<strong>in</strong>t samples from the sur-<br />

121


face <strong>of</strong> an object is obta<strong>in</strong>ed. By plac<strong>in</strong>g the object <strong>in</strong> different positions<br />

on the tabletop or plac<strong>in</strong>g it <strong>in</strong> different orientations towards the camera<br />

yields a few po<strong>in</strong>t sets, which is expected to cover the whole surface <strong>of</strong> the<br />

object to be measured. The objective <strong>of</strong> registration is to fuse these clouds<br />

together by estimat<strong>in</strong>g the transformations between them, and try<strong>in</strong>g to<br />

place all the data <strong>in</strong>to the same reference frame to visualise or for further<br />

process<strong>in</strong>g.<br />

The process <strong>of</strong> po<strong>in</strong>t sets fusion beg<strong>in</strong>s with 2D image registration on<br />

colour textures <strong>of</strong> two participat<strong>in</strong>g views, where the <strong>in</strong>terest<strong>in</strong>g po<strong>in</strong>ts are<br />

first extracted by corner detectors and then correlated. Once the 2D corre-<br />

spondences are established, the 3D coord<strong>in</strong>ates <strong>of</strong> the matched po<strong>in</strong>ts are<br />

used as control po<strong>in</strong>ts to estimate the rotation and translation <strong>in</strong> 3D space<br />

between these two sets <strong>of</strong> po<strong>in</strong>ts. The estimated rotation and translation<br />

vectors are used as an <strong>in</strong>itial guess to perform a trial merge, by wrapp<strong>in</strong>g<br />

one po<strong>in</strong>t set to another <strong>in</strong> 3D space based on the estimated transform. The<br />

user has the f<strong>in</strong>al decision <strong>of</strong> whether to accept this trial given by the com-<br />

puter, or manually improve the fusion <strong>of</strong> po<strong>in</strong>t sets by tun<strong>in</strong>g the them<br />

<strong>in</strong>to different poses <strong>in</strong> a virtual environment us<strong>in</strong>g the augmented tools.<br />

The whole process comb<strong>in</strong>es automated image process<strong>in</strong>g and human<br />

<strong>in</strong>teraction. For example, tasks such as 2D image registration or exhausted<br />

search<strong>in</strong>g for transformation vectors are executed by automated process<br />

while the f<strong>in</strong>al tun<strong>in</strong>g and merg<strong>in</strong>g is handed over by human <strong>in</strong>teraction.<br />

This is not only because the humans are chosen to be the decision maker,<br />

122


ut also that this is what humans are good at – spot where th<strong>in</strong>gs go wrong<br />

and respond to it <strong>in</strong> an effective way. In the rest <strong>of</strong> this chapter, this is ex-<br />

pla<strong>in</strong>ed <strong>in</strong> details.<br />

5.1 Introduction<br />

Assume there exist two po<strong>in</strong>t sets {mi} and {di}, i = 1, 2, ..., N, and the cor-<br />

respondences between them are already established, either from ground<br />

truth or by match<strong>in</strong>g the po<strong>in</strong>t sets <strong>in</strong> 3D space. We name {mi} the model<br />

po<strong>in</strong>ts and {di} the data po<strong>in</strong>ts. If they are both from the same model,<br />

the objective is to f<strong>in</strong>d the relative rotation and translation from the data<br />

po<strong>in</strong>ts to the model po<strong>in</strong>ts, so that <strong>in</strong> 3D space they are related by<br />

di = Rmi + T + ei<br />

(5.1)<br />

where R is the 3 × 1 rotation matrix, T is the 3 × 1 translation vector<br />

and ei is a noise vector. Solv<strong>in</strong>g for the optimal solutions <strong>of</strong> ˆ R and ˆ T that<br />

maps the two po<strong>in</strong>t sets is a least square m<strong>in</strong>imisation problem:<br />

N<br />

di − ˆ Rmi − ˆ T 2<br />

i=1<br />

(5.2)<br />

Because the correspondences between the po<strong>in</strong>t sets are unknown a-<br />

priori, the most straightforward method to register two po<strong>in</strong>t sets is ex-<br />

haustive search <strong>in</strong> 3D space. However, this method faces the challenges<br />

from process<strong>in</strong>g time, convergence speed, and fall<strong>in</strong>g <strong>in</strong>to local m<strong>in</strong>ima. It<br />

is not complex to implement but consumes a lot <strong>of</strong> the process<strong>in</strong>g power,<br />

123


and is therefore not suitable for VAE systems.<br />

Us<strong>in</strong>g calibrated motion is another rout<strong>in</strong>e to solve the registration, but<br />

this br<strong>in</strong>gs new problems too. To control either the movement <strong>of</strong> the scan-<br />

ners or the object to be measured, additional hardware equipment such as<br />

rails and turntables are <strong>in</strong>evitable. The scanner may require extra calibra-<br />

tion as well. More importantly, <strong>in</strong> the context <strong>of</strong> VAE, it is desired that the<br />

restriction to controlled motion must be lifted, and the object to be mea-<br />

sured can be freely moved <strong>in</strong>to any different poses <strong>in</strong> 3D space, under the<br />

guidance <strong>of</strong> the user.<br />

In this research, the rout<strong>in</strong>e we choose to fuse two po<strong>in</strong>t sets <strong>in</strong>cor-<br />

porates three stages: 2D planar image registration (section 5.3), po<strong>in</strong>t set<br />

registration us<strong>in</strong>g corresponded features with a voxel based quantisation<br />

process (section 5.4), and render<strong>in</strong>g (section 5.5). They are discussed sepa-<br />

rately <strong>in</strong> the rest <strong>of</strong> this chapter. When more than two views are presented,<br />

the problem is reduced to a cha<strong>in</strong> <strong>of</strong> pairwise registrations.<br />

124


5.2 Background<br />

Figure 5.1: A rout<strong>in</strong>e <strong>of</strong> po<strong>in</strong>t set registration.<br />

5.2.1 Rotations and Translations <strong>in</strong> 3D<br />

There are several common ways to build a rotation matrix. The most fre-<br />

quently documented representation is to rotate a po<strong>in</strong>t around one <strong>of</strong> the<br />

three coord<strong>in</strong>ate axes. The advantage <strong>of</strong> us<strong>in</strong>g this representation is the<br />

generated 3 × 3 rotation matrix can be applied to 3D po<strong>in</strong>ts for matrix ma-<br />

nipulations straightaway. To rotate a po<strong>in</strong>t around X, Y, and Z axes, we<br />

have:<br />

Rx =<br />

Rx =<br />

⎡<br />

⎤<br />

1<br />

⎢<br />

⎢0<br />

⎣<br />

0<br />

cos(θ)<br />

0<br />

⎥<br />

− s<strong>in</strong>(θ) ⎥<br />

⎦<br />

0 s<strong>in</strong>(θ) cos(θ)<br />

⎡<br />

⎤<br />

cos(φ)<br />

⎢ 0<br />

⎣<br />

0<br />

1<br />

s<strong>in</strong>(φ)<br />

⎥<br />

0 ⎥<br />

⎦<br />

− s<strong>in</strong>(φ) 0 cos(φ)<br />

125<br />

(5.3)<br />

(5.4)


Rx =<br />

⎡<br />

⎤<br />

cos(ψ)<br />

⎢<br />

⎢s<strong>in</strong>(ψ)<br />

⎣<br />

− s<strong>in</strong>(ψ)<br />

cos(ψ)<br />

0<br />

⎥<br />

0⎥<br />

⎦<br />

0 0 1<br />

(5.5)<br />

where θ, φ, and ψ are the rotations around X, Y, and Z axes respectively.<br />

More detailed discussions <strong>of</strong> rotation <strong>in</strong> 3D are given <strong>in</strong> [44], [47].<br />

5.2.2 A SVD Based Least Square Fitt<strong>in</strong>g Method<br />

SVD is one <strong>of</strong> the most significant topic <strong>in</strong> l<strong>in</strong>ear algebra and it has con-<br />

siderable theoretical and practical values [54, 62, 95]. A very important<br />

feature <strong>of</strong> SVD is that it can be performed on any real matrix. The result<br />

<strong>of</strong> this decomposition is to factor matrix A <strong>in</strong>to three matrices U, S, V such<br />

that A = USV T , where U and V are orthogonal matrices and S is a diago-<br />

nal matrix. SVD is also a common tool used to solve least square solutions<br />

(section 3.5, section 4.5).<br />

Arun, Huang and Bolste<strong>in</strong> [3] proposed a method <strong>of</strong> comput<strong>in</strong>g the<br />

3D rotation matrix and translation vector by do<strong>in</strong>g the S<strong>in</strong>gular Value<br />

Decomposition (SVD) <strong>of</strong> the 3 ×correlation matrix, which is built as fol-<br />

lows,<br />

H =<br />

N<br />

i=1<br />

mc,i d T c,i<br />

(5.6)<br />

where mc,i and dc,i are obta<strong>in</strong>ed by translat<strong>in</strong>g the orig<strong>in</strong>al data sets mi<br />

126


and di (equation 5.2) to the orig<strong>in</strong>.<br />

The SVD <strong>of</strong> the correlation matrix is H = USV T , and the optimal rota-<br />

tion matrix is first computed from<br />

ˆR = V U T<br />

(5.7)<br />

The computation <strong>of</strong> ˆ R is also known as the Orthogonal Procrustes Prob-<br />

lem [88].<br />

The optimal translation matrix is the vector that aligns the centroid <strong>of</strong><br />

the po<strong>in</strong>t set di and mi, which is<br />

ˆT = ¯ d − ¯m (5.8)<br />

Of course ˆ T = ¯ d − ˆ R ¯m exists too because rotat<strong>in</strong>g a po<strong>in</strong>t set about the<br />

orig<strong>in</strong> doesn’t change the centroid <strong>of</strong> the po<strong>in</strong>t set itself.<br />

5.3 Image Registration<br />

5.3.1 Corner Detector<br />

To build up a dense correspondence map given a pair <strong>of</strong> <strong>in</strong>put images is<br />

not practical consider<strong>in</strong>g the amount <strong>of</strong> computation <strong>in</strong>volved. Therefore<br />

127


the first step is to choose a set <strong>of</strong> dist<strong>in</strong>guished po<strong>in</strong>ts as <strong>in</strong>terest po<strong>in</strong>ts,<br />

from both <strong>in</strong>put images. To f<strong>in</strong>d these <strong>in</strong>terest po<strong>in</strong>ts, a Harris corner de-<br />

tector [46] algorithm is used on the textures. The corner detector uses the<br />

follow<strong>in</strong>g structure matrix to evaluate whether the given pixel is a corner<br />

or not.<br />

⎡<br />

G = ⎣<br />

<br />

w f 2 x<br />

<br />

w fxfy<br />

<br />

w fxfy<br />

<br />

w f 2 y<br />

⎤ ⎡<br />

⎦ = Q ⎣ λ1<br />

⎤<br />

0<br />

⎦ Q T<br />

0 λ2<br />

(5.9)<br />

The second part <strong>of</strong> the equation 5.9 is the decomposition to its left, fx<br />

and fy are the first derivatives <strong>of</strong> horizontal and vertical directions respec-<br />

tively, and w is the w<strong>in</strong>dow size <strong>of</strong> aggregation. For the two output eigen-<br />

values, λ1 ≥ λ2, and the Harris corner detector def<strong>in</strong>es when λ2 ≫ 0 the<br />

pixel can be <strong>in</strong>terpreted as a corner with<strong>in</strong> a certa<strong>in</strong> region. Even though<br />

the sign <strong>of</strong> the eigenvalues conta<strong>in</strong>s <strong>in</strong>formation <strong>of</strong> the local gradient, we<br />

are not <strong>in</strong>terested <strong>in</strong> them here as our purpose is to f<strong>in</strong>d the po<strong>in</strong>t <strong>of</strong> <strong>in</strong>ter-<br />

est.<br />

To implement the corner detection algorithm, fx and fy are first com-<br />

puted from the convolution <strong>of</strong> the orig<strong>in</strong>al image with two derivative ker-<br />

nels<br />

Dx =<br />

⎡ ⎤<br />

−1<br />

⎢<br />

⎢−1<br />

⎣<br />

0<br />

0<br />

1<br />

⎥<br />

1⎥<br />

⎦<br />

−1 0 1<br />

128<br />

(5.10)


Dy =<br />

⎡<br />

⎤<br />

−1<br />

⎢ 0<br />

⎣<br />

−1<br />

0<br />

−1<br />

⎥<br />

0 ⎥<br />

⎦<br />

1 1 1<br />

(5.11)<br />

The G matrix is constructed for each pixel from the derivatives, and it is<br />

aggregated by its neighbour<strong>in</strong>g pixels. Then the two eigenvalues <strong>of</strong> G ma-<br />

trix is computed – the smaller <strong>of</strong> the two are stored. The pixel is considered<br />

as a corner if it has the biggest stored eigenvalue <strong>in</strong> the given area, and the<br />

value is greater than a threshold. This step is repeated for all pixels from<br />

both <strong>of</strong> the <strong>in</strong>put images.<br />

The whole process is shown <strong>in</strong> figure 5.2. A pair <strong>of</strong> images <strong>of</strong> a corridor<br />

are used to give better illustration <strong>of</strong> the extraction <strong>of</strong> corner po<strong>in</strong>ts. w1<br />

is the w<strong>in</strong>dow <strong>of</strong> aggregation for the structure matrix G (equ.5.9). w2 is<br />

the local evaluation w<strong>in</strong>dow with<strong>in</strong> which the pixel with the biggest λ2<br />

is considered as a corner candidate. λ is the threshold for λ2: the pixel is<br />

considered as a corner if λ2 > λ.<br />

5.3.2 Normalised Cross Correlation<br />

After the <strong>in</strong>terest po<strong>in</strong>ts are detected <strong>in</strong> the <strong>in</strong>put image pair, correspon-<br />

dences are found us<strong>in</strong>g Normalised Cross Correlation (NCC) [77]. For<br />

each <strong>in</strong>terest po<strong>in</strong>t <strong>in</strong> the left image, we look for its maximum correlation<br />

<strong>in</strong> the right image us<strong>in</strong>g the NCC cost function below,<br />

NCC =<br />

<br />

(x,y)∈W (f1(x, y) − ¯ f1)(f2(x, y) − ¯ f2)<br />

(x,y)∈W (f1(x, y) − ¯ <br />

f1) 2<br />

(x,y)∈W (f2(x, y) − ¯ f2) 2<br />

129<br />

(5.12)


(a) Orig<strong>in</strong>al image (b) Gaussian smoothed image<br />

(c) First x derivatives (d) First y derivatives<br />

(e) Eigenvalue image (f) Detected corners<br />

Figure 5.2: Corner detection.<br />

where fk(x, y) is the k − th image block, and ¯ fk is the average value <strong>of</strong><br />

the block. W is the size <strong>of</strong> the search w<strong>in</strong>dow.<br />

130


To implement the algorithm, we first take a pixel from the left image<br />

and construct a N × N block centred at that pixel. Then calculate the NCC<br />

between the current block and all the corner po<strong>in</strong>ts encountered <strong>in</strong> the<br />

right image, with<strong>in</strong> the search range W . The corner po<strong>in</strong>t with the maxi-<br />

mum NCC value is assigned the correspond<strong>in</strong>g pixel. This process is re-<br />

peated for all the pixels <strong>in</strong> the left image.<br />

Results are shown <strong>in</strong> figure 5.3 and 5.4. Choos<strong>in</strong>g different sizes <strong>of</strong> the<br />

search w<strong>in</strong>dow yields different results. Especially when periodic patterns<br />

are <strong>in</strong>volved, the checker board for example, the result is far less accurate<br />

if an <strong>in</strong>appropriate search w<strong>in</strong>dow size is chosen. Furthermore, at this<br />

stage the correspondences are not one-to-one. For a corner po<strong>in</strong>t <strong>in</strong> the<br />

right image, it is likely to happen that more than one po<strong>in</strong>t from the left<br />

image has found it as the best match. The details <strong>of</strong> mismatch removals<br />

are discussed <strong>in</strong> section 5.3.3.<br />

5.3.3 Outlier Removals<br />

With given correspondences, we feed them <strong>in</strong>to the correlation matrix<br />

(equ.5.6) so the rotation matrix and translation vector can be estimated.<br />

To do this reliably, outliers need to be removed. The RANdom SAmple<br />

Consensus (RANSAC) algorithm [38] is a widely used algorithm for ro-<br />

bust fitt<strong>in</strong>g <strong>of</strong> models <strong>in</strong> the presence <strong>of</strong> data outliers. The algorithm keeps<br />

randomly select<strong>in</strong>g data items and uses them to estimate the data model<br />

131


(a) Searchw<strong>in</strong>dow : W = 64<br />

(b) Searchw<strong>in</strong>dow : W = 256<br />

Figure 5.3: NCC results.<br />

until a good fit is found or the maximum iterations is reached. Only the<br />

data that qualify the certa<strong>in</strong> criteria are considered as mean<strong>in</strong>gful data.<br />

The choice <strong>of</strong> the criteria here depends on the data to be measured, for ex-<br />

ample it can be the Euclidean distance <strong>of</strong> a po<strong>in</strong>t to the centroid <strong>of</strong> a cloud<br />

<strong>of</strong> po<strong>in</strong>ts, the disparity <strong>in</strong> brightness <strong>of</strong> a group <strong>of</strong> w<strong>in</strong>dowed pixels, or<br />

other cost functions.<br />

In this work, s<strong>in</strong>ce the transform between two observed images can be<br />

132


(a) W = 64<br />

(b) W = 256<br />

Figure 5.4: NCC results (periodic pattern).<br />

encapsulated <strong>in</strong> a 3 × 3 homography matrix, the RANSAC algorithm is<br />

implemented with the adaptations as follows:<br />

1. Start with putative correspondences computed from NCC (section<br />

5.3.2).<br />

2. Repeat step 3-7 for N times, with N be<strong>in</strong>g updated us<strong>in</strong>g algorithm<br />

4.5 from [47].<br />

3. Select a random sample <strong>of</strong> 4 correspondences and check the data<br />

133


col<strong>in</strong>earity. If the data is bad, reselect a sample.<br />

4. Compute the homography H us<strong>in</strong>g the method presented <strong>in</strong> section<br />

3.5.<br />

5. Calculate the distance for each <strong>of</strong> the putative correspondences d =<br />

d(mi, ˆmi) 2 + d(di, ˆ di) 2 , where ˆmi and ˆ di are the transformed po<strong>in</strong>ts<br />

based on the estimated homography H.<br />

6. Compute the number <strong>of</strong> putative correspondences consistent with<br />

the current H, by the criterion that the distance calculated <strong>in</strong> step 5<br />

is no greater than a empirical threshold. The qualify<strong>in</strong>g correspon-<br />

dences are <strong>in</strong>liers.<br />

7. If the number <strong>of</strong> <strong>in</strong>liers for the current H is maximum, update H and<br />

the set <strong>of</strong> <strong>in</strong>liers consistent with H.<br />

8. When reach here, choose the group <strong>of</strong> <strong>in</strong>liers associated with the best<br />

H so far.<br />

9. Re-calculate the H us<strong>in</strong>g all the <strong>in</strong>liers left.<br />

In general cases, because the homography H is estimated by 4 ran-<br />

domly selected correspondences <strong>in</strong> each loop – even if they are estimated<br />

from the best set <strong>of</strong> 4 pairs, the f<strong>in</strong>al homography still needs to be ref<strong>in</strong>ed<br />

by calculat<strong>in</strong>g it once aga<strong>in</strong> with all the qualified <strong>in</strong>liers from putative cor-<br />

respondences. However <strong>in</strong> this work we are only focused on choos<strong>in</strong>g the<br />

reliable correspondences <strong>in</strong>stead <strong>of</strong> look<strong>in</strong>g for the 2D projective trans-<br />

134


form between them, as a result step 9 is not necessary and can be omitted.<br />

(a) T = 100, 235 putative correspondences after NCC, 142 <strong>in</strong>liers.<br />

(b) Rectified image pair.<br />

Figure 5.5: Robust estimation. (<strong>in</strong>liers shown by red connect<strong>in</strong>g l<strong>in</strong>es)<br />

135


(a) T = 50, 80 putative correspondences after NCC, 80 <strong>in</strong>liers.<br />

(b) Rectified image pair.<br />

Figure 5.6: Robust estimation. (<strong>in</strong>liers shown by <strong>in</strong>dex numbers)<br />

5.4 Fusion<br />

5.4.1 Data structure <strong>of</strong> a po<strong>in</strong>t set<br />

The data structure <strong>of</strong> a po<strong>in</strong>t set is depicted <strong>in</strong> figure 5.7. For each po<strong>in</strong>t,<br />

the follow<strong>in</strong>g <strong>in</strong>formation is stored: its <strong>in</strong>dex <strong>in</strong> the data array, 3D world<br />

coord<strong>in</strong>ates (X, Y, Z), 2D image coord<strong>in</strong>ates (x, y), and its colour <strong>in</strong>forma-<br />

tion <strong>in</strong> RGB channels.<br />

136


Figure 5.7: Data structure <strong>of</strong> a po<strong>in</strong>t set.<br />

5.4.2 Po<strong>in</strong>t set fusion with voxel quantisation<br />

For each s<strong>in</strong>gle view, a po<strong>in</strong>t set is given from estimat<strong>in</strong>g the 3D positions<br />

<strong>of</strong> the foreground pixels <strong>in</strong> the captured image. All background parts like<br />

the table top and non projected area have non-positive depth and they are<br />

rejected. Therefore the size <strong>of</strong> the po<strong>in</strong>t set is the total number <strong>of</strong> pixels<br />

that have positive depth <strong>in</strong> the correspond<strong>in</strong>g depth image.<br />

The data size can be huge sometimes. Figure 5.8(a) shows the po<strong>in</strong>t set<br />

<strong>of</strong> a fluffy doll which has the dimension <strong>of</strong> roughly 600mm <strong>in</strong> height width<br />

and depth. The resultant po<strong>in</strong>t set size is 34056, where a lot <strong>of</strong> po<strong>in</strong>ts are<br />

actually very close to its neighbour<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong> 3D space. This not only<br />

137


causes redundancy and <strong>in</strong>creases the burden <strong>of</strong> render<strong>in</strong>g the po<strong>in</strong>t set or<br />

transform<strong>in</strong>g it <strong>in</strong> 3D. A voxel quantisation method is presented here to<br />

deal with this problem.<br />

(a) the po<strong>in</strong>t set (b) voxel quantisation<br />

Figure 5.8: Voxel quantisation <strong>of</strong> the large data set.<br />

For each po<strong>in</strong>t set, we keep two copies <strong>in</strong> the memory. One copy is<br />

the orig<strong>in</strong>al data set, where all the po<strong>in</strong>ts are saved as backup so that no<br />

<strong>in</strong>formation is lost. Another copy is the slimmed version for display or<br />

other front end purposes. It beg<strong>in</strong>s with estimation <strong>of</strong> what k<strong>in</strong>d <strong>of</strong> size<br />

the po<strong>in</strong>t set is occupy<strong>in</strong>g <strong>in</strong> 3D space, by comput<strong>in</strong>g the centroid <strong>of</strong> the<br />

po<strong>in</strong>t set and the furthest po<strong>in</strong>ts along the X,Y,Z axes. Then a cube is con-<br />

structed with size <strong>of</strong> the estimated required size to conta<strong>in</strong> the whole po<strong>in</strong>t<br />

set, and it is divided <strong>in</strong>to voxels which are smaller cubes (figure 5.8(b)). All<br />

3D po<strong>in</strong>ts fall<strong>in</strong>g <strong>in</strong>to the same voxel are averaged <strong>in</strong>to one po<strong>in</strong>t, and the<br />

voxels with no po<strong>in</strong>ts fall<strong>in</strong>g <strong>in</strong>to them are not considered.<br />

138


Bigger and fewer voxels gives coarser quantisation and less details (fig-<br />

ure 5.9). A po<strong>in</strong>t set with orig<strong>in</strong>al size <strong>of</strong> 34056 is slimmed us<strong>in</strong>g voxel size<br />

s = 1mm, 10mm respectively. As the voxel size <strong>in</strong>creases, the po<strong>in</strong>t set<br />

gets more and more sparse.<br />

(a) s = 1mm, 32097 po<strong>in</strong>ts (b) s = 10mm, 4373 po<strong>in</strong>ts<br />

Figure 5.9: Different quantisation level by choos<strong>in</strong>g different voxel size.<br />

Figure 5.12 shows the choice <strong>of</strong> the voxel size can be object irrelevant.<br />

The football, fluffy owl, and the vase which is placed <strong>in</strong> different orien-<br />

tations all have different object size and surface structure (figure 5.10 and<br />

5.11). In figure 5.12(a), the total po<strong>in</strong>ts curve for the owl starts very high<br />

but has a dramatic drop. This is because the physical size <strong>of</strong> the object is<br />

much bigger than the other three object tested. By compar<strong>in</strong>g figure 5.12(a)<br />

and (b), it is not hard to f<strong>in</strong>d out that the total po<strong>in</strong>t size <strong>of</strong> the po<strong>in</strong>t set has<br />

very little impact on the amount <strong>of</strong> data be<strong>in</strong>g lost by voxel quantisation.<br />

In the graph with the percentage curves, it can be seen that all four objects<br />

drops <strong>in</strong> a similar manner as the voxel size <strong>in</strong>creases.<br />

139


(a) football (b) po<strong>in</strong>t set <strong>of</strong> (a)<br />

(c) owl (d) po<strong>in</strong>t set <strong>of</strong> (c)<br />

Figure 5.10: The captured objects <strong>of</strong> figure 5.12.<br />

Look<strong>in</strong>g from 5.12(b), an universal voxel size <strong>of</strong> 2mm can be chosen<br />

to conserve over 80% <strong>of</strong> the orig<strong>in</strong>al data, while choos<strong>in</strong>g a voxel size <strong>of</strong><br />

5mm throws half <strong>of</strong> the <strong>in</strong>formation away. This is particularly useful be-<br />

cause the voxel size can be decided by how much the data from different<br />

views are overlapp<strong>in</strong>g. The redundancy can be reduced to a m<strong>in</strong>imum if<br />

the appropriate voxel size is chosen.<br />

140


(a) vase (horizontal shot) (b) po<strong>in</strong>t set <strong>of</strong> (a)<br />

(c) vase (vertical shot) (d) po<strong>in</strong>t set <strong>of</strong> (c)<br />

Figure 5.11: The captured objects <strong>of</strong> figure 5.12.<br />

5.4.3 User Assisted Tun<strong>in</strong>g<br />

As discussed earlier, the transform between the two po<strong>in</strong>t sets <strong>in</strong> 3D space<br />

can be estimated us<strong>in</strong>g the SVD based fitt<strong>in</strong>g algorithm (section 5.2.2),<br />

from a set <strong>of</strong> match<strong>in</strong>g po<strong>in</strong>ts computed from section 5.3. Before mak-<br />

<strong>in</strong>g the commitment <strong>in</strong> sav<strong>in</strong>g the estimate transform, the user is given the<br />

chance <strong>of</strong> manually tun<strong>in</strong>g the po<strong>in</strong>t sets. This process is also visualised<br />

and the tun<strong>in</strong>g results is <strong>in</strong>stantly reflected on the desktop, as shown <strong>in</strong><br />

figure 5.13.<br />

Further discussion <strong>of</strong> this <strong>in</strong>teractive tun<strong>in</strong>g and the scenario <strong>of</strong> mul-<br />

141


(a) total po<strong>in</strong>ts<br />

(b) percentage <strong>of</strong> the orig<strong>in</strong>al data size<br />

Figure 5.12: The quantisation effect <strong>of</strong> choos<strong>in</strong>g different voxel size on the<br />

total po<strong>in</strong>t set size.<br />

tiple po<strong>in</strong>t set registration are both presented <strong>in</strong> a more detailed scale <strong>in</strong><br />

sections 6.4.4 and 6.4.5.<br />

142


Figure 5.13: Manual tun<strong>in</strong>g <strong>of</strong> po<strong>in</strong>t sets registration.<br />

5.5 Render<strong>in</strong>g A Rotat<strong>in</strong>g Object<br />

Rotat<strong>in</strong>g an object about the WCS orig<strong>in</strong> has the risk <strong>of</strong> mov<strong>in</strong>g the object<br />

out <strong>of</strong> the camera’s field <strong>of</strong> view, so the common way to visualise the 3D<br />

data is to rotate it about the centroid as if the object is placed on a turn<br />

table. For each object po<strong>in</strong>t, the <strong>in</strong>stantaneously chang<strong>in</strong>g world coordi-<br />

nates X ′ , Y ′ , Z ′ is projected onto the 2D camera space with the camera pose<br />

calibrated,<br />

143


⎛<br />

⎜<br />

X<br />

⎜<br />

K(R|T ) ⎜<br />

⎝<br />

′<br />

Y ′<br />

Z ′<br />

⎞<br />

⎛<br />

⎟ x<br />

⎟ ⎜<br />

⎟ ⎜<br />

⎟ ≈ ⎜<br />

⎟ ⎝<br />

⎟<br />

⎠<br />

1<br />

′<br />

y ′<br />

⎞<br />

⎟<br />

⎠<br />

1<br />

(5.13)<br />

x ′ , y ′ is the mov<strong>in</strong>g 2D coord<strong>in</strong>ate <strong>in</strong> the camera space. We then attach<br />

the colour <strong>in</strong>formation associated with the current po<strong>in</strong>t (figure 5.7).<br />

(a) (b) (c)<br />

(d) (e) (f)<br />

Figure 5.14: Different rendered views. (top:rendered range images; bot-<br />

tom:rendered object attached with colour texture)<br />

144


5.6 Conclusions<br />

In this chapter a framework is presented for the fusion between two 3D<br />

po<strong>in</strong>t sets, <strong>in</strong> other words, the registration between two views. This is a<br />

comb<strong>in</strong>ation <strong>of</strong> conventional automated 2D image registration, a 3D po<strong>in</strong>t<br />

set registration, and a user-guided human-computer collaborative work <strong>in</strong><br />

a VAE. The proposed framework correlates two sets <strong>of</strong> 3D data captured<br />

from different views <strong>of</strong> the same object, with ideally an overlapp<strong>in</strong>g part<br />

shared between the two views. The registration framework can be iterated<br />

to perform the fusion <strong>of</strong> multiple views.<br />

The process beg<strong>in</strong>s with 2D image registration on colour textures <strong>of</strong><br />

two participat<strong>in</strong>g views, where the <strong>in</strong>terest<strong>in</strong>g po<strong>in</strong>ts are first extracted by<br />

corner detectors and then correlated us<strong>in</strong>g Normalised Cross-Correlation<br />

(NCC). Once the 2D correspondences are built, the 3D coord<strong>in</strong>ates <strong>of</strong> the<br />

matched po<strong>in</strong>ts are used to estimate the transform <strong>in</strong> 3D space between<br />

these two sets <strong>of</strong> po<strong>in</strong>ts us<strong>in</strong>g S<strong>in</strong>gular Value Decomposition (SVD) and<br />

Orthogonal Procrustes [88]. The estimated rotation and translation vec-<br />

tors are used as an <strong>in</strong>itial guess to perform a trial merge, by wrapp<strong>in</strong>g<br />

one po<strong>in</strong>t set to another <strong>in</strong> 3D space based on the estimated rotation and<br />

translation. The user has the f<strong>in</strong>al decision <strong>of</strong> whether to accept this trial<br />

given by the computer, or manually improve the fusion <strong>of</strong> po<strong>in</strong>t sets by<br />

tun<strong>in</strong>g the them <strong>in</strong>to different poses <strong>in</strong> a virtual environment us<strong>in</strong>g the<br />

augmented tools.<br />

In addition to the registration itself, a voxel quantisation mechanism is<br />

145


proposed and implemented to reduce the data redundancy and speed up<br />

the render<strong>in</strong>g. This quantisation is <strong>in</strong> particular desired <strong>in</strong> multiple po<strong>in</strong>t<br />

sets fusion scenario, where the data redundancy is relative larger because<br />

the overlapp<strong>in</strong>g areas between a number <strong>of</strong> po<strong>in</strong>t sets. Prelim<strong>in</strong>ary results<br />

also show that the optimal quantisation level is only affected by the choice<br />

<strong>of</strong> voxel size, and it is object <strong>in</strong>dependent.<br />

5.6.1 Future Work<br />

Although reasonable results can be achieved us<strong>in</strong>g an automated regis-<br />

tration followed by user’s manual tun<strong>in</strong>g, the participat<strong>in</strong>g two views are<br />

preferred to have a fair amount <strong>of</strong> overlapp<strong>in</strong>g area, otherwise the regis-<br />

tration results can become very poor. This is the ma<strong>in</strong> reason caus<strong>in</strong>g the<br />

extra data storage, and the performance can be affected when measur<strong>in</strong>g<br />

objects with clean-cut surfaces such as a rectangular box. A feature based<br />

image registration also means it is hard to work on objects with very little<br />

texture.<br />

Future work <strong>in</strong>clude possible improvements <strong>in</strong> several areas:<br />

• First, dur<strong>in</strong>g the process <strong>of</strong> image registration, it is on purpose that<br />

we aim to hide as much technical details as possible from users,<br />

while still provid<strong>in</strong>g them a means <strong>of</strong> work<strong>in</strong>g towards optimal re-<br />

146


sults by adjust<strong>in</strong>g the parameter sett<strong>in</strong>gs randomly with<strong>in</strong> a closed<br />

<strong>in</strong>terval. However, the <strong>in</strong>terface can still be elaborated to give the<br />

user more targeted <strong>in</strong>itiative on the parameter sett<strong>in</strong>gs. For example,<br />

provid<strong>in</strong>g the user a choice <strong>of</strong> ’less corner po<strong>in</strong>ts’ or ’more tolerant<br />

cross-correlation’ could be more presentable way than a simple ran-<br />

domised repetition.<br />

• Second, the visualisation <strong>in</strong> tun<strong>in</strong>g can be improved (figure 5.13).<br />

The user can be provided with a means <strong>of</strong> <strong>in</strong>spect<strong>in</strong>g the current<br />

po<strong>in</strong>t sets be<strong>in</strong>g merged from a variety <strong>of</strong> angles, to help with the<br />

merge. This is particularly helpful for fus<strong>in</strong>g two pieces which share<br />

little overlapp<strong>in</strong>g area, for example, two halves <strong>of</strong> a sphere.<br />

• Last but not the least, there is a possibility <strong>of</strong> depth <strong>in</strong>formation be<strong>in</strong>g<br />

used for establish<strong>in</strong>g correspond<strong>in</strong>g po<strong>in</strong>ts, when there is a lack <strong>of</strong><br />

texture across the surface. This is can be regarded as us<strong>in</strong>g the depth<br />

map as an alternative feature to the texture. Although the prospect<br />

<strong>of</strong> us<strong>in</strong>g depth <strong>in</strong>formation for image registration faces the challenge<br />

from depth <strong>in</strong>accuracies (e.g. caused by depth discont<strong>in</strong>uities), it is<br />

expected that an appropriately comb<strong>in</strong>ed use <strong>of</strong> the depth <strong>in</strong>forma-<br />

tion and the texture <strong>in</strong>formation would yield positive results.<br />

147


Chapter 6<br />

System Design<br />

6.1 Introduction<br />

In chapters 4 and 5, we discussed the shape acquisition stage and the post-<br />

process<strong>in</strong>g <strong>of</strong> the scanned data. They are both separately performed com-<br />

puter vision tasks. In this chapter we address the design <strong>of</strong> a system that<br />

<strong>in</strong>corporates these two components <strong>in</strong>to a complete and <strong>in</strong>teractive sys-<br />

tem. The system provides the follow<strong>in</strong>g:<br />

1. An automatically generated and ma<strong>in</strong>ta<strong>in</strong>ed platform on which the<br />

148


data are visualised.<br />

2. A planar surface with real objects and video augmented signals.<br />

3. Widget tools for enabl<strong>in</strong>g user-computer <strong>in</strong>teractions, without the<br />

need <strong>of</strong> traditional <strong>in</strong>put devices such as mouse, keyboard or laser<br />

po<strong>in</strong>ter.<br />

4. Accurate automated facilities, with ease <strong>of</strong> use and correctability, and<br />

the user decides when, where and how to utilise them.<br />

The most important feature <strong>of</strong> system presented is that the user plays<br />

an active role <strong>in</strong> the <strong>in</strong>teractions. They make the f<strong>in</strong>al call <strong>of</strong> what is to be<br />

done next, by giv<strong>in</strong>g various <strong>in</strong>structions us<strong>in</strong>g the tools provided. Typical<br />

functionality <strong>in</strong>cludes range map touch-up, rejection <strong>of</strong> a scan, captur<strong>in</strong>g<br />

a snap shot and lots more. Apart from trigger<strong>in</strong>g various computer vision<br />

tasks, the user also decides what part <strong>of</strong> the collected data to be displayed.<br />

The central display area is limited and not all the scanned data will be<br />

used. More detailed discussions on the user <strong>in</strong>terface are presented <strong>in</strong> sec-<br />

tion 6.4.<br />

On the other hand, the computer itself <strong>of</strong>fers user help <strong>in</strong>formation ei-<br />

ther <strong>in</strong> a visualised way or <strong>in</strong> form <strong>of</strong> text messages. The help <strong>in</strong>formation<br />

can be a brief summary <strong>of</strong> the current data, <strong>of</strong>fer<strong>in</strong>g the user different op-<br />

tions about what might be the next move, or how to trigger these events.<br />

But this is a user guided, user centralised system, so users still have the<br />

f<strong>in</strong>al call under all circumstances.<br />

149


The calibration stage <strong>in</strong>troduced <strong>in</strong> chapter 3, however, has to be a<br />

stand-alone step and can not be carried out <strong>in</strong> this augmented environ-<br />

ment, because (a): it is normally performed prior to everyth<strong>in</strong>g else if the<br />

camera-projector system is uncalibrated; (b): the <strong>in</strong>terpretations <strong>of</strong> human<br />

gestures requires accurate mapp<strong>in</strong>g between the augmented projections<br />

and the observed images; (c): once the calibration is done, there is no need<br />

to perform the calibration aga<strong>in</strong> unless the position<strong>in</strong>g <strong>of</strong> the projector-<br />

camera system or the table setup has been changed.<br />

The rest <strong>of</strong> this chapter is organised as follows. In section 6.2, two wid-<br />

gets are <strong>in</strong>troduced. They are implemented to simulate two <strong>of</strong> two most<br />

frequently used gestures <strong>in</strong> the user-mach<strong>in</strong>e <strong>in</strong>teractions, the button push<br />

and the touchpad slide. The background and some other practical issues<br />

dur<strong>in</strong>g implementation are discussed as well. In section 6.3 the ma<strong>in</strong> user<br />

<strong>in</strong>terface <strong>of</strong> the system is <strong>in</strong>troduced. Some <strong>of</strong> the ma<strong>in</strong> utilities and func-<br />

tionality are presented <strong>in</strong> section 6.4. Section 6.5 is the conclusions.<br />

150


6.2 Widgets Provided for Interaction<br />

6.2.1 Introduction<br />

Where a vision system is used as the <strong>in</strong>teractive device <strong>in</strong> a man-mach<strong>in</strong>e<br />

collaboration, it is desirable to have an efficient solution for the user to give<br />

orders without hav<strong>in</strong>g to turn to the traditional <strong>in</strong>put devices. In this re-<br />

search tabletop <strong>in</strong>teraction is normally concerned with hands rather than<br />

other part <strong>of</strong> the human body or other po<strong>in</strong>t<strong>in</strong>g devices. Therefore hand<br />

gesture is the most frequently used behaviour for user to give <strong>in</strong>structions.<br />

The most common gesture is the button push – to trigger an event. In a<br />

vision system, a button push does not necessarily require physical contact<br />

with the desktop surface. Without the presence <strong>of</strong> a touch screen or other<br />

contact sensors, it is hard to visually detect whether the user’s hand has<br />

touched the <strong>in</strong>terface or not. The method discussed here is to monitor the<br />

<strong>in</strong>terested area over consequent frames to analyse whether the button has<br />

been pushed, kept pressed, or released.<br />

Po<strong>in</strong>t<strong>in</strong>g is also realised as another widget <strong>in</strong> this system, equivalent to<br />

a touchpad on a laptop. When the po<strong>in</strong>t<strong>in</strong>g device is engaged, a rectangle<br />

<strong>in</strong> the control area is assigned to a touchpad, while a cursor is rendered <strong>in</strong><br />

the data area. The user can slide their f<strong>in</strong>ger across the touchpad as if they<br />

are work<strong>in</strong>g on a laptop. The f<strong>in</strong>ger tip movement <strong>in</strong> the observed images<br />

is analysed and the system responds to it by chang<strong>in</strong>g the display location<br />

<strong>of</strong> the augmented cursor.<br />

151


Figure 6.1(a) shows an image to be projected. The green rectangle <strong>in</strong><br />

the middle bottom section <strong>of</strong> the <strong>in</strong>terface is the touchpad. The bottom<br />

image shows the user us<strong>in</strong>g the touchpad with left hand and po<strong>in</strong>t<strong>in</strong>g a<br />

button us<strong>in</strong>g the right hand.<br />

152


(a) A projected image.<br />

(b) The observed image.<br />

Figure 6.1: A snapshot with touchpad and buttons.<br />

Figure 6.2 shows an object is be<strong>in</strong>g scanned to get the 2.5D depth map.<br />

While the scan is be<strong>in</strong>g performed, the projection image space (shown <strong>in</strong><br />

153


figure 6.1(a)) is replaced with a 1024 × 768 Gray coded stripe image. After<br />

the scan is f<strong>in</strong>ished, the menus and control buttons will reappear <strong>in</strong> the<br />

<strong>in</strong>teractive <strong>in</strong>terface.<br />

Figure 6.2: A captured image show<strong>in</strong>g an object is be<strong>in</strong>g scanned.<br />

6.2.2 Background<br />

Most <strong>of</strong> the current f<strong>in</strong>ger detection techniques can be classified <strong>in</strong>to three<br />

ma<strong>in</strong> categories.<br />

The majority <strong>of</strong> these techniques rely on background differenc<strong>in</strong>g [69,<br />

63, 72] for the <strong>in</strong>itial stage <strong>of</strong> image process<strong>in</strong>g. In [69], Malik and Laszlo<br />

develop a vision-based <strong>in</strong>put device which allows for hand <strong>in</strong>teractions<br />

with desktop PCs. They use a pair <strong>of</strong> cameras to provide the 3D positions<br />

154


<strong>of</strong> a user’s f<strong>in</strong>gertips, and locate the f<strong>in</strong>gertip and its orientation by seg-<br />

ment<strong>in</strong>g the foreground hand regions from the background. Parnham [74]<br />

proposes a technique <strong>in</strong>volv<strong>in</strong>g a comb<strong>in</strong>ation <strong>of</strong> plane calibration shadow<br />

removal via the analysis <strong>of</strong> the <strong>in</strong>variance image. Letessier and Bérard [63]<br />

present a technique that comb<strong>in</strong>es a method for image differenc<strong>in</strong>g and a<br />

f<strong>in</strong>gertip detection algorithm named Fast Rejection Filter (FRF). FRF is a<br />

set <strong>of</strong> rules to classify hand pixels and non-hand pixels, however it is only<br />

concerned with detect<strong>in</strong>g f<strong>in</strong>gertips but not hand shape. Therefore it is<br />

unable to detect f<strong>in</strong>gers that are pressed together.<br />

Some others make use <strong>of</strong> sk<strong>in</strong> colour detection. In [2], a colour de-<br />

tection method is presented us<strong>in</strong>g a Bayesian classifier [36] plus a small<br />

set <strong>of</strong> tra<strong>in</strong><strong>in</strong>g data. Then a curvature analysis algorithm is applied on<br />

the detected contours to determ<strong>in</strong>e peaks which could correspond to f<strong>in</strong>-<br />

gertips. Quek et al. [78] develop a system named F<strong>in</strong>gerMouse which al-<br />

lows f<strong>in</strong>ger po<strong>in</strong>t<strong>in</strong>g to replace the mouse to control a desktop PC. Their<br />

method <strong>in</strong>volves segmentation via a tra<strong>in</strong><strong>in</strong>g-required probabilistic colour<br />

table look-up, and a Pr<strong>in</strong>ciple Component Analysis (PCA) based f<strong>in</strong>gertip<br />

detection algorithm.<br />

Us<strong>in</strong>g a mask to perform template match<strong>in</strong>g is another way to detect<br />

f<strong>in</strong>gertips. There are techniques where researchers use markers [34, 32]<br />

and gloves [96, 19]. Some researchers use fiducials [56] as po<strong>in</strong>t<strong>in</strong>g device,<br />

which also falls <strong>in</strong>to this category.<br />

155


Apart from the aforementioned ma<strong>in</strong> categories, an alternative is to use<br />

more expensive hardware such as thermoscopic camera or <strong>in</strong>fra-red cam-<br />

era to provide a clean b<strong>in</strong>ary image for further process<strong>in</strong>g [58, 85].<br />

6.2.3 Practical Issues<br />

A few practical issues have to be addressed before background differenc-<br />

<strong>in</strong>g based f<strong>in</strong>ger detection techniques are used <strong>in</strong> this system. F<strong>in</strong>ger de-<br />

tection for use <strong>in</strong> an VAE application is different from those used <strong>in</strong> a con-<br />

ventional vision system. First, it must be resilient to the effect <strong>of</strong> various<br />

light<strong>in</strong>g conditions, especially to the projections. Second, it has to be ef-<br />

ficient so as to be responsive but not adversely affect the performance <strong>of</strong><br />

the rest <strong>of</strong> the system. Third, a user should be able to walk up to the table-<br />

top and beg<strong>in</strong> <strong>in</strong>teract<strong>in</strong>g without the need <strong>of</strong> extra equipment such as<br />

markers or gloves. Last, it provides <strong>in</strong>teractions without conventional <strong>in</strong>-<br />

put devices such as mouse and keyboard, nor the need <strong>of</strong> more expensive<br />

tabletop touch-screens, which means the move and click behaviour that are<br />

usually available by the mouse, need to be addressed.<br />

With those factors stated above, template match<strong>in</strong>g based methods<br />

which might require extra tra<strong>in</strong><strong>in</strong>g are not suitable for this application.<br />

Moreover, although both move and click can be detected <strong>in</strong> a s<strong>in</strong>gle paradigm<br />

<strong>of</strong> f<strong>in</strong>gertip detection by respond<strong>in</strong>g to the <strong>in</strong>stantaneous f<strong>in</strong>gertip loca-<br />

tion, process<strong>in</strong>g the whole image for each frame is not efficient. Robust<br />

background segmentation techniques usually <strong>in</strong>volve analys<strong>in</strong>g the pixel<br />

156


classifications by modell<strong>in</strong>g them as Mixture <strong>of</strong> Gaussians [25, 94, 53] <strong>in</strong><br />

an adjacent few frames. This will <strong>in</strong>evitably causes process<strong>in</strong>g overhead<br />

and affect the overall system performance.<br />

In this research, click is the dom<strong>in</strong>ant <strong>in</strong>teractive gesture therefore we<br />

model it as a button-push action, with a number <strong>of</strong> virtual buttons pro-<br />

vided with<strong>in</strong> the <strong>in</strong>terface (figure 6.1). move is realised by designat<strong>in</strong>g an<br />

area as a touch-pad and switch<strong>in</strong>g it on and <strong>of</strong>f depend<strong>in</strong>g on whether the<br />

locat<strong>in</strong>g device is required or not for the current function, and the desig-<br />

nated touch-pad area is processed <strong>in</strong>stead <strong>of</strong> the whole frame.<br />

6.2.4 Implementation <strong>of</strong> Pushbutton<br />

Figure 6.3: F<strong>in</strong>ger detection.<br />

Our approach to realise the pushbutton widget is to divide the button <strong>in</strong>to<br />

two areas (Figure 6.3). Area A is the <strong>in</strong>ner area where f<strong>in</strong>gers are most<br />

likely placed, and it is roughly the same size as human f<strong>in</strong>ger tip. Area B<br />

157


is the outer area.<br />

Let At0 be the average lum<strong>in</strong>ance over area A at time t0, and At1 for<br />

time t1, then the average lum<strong>in</strong>ance change between this time period is<br />

Similarly for area B, we have<br />

∆A = At0 − At1<br />

∆B = Bt0 − Bt1<br />

(6.1)<br />

(6.2)<br />

We can def<strong>in</strong>e a button be<strong>in</strong>g touched if |∆A| > w1 and |∆B| < w2,<br />

where w1 and w2 are both positive thresholds.<br />

To detect button press and release events, the sign <strong>of</strong> ∆A needs to be<br />

considered. Due to the fact that human sk<strong>in</strong> absorbs a bigger proportion<br />

<strong>of</strong> the <strong>in</strong>cident light than the desktop surface (<strong>in</strong> this case a more reflective<br />

white board), the f<strong>in</strong>ger appears significantly less bright than the back-<br />

ground <strong>in</strong> the image observed from camera. By tak<strong>in</strong>g <strong>in</strong>to account the<br />

sign <strong>of</strong> ∆A <strong>in</strong>stead <strong>of</strong> its absolute value, we can detect the button press<br />

and button release events. The advantage <strong>of</strong> this appearance-based f<strong>in</strong>ger<br />

detection is that it is immune to changes <strong>in</strong> light<strong>in</strong>g conditions and acci-<br />

dental occlusions.<br />

In an early version [64] <strong>of</strong> our f<strong>in</strong>ger detection system the button area<br />

is taken as a whole when be<strong>in</strong>g monitored. The dual region approach is<br />

more reliable. We have tested the new approach for a cont<strong>in</strong>uous time<br />

period <strong>of</strong> more than 24 hours, dur<strong>in</strong>g which it survived extreme changes<br />

158


<strong>in</strong> light<strong>in</strong>g conditions such as sunrise, sunset, pull<strong>in</strong>g up and down the<br />

bl<strong>in</strong>ds, switch<strong>in</strong>g lights on and <strong>of</strong>f. The buttons are never mistriggered.<br />

Button calibration<br />

The two thresholds w1 and w2 <strong>in</strong>troduced above are set <strong>in</strong> different ways.<br />

w1which controls the outer region is set empirically to a small value so<br />

that the outer region <strong>of</strong> the button is <strong>in</strong>tolerant to noise, which makes the<br />

button less likely to be triggered accidentally. The <strong>in</strong>ner region is where<br />

the f<strong>in</strong>ger is normally pressed.<br />

(a) The projected button. (b) The observed button<br />

(no f<strong>in</strong>ger).<br />

Figure 6.4: Button calibration.<br />

(c) The observed button<br />

(f<strong>in</strong>ger pressed).<br />

To decide the threshold w2 for the <strong>in</strong>ner region, a quick calibration pro-<br />

cess is provided at the beg<strong>in</strong>n<strong>in</strong>g. First, a button is projected onto the<br />

surface (figure 6.4). The system takes an image <strong>of</strong> the projected button by<br />

itself and works out the average pixel value for <strong>in</strong>ner region, say v1. In<br />

practice, v1 can be an average value over a small time period ∆t. Then a<br />

help message is displayed to advise the user to press the button. Similarly<br />

159


let v2 be the average pixel value <strong>of</strong> the <strong>in</strong>ner region over a small time pe-<br />

riod. Then w2 = v1 − v2.<br />

Although w1 and w2 are both averaged values from a period <strong>of</strong> time, it<br />

is still regarded as a short period consider<strong>in</strong>g that the system will be up<br />

and runn<strong>in</strong>g for a much longer time. Therefore <strong>in</strong> practice, a tolerance fac-<br />

tor t are applied. w ′ 1 = w1t and w ′ 2 = w2t are used as the f<strong>in</strong>al threshold<br />

values.<br />

Figure 6.5: The TPR and FPR <strong>of</strong> button push detection.<br />

160


In figure 6.5 shows the effect <strong>of</strong> choos<strong>in</strong>g different tolerance factor t<br />

on the button detection performance. We also study the improvement <strong>of</strong><br />

the dual-region method over the previous implementation where average<br />

pixel values across the whole button region is used.<br />

The test framework is designed as follows. For each method, we first<br />

evaluate its TPR by keep press<strong>in</strong>g the button and record the rate <strong>of</strong> suc-<br />

cessful detection. Then a hand is randomly waved over the button, us<strong>in</strong>g<br />

various types <strong>of</strong> gestures and the rate <strong>of</strong> mis-trigger<strong>in</strong>g as FPR. For both<br />

experiments, 100 times <strong>of</strong> the repeated same actions is used.<br />

The top graph shows that us<strong>in</strong>g the old method, although <strong>in</strong>creas<strong>in</strong>g<br />

the tolerance factor decreases the FPR, it is at the sacrifice <strong>of</strong> TPR. As we<br />

<strong>in</strong>crease the tolerance factor he TPR is lowered down to near 60%, its FPR<br />

is still way too high at 40%. The new method shows promis<strong>in</strong>g results,<br />

thanks to its dual region design (figure 6.3 on page 157) that effectively re-<br />

duces the chance <strong>of</strong> the button be<strong>in</strong>g accidentally hit. The FPR <strong>of</strong> the new<br />

method is controlled below 10% <strong>in</strong> the bottom graph, while the TPR stays<br />

above 80% with the tolerance factor set below 0.6.<br />

All curves <strong>in</strong> both top and bottom graph has a similar trend <strong>of</strong> de-<br />

crease as the tolerance factor <strong>in</strong>creases. This is expected because with us-<br />

<strong>in</strong>g smaller tolerance factor decreases threshold values for both <strong>in</strong>ner and<br />

outer region detection, which ultimately leads to both button positive de-<br />

tection and mis-detection be<strong>in</strong>g more likely to happen.<br />

161


Button observation<br />

For each s<strong>in</strong>gle button, its position and size is fixed <strong>in</strong> the projection im-<br />

age. Once a button is def<strong>in</strong>ed, it is assigned a constant 2D position and<br />

size (length and width). The position and size <strong>of</strong> the button <strong>in</strong> the ob-<br />

served image depends on the camera and projector setup. S<strong>in</strong>ce there is<br />

a plane-to-plane projective transform between the camera space and the<br />

projector space while <strong>in</strong>duced by the desktop as a third plane (section 3.5),<br />

once a button is attached onto the source image, its appearance (position<br />

and size) <strong>in</strong> the observed image is known. Here is an illustration <strong>of</strong> a pro-<br />

jection image and its observed camera image (figure 6.6).<br />

162


(a) The projected buttons.<br />

(b) The observed buttons.<br />

Figure 6.6: The projected buttons and their observations <strong>in</strong> camera image.<br />

(The red blocks only <strong>in</strong>dicate the area to be monitored).<br />

163


6.2.5 Implementation <strong>of</strong> Touchpad<br />

Real-time segmentation <strong>of</strong> mov<strong>in</strong>g regions <strong>in</strong> image sequences is done by<br />

background subtraction. The simplest way to do it is threshold<strong>in</strong>g the<br />

error between the image taken earlier without any mov<strong>in</strong>g objects and<br />

the current image. However, to deal with the various light<strong>in</strong>g conditions<br />

which change from time to time <strong>in</strong>volves more complicated process<strong>in</strong>g.<br />

As discussed earlier <strong>in</strong> section 6.2.3, a separate rectangular area is as-<br />

signed and a constant pattern is projected onto it as the touchpad. This<br />

area is monitored and we apply the background subtraction algorithm<br />

only <strong>in</strong> that area <strong>in</strong> the observed frames.<br />

The mixture <strong>of</strong> Gaussian based adaptive background modell<strong>in</strong>g method<br />

[25] is used to generate a foreground mask for each frame. In this appli-<br />

cation the detected foreground regions are f<strong>in</strong>gers or sometimes with part<br />

<strong>of</strong> the palm also <strong>in</strong>cluded. Unlike most <strong>of</strong> the vision systems, we do not<br />

explicitly segment the foreground blobs because the <strong>in</strong>formation needed<br />

from the foreground region is the f<strong>in</strong>ger tips, and it is assumed that f<strong>in</strong>ger<br />

is always po<strong>in</strong>t<strong>in</strong>g up.<br />

Figure 6.7 shows the result <strong>of</strong> background segmentation algorithm on<br />

four different occasions. From left to right column-wise, the image are cap-<br />

tured when 1. only one f<strong>in</strong>ger is at present; 2. two f<strong>in</strong>gers are at present;<br />

3. part <strong>of</strong> the palm is <strong>in</strong>cluded; 4. the whole upper hand is <strong>in</strong>cluded. The<br />

f<strong>in</strong>gertip is f<strong>in</strong>ally located at the top middle position <strong>of</strong> the most dom<strong>in</strong>ant<br />

164


lob <strong>in</strong> the resultant foreground region.<br />

(a) Orig<strong>in</strong>al image.<br />

(b) Background region.<br />

(c) Foreground region.<br />

(d) Detected f<strong>in</strong>gertip.<br />

Figure 6.7: F<strong>in</strong>gertip detection us<strong>in</strong>g background segmentation algorithm.<br />

165


6.3 User <strong>in</strong>terface<br />

S<strong>in</strong>ce the whole system is based on <strong>in</strong>teractions, it is important to have an<br />

well designed <strong>in</strong>terface via which the user can give <strong>in</strong>structions and re-<br />

ceive feedback from the computer. Therefore, it must be understandable,<br />

streaml<strong>in</strong>ed and easy to use. Two pr<strong>in</strong>ciples are tightly followed dur<strong>in</strong>g<br />

the design <strong>of</strong> the user <strong>in</strong>terface. First, the data area is maximised to be<br />

able to present all relevant <strong>in</strong>formation and data. Second, various controls<br />

are efficiently grouped <strong>in</strong>to different sections while tak<strong>in</strong>g as little space as<br />

possible. Besides, we are also aware that not all the control units need to<br />

be revealed at the same time for the purpose <strong>of</strong> sav<strong>in</strong>g the limited desktop<br />

space.<br />

The user <strong>in</strong>terface itself is a 1024 × 768 image be<strong>in</strong>g projected onto the<br />

desktop surface. Figure 6.8 shows a screen shot <strong>of</strong> the work<strong>in</strong>g environ-<br />

ment. It is divided <strong>in</strong>to 5 areas.<br />

Left column<br />

The left column is the preview area where all the thumbnails are listed.<br />

Only the thumbnails <strong>of</strong> the views that have already been scanned will be<br />

displayed here. The user can switch between different views by press-<br />

<strong>in</strong>g the correspond<strong>in</strong>g thumbnails. The current <strong>in</strong>vestigated view is high-<br />

lighted and red framed.<br />

Right column<br />

166


Figure 6.8: A screen shot <strong>of</strong> the work<strong>in</strong>g environment.<br />

The right column is the area for system controls. These are the most im-<br />

portant system-wide controls so they will stay on display throughout the<br />

whole process. From the bottom up, they are Lock, Snapshot, Scan, Re-Scan,<br />

and Exit. The user might want to Lock the current desktop when the target<br />

object needs to be re-positioned manually by user or the tabletop is go<strong>in</strong>g<br />

to be unattended for some time so that the buttons will not be accidentally<br />

triggered. When the desktop is <strong>in</strong> lock, all buttons except the Lock button<br />

are not responsive until it is unlocked by user. By press<strong>in</strong>g the Scan button,<br />

a new structured light projection takes over the system. When it is done,<br />

all relevant <strong>in</strong>formation such as the texture map and depth map are dis-<br />

played <strong>in</strong> the central area and the system goes back to idle. A thumbnail<br />

<strong>of</strong> this scan is displayed <strong>in</strong> the left column too. Re-Scan is similar to Scan<br />

button, the only difference be<strong>in</strong>g press<strong>in</strong>g Re-Scan will erase data from the<br />

167


previous shape <strong>in</strong>put. This is useful sometimes when a structured light<br />

process is disturbed which could result <strong>in</strong> unexpected large errors <strong>in</strong> the<br />

scanned data, so they are deleted prior to the next scan to save the mem-<br />

ory. On the top <strong>of</strong> this column is an Exit button to quit the whole system.<br />

Bottom left panel<br />

The bottom left area conta<strong>in</strong>s four mode buttons: Inspect, Touchup, Corre-<br />

spondence, and Visualise. Once a mode button is pressed, it will stay high-<br />

lighted and the system engages the appropriate mode. Relevant guide<br />

messages will appear above the control panel to briefly <strong>in</strong>troduce what<br />

can be done <strong>in</strong> this mode or sometimes advise the user what the next pos-<br />

sible steps are. The user can hit the same mode button aga<strong>in</strong> to quit the<br />

current mode, or simply press another mode button to switch to a differ-<br />

ent mode directly. Detailed discussion <strong>of</strong> the <strong>in</strong>dividual modes is given <strong>in</strong><br />

section 6.4.<br />

Bottom right panel<br />

The content displayed <strong>in</strong> the bottom right section depends on the mode<br />

currently engaged.<br />

Central display area<br />

The central area holds the ma<strong>in</strong> display. Normally, all data displayed <strong>in</strong><br />

the central area is from the same view. This area is composed <strong>of</strong> four sub<br />

pictures: depth map, texture, colour texture, and a rendered model with<br />

texture map attached onto the depth map.<br />

168


6.4 Ma<strong>in</strong> Utilities<br />

In this section the ma<strong>in</strong> utilities <strong>of</strong> the system are <strong>in</strong>troduced. They not<br />

only function <strong>in</strong>dividually but also work collectively as a whole unit to<br />

perform the 3D <strong>in</strong>put task under the user’s <strong>in</strong>structions. Although some<br />

<strong>of</strong> the utilities requires certa<strong>in</strong> steps to be done first, there is no specific<br />

order <strong>of</strong> which <strong>of</strong> them comes first or which last. The user can switch be-<br />

tween these modes anytime based on what to be done next. If an illegal<br />

operation is evoked a warn<strong>in</strong>g message will appear to advise the user <strong>of</strong><br />

the correct options.<br />

We now briefly describe how the system works as an overview, then<br />

discuss the <strong>in</strong>dividual utilities via a scenario example to illustrate how<br />

they perform the <strong>in</strong>dividual tasks.<br />

169


6.4.1 Overview<br />

Figure 6.9 shows a screen shot <strong>of</strong> system start-up projection. On the left<br />

hand side, a few place holders are attached and each <strong>of</strong> them represents<br />

one view. This is where the thumbnails <strong>of</strong> the captured views are go<strong>in</strong>g to<br />

be placed after the user runs the structured light scan. On the right hand<br />

side are the attached system buttons which can be hit any time dur<strong>in</strong>g the<br />

process. The Lock button is placed at the bottom for the user’s convenience<br />

to lock up the screen so it is temporarily not responsive to the user’s <strong>in</strong>-<br />

structions. Four mode buttons are also shown at the bottom left, however<br />

at this stage they will not evoke any applications because there is currently<br />

no captured data to be processed.<br />

At the bottom centre area, a button with a small red area is attached<br />

and flashed. A help message is displayed above the button to <strong>in</strong>form the<br />

user <strong>of</strong> the button calibration with five seconds count down. After the<br />

count down, the user is expected to put his f<strong>in</strong>ger <strong>in</strong> the designated area<br />

to perform the button calibration, and the system will choose an optimal<br />

value for the button push detection threshold based on the current room<br />

light<strong>in</strong>g, the projection illum<strong>in</strong>ation level, and this specific person’s sk<strong>in</strong><br />

colour. Detailed discussion <strong>of</strong> this calibration process is <strong>in</strong> section 6.2.4.<br />

A quick structured light scan is done right after the button calibration,<br />

as a plane calibration step (section 4.4.2). The scan button (third button<br />

from the bottom up <strong>in</strong> the right column, the one with the black and white<br />

stripes) is flashed to rem<strong>in</strong>d the user to capture data first before any pro-<br />

cess<strong>in</strong>g can be carried out.<br />

170


Figure 6.9: Screen shot <strong>of</strong> the system start-up state.<br />

Once a scanned view is captured, some contents <strong>of</strong> the screen will be<br />

updated. A thumbnail <strong>of</strong> the current view is attached to the appropri-<br />

ate place <strong>in</strong> the left column. It serves as an identification <strong>of</strong> the view it<br />

represents. The user can switch between different views to perform the<br />

process<strong>in</strong>g task by press<strong>in</strong>g the correspond<strong>in</strong>g thumbnails. The captured<br />

data is visualised <strong>in</strong> the central display area <strong>in</strong> different forms: the depth<br />

map, a rendered 3D partial model, the texture map, and the colour map.<br />

Various tasks can be performed right after a view is captured. In gen-<br />

eral, there are four ma<strong>in</strong> modes the user can switch <strong>in</strong>to:<br />

• The Inspect Mode for check<strong>in</strong>g the captured data without chang<strong>in</strong>g<br />

171


the data itself. The user can <strong>in</strong>spect the data not only on the depth<br />

map itself but also through a manipulatable rendered 3D model.<br />

• The Touchup Mode for touch<strong>in</strong>g up the depth map if an obvious error<br />

is believed to have occurred.<br />

• The Correspondence Mode for f<strong>in</strong>d<strong>in</strong>g match<strong>in</strong>g po<strong>in</strong>ts, estimat<strong>in</strong>g the<br />

transform between two views, and fus<strong>in</strong>g the two views together. At<br />

least two captured views are required for this mode.<br />

• The Visualisation Mode for visualis<strong>in</strong>g the built 3D model. The user<br />

can visualise the f<strong>in</strong>al 3D model that has been built, check which<br />

view contributes to a certa<strong>in</strong> part <strong>of</strong> the object, and how well the<br />

views are fused together by switch<strong>in</strong>g any <strong>of</strong> the views on and <strong>of</strong>f.<br />

From section 6.4.2 to 6.4.5, an owl object experiment is used <strong>in</strong> an ex-<br />

ample scenario to show the usage <strong>of</strong> these utilities, both <strong>in</strong>dividually and<br />

collectively.<br />

6.4.2 Mode 1: Inspect<br />

172


In Inspect Mode, the user adjusts the orientation <strong>of</strong> the selected rendered<br />

model, for view<strong>in</strong>g or check<strong>in</strong>g purposes. The first four arrow buttons<br />

are provided for rotat<strong>in</strong>g the rendered model <strong>in</strong> 3D space (pan and tilt),<br />

while the two rightmost buttons adjust the magnitude ga<strong>in</strong> <strong>of</strong> the ren-<br />

dered model to further <strong>in</strong>spect the surface.<br />

Normally the very first move after a scan is to switch to this mode, to<br />

exam<strong>in</strong>e the accuracy <strong>of</strong> the estimated depth map and see if there is any<br />

outstand<strong>in</strong>g errors which can be caused by surface discont<strong>in</strong>uities, shad-<br />

ows, reflectance artifacts or other disturbances occurr<strong>in</strong>g dur<strong>in</strong>g the scan.<br />

The Inspect Mode does not <strong>in</strong>volve any process<strong>in</strong>g <strong>of</strong> the collected data, but<br />

works closely with the other modes. One can switch to this mode anytime<br />

for <strong>in</strong>spection purposes. It is sometime helpful to switch to a different<br />

view, if available, to double check the identified error and ga<strong>in</strong> more con-<br />

fidence.<br />

173


Figure 6.10: Owl experiment, 3 views captured, current on view 1.<br />

Figure 6.11: Owl experiment, 3 views captured, current on view 0, model<br />

rotated.<br />

174


Figure 6.10 shows the projected display after three views are captured,<br />

and view 1 is currently selected. In the depth map, two white spots are<br />

observed and <strong>in</strong>itially identified to be an obvious error. The error is more<br />

obvious <strong>in</strong> the top right picture where it is rendered <strong>in</strong> 3D and attached<br />

with the colour map. The two spikes seen <strong>in</strong> that picture correspond to<br />

the two bright spots found <strong>in</strong> the depth map, and this can be further con-<br />

firmed by rotat<strong>in</strong>g the rendered model <strong>in</strong>to a more suitable angle (figure<br />

6.11), where it can be clearly seen that the two spikes come from the side<br />

<strong>of</strong> the owl’s left foot. These two sparks come from two t<strong>in</strong>y spots on the<br />

owl’s right leg (the one underneath), where the projector fails to illum<strong>in</strong>ate<br />

those that little area, but it is with<strong>in</strong> the view <strong>of</strong> the camera.<br />

Once the error is identified and confirmed, the user can move on to<br />

Touchup Mode to correct the error, after which they can witch back to <strong>in</strong>-<br />

spect the results aga<strong>in</strong>, but this is totally the user’s choice.<br />

6.4.3 Mode 2: Touchup<br />

Touchup Mode gives the user opportunity to manually touch up on the<br />

175


depth map and improve the view, without hav<strong>in</strong>g to adjust the system<br />

parameters or run the shape acquisition stage once more. Although this<br />

mode doesn’t provide a sophisticated and detailed correction mechanism<br />

for the depth map, it does <strong>of</strong>fer a tool for the user to alleviate or erase<br />

the most obvious errors based on their own judgement. Once the capture<br />

error is presentably visualised <strong>in</strong> the Inspect Mode, this correction tool is<br />

simple to use, fast, and effective.<br />

In this mode, different functional buttons are provided - a touch pad<br />

for locat<strong>in</strong>g the cursor and a push button to commit the change. A speed<br />

control button is also provided to adjust the cursor speed. The cursor can<br />

be positioned quickly towards the error po<strong>in</strong>t by faster cursor movement,<br />

but once it is located slower cursor movement may be used to p<strong>in</strong>po<strong>in</strong>t<br />

the error spot. The cursor is restricted with<strong>in</strong> the depth map sub-w<strong>in</strong>dow.<br />

The same owl object is used as an example to illustrate the touchup<br />

process. First an error po<strong>in</strong>t <strong>in</strong> the depth map is identified <strong>in</strong> the Inspection<br />

Mode, as shown <strong>in</strong> figure 6.10 and 6.11. The error actually occurs <strong>in</strong> the<br />

codification stage where the codewords <strong>of</strong> a group <strong>of</strong> pixels are wrongly<br />

built hence the table look-up result for those pixels are <strong>in</strong>correct. Figure<br />

6.12 shows a row <strong>in</strong>dex image, which is the result <strong>of</strong> codification table<br />

look-up. In the row <strong>in</strong>dex image, the pixel value corresponds to which<br />

row <strong>of</strong> the projection image it is illum<strong>in</strong>ated by, and the brighter pixels<br />

correspond to higher rows. This image is an <strong>of</strong>f-l<strong>in</strong>e <strong>in</strong>spection dur<strong>in</strong>g de-<br />

bug and will not be shown to the user.<br />

176


Figure 6.12: The row <strong>in</strong>dex picture <strong>of</strong> the first view (the brighter pixel<br />

values correspond to higher rows <strong>in</strong> the projection image.)<br />

The touch-up process executes a median filter on the area located by<br />

the cursor once the commit button is hit. The median filter is very effec-<br />

tive for the type <strong>of</strong> salt and pepper noise <strong>in</strong> this example. The result <strong>of</strong><br />

the touchup is not only shown on the depth map, it is also reflected on<br />

the rendered model <strong>in</strong> the image to its right <strong>in</strong>stantly (figure 6.13), as the<br />

two are synchronised throughout the process. It is clearly seen that the<br />

spikes <strong>in</strong> the rendered image caused by depth error are no longer present,<br />

compared to figure 6.11. (Note, the big <strong>in</strong>crease <strong>in</strong> brightness level <strong>of</strong> the<br />

depth maps between figure 6.13 and 6.11 is caused by scal<strong>in</strong>g, because all<br />

displayed depth maps are re-scaled to 0-255 otherwise all pixels exceed<strong>in</strong>g<br />

255 will appear as full white.)<br />

177


Once the touchup is done, the user is also advised to switch <strong>in</strong>to the<br />

Inspect Mode to tune the 3D model <strong>in</strong>to a better pose to double check the<br />

questioned part, and see if there is any other part <strong>of</strong> the object needs to be<br />

corrected.<br />

The changes made by the median filter to the depth map are also up-<br />

dated <strong>in</strong> the correspond<strong>in</strong>g 3D po<strong>in</strong>t set <strong>of</strong> the current view. Upon exit<br />

<strong>of</strong> the touchup process, the user has a f<strong>in</strong>al Yes-or-No choice <strong>of</strong> whether<br />

to accept this change permanently. If No is selected, the modified part is<br />

recovered by the backup data. Otherwise, the updated data will replace<br />

the old version to participate further process<strong>in</strong>g.<br />

Figure 6.13: The touchup result <strong>of</strong> 6.10.<br />

178


6.4.4 Mode 3: Correspondence<br />

In Correspondence Mode, this mode follows the work flow <strong>in</strong>troduced<br />

<strong>in</strong> section 5.3 and section 5.4. It is named Correspondence Mode because<br />

it starts with f<strong>in</strong>d<strong>in</strong>g the match<strong>in</strong>g po<strong>in</strong>ts between the image pair, and the<br />

correspondences hold the key to the <strong>in</strong>itial guess <strong>of</strong> the transform between<br />

the two views. This <strong>in</strong>itial guess provides the user with a trial fuse, which<br />

can be further adjusted. A m<strong>in</strong>imum <strong>of</strong> two views is required to perform<br />

this task.<br />

While the all the back-end image process<strong>in</strong>g tasks are discussed ear-<br />

lier <strong>in</strong> chapter 5, here we are concerned with the <strong>in</strong>terface part and how<br />

to <strong>in</strong>corporate the back-end process <strong>in</strong>to a collaborative environment. The<br />

ma<strong>in</strong> pr<strong>in</strong>ciple that is susta<strong>in</strong>ed here is to perform the whole po<strong>in</strong>t set fu-<br />

sion process where the user works as a decision maker and the computer<br />

merely as a work force and a source <strong>of</strong> guidance.<br />

Dur<strong>in</strong>g the process <strong>of</strong> image registration and po<strong>in</strong>t set fusion, a set <strong>of</strong><br />

parameters are used for each s<strong>in</strong>gle step. Although there is a set <strong>of</strong> trial<br />

parameters provided to work with most <strong>of</strong> the scenarios, different objects<br />

have different properties (e.g. size, texture, surface reflection) and it is dif-<br />

ficult to f<strong>in</strong>d the best set <strong>of</strong> parameters for <strong>in</strong>dividual objects. For example,<br />

to register a pair <strong>of</strong> images <strong>of</strong> a periodic pattern such as a checkerboard<br />

179


(figure 5.4), choos<strong>in</strong>g a too big search w<strong>in</strong>dow confounds the NCC with<br />

mismatches. On the other hand, if the search w<strong>in</strong>dow is not big enough,<br />

the right correspondence might not be found <strong>in</strong> a largely displaced image<br />

pair. Therefore we provide a randomised mechanism to let the user f<strong>in</strong>d<br />

those optimal parameters while not be<strong>in</strong>g exposed to too many much tech-<br />

nical details. The underly<strong>in</strong>g idea is to keep it simple, and keep it visualised.<br />

The process beg<strong>in</strong>s with list<strong>in</strong>g the views that have been scanned. The<br />

user is advised to choose two views as ’from’ image and ’to’ image for<br />

image registration <strong>in</strong> order to transfer the ’from’ po<strong>in</strong>t set towards the ’to’<br />

po<strong>in</strong>t set (figure 6.14). If any other two views have already been registered<br />

previously, there will be a red connection l<strong>in</strong>e underneath <strong>in</strong>dicat<strong>in</strong>g so.<br />

The colour texture maps <strong>of</strong> the selected two views will participate <strong>in</strong> the<br />

registration.<br />

Instead <strong>of</strong> tak<strong>in</strong>g the whole part <strong>of</strong> the two selected images, the system<br />

crops the images with a ROI (figure 6.15) based on the expected position<br />

and size, both estimated by the object size estimated from the po<strong>in</strong>t set <strong>in</strong><br />

3D space and the camera imag<strong>in</strong>g geometry (these are all available because<br />

the camera-projector pair is calibrated, and the centroid <strong>of</strong> the object, the<br />

m<strong>in</strong>imum and maximum ends <strong>of</strong> the object along the X, Y, Z axes can all<br />

be worked out from the 3D po<strong>in</strong>t set). Giv<strong>in</strong>g the user an option <strong>of</strong> choos-<br />

<strong>in</strong>g the ROI has another purpose, as non rigid objects can be partially de-<br />

formed while be<strong>in</strong>g positioned to different poses, so these deformed parts<br />

are ideally excluded from participat<strong>in</strong>g <strong>in</strong> the correspondence match<strong>in</strong>g.<br />

180


Figure 6.14: Correspondence Mode: two images are selected as ’from’ and<br />

’to’.<br />

Figure 6.15: Correspondence Mode: ROIs are selected.<br />

181


After the image pair with ROIs is chosen, they are enlarged and dis-<br />

played at the centre <strong>of</strong> the desktop to show better details. Three image<br />

process<strong>in</strong>g tasks, corner detection, cross-correlation and outlier exclusion,<br />

are performed. The implementation details are discussed earlier <strong>in</strong> sec-<br />

tion 5.3.1 - 5.3.3. While these image process<strong>in</strong>g tasks are performed (figure<br />

6.16 - 6.17), all system parameters are hidden from the user but the user<br />

still has the privilege to adjust the parameters and re-do the current step<br />

aga<strong>in</strong> with a new set <strong>of</strong> parameters. At each step <strong>of</strong> the aforementioned<br />

image process<strong>in</strong>g task, a set <strong>of</strong> default parameters which is pre-set with<br />

empirical values is loaded with the <strong>in</strong>stant result reflected on the desktop.<br />

All parameters also come with a float<strong>in</strong>g range, from which they can be<br />

randomly selected. If the user is satisfied with the result yielded by the<br />

current parameter set, he/she can hit the Proceed button (the one with a<br />

tick) and move on to the next step. Otherwise, the user can use the Adjust<br />

button (the one with two gears) to select a new comb<strong>in</strong>ation <strong>of</strong> the param-<br />

eters which are randomly selected from the allowed closed <strong>in</strong>terval. This<br />

process is repeated until a satisfy<strong>in</strong>g result is shown on the desktop before<br />

mov<strong>in</strong>g onto the next step.<br />

Reasonable outcomes are <strong>of</strong>ten achieved at the first attempt. The user is<br />

advised to repeat the process a few times us<strong>in</strong>g different sett<strong>in</strong>gs to com-<br />

pare the results, or sometimes work<strong>in</strong>g towards the possibility <strong>of</strong> even<br />

better results. However, all the parameters are restricted to be randomised<br />

only and not controllable, to comply with our pr<strong>in</strong>ciple <strong>of</strong> keep<strong>in</strong>g it sim-<br />

ple by leav<strong>in</strong>g all the technical details hidden.<br />

182


Figure 6.16: Correspondence Mode: extracted corners.<br />

Figure 6.17: Correspondence Mode: correlated and improved po<strong>in</strong>t corre-<br />

spondences.<br />

183


The established correspondences may still not be good enough. This<br />

is expected when the two participat<strong>in</strong>g images are highlighted. There are<br />

parts <strong>in</strong> the left image that will appear perspectively deformed <strong>in</strong> the other<br />

image, or sometimes don’t even exist because <strong>of</strong> the view po<strong>in</strong>t change.<br />

Other challenges <strong>in</strong>clude lack <strong>of</strong> texture <strong>of</strong> the measured object, sur-<br />

face reflections caused by the bright projection light, and deformed parts<br />

<strong>of</strong> objects such as stuffed animals. Further discussions <strong>of</strong> how to tackle<br />

these problems are given later <strong>in</strong> chapter 7.<br />

In figure 6.17 where the correspondences are shown, press<strong>in</strong>g the pro-<br />

ceed button will make the commitment <strong>of</strong> us<strong>in</strong>g the current po<strong>in</strong>t corre-<br />

spondences as control po<strong>in</strong>ts to deploy the estimation <strong>of</strong> the rotation and<br />

translation vectors. The estimation is a quick process which takes less than<br />

a second, then the second po<strong>in</strong>t set is transformed towards the other based<br />

on the rotation and translation vectors estimated. This is a trial registra-<br />

tion <strong>of</strong> the two po<strong>in</strong>t sets suggested by the system as an <strong>in</strong>itial guess.<br />

The user can accept this registration by press<strong>in</strong>g the proceed button<br />

aga<strong>in</strong>, or to further adjust their positions manually. By switch<strong>in</strong>g between<br />

the R and T buttons, both <strong>of</strong> which are attached with a set <strong>of</strong> six buttons<br />

for rotat<strong>in</strong>g a po<strong>in</strong>t set about its centroid around the X, Y, Z axes or trans-<br />

lat<strong>in</strong>g along them, the engaged po<strong>in</strong>t set can be manipulated rotation-wise<br />

and translation-wise respectively (figure 6.18).<br />

184


Dur<strong>in</strong>g the course <strong>of</strong> tun<strong>in</strong>g, the first po<strong>in</strong>t set (on the left) is used as<br />

a reference while the second one is transformed towards the first one. A<br />

f<strong>in</strong>al solution is thought to be reached (figure 6.19) after the overlapp<strong>in</strong>g<br />

area <strong>of</strong> the two po<strong>in</strong>ts co<strong>in</strong>cide on each other.<br />

Figure 6.18: Correspondence Mode: visualised po<strong>in</strong>t sets tun<strong>in</strong>g, with con-<br />

trollable rotation and translation.<br />

185


Figure 6.19: Correspondence Mode: two po<strong>in</strong>t sets are fused.<br />

6.4.5 Mode 4: Visualisation<br />

Although the captured data can be visualised by different means <strong>in</strong><br />

any <strong>of</strong> the three modes <strong>in</strong>troduced earlier, Visualisation Mode <strong>of</strong>fers the fa-<br />

cility to visualise the complete 3D model that is built through the previous<br />

work, <strong>in</strong> 360 degrees. In this mode, the controls are not as sophisticated as<br />

other modes – all the scanned views are listed <strong>in</strong> the bottom centre control<br />

186


panel area represented by the resisted m<strong>in</strong>i version <strong>of</strong> their colour textures<br />

(figure 6.20). The rendered object will be displayed at the centre <strong>of</strong> the dis-<br />

play area, slowly rotat<strong>in</strong>g about its centroid as if it is placed on a turn table.<br />

To be noticed, <strong>in</strong> Correspondence Mode, two po<strong>in</strong>t sets are only regis-<br />

tered (i.e. to work out the rotation and translation vectors between them),<br />

but no po<strong>in</strong>t set data is changed. In this mode, all po<strong>in</strong>t sets selected will<br />

be merged together (i.e transform one po<strong>in</strong>t set towards the other so that<br />

they are <strong>in</strong> the same coord<strong>in</strong>ate space and shar<strong>in</strong>g a same centroid).<br />

Figure 6.20: The Visualisation Mode.<br />

Apart from the view<strong>in</strong>g, the other only operation the user can do <strong>in</strong><br />

the Visualisation Mode is turn<strong>in</strong>g on or <strong>of</strong>f different views to <strong>in</strong>spect the 3D<br />

model <strong>of</strong> the measured object by press<strong>in</strong>g the correspond<strong>in</strong>g buttons. All<br />

187


the views be<strong>in</strong>g turned on are fused first us<strong>in</strong>g the estimated transform<br />

between them which is previously worked out <strong>in</strong> the Correspondence Mode.<br />

There can be more than one view be<strong>in</strong>g turned on at the same time, or even<br />

all <strong>of</strong> the views (if all the necessary transform <strong>in</strong>formation is available) –<br />

and this is possible only <strong>in</strong> this mode. If no view is selected, noth<strong>in</strong>g is<br />

displayed.<br />

However, not all <strong>of</strong> the views can be selected randomly and then fused<br />

together. There are a few ground rules that need to be applied for choos<strong>in</strong>g<br />

different views to be fused:<br />

• If two views are to be selected, they have to be either registered <strong>in</strong><br />

the Correspondence Mode (i.e. the transform vectors between them are<br />

available), or they are both registered with a same third view.<br />

• Registration relay is also allowed (e.g. if view 1 and 2, 2 and 3, 3 and<br />

4 are all registered, then view 1 and 4 are registered too).<br />

• All <strong>in</strong>ter-registered views are categorised <strong>in</strong>to a same group, and<br />

only the views from the same group can be visualised at the same<br />

time.<br />

The reason beh<strong>in</strong>d the above rules is that any two registered views can<br />

be regarded to have a path between them – the rotation and the transla-<br />

tion vectors. Suppose the rotation vector from view A to view B is RAB =<br />

(θ, φ, ψ) and let its translation vector is TAB = (Tx, Ty, Tz), then the rotation<br />

188


and translation vectors from view B to view A are RBA = (−θ, −φ, −ψ) and<br />

TBA = (−Tx, −Ty, −Tz). This relationship propagates to multiple views be-<br />

cause as long as there is not a stand-alone view that is not registered to any<br />

<strong>of</strong> the others, there is always a path <strong>of</strong> transform for this view to be trans-<br />

formed to any <strong>of</strong> the other’s orientation and position.<br />

Table 6.1 gives an example <strong>of</strong> the propagation <strong>of</strong> this relationship. We<br />

still consider the scenario example used earlier <strong>in</strong> this chapter <strong>in</strong> which<br />

five different views <strong>of</strong> the owl are scanned while the sixth view is not cap-<br />

tured yet. It starts from stage 0 where all five views are related to each<br />

other and the views are not grouped. At stage 1, view 1 and view 2 are<br />

registered so they are labelled as group 1. A red connection l<strong>in</strong>e is drawn<br />

between them to <strong>in</strong>dicated this relationship. At stage 1, view 3 and view<br />

4 are registered as a new group, group 2, and it is <strong>in</strong>dicated by a green<br />

connection l<strong>in</strong>e underneath. So up to this po<strong>in</strong>t, there are two separate<br />

groups between those five scanned views both <strong>of</strong> which are <strong>in</strong>dicated by<br />

different colours to advise to the user that a view from the red group and a<br />

view from the green group or the stand-alone view 5 can not be displayed<br />

together, because there is no way to fuse them. After stage 3, a new regis-<br />

tration is completed between view 1 and 5 so the same group<strong>in</strong>g process<br />

is carried out. The situation is totally different after stage 4, after which<br />

view 2 from view 3 are registered. This registration changes everyth<strong>in</strong>g as<br />

it br<strong>in</strong>gs the two groups <strong>in</strong>to one. In other words, a registration between<br />

any other two views will result <strong>in</strong> the same group<strong>in</strong>g as long as they are<br />

from two different groups, one each.<br />

189


Stage View<br />

(from)<br />

View<br />

(to)<br />

0 n/a n/a 0<br />

1 1 2 1<br />

2 3 4 2<br />

3 1 5 2<br />

4 2 3 2<br />

Number<br />

<strong>of</strong><br />

groups<br />

Relationship l<strong>in</strong>es<br />

Table 6.1: Group<strong>in</strong>g status <strong>of</strong> the po<strong>in</strong>t sets at different stages.<br />

Figure 6.21, 6.22, and 6.23 shows process <strong>of</strong> a model <strong>of</strong> the owl be<strong>in</strong>g<br />

built from three central views. By fus<strong>in</strong>g the view 2 and 3 together and<br />

visualised the fused model, it can be seen from figure 6.21 that the right-<br />

fac<strong>in</strong>g object <strong>in</strong> view 2 completes the left w<strong>in</strong>g part that is partially not vis-<br />

ible <strong>in</strong> view 3 where the object is fac<strong>in</strong>g straight up. However, as the same<br />

model be<strong>in</strong>g rotated around its centroid until its right part is exposed, it is<br />

clear that the right w<strong>in</strong>g <strong>of</strong> the current model is miss<strong>in</strong>g data.<br />

We notice the object <strong>in</strong> view 4 is fac<strong>in</strong>g left and its right part is visible<br />

while still shar<strong>in</strong>g a fair amount <strong>of</strong> the overlapp<strong>in</strong>g area between view 3<br />

190


and itself. By fus<strong>in</strong>g view 4 <strong>in</strong>to the model previously built from view 2<br />

and 3, another part <strong>of</strong> the object is fulfilled as shown <strong>in</strong> figure 6.23.<br />

Figure 6.21: View 2 and 3 fused together. View completes the left w<strong>in</strong>g <strong>of</strong><br />

the owl.<br />

191


Figure 6.22: View 2 and 3 fused together.<br />

Figure 6.23: Fusion <strong>of</strong> view 2, 3, and 4.<br />

192


6.5 Conclusions<br />

In this chapter we present a work<strong>in</strong>g and user friendly <strong>in</strong>terface for VAE<br />

system designed <strong>in</strong> this research. This <strong>in</strong>teractive <strong>in</strong>terface is a mixed en-<br />

vironment with real objects and projected signals, where users’ <strong>in</strong>terac-<br />

tion with these objects and projections are captured and <strong>in</strong>terpreted by<br />

adjusted projections. Techniques <strong>in</strong>troduced <strong>in</strong> chapter 4 and 5 are both<br />

<strong>in</strong>tegral parts <strong>of</strong> the designed system, while efficient monitor<strong>in</strong>g <strong>of</strong> the <strong>in</strong>-<br />

teractive surface and accurate response to it rely on the explicit calibration<br />

presented <strong>in</strong> chapter 3.<br />

Two widgets are <strong>in</strong>troduced and then implemented to simulate two <strong>of</strong><br />

two <strong>of</strong> the most frequently used gestures <strong>in</strong> the human-computer <strong>in</strong>ter-<br />

actions, the button push for trigger<strong>in</strong>g events and the touchpad slide for<br />

position<strong>in</strong>g.<br />

Four major facilities are provided to accomplish the task <strong>of</strong> 3D <strong>in</strong>put,<br />

with which the user are allowed to <strong>in</strong>spect the captured data from differ-<br />

ent view angles, po<strong>in</strong>t out and correct errors, manipulate the projection<br />

signals, and f<strong>in</strong>ally build and visualise the complete 3D model. Other<br />

tools such as a desktop lock-down and snap-shot tool are also provided<br />

for practical uses dur<strong>in</strong>g the process.<br />

193


6.5.1 Future Work<br />

In an <strong>in</strong>teractive user <strong>in</strong>terface, an easy-to-use and efficient <strong>in</strong>teractive tool<br />

is always desired. Future work for implementation <strong>of</strong> f<strong>in</strong>ger tip detection<br />

can be beneficial to the system. Provided robust f<strong>in</strong>ger detection is imple-<br />

mented across the whole projection area, the touch up can be much easier<br />

as the user can po<strong>in</strong>t his f<strong>in</strong>ger directly at the questionable area.<br />

Drag and drop <strong>of</strong> the virtual elements on the desktop can be another<br />

possible extension to the f<strong>in</strong>ger detection. Previous work at York [74]<br />

yields promis<strong>in</strong>g results and lays the foundation <strong>of</strong> the future work <strong>in</strong> this<br />

area.<br />

As a f<strong>in</strong>al <strong>in</strong>spection on the built 3D model, the visualisation mode (sec-<br />

tion 6.4.5) can be further elaborated. Possible implementation <strong>of</strong> touch-up<br />

<strong>in</strong> 3D space is a big plus, as this is the stage where errors are likely to<br />

be rediscovered. Efficient and quick responses need to be made to cor-<br />

rect those errors on the rendered model straightaway, <strong>in</strong> a visualised way,<br />

rather than repeatedly go<strong>in</strong>g back to the 2D models.<br />

194


Chapter 7<br />

System Evaluation<br />

Most <strong>of</strong> the techniques used <strong>in</strong> this research have already been evaluated<br />

and justified at appropriate stages earlier <strong>in</strong> the thesis. In this chapter, we<br />

present <strong>in</strong>formal user tests to evaluate the system performance. In partic-<br />

ular, the system performance with different test objects are evaluated, to<br />

provide an <strong>in</strong>sightful suggestion <strong>of</strong> what is the possible way <strong>of</strong> achiev<strong>in</strong>g<br />

the best results with the presence <strong>of</strong> technical challenges and practical is-<br />

sues.<br />

195


7.1 Test Objects<br />

7.1.1 An Overview<br />

An overview <strong>of</strong> the objects used for experiments is listed <strong>in</strong> table 7.1. Each<br />

object is represented with a thumbnail, object name, and a brief descrip-<br />

tion.<br />

7.1.2 Object Descriptions<br />

The objects chosen to participate <strong>in</strong> the user tests covers a variety <strong>of</strong> differ-<br />

ent sizes, colours, and surface materials, as a diversity. For example, the<br />

owl appeared <strong>in</strong> previous chapters as example object because it presents<br />

various challenges to the techniques presented <strong>in</strong> early part <strong>of</strong> this the-<br />

sis. It has convexity and concaveness across it surface, and this will easily<br />

cause shadows while be<strong>in</strong>g illum<strong>in</strong>ated from certa<strong>in</strong> angles. The owl itself<br />

doesn’t lack texture, but its fluffy surface complicates the texture mapp<strong>in</strong>g<br />

because the same texture could appear totally different due to the <strong>in</strong>ter-<br />

reflections caused by the uneven surface. Furthermore, the back side <strong>of</strong><br />

the owl completely lacks texture.<br />

Other test objects present different technical challenges. The football<br />

is an example <strong>of</strong> high specular reflectance. Despite the system not be<strong>in</strong>g<br />

designed for human body measurement because the top-down projector-<br />

camera setup, we still did a test to evaluate how well the system performs<br />

196


Thumbnail Object Description<br />

Cushion A small s<strong>of</strong>t cushion with bright colour texture.<br />

A small turtle attached onto the right side, but<br />

the tropical fish is just a 2D pattern.<br />

Football A small spherical object. Slightly deflated for it<br />

to stand on the table by itself. Surface has high<br />

specular reflection.<br />

Stand A mid-sized object made with cardboard and<br />

wrapped with brown pack<strong>in</strong>g paper, hardly re-<br />

flect<strong>in</strong>g any lights.<br />

Owl A fairly big stuffed animal. It has s<strong>of</strong>t and fluffy<br />

<strong>Human</strong><br />

Body<br />

surface, and part <strong>of</strong> its body will be deformed<br />

while chang<strong>in</strong>g pose.<br />

A user ly<strong>in</strong>g on desktop. The rigidity is not guar-<br />

anteed, as the relative position between the head<br />

and the upper-body can be changed from one<br />

pose to another.<br />

Table 7.1: An overview <strong>of</strong> the objects used for the tests.<br />

on such an object and see where can be improved. Dur<strong>in</strong>g the human body<br />

test, the table top is lowered. This is not a computer vision driven move –<br />

purely to comply with the health and safety regulations.<br />

197


In the rest <strong>of</strong> this chapter, test frameworks are designed to test the <strong>in</strong>-<br />

dividual ma<strong>in</strong> techniques proposed and evaluate the performance with<br />

various types <strong>of</strong> objects. Then the system is evaluated as a whole.<br />

7.2 Shape Acquisition<br />

In section 7.2, the performance <strong>of</strong> the shape acquisition us<strong>in</strong>g structured<br />

light on different objects is evaluated. Most <strong>of</strong> the techniques <strong>in</strong>volved <strong>in</strong><br />

structured light scan are either discussed or experimentally tested <strong>in</strong> chap-<br />

ter 4, but it is still unclear how these separate pieces <strong>of</strong> techniques work as<br />

a whole. This section is aimed to address this issue.<br />

198


Object No. <strong>of</strong><br />

views<br />

Initial<br />

error<br />

(per<br />

view)<br />

Error<br />

after<br />

touchup<br />

(per<br />

view)<br />

Initial diagnosis<br />

Cushion 2 5 (2.5) 0 (0) black part <strong>of</strong> the object surface<br />

Football 5 16 (3.2) 0 (0) common field <strong>of</strong> view problem (re-<br />

gions that can only be seen from the<br />

camera)<br />

Stand 5 36 (7.2) 5 (1) surface reflection<br />

Owl 5 6 (1.2) 0 (0) concaveness on the surface fails to be<br />

<strong>Human</strong><br />

Body<br />

illum<strong>in</strong>ated by the projector because<br />

<strong>of</strong> occlusion<br />

3 4 (1.3) 0 (0) distance from the object to the<br />

projector-camera pair<br />

Table 7.2: Evaluation: depth capture error, and their corrections.<br />

Table 7.2 lists the performance <strong>of</strong> the shape acquisition process us<strong>in</strong>g<br />

objects <strong>of</strong> different size, shape, and surface. It also shows the amount <strong>of</strong><br />

effort required to touch-up the most obvious error until all captured depth<br />

<strong>in</strong>formation are reasonably accurate upon visual <strong>in</strong>spection. The numbers<br />

shown <strong>in</strong> the table are the number <strong>of</strong> parts (e.g. spikes, jumps, holes, and<br />

etc.) that are believed to be error (numbers <strong>in</strong> the brackets are the averaged<br />

number <strong>of</strong> errors per view). The third column <strong>in</strong> the table is the <strong>in</strong>itial er-<br />

199


or <strong>in</strong> the captured depth maps, and the fourth column is the number <strong>of</strong><br />

unerasable errors rema<strong>in</strong>ed after the user touch-up. Initial diagnoses <strong>of</strong><br />

the possible reason <strong>of</strong> the error are listed <strong>in</strong> the last column, to be further<br />

justified.<br />

7.2.1 The Owl Experiment<br />

General speak<strong>in</strong>g, best result <strong>of</strong> the depth capture comes from the Owl ex-<br />

periment. Despite the owl be<strong>in</strong>g the second biggest object among those<br />

five be<strong>in</strong>g tested, it has a more cont<strong>in</strong>uous surface. The camera and the<br />

projector share a close common view<strong>in</strong>g area <strong>of</strong> the surface (i.e. where the<br />

projector can reach is where the camera can see, and vice versa). The only<br />

obvious <strong>in</strong>accurate measurement at the concaved part at the bottom <strong>of</strong> the<br />

owl’s feet. The error part, seen as a bright dot <strong>in</strong> figure 7.1(a), is t<strong>in</strong>y and<br />

can be easily erased by s<strong>in</strong>gle touch-up.<br />

7.2.2 The Football and Stand Experiment<br />

In this section, two objects are tested together to have a comparison be-<br />

tween them. Between the two objects <strong>in</strong> Football and Stand, there are a few<br />

dissimilarities. The capture result <strong>of</strong> three views are listed for each <strong>of</strong> these<br />

two experiments, <strong>in</strong> figure 7.4 and 7.3.<br />

200


(a) Depth map (b) Rendered model<br />

(c) Depth map (d) Rendered model<br />

Figure 7.1: Shape acquisition test: Owl. Top two: before touchup; bottom<br />

two: after touchup.<br />

• The difference <strong>in</strong> specular reflectance. The football is a rigid spheri-<br />

cal object with high gloss surface. For test<strong>in</strong>g purpose, it is slightly<br />

deflated to be firmly placed on the desktop without us<strong>in</strong>g a stand.<br />

The brown stand is an object made <strong>of</strong> cardboard, but wrapped up<br />

with reflective brown pack<strong>in</strong>g paper. Reflectance <strong>in</strong> Football experi-<br />

ment is more severe than the Stand. However, due to the spherical<br />

surface <strong>of</strong> the football, the high reflectance are focused onto one s<strong>in</strong>-<br />

201


Figure 7.2: The projector-camera pair setup. The shaded part is the ’dead’<br />

area that can not be illum<strong>in</strong>ated by the projector but <strong>in</strong> the view<strong>in</strong>g range<br />

<strong>of</strong> the camera.<br />

gle po<strong>in</strong>t. It is noticed that <strong>in</strong> figure 7.4, the error caused by the high<br />

gloss surface is already filtered out by apply<strong>in</strong>g a smooth filter after<br />

on the scanned data, and the user touchup can be spared.<br />

• The difference <strong>in</strong> the projection light required. As mentioned above<br />

the high gloss surface <strong>of</strong> the football causes an overall <strong>in</strong>crease <strong>in</strong><br />

pixel values across the image, adjustment on the projection bright-<br />

ness is required to avoid the white balance <strong>in</strong> the captured image<br />

202


e<strong>in</strong>g too high that the texture details are lost. The opposite need<br />

to be done for the Stand experiment. In implementation, projection<br />

brightness <strong>of</strong> 100 is used for Football, and 200 for the Stand.<br />

• Both cases suffer from shadows and occlusions, but <strong>in</strong> a slight dif-<br />

ferent way. In the Football experiment, it can be seen from figure 7.4<br />

that the only obvious depth <strong>in</strong>accuracy occurs near the lower bottom<br />

rim <strong>of</strong> the sphere. This is because a small part <strong>of</strong> the desktop always<br />

stay out <strong>of</strong> the illum<strong>in</strong>ation but is <strong>in</strong> the view<strong>in</strong>g range <strong>of</strong> the camera<br />

(see figure 7.2). For Stand, it is a different case where the object is<br />

wrapped with material that hardly reflect any light. It is illustrated<br />

<strong>in</strong> figure 7.3 that all those planes nearly parallel to the projection rays<br />

are severely affected, because there is not enough projection light get<br />

reflected to the camera plane via the surface. There are also two small<br />

areas <strong>of</strong> depth <strong>in</strong>accuracies caused by shadows, but can be easily cor-<br />

rected us<strong>in</strong>g the touchup tool provided.<br />

203


(a) Depth map (b) Colour map<br />

(c) Depth map (d) Colour map<br />

(e) Depth map (f) Colour map<br />

Figure 7.3: Shape acquisition test: Stand. Left column: depth maps; right<br />

column: the correspond<strong>in</strong>g textures.<br />

204


(a) Depth map (b) Colour map<br />

(c) Depth map (d) Colour map<br />

(e) Depth map (f) Colour map<br />

Figure 7.4: Shape acquisition test: Football. Left column: depth maps; right<br />

column: the correspond<strong>in</strong>g textures.<br />

205


It is noticed <strong>in</strong> table 7.2 that Stand is the only test with unerasable cap-<br />

ture errors. This is because the error parts are too big for the median filter<br />

to handle<br />

7.2.3 The Cushion and <strong>Human</strong> Body Experiment<br />

We compare the results <strong>of</strong> Cushion and <strong>Human</strong> Body together because the<br />

similarities shared between them. In both experiments, fewer views are<br />

used. For the cushion, front and back are the only two views captured, as<br />

it is hard to place the cushion <strong>in</strong>to other orientations. When human body<br />

is be<strong>in</strong>g measured, we lower the table first (reason stated <strong>in</strong> section 7.1.2)<br />

then the tester lie on the table. Three views are captured: one fac<strong>in</strong>g left,<br />

one fac<strong>in</strong>g right and the third one fac<strong>in</strong>g up.<br />

In these two tests, the object surface are cont<strong>in</strong>uous and convex hence<br />

the problem we had <strong>in</strong> figure 7.4 and 7.3 does not occur here. However,<br />

’holes’ <strong>in</strong> depth image are found at the eyes and tail <strong>of</strong> the fish, and part<br />

<strong>of</strong> the human’s hair, which are all black area. After study<strong>in</strong>g the captured<br />

Gray coded stripe image, it is found that all those area appear black (pre-<br />

cisely, with 0 pixel values) <strong>in</strong> the observed image whether they are illu-<br />

m<strong>in</strong>ated by the white or black projection. As a result, they stay 0 <strong>in</strong> the<br />

subtraction image <strong>of</strong> positive and negative images, and will be labelled as<br />

background pixels.<br />

206


(a) Depth map (front view, before<br />

touchup)<br />

(b) Colour map (front view)<br />

(c) Depth map (back view) (d) Colour map (back view)<br />

Figure 7.5: Shape acquisition test: Cushion. Left column: depth maps; right<br />

column: the correspond<strong>in</strong>g textures.<br />

207


(a) Depth map (b) Colour map<br />

(c) Depth map (d) Colour map<br />

Figure 7.6: Shape acquisition test: <strong>Human</strong> Body. Left column: depth maps;<br />

right column: the correspond<strong>in</strong>g textures.<br />

7.3 Correspondences F<strong>in</strong>d<strong>in</strong>g<br />

The test framework for evaluate correspondences f<strong>in</strong>d<strong>in</strong>g is set as follows.<br />

For each object test, we pick two adjacent views, and run the correspon-<br />

dence program on the image pair. Depth and po<strong>in</strong>t set data are touched up<br />

208


if there is any obvious errors, before we start f<strong>in</strong>d<strong>in</strong>g the correspondence.<br />

As <strong>in</strong>troduced earlier, when do<strong>in</strong>g the corner detection the user is pro-<br />

vided with a facility to randomise a parameter set, run the program, and<br />

the <strong>in</strong>stant results are projected onto the desktop for <strong>in</strong>spection. The exact<br />

value <strong>of</strong> the parameters such as the search range, the eigenvalue thresh-<br />

old or the w<strong>in</strong>dow size for local aggregation are all hidden from the user.<br />

While repeat<strong>in</strong>g the process by randomis<strong>in</strong>g the parameter set, it is not<br />

necessary that the parameter set which yields the most corners is chosen<br />

as the optimal value. The user is advised to use his own judgement by<br />

look<strong>in</strong>g at the result reflected on the desktop. This is similar to debugg<strong>in</strong>g<br />

a C program on a local PC, the only difference be<strong>in</strong>g that <strong>in</strong> this application<br />

the user doesn’t have to know anyth<strong>in</strong>g about the technical details which<br />

is hidden. Therefore, we apply the same rule <strong>of</strong> ’how to choose the opti-<br />

mal parameter set’ <strong>in</strong> the test framework, to simulate the user’s behaviour.<br />

209


Object Corners<br />

(left im-<br />

age)<br />

Corners<br />

(right<br />

image)<br />

No. <strong>of</strong><br />

correspon-<br />

dences<br />

User ad-<br />

justment<br />

required?<br />

Cushion n/a n/a n/a Y 1<br />

Football 42 51 18 Y 1<br />

Stand 87 105 29 N 2.5<br />

Owl 206 197 30 Y 4<br />

<strong>Human</strong><br />

Body<br />

102 113 39 N 2<br />

Table 7.3: Evaluation: build<strong>in</strong>g correspondences.<br />

Table 7.3 shows the test result.<br />

Total time<br />

spent<br />

(m<strong>in</strong>utes)<br />

It is noticed that the first test, Cushion, doesn’t have the results or num-<br />

ber <strong>of</strong> corners detected or number <strong>of</strong> correspondence built. This is because<br />

there are only two views captured for the cushion, one is the top view and<br />

one is the bottom view. Although these two captured views complete the<br />

object model, they share no overlapp<strong>in</strong>g part. Therefore it is mean<strong>in</strong>gless<br />

to run the correspondence search between the two images. In the test, we<br />

skip the corner extraction and correlation step, go straight <strong>in</strong>to the tun<strong>in</strong>g.<br />

The tun<strong>in</strong>g task is straightforward too, as all the user has to do is to turn<br />

the second view over (rotate by 180 ◦ ).<br />

For the rest <strong>of</strong> the objects, it is usually more time is spent on bigger<br />

210


objects. The correspondence search <strong>in</strong> Stand and <strong>Human</strong> Body tests works<br />

very well, hence the <strong>in</strong>itial trial rotation and translation vectors given by<br />

the computer are accepted without further user adjustment. The Owl ex-<br />

periment take longer time, where lots <strong>of</strong> corner po<strong>in</strong>ts are detected but<br />

only a small portion <strong>of</strong> them are found to be match<strong>in</strong>g. It can be seen from<br />

diagram 7.7 that it has only about half the percentage <strong>of</strong> build<strong>in</strong>g corre-<br />

spondences from detected corners, compared to other objects.<br />

(a)<br />

(b)<br />

Figure 7.7: Number <strong>of</strong> extracted corner po<strong>in</strong>ts and matched correspon-<br />

dence.<br />

211


7.4 Conclusions<br />

In this chapter, five objects are used as the test objects to evaluate the sys-<br />

tem performance with<strong>in</strong> controlled test framework. Although a lot more<br />

objects have been tested <strong>in</strong> this research, those five listed here are the most<br />

representative ones illustrat<strong>in</strong>g the impact <strong>of</strong> different objects on the re-<br />

sults. This <strong>in</strong>cludes the surface reflectance <strong>of</strong> the objects, their texture, con-<br />

vexity and concaveness, rigidity, and the level <strong>of</strong> depth cont<strong>in</strong>uity across<br />

the surface.<br />

Two key components <strong>of</strong> the system, shape acquisition via structured<br />

light scan and po<strong>in</strong>t set registration from po<strong>in</strong>t correspondences, are tested.<br />

Statistics and experimental results give the diagnose and possible solution<br />

to the problem caused by those aforementioned challenges, and provide<br />

the foundation on which future research can be built.<br />

212


Chapter 8<br />

Conclusions<br />

8.1 Summary<br />

All <strong>of</strong> the chapters presented <strong>in</strong> this thesis conta<strong>in</strong> their own <strong>in</strong>troductions<br />

and conclusions. Apart from Introduction, Background and this Conclu-<br />

sion chapter itself, the rest <strong>of</strong> the thesis is summarised as follows:<br />

• Chapter 3 Calibration<br />

Methods for complete calibration <strong>of</strong> the VAE system are presented.<br />

This <strong>in</strong>cludes a full calibration <strong>of</strong> the projector-camera system for<br />

213


their own <strong>in</strong>tr<strong>in</strong>sic and extr<strong>in</strong>sic parameters, and the calibration for<br />

a plane-to-plane homography between the rendered projector plane<br />

and the captured image plane <strong>in</strong>duced by a third plane.<br />

• Chapter 4 Shape Acquisition<br />

A Gray coded structured light scan is implemented for acquir<strong>in</strong>g<br />

depth <strong>in</strong>formation. It is then extended and adapted to tackle the<br />

practical issues raised, before be<strong>in</strong>g <strong>in</strong>corporated <strong>in</strong>to the whole VAE<br />

framework.<br />

• Chapter 5 Registration <strong>of</strong> Po<strong>in</strong>t Sets<br />

A framework for 3D po<strong>in</strong>t set registration is presented <strong>in</strong> this chap-<br />

ter. The conventional image registration technique is used to f<strong>in</strong>d<br />

correspond<strong>in</strong>g po<strong>in</strong>ts between a pair <strong>of</strong> 2D image, and the estab-<br />

lished correspondences are propagated from 2D to register the po<strong>in</strong>t<br />

sets <strong>in</strong> 3D space. This framework is justified to work not only on<br />

planar surface, but also arbitrary objects with the user’s assistance <strong>in</strong><br />

a VAE system, while there is no ground truth <strong>in</strong>formation known a<br />

priori.<br />

• Chapter 6 System Design<br />

This is the core <strong>of</strong> this research. A new system design is presented<br />

<strong>in</strong> this chapter, for <strong>in</strong>putt<strong>in</strong>g 3D by work<strong>in</strong>g collaboratively with the<br />

PC <strong>in</strong> a VAE. The proposed system is cheap to ma<strong>in</strong>ta<strong>in</strong> with <strong>of</strong>f-the-<br />

shelf hardware, and easy to be deployed with requir<strong>in</strong>g m<strong>in</strong>imum<br />

214


configuration <strong>of</strong> the projector-camera pair. The use <strong>of</strong> the system<br />

presented is aimed not be<strong>in</strong>g restricted only <strong>in</strong> research laboratory<br />

environment.<br />

• Chapter 7 User Experiments<br />

Major components <strong>of</strong> the system are evaluated <strong>in</strong> this chapter, with<br />

controlled test frameworks.<br />

8.2 Discussions<br />

System-wise, one <strong>of</strong> the most important design goals is to allow users to<br />

br<strong>in</strong>g their objects to be <strong>in</strong>put, walk up to the VAE and start the mission<br />

without worry<strong>in</strong>g about the technical details <strong>of</strong> computer vision or how to<br />

produce the code to do that. We aim to recreate an environment where the<br />

computer and its attached vision equipment work as an assistant to the<br />

user, while the user always make the f<strong>in</strong>al call on key decisions based on<br />

the feedback from this <strong>in</strong>teractive collaboration. Higher cost equipment<br />

such as HMD, touch screen, or other customised tools such as markers<br />

and gloves are all avoided, as the system presented here is not only de-<br />

signed for laboratory purpose, but also for home and <strong>of</strong>fice environment<br />

or other open public space such as schools and meseums.<br />

215


8.3 Future Work<br />

Techniques employed <strong>in</strong> this research are evaluated <strong>in</strong> separate chapters.<br />

Although the framework itself conta<strong>in</strong>s techniques that are already widely<br />

used <strong>in</strong> the field, it br<strong>in</strong>gs together these techniques <strong>in</strong> a new, practical<br />

and efficient way. But as mentioned before, many <strong>of</strong> the system elements<br />

would benefit from further improvement and optimisation.<br />

There are planned improvements for the techniques used <strong>in</strong> the sys-<br />

tem. In calibration, manual adjustment <strong>of</strong> the photometric sett<strong>in</strong>gs <strong>of</strong> the<br />

camera and the projector is not only <strong>in</strong>convenient but also <strong>in</strong>efficient. Fur-<br />

ther development <strong>of</strong> the calibration framework would <strong>in</strong>clude automatic<br />

photomatric calibration.<br />

Once the automated photometric calibration is feasible, it might be sen-<br />

sible to exploit the use <strong>of</strong> colour-based structured light techniques which<br />

allow real-time scan <strong>of</strong> depth <strong>in</strong>formation. There are also other planned<br />

improvement for the shape acquisition framework, as described <strong>in</strong> section<br />

4.6.1.<br />

Assign<strong>in</strong>g user more power and <strong>in</strong>itiative <strong>in</strong> the po<strong>in</strong>t set registration<br />

stage would be another big step forward, because if appropriately de-<br />

signed and implemented, it would gauge the registration process more<br />

quickly and efficiently towards the optimal results, while the user’s lead-<br />

<strong>in</strong>g role is still ma<strong>in</strong>ta<strong>in</strong>ed.<br />

216


As mentioned <strong>in</strong> section 6.5.1, robust f<strong>in</strong>gertip detection and touchup<br />

<strong>in</strong> 3D space are regarded as two major improvements <strong>in</strong> future work. Suc-<br />

cessful f<strong>in</strong>gertip detection would not only simplify the user <strong>in</strong>terface by<br />

reduc<strong>in</strong>g the number <strong>of</strong> <strong>in</strong>teractive buttons required, but also <strong>of</strong>fer a new<br />

dimension <strong>of</strong> user <strong>in</strong>teraction as locat<strong>in</strong>g a po<strong>in</strong>t would be much easier<br />

either on a physical object or on a virtual element. 3D touchup could ef-<br />

fectively be a consequence <strong>of</strong> the deployment <strong>of</strong> f<strong>in</strong>gertip detection, and<br />

it would be a big boost if the user is allowed to manipulate the rendered<br />

object model us<strong>in</strong>g his bare hands as if he is touch<strong>in</strong>g the real object.<br />

Regard<strong>in</strong>g the user test carried out <strong>in</strong> chapter 7, they are still ma<strong>in</strong>ly at<br />

a descriptive stage. The next task <strong>of</strong> system performance measure would<br />

be aimed to get testers from a variety <strong>of</strong> backgrounds – from computer<br />

vision academics to someone who has little experience with the field – to<br />

characterise the system both behaviourally and experimentally.<br />

217


Bibliography<br />

[1] P. Anandan. A computational framework and an algorithm for the mea-<br />

surement <strong>of</strong> visual motion. International Journal <strong>of</strong> <strong>Computer</strong> Vision, 2(3):283–<br />

310, 1989.<br />

[2] A. Argyros and M.I.A. Lourakis. Vision-based <strong>in</strong>terpretation <strong>of</strong> hand ges-<br />

tures for remote control <strong>of</strong> a computer mouse. In European Conference on<br />

<strong>Computer</strong> Vision, Workshop on <strong>Human</strong> <strong>Computer</strong> Interactions, pages 40–51,<br />

2006.<br />

[3] K.S. Arun, T.S. Huang, and S.D. Bloste<strong>in</strong>. Least-squares fitt<strong>in</strong>g <strong>of</strong> two 3-d<br />

po<strong>in</strong>t sets. IEEE Trans. on Pattern Analysis and Mach<strong>in</strong>e Intelligence, 9(5):698–<br />

700, 1987.<br />

[4] K.E. Atk<strong>in</strong>son. An Introduction to Numerical Analysis. John Wiley and Sons,<br />

2nd edition, 1989.<br />

[5] J.L. Barron, D.J. Fleet, and S.S. Beauchem<strong>in</strong>. Performance <strong>of</strong> optical flow<br />

techniques. Int. J. Comput. Vision, 12(1):43–77, 1994.<br />

[6] J. Batlle, E. Mouaddib, and J. Salvi. Recent progress <strong>in</strong> coded structured<br />

light as a technique to solve the correspondence problem: A survey. Pattern<br />

Recognition, 31(7):963–982, July 1998.<br />

218


[7] S.S. Beauchem<strong>in</strong> and J.L. Barron. The computation <strong>of</strong> optical flow. ACM<br />

Comput. Surv., 27(3):433–466, 1995.<br />

[8] J.R. Bergen, P.J. Burt, R. H<strong>in</strong>gorani, and S. Peleg. Comput<strong>in</strong>g two motions<br />

from three frames. ICCV, 90:27–32, 1990.<br />

[9] D. Bergmann. New approach for automatic surface reconstruction with<br />

coded light. Remote Sens<strong>in</strong>g and Reconstruction for Three-Dimensional Objects<br />

and Scenes, 2572(1):2–9, 1995.<br />

[10] M.J. Black and P. Anandan. A framework for the robust estimation <strong>of</strong> opti-<br />

cal flow. In ICCV93, pages 231–236, 1993.<br />

[11] M. Blackm and A. Rangarajan. the unification <strong>of</strong> l<strong>in</strong>e processes, outlier<br />

rejection, and robust statistics with applications to early vision. 1996.<br />

[12] A. Blake and R. Cipolla. Robust estimation <strong>of</strong> surface curvature from defor-<br />

mation <strong>of</strong> apparent contours. In Proceed<strong>in</strong>gs <strong>of</strong> the First European Conference<br />

on <strong>Computer</strong> Vision, pages 465–474, London, UK, 1990. Spr<strong>in</strong>ger-Verlag.<br />

[13] S. Borkowski, J. Letessier, and J.L. Crowley. Spatial control <strong>of</strong> <strong>in</strong>teractive<br />

surfaces <strong>in</strong> an augmented environment. In EHCI/DS-VIS, pages 228–244,<br />

2004.<br />

[14] J.Y. Bouguet. Camera calibration toolbox for matlab, 2006. (Last retrieved<br />

30 November 2006).<br />

[15] J.Y. Bouguet and P. Perona. 3d photography on your desk. In ICCV ’98,<br />

pages 43–50, 1998.<br />

[16] K.L. Boyer and A.C. Kak. Color-encoded structured light for rapid active<br />

rang<strong>in</strong>g. IEEE Trans. Pattern Anal. Mach. Intell., 9(1):14–28, 1987.<br />

219


[17] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy m<strong>in</strong>imization<br />

via graph cuts. In ICCV (1), pages 377–384, 1999.<br />

[18] D.C. Brown. Decenter<strong>in</strong>g distortion <strong>of</strong> lenses. Photometric Eng<strong>in</strong>eer<strong>in</strong>g,<br />

32(3):444–462, 1966.<br />

[19] V. Buchmann, S. Violich, M. Bill<strong>in</strong>ghurst, and A. Cockburn. F<strong>in</strong>gartips:<br />

gesture based direct manipulation <strong>in</strong> augmented reality. In GRAPHITE ’04:<br />

Proceed<strong>in</strong>gs <strong>of</strong> the 2nd <strong>in</strong>ternational conference on <strong>Computer</strong> graphics and <strong>in</strong>ter-<br />

active techniques <strong>in</strong> Australasia and South East Asia, pages 212–221, New York,<br />

NY, USA, 2004. ACM.<br />

[20] J.F. Canny. A Computational Approach to Edge Detection. 8(6):679–698,<br />

1986.<br />

[21] B. Carrihill and R.A. Hummel. Experiments with the <strong>in</strong>tensity ratio data<br />

sensor. 32(3):337–358, December 1985.<br />

[22] D. Caspi, N. Kiryati, and J. Shamir. Range imag<strong>in</strong>g with adaptive color<br />

structured light. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e Intelli-<br />

gence, 20(5):470–480, 1998.<br />

[23] G. Chazan and N. Kiryati. Pyramidal <strong>in</strong>tensity ratio depth sensor. Tech-<br />

nical Report, Center for Communication and Information Technologies, Dept. <strong>of</strong><br />

Electrical Eng. Haifa, Israel, Oct 1995.<br />

[24] C.S. Chen, Y.P. Hung, C.C. Chiang, and J.L. Wu. Range data-acquisition<br />

us<strong>in</strong>g color structured light<strong>in</strong>g and stereo vision. 15(6):445–456, June 1997.<br />

[25] C. Chris Stauffer, W. Eric, and L. Grimson. Adaptive background mixture<br />

models for real-time track<strong>in</strong>g. In CVPR, pages 2246–2252, 1999.<br />

220


[26] R. Cipolla, T. Drummond, and D. Robertson. Camera calibration from van-<br />

ish<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong> images <strong>of</strong> architectural scenes. BMVC, 1999.<br />

[27] E. Costanza and J.A. Rob<strong>in</strong>son. A region adjacency tree approach to the<br />

detection and design <strong>of</strong> fiducials. In Proc. Vision, <strong>Video</strong> and Graphics, Bath,<br />

UK, July 2003.<br />

[28] E. Costanza, S. B. Shelley, and J.A. Rob<strong>in</strong>son. d-touch: a consumer-grade<br />

tangible <strong>in</strong>terface module and musical applications. In Proceed<strong>in</strong>gs <strong>of</strong> De-<br />

sign<strong>in</strong>g for Society HCI2003, Bath, UK, September 2003.<br />

[29] E. Costanza, S.B. Shelley, and J.A. Rob<strong>in</strong>son. Introduc<strong>in</strong>g audio d-touch:<br />

A tangible user <strong>in</strong>terface for music composition and performance. Digital<br />

Audio Effects (DAFx) 2003, September 2003.<br />

[30] J. Coutaz, S. Borkowski, and N. Barralon. Coupl<strong>in</strong>g <strong>in</strong>teraction resources:<br />

an analytical model. In sOc-EUSAI ’05: Proceed<strong>in</strong>gs <strong>of</strong> the 2005 jo<strong>in</strong>t conference<br />

on Smart objects and ambient <strong>in</strong>telligence, pages 183–188, New York, NY, USA,<br />

2005. ACM.<br />

[31] A. Crim<strong>in</strong>isi, I.D. Reid, and A. Zisserman. S<strong>in</strong>gle view metrology. IJCV,<br />

40(2):123–148, November 2000.<br />

[32] J. Crowley, F. Berard, and J. Coutaz. F<strong>in</strong>ger track<strong>in</strong>g as an <strong>in</strong>put device for<br />

augmented reality, 1995.<br />

[33] C.J. Davies and M.S. Nixon. A hough transform for detect<strong>in</strong>g the location<br />

and orientation <strong>of</strong> 3-dimensional surfaces via color encoded spots. SMC-B,<br />

28(1):90–95, February 1998.<br />

[34] J. Davis and M. Shah. Visual gesture recognition, 1994.<br />

221


[35] R. Deriche and O.D. Faugeras. Track<strong>in</strong>g l<strong>in</strong>e segments. In ECCV 90: Proceed-<br />

<strong>in</strong>gs <strong>of</strong> the first european conference on <strong>Computer</strong> vision, pages 259–268, New<br />

York, NY, USA, 1990. Spr<strong>in</strong>ger-Verlag New York, Inc.<br />

[36] P. Dom<strong>in</strong>gos and M. Pazzani. On the optimality <strong>of</strong> the simple bayesian<br />

classifier under zero-one loss. Mach<strong>in</strong>e Learn<strong>in</strong>g, 29(2):103–130, November<br />

1997.<br />

[37] O. Faugeras. Three-Dimensional <strong>Computer</strong> Vision. MIT Press, 1993.<br />

[38] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for<br />

model fitt<strong>in</strong>g with applications to image analysis and automated cartogra-<br />

phy. Commun. ACM, 24(6):381–395, June 1981.<br />

[39] A.W. Fitzgibbon and A Zisserman. Automatic 3d model acquisition and<br />

generation <strong>of</strong> new images from video sequences. In European Signal Pro-<br />

cess<strong>in</strong>g Conference (EUSIPCO98), pages 1261–1269, Rhodes, Greece, 1998.<br />

[40] d.j. Fleet and A.D. Jepson. Computation <strong>of</strong> component image velocity from<br />

local phase <strong>in</strong>formation. Int. J. Comput. Vision, 5(1):77–104, 1990.<br />

[41] D.M. Frohlich, T. Clancy, J.A. Rob<strong>in</strong>son, and E. Costanza. The audiophoto<br />

desk. 2ad. The Second International Conference on Appliance Design, May 2004.<br />

[42] W.C. Grauste<strong>in</strong>. Homogeneous Cartesian Coord<strong>in</strong>ates. L<strong>in</strong>ear Dependence <strong>of</strong><br />

Po<strong>in</strong>ts and L<strong>in</strong>es . New York: Macmillan, pp. 29-49, 1930.<br />

[43] W.E.L. Grimson. Computational experiments with a feature based stereo<br />

algorithm. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e Intelligence,<br />

7:17–34, 1985.<br />

[44] D. Gruber. The mathematics <strong>of</strong> the 3d rotation matrix, 2000. (Last retrieved<br />

March 2007).<br />

222


[45] J. Guehr<strong>in</strong>g. Dense 3d surface acquisition by structured light us<strong>in</strong>g <strong>of</strong>f-the-<br />

shelf components. In SPIE, <strong>Video</strong>metrics and Optical Methods for 3D Shape<br />

Measurement, volume 4309 <strong>of</strong> Presented at the Society <strong>of</strong> Photo-Optical Instru-<br />

mentation Eng<strong>in</strong>eers (SPIE) Conference, pages 220–231, December 2000.<br />

[46] C. Harris and M. Stephens. A comb<strong>in</strong>ed corner and edge detection. pages<br />

147–151, 1988.<br />

[47] R.I. Hartley and A. Zisserman. Multiple View Geometry <strong>in</strong> <strong>Computer</strong> Vision.<br />

Cambridge University Press, ISBN: 0521540518, second edition, 2004.<br />

[48] J. Heikkila and O. Silven. A four-step camera calibration procedure with<br />

implicit image correction. In IEEE <strong>Computer</strong> Vision and Pattern Recognition,<br />

pages 1106–1112, 1997.<br />

[49] B.K.P. Horn and B.G. Schunck. Determ<strong>in</strong><strong>in</strong>g optical flow. Artificial Intelli-<br />

gence, 17:185–203, 1981.<br />

[50] J. Hyde and D. Parnham. the openillusionist project, 2008. (Last retrieved<br />

May 2008).<br />

[51] Intel. Open source computer vision library. (Last retrieved 30 Nov 2006).<br />

[52] J.A. JRob<strong>in</strong>son and C. Robertson. The livepaper system: augment<strong>in</strong>g paper<br />

on an enhanced tabletop. <strong>Computer</strong>s & Graphics, 25(5):731–743, 2001.<br />

[53] P. KaewTraKulPong and R. Bowden. An improved adaptive background<br />

mixture model for real-time track<strong>in</strong>g with shadow detection, 2001.<br />

[54] D. Kalman. A s<strong>in</strong>gularly valuable decomposition: The svd <strong>of</strong> a matrix. The<br />

College Mathematics Journal, 27(1):2–23, 1996.<br />

[55] T. Kanade. Development <strong>of</strong> a <strong>Video</strong>-Rate Stereo Mach<strong>in</strong>e. In 1994 ARPA<br />

Image Understand<strong>in</strong>g Workshop, November 1994.<br />

223


[56] H. Kato and M. Bill<strong>in</strong>ghurst. Marker track<strong>in</strong>g and hmd calibration for a<br />

video-based augmented reality conferenc<strong>in</strong>g system. In IWAR ’99: Proceed-<br />

<strong>in</strong>gs <strong>of</strong> the 2nd IEEE and ACM International Workshop on <strong>Augmented</strong> Reality,<br />

page 85, Wash<strong>in</strong>gton, DC, USA, 1999. IEEE <strong>Computer</strong> Society.<br />

[57] R. Klette, K. Schluns, and A. Koschan. <strong>Computer</strong> Vision: Three-Dimensional<br />

Data from Images. Spr<strong>in</strong>ger-Verlag S<strong>in</strong>gapore Pte. Limited, 1998.<br />

[58] H. Koike, Y. Sato, Y. Kobayashi, H. Tobita, and M. Kobayashi. Interactive<br />

textbook and <strong>in</strong>teractive venn diagram: natural and <strong>in</strong>tuitive <strong>in</strong>terfaces on<br />

augmented desk system. In CHI ’00: Proceed<strong>in</strong>gs <strong>of</strong> the SIGCHI conference<br />

on <strong>Human</strong> factors <strong>in</strong> comput<strong>in</strong>g systems, pages 121–128, New York, NY, USA,<br />

2000. ACM.<br />

[59] M.W. Krueger. Artificial Reality. Addison-Wesley, Read<strong>in</strong>g, MA, 1983.<br />

[60] M.W. Krueger. Environmental technology: mak<strong>in</strong>g the real world virtual.<br />

Commun. ACM, 36(7):36–37, 1993.<br />

[61] D.T. Lawton and W.F. Gardner. Translational decomposition <strong>of</strong> flow fields.<br />

pages 697–705, 1993.<br />

[62] D.C. Lay. L<strong>in</strong>ear Algebra and its Applications. Addison Wesley Longman Inc.,<br />

1997.<br />

[63] J. Letessier and F. Bérard. Visual track<strong>in</strong>g <strong>of</strong> bare f<strong>in</strong>gers for <strong>in</strong>teractive<br />

surfaces. In UIST ’04: Proceed<strong>in</strong>gs <strong>of</strong> the 17th annual ACM symposium on User<br />

<strong>in</strong>terface s<strong>of</strong>tware and technology, pages 119–122, New York, NY, USA, 2004.<br />

ACM.<br />

[64] L. Li and J.A Rob<strong>in</strong>son. A semi-automatic human-computer collaborative<br />

system for 3d shapes <strong>in</strong>putt<strong>in</strong>g. IET Visual Information Eng<strong>in</strong>eer<strong>in</strong>g, July<br />

2007.<br />

224


[65] B.D. Lucas and T. Kanade. An iterative image registration technique with<br />

an application to stereo vision. In In Proceed<strong>in</strong>gs <strong>of</strong> International Jo<strong>in</strong>t Confer-<br />

ence on Artificial Intelligence, pages 674–679, 1981.<br />

[66] F. Lv, T. Zhao, and R. Nevatia. Camera calibration from video <strong>of</strong> a walk-<br />

<strong>in</strong>g human. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e Intelligence,<br />

28(9):1513–1518, 2006.<br />

[67] P. Maes. Artificial life meets enterta<strong>in</strong>ment: lifelike autonomous agents.<br />

Commun. ACM, 38(11):108–114, 1995.<br />

[68] P. Maes, T. Darrell, B. Blumberg, and A. Pentland. The alive system: Wire-<br />

less, full-body <strong>in</strong>teraction with autonomous agents. 1996.<br />

[69] S. Malik and J. Laszlo. Visual touchpad: a two-handed gestural <strong>in</strong>put de-<br />

vice. In ICMI, pages 289–296, 2004.<br />

[70] J.W. Mateer and J.A. Rob<strong>in</strong>son. A vision-based postproduction tool for<br />

footage logg<strong>in</strong>g, analysis, and annotation. Graph. Models, 67(6):565–583,<br />

2005.<br />

[71] H.K. Nishihara. Prism: A practical real-time imag<strong>in</strong>g stereo matcher. Tech-<br />

nical report, Cambridge, MA, USA, 1984.<br />

[72] C. Nölker and H. Ritter. Detection <strong>of</strong> f<strong>in</strong>gertips <strong>in</strong> human hand movement<br />

sequences. In I. Wachsmuth and M. Fröhlich, editors, Gesture and Sign Lan-<br />

guage <strong>in</strong> <strong>Human</strong>-<strong>Computer</strong> Interaction, Proceed<strong>in</strong>gs <strong>of</strong> the International Gesture<br />

Workshop 1997, pages 209–218. Spr<strong>in</strong>ger, 1998.<br />

[73] S. O’Mahony and J.A. Rob<strong>in</strong>son. Penpets: a physical environment for vir-<br />

tual animals. In CHI ’03: CHI ’03 extended abstracts on <strong>Human</strong> factors <strong>in</strong><br />

comput<strong>in</strong>g systems, pages 622–623, New York, NY, USA, 2003. ACM.<br />

225


[74] D. Parnham. An Infrastructure for <strong>Video</strong>-<strong>Augmented</strong> Environments. PhD the-<br />

sis, University <strong>of</strong> York, February 2007.<br />

[75] D. Parnham, J.A. Rob<strong>in</strong>son, and Y. Zhao. A compact fiducial for aff<strong>in</strong>e<br />

augmented reality. Second International Conference on Visual Information En-<br />

g<strong>in</strong>eer<strong>in</strong>g (VIE), pages 347–352, April 2005.<br />

[76] J.L. Posdamer and M.D. Altschuler. Surface measurement by space-<br />

encoded projected beam system. CGIP, 18(1):1–17, January 1982.<br />

[77] W.H. Press, S.A. Teukolsky, W.T. Vetterl<strong>in</strong>g, and B.P. Flannery. Numerical<br />

Recipes <strong>in</strong> C: The Art <strong>of</strong> Scientific Comput<strong>in</strong>g. Cambridge University Press,<br />

2nd edition, January 1993.<br />

[78] F. Quek, T. Mysliwiec, and M. Zhao. Figermouse: a freehand po<strong>in</strong>t<strong>in</strong>g <strong>in</strong>-<br />

terface. In International Workshop on Automatic Face and Gesture Recognition,<br />

pages 372–377, Zurich, Switzerland, June 1995.<br />

[79] J. Renno, J. Orwell, and G. Jones. Learn<strong>in</strong>g surveillance track<strong>in</strong>g models for<br />

the self-calibrated ground plane, 2002.<br />

[80] J.A. Rob<strong>in</strong>son. Collaborative vision and <strong>in</strong>teractive mosaic<strong>in</strong>g. Vision, <strong>Video</strong><br />

and Graphics (VVG), July 2003.<br />

[81] C. Rocch<strong>in</strong>i, P. Cignoni, C. Montani, P. P<strong>in</strong>gi, and R. Scopigno. A low cost 3d<br />

scanner based on structured light. EUROGRAPHICS, 20(3):299–308, 2001.<br />

[82] S. Roy and I.J. Cox. A maximum-flow formulation <strong>of</strong> the n-camera stereo<br />

correspondence problem. In ICCV, pages 492–502, 1998.<br />

[83] G. Sansoni, S. Lazzari, S. Peli, and F. Docchio. 3-d imager for dimensional<br />

gaug<strong>in</strong>g <strong>of</strong> <strong>in</strong>dustrial workpieces: State-<strong>of</strong>-the-art <strong>of</strong> the development <strong>of</strong> a<br />

robust and versatile system. 3dim, 0:19, 1997.<br />

226


[84] K. Sato and S. Inokuchi. Three-dimensional surface measurement by space<br />

encod<strong>in</strong>g range imag<strong>in</strong>g. J.Robotic Systems, 2(1):27–39, 1985.<br />

[85] Y. Sato, Y. Kobayashi, and H. Koike. Fast track<strong>in</strong>g <strong>of</strong> hands and f<strong>in</strong>gertips<br />

<strong>in</strong> <strong>in</strong>frared images for augmented desk <strong>in</strong>terface. In FG ’00: Proceed<strong>in</strong>gs <strong>of</strong><br />

the Fourth IEEE International Conference on Automatic Face and Gesture Recog-<br />

nition 2000, page 462, Wash<strong>in</strong>gton, DC, USA, 2000. IEEE <strong>Computer</strong> Society.<br />

[86] D. Scharste<strong>in</strong> and R. Szeliski. Stereo match<strong>in</strong>g with non-l<strong>in</strong>ear diffusion.<br />

Technical Report TR96-1575, 18, 1996.<br />

[87] D. Scharste<strong>in</strong>, R. Szeliski, and R. Zabih. A taxonomy and evaluation <strong>of</strong><br />

dense two-frame stereo correspondence algorithms. In Proceed<strong>in</strong>gs <strong>of</strong> the<br />

IEEE Workshop on Stereo and Multi-Basel<strong>in</strong>e Vision, Kauai, HI, Dec. 2001., 2001.<br />

[88] P. Schnemann. A generalized solution <strong>of</strong> the orthogonal procrustes prob-<br />

lem. Psychometrika, 31(1):1–10, March 1966.<br />

[89] L. Shapiro and G. Stockman. <strong>Computer</strong> Vision. Prentice Hall, 2001.<br />

[90] L.S. Shapiro, H. Wang, and J.M. Brady. A match<strong>in</strong>g and track<strong>in</strong>g strategy<br />

for <strong>in</strong>dependently mov<strong>in</strong>g objects. Proc. 3rd British Mach<strong>in</strong>e Vision Confer-<br />

ence, pages 306–315, September 1992.<br />

[91] H. Shikawa and D. Geiger. Occlusions, discont<strong>in</strong>uities, and epipolar l<strong>in</strong>es<br />

<strong>in</strong> stereo. Lecture Notes <strong>in</strong> <strong>Computer</strong> Science, 1406:232–248, 1998.<br />

[92] D. S<strong>in</strong>clair, A. Blake, S. Smith, and S. Rothwel. Planar region detection and<br />

motion recovery. In 3rd British Mach<strong>in</strong>e Vision Conference, 1992.<br />

[93] Q. Stafford-Fraser and P. Rob<strong>in</strong>son. Brightboard: A video-augmented en-<br />

vironment. In CHI, pages 134–141, 1996.<br />

227


[94] C. Stauffer, W. Eric, and L. Grimson. Learn<strong>in</strong>g patterns <strong>of</strong> activity us<strong>in</strong>g<br />

real-time track<strong>in</strong>g. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):747–757,<br />

2000.<br />

[95] G. Strang. Introduction to L<strong>in</strong>ear Algebra, Third Edition. Wellesley Cambridge<br />

Pr, March 2003.<br />

[96] B. Thomas and W. Piekarski. Glove based user <strong>in</strong>teraction techniques for<br />

augmented reality <strong>in</strong> an outdoor environment, 2002.<br />

[97] M. Trob<strong>in</strong>a. Error model <strong>of</strong> a coded-light range sensor, 1995.<br />

[98] E. Trucco, R.B. Fisher, A.W. Fitzgibbon, and D.K. Naidu. Calibration,<br />

data consistency and model acquisition with a 3-d laser striper. RobCIM,<br />

11(4):292–310, 1998.<br />

[99] R.Y. Tsai. An efficient and accurate camera calibration technique for 3d<br />

mach<strong>in</strong>e vision. In Proceed<strong>in</strong>gs <strong>of</strong> IEEE Conference on <strong>Computer</strong> Vision and<br />

Pattern Recognition, pages 364–374, Miami Beach, FL, 1986.<br />

[100] Vision3dWeb. (Last retrieved, April 2008).<br />

[101] website. the national museum scotland, 2008. (Last retrieved May 2008).<br />

[102] P. Wellner. The digitaldesk calculator: tangible manipulation on a desk top<br />

display, 1991.<br />

[103] P. Wellner. Adaptive threshold<strong>in</strong>g on the digitaldesk. EuroPARC Technical<br />

Report EPC-93-110, 1993, 1993.<br />

[104] P. Wellner. Interact<strong>in</strong>g with paper on the DigitalDesk. Communications <strong>of</strong><br />

the ACM, 36(7):86–97, 1993.<br />

228


[105] G. Wiora. High-resolution measurement <strong>of</strong> phase-shift amplitude and nu-<br />

meric object phase calculation. Vision Geometry IX, 4117(1):289–299, 2000.<br />

[106] R.J. Woodham. Photometric method for determ<strong>in</strong><strong>in</strong>g surface orientation<br />

from multiple images. Optical Eng<strong>in</strong>eer<strong>in</strong>, 19:139 –144, 1980.<br />

[107] R.J Woodham. Determ<strong>in</strong><strong>in</strong>g surface curvature with photometric stereo.<br />

IEEE Conf Robotics & Automation, pages 36–42, 1989.<br />

[108] R.J. Woodham. Gradient and curvature from photometric stereo <strong>in</strong>clud-<br />

<strong>in</strong>g local confidence estimation. Journal <strong>of</strong> the Optical Society <strong>of</strong> America,<br />

11(11):3050–3068, 1994.<br />

[109] L. Zhang, B. Curless, and S. M. Seitz. Rapid shape acquisition us<strong>in</strong>g color<br />

structured light and multi-pass dynamic programm<strong>in</strong>g. The 1st IEEE In-<br />

ternational Symposium on 3D Data Process<strong>in</strong>g, Visualization, and Transmission,<br />

pages 24–36, June 2002.<br />

[110] L. Zhang, B. Curless, and S. Seitz. Spacetime stereo: Shape recovery for dy-<br />

namic scenes. International Conference on <strong>Computer</strong> Vision and Pattern Recog-<br />

nition, , Madison, WI., pages 367–374, June, 2003.<br />

[111] Z.Y. Zhang. A flexible new technique for camera calibration. IEEE Transac-<br />

tion Pattern Analysis and Mach<strong>in</strong>e Intelligence, 22(11):1300–1334, 2000.<br />

229


Appendix A<br />

Declarations for class CButton<br />

1 #pragma once<br />

2<br />

3<br />

4 // number <strong>of</strong> buttons<br />

5 #def<strong>in</strong>e NUM_BUTTONS 59<br />

6<br />

7 // default threshold used for <strong>in</strong>ner region, if button<br />

calibration is skipped<br />

8 #def<strong>in</strong>e BUTTON_TH_INNER 10.00<br />

9<br />

10 // default threshold used for outer region, if button<br />

calibration is skipped<br />

11 #def<strong>in</strong>e BUTTON_TH_OUTER 5.0<br />

12<br />

13 // the time period a button stays highlighted for, <strong>in</strong><br />

milliseconds<br />

14 #def<strong>in</strong>e BUTTON_INT 500<br />

15<br />

16<br />

17 // Top-left corners <strong>of</strong> all buttons<br />

18 <strong>in</strong>t button_pos[NUM_BUTTONS*2] =<br />

19 {<br />

20 954, 618, // 0: lock // 60x40<br />

21 954, 538, // 1: save<br />

22 954, 458, // 2: SL<br />

23 954, 378, // 3: SL_repeat<br />

24 954, 298, // 4: exit<br />

25<br />

26 10, 588, // 5: thumbnail (80 x 60)<br />

27 10, 508, // 6: thumbnail (80 x 60)<br />

230


28 10, 428, // 7: thumbnail (80 x 60)<br />

29 10, 348, // 8: thumbnail (80 x 60)<br />

30 10, 268, // 9: thumbnail (80 x 60)<br />

31 10, 188, // 10: thumbnail (80 x 60)<br />

32<br />

33 80, 698, // 11: INSPECT MODE<br />

34 150, 698, // 12: TOUCHUP MODE<br />

35 220, 698, // 13: CORRESPONDENCE MODE<br />

36 290, 698, // 14: visualization mode<br />

37<br />

38 380, 698, // 15: up<br />

39 460, 698, // 16: down<br />

40 540, 698, // 17: left<br />

41 620, 698, // 18: right<br />

42 700, 698, // 19: <strong>in</strong><br />

43 780, 698, // 20: out<br />

44<br />

45 370, 698, // 21: v_expand<br />

46 440, 698, // 22: v_shr<strong>in</strong>k<br />

47 510, 698, // 23: h_expand<br />

48 580, 698, // 24: h_shr<strong>in</strong>k<br />

49 660, 698, // 25: roi_up<br />

50 730, 698, // 26: roi_down<br />

51 800, 698, // 27: roi_left<br />

52 870, 698, // 28: roi_right<br />

53<br />

54 370, 673, // 29: touchpad (150 x 90)<br />

55 530, 698, // 30: double cursor speed<br />

56 600, 698, // 31: push button<br />

57<br />

58 884, 698, // 32: manual search<br />

59 954, 698, // 33: mouse assisted<br />

60<br />

61 880, 698, // 34: param<br />

62 954, 698, // 35: proceed<br />

63<br />

64 370, 673, // 36: R<br />

65 370, 723, // 37: T<br />

66<br />

67 450, 698, // 38: R_x+<br />

68 520, 698, // 39: R_x-<br />

69 590, 698, // 40: R_y+<br />

70 660, 698, // 41: R_y-<br />

71 730, 698, // 42: R_z-<br />

72 800, 698, // 43: R_z-<br />

73<br />

74 450, 698, // 44: T_x+<br />

75 520, 698, // 45: T_x-<br />

231


76 590, 698, // 46: T_y+<br />

77 660, 698, // 47: T_y-<br />

78 730, 698, // 48: T_z+<br />

79 800, 698, // 49: T_z-<br />

80<br />

81 870, 698, // 50: x1, x2, x4, x8<br />

82<br />

83 450, 698, // 51: po<strong>in</strong>tset0<br />

84 520, 698, // 52: po<strong>in</strong>tset1<br />

85 590, 698, // 53: po<strong>in</strong>tset2<br />

86 660, 698, // 54: po<strong>in</strong>tset3<br />

87 730, 698, // 55: po<strong>in</strong>tset4<br />

88 800, 698, // 56: po<strong>in</strong>tset5<br />

89<br />

90 880, 698, // 57: no<br />

91 870, 618 // 58: tun<strong>in</strong>g pose<br />

92 };<br />

93<br />

94<br />

95 // Button IDs<br />

96 enum BUTTON_ID<br />

97 {<br />

98 SYS_LOCK,<br />

99 SYS_SAVE,<br />

100 SYS_SL,<br />

101 SYS_SL2,<br />

102 SYS_EXT,<br />

103<br />

104 THUMB_0,<br />

105 THUMB_1,<br />

106 THUMB_2,<br />

107 THUMB_3,<br />

108 THUMB_4,<br />

109 THUMB_5,<br />

110<br />

111 MODE_INSPECT,<br />

112 MODE_TOUCHUP,<br />

113 MODE_CORRESP,<br />

114 MODE_VISUAL,<br />

115<br />

116 CTRL_UP,<br />

117 CTRL_DOWN,<br />

118 CTRL_LEFT,<br />

119 CTRL_RIGHT,<br />

120 CTRL_IN,<br />

121 CTRL_OUT,<br />

122<br />

123 CTRL_ROI_VEXPAND,<br />

232


124 CTRL_ROI_VSHRINK,<br />

125 CTRL_ROI_HEXPAND,<br />

126 CTRL_ROI_HSHRINK,<br />

127 CTRL_ROI_UP,<br />

128 CTRL_ROI_DOWN,<br />

129 CTRL_ROI_LEFT,<br />

130 CTRL_ROI_RIGHT,<br />

131<br />

132 CTRL_TOUCHPAD,<br />

133 CTRL_DOUBLE_SPEED,<br />

134 CTRL_PUSHBUTTON,<br />

135<br />

136 CTRL_MANUAL,<br />

137 CTRL_MOUSE,<br />

138<br />

139 CTRL_PARAM,<br />

140 CTRL_PROCEED,<br />

141<br />

142 CTRL_R,<br />

143 CTRL_T,<br />

144<br />

145 CTRL_R_XP,<br />

146 CTRL_R_XM,<br />

147 CTRL_R_YP,<br />

148 CTRL_R_YM,<br />

149 CTRL_R_ZP,<br />

150 CTRL_R_ZM,<br />

151<br />

152 CTRL_T_XP,<br />

153 CTRL_T_XM,<br />

154 CTRL_T_YP,<br />

155 CTRL_T_YM,<br />

156 CTRL_T_ZP,<br />

157 CTRL_T_ZM,<br />

158<br />

159 CTRL_CHANGE_SPEED,<br />

160<br />

161 CTRL_SELECT_0,<br />

162 CTRL_SELECT_1,<br />

163 CTRL_SELECT_2,<br />

164 CTRL_SELECT_3,<br />

165 CTRL_SELECT_4,<br />

166 CTRL_SELECT_5,<br />

167<br />

168 CTRL_NO,<br />

169 CTRL_TUNING_POSE,<br />

170 };<br />

171<br />

233


172<br />

173<br />

174 //--------------------------------------------<br />

175 // CButton class declaration<br />

176 //--------------------------------------------<br />

177<br />

178 class CButton<br />

179 {<br />

180 private:<br />

181 CvRect mProRect; // button position/size <strong>in</strong> projector<br />

image<br />

182 CvRect mCamRect; // button position/size <strong>in</strong> camera image<br />

183 CvRect mCamInnerRect; // <strong>in</strong>ner region for button push<br />

detection<br />

184<br />

185 char *mpImageName; // name <strong>of</strong> the image to be loaded for<br />

the button<br />

186 char *mpHelpText1; // help text 1st l<strong>in</strong>e<br />

187 char *mpHelpText2; // help text 2nd l<strong>in</strong>e<br />

188<br />

189 bool mFlagActive; // a flag that <strong>in</strong>dicates the current<br />

button is engaged or not<br />

190 bool mFlagHighlighted; // a flag that <strong>in</strong>dicates the<br />

current button is highlighted or not<br />

191<br />

192 // Constructor<br />

193 CButton();<br />

194<br />

195 // Destructor<br />

196 ˜CButton();<br />

197<br />

198 public:<br />

199<br />

200 // Based on the size <strong>in</strong> the projector image, calculate the<br />

buttons’ expected positions <strong>in</strong> the camera image<br />

201 void SetSize(<strong>in</strong>t px, <strong>in</strong>t py, <strong>in</strong>t pxsize, <strong>in</strong>t pysize);<br />

202<br />

203 // Initialise nth button<br />

204 void Initialise(<strong>in</strong>t n);<br />

205<br />

206 // on given image, get <strong>in</strong>ner region avg<br />

207 double GetInnerAvg(picture_<strong>of</strong>_<strong>in</strong>t *<strong>in</strong>pic);<br />

208<br />

209 // on given image, get outer region avg<br />

210 double GetOuterAvg(picture_<strong>of</strong>_<strong>in</strong>t *<strong>in</strong>pic);<br />

211<br />

212 bool Pressed();<br />

213 bool Released();<br />

234


214 void Flash();<br />

215 void Highlight();<br />

216 void Dehighlight();<br />

217<br />

218 void Attach();<br />

219 void Detach();<br />

220 void AttachText();<br />

221 void DetachText();<br />

222 void AttachNewText(char *<strong>in</strong>Text1, char *<strong>in</strong>Text2);<br />

223<br />

224 // Black text with white background, as opposed to normal<br />

text<br />

225 void AttachInverseText();<br />

226<br />

227 void DrawButtonBoundary(colour_picture &<strong>in</strong>pic);<br />

228 };<br />

List<strong>in</strong>g A.1: Header: Button.h<br />

235


Appendix B<br />

Declarations for class CPo<strong>in</strong>tSet<br />

1 #pragma once<br />

2<br />

3 #<strong>in</strong>clude <br />

4 #<strong>in</strong>clude "XMLParser.h"<br />

5<br />

6 us<strong>in</strong>g namespace std;<br />

7<br />

8 typedef std::vector CvMat_vector;<br />

9 typedef std::vector CvScalar_vector;<br />

10<br />

11<br />

12<br />

13 //--------------------------------------------<br />

14 // CPo<strong>in</strong>tSet class declaration<br />

15 //--------------------------------------------<br />

16 class CPo<strong>in</strong>tSet<br />

17 {<br />

18 private:<br />

19<br />

20 //------------------------<br />

21 // Ma<strong>in</strong> data<br />

22 //------------------------<br />

23 <strong>in</strong>t mLength; // total number <strong>of</strong> po<strong>in</strong>ts<br />

24 CvMat *mpObjectPo<strong>in</strong>ts; // 3D coord<strong>in</strong>ates<br />

25 CvMat *mpImagePo<strong>in</strong>ts; // 2D positions<br />

26 CvScalar *mpColour; // colour <strong>in</strong>formation<br />

27<br />

28 <strong>in</strong>t mLength_bk;<br />

29 CvMat *mpObjectPo<strong>in</strong>ts_bk;<br />

30 CvMat *mpImagePo<strong>in</strong>ts_bk;<br />

236


31 CvScalar *mpColour_bk;<br />

32<br />

33 //------------------------<br />

34 // Matrices<br />

35 //------------------------<br />

36 CvMat *mpCentroid;<br />

37 CvMat *mpRvec; // 3x1 <strong>in</strong>stant rotation vector<br />

38 CvMat *mpTvec; // 3x1 <strong>in</strong>stant translation vector<br />

39 CvMat *mpRvecInter[NUM_VIEWS]; // 3x1 <strong>in</strong>ter-po<strong>in</strong>tset<br />

rotation vectors<br />

40 CvMat *mpTvecInter[NUM_VIEWS]; // Save as above, but<br />

vectors for translation<br />

41 <strong>in</strong>t mMergedGroup; // which group this po<strong>in</strong>t set is<br />

merged to, -1 for non-merge, 0 for group0, 1 for group1<br />

, and so on...<br />

42<br />

43 //------------------------<br />

44 // Rendered images<br />

45 //------------------------<br />

46 picture_<strong>of</strong>_<strong>in</strong>t *mpImageBwPic; // black and white model<br />

47 colour_picture *mpImageColorPic;// model attached with<br />

colour <strong>in</strong>formation<br />

48<br />

49<br />

50 //-------------------------------------------------<br />

51 // Constructor, Decontructor<br />

52 //-------------------------------------------------<br />

53 CPo<strong>in</strong>tSet ();<br />

54 ˜CPo<strong>in</strong>tSet ();<br />

55<br />

56 // Overload operator, for po<strong>in</strong>t set replication<br />

57 CPo<strong>in</strong>tSet& CPo<strong>in</strong>tSet::operator=(CPo<strong>in</strong>tSet& param);<br />

58<br />

59 public:<br />

60 //-------------------------------------------------<br />

61 // Primary functions<br />

62 //-------------------------------------------------<br />

63<br />

64 // Load po<strong>in</strong>t set from XML<br />

65 void LoadXML(char *fileName);<br />

66<br />

67 // Save po<strong>in</strong>t set to XML<br />

68 void SaveXML(char *fileName);<br />

69<br />

70 // Reallocate both front data and backup data<br />

71 void ReallocateAllMemory(<strong>in</strong>t len, <strong>in</strong>t len_bk);<br />

72<br />

73 // Reallocate memory for front data with size <strong>of</strong> len<br />

237


74 void ReallocateFrontMemory(<strong>in</strong>t len);<br />

75<br />

76 // Reallocate memory for backup data with size <strong>of</strong> len<br />

77 void ReallocateBackMemory(<strong>in</strong>t len);<br />

78<br />

79 // Replace front data with backup<br />

80 void ResetFromBackup();<br />

81<br />

82 // Save front data <strong>in</strong>to backup<br />

83 void SaveToBackup();<br />

84<br />

85 // Default -1 means list all data; otherwise the list nth<br />

element<br />

86 void List(<strong>in</strong>t <strong>in</strong>dex=-1);<br />

87<br />

88 // Given 2D image coord<strong>in</strong>ate, f<strong>in</strong>d the po<strong>in</strong>t <strong>in</strong> po<strong>in</strong>t set,<br />

and return its <strong>in</strong>dex<br />

89 <strong>in</strong>t Get<strong>Index</strong>(<strong>in</strong>t x<strong>in</strong>, <strong>in</strong>t y<strong>in</strong>);<br />

90<br />

91 // Cut <strong>of</strong>f out-<strong>of</strong>-boundary po<strong>in</strong>ts and zero-depth po<strong>in</strong>ts<br />

92 void RestrictSize(<strong>in</strong>t size);<br />

93<br />

94 // Slim with voxel quantisation<br />

95 void Slim(<strong>in</strong>t objSize, <strong>in</strong>t voxSize);<br />

96<br />

97<br />

98 //-------------------------------------------------<br />

99 // Po<strong>in</strong>t set transform <strong>in</strong> 3D<br />

100 //-------------------------------------------------<br />

101 void UpdateCentroid();<br />

102 void Rotate();<br />

103 void Translate();<br />

104<br />

105 // Rotate + Translate + UpdateCentroid<br />

106 void FullTransform();<br />

107<br />

108 // Rotate about the WCS orig<strong>in</strong><br />

109 void RotateAboutOrig<strong>in</strong>();<br />

110<br />

111 // Theta rotation about unit vector (x, y, z)<br />

112 void RotateThetaAboutVector();<br />

113<br />

114 // Manually f<strong>in</strong>e tune rotation or translation. flag: -1,<br />

do noth<strong>in</strong>g; flag: 1˜6 for rotation; flag: 7˜12 for<br />

translation<br />

115 void StepAdjustRorT(<strong>in</strong>t flag=-1);<br />

116<br />

117<br />

238


118 //-------------------------------------------------<br />

119 // Plott<strong>in</strong>g and display<br />

120 //-------------------------------------------------<br />

121<br />

122 // Draw rendered po<strong>in</strong>t set <strong>in</strong>to an image for display<br />

123 void DrawBw(<strong>in</strong>t flagTopHalf=0, <strong>in</strong>t flagInterp=0, <strong>in</strong>t<br />

<strong>in</strong>terpStep=1);<br />

124<br />

125 // Draw rendered po<strong>in</strong>t set <strong>in</strong>to an image for display (with<br />

colour <strong>in</strong>fo attached)<br />

126 void DrawColor(<strong>in</strong>t flagTopHalf=0, <strong>in</strong>t flagInterp=0, <strong>in</strong>t<br />

<strong>in</strong>terpStep=1);<br />

127 };<br />

List<strong>in</strong>g B.1: Header: Po<strong>in</strong>tSet.h<br />

239


Appendix C<br />

Declarations for class CView<br />

1 #pragma once<br />

2<br />

3<br />

4 #<strong>in</strong>clude "Po<strong>in</strong>tSet.h"<br />

5 #<strong>in</strong>clude "Cursor.h"<br />

6<br />

7 // number <strong>of</strong> views (maximum allowed)<br />

8 #def<strong>in</strong>e NUM_VIEWS 6<br />

9<br />

10 // number <strong>of</strong> views to be tested, debug mode<br />

11 #def<strong>in</strong>e TESTING_VIEWS 5<br />

12<br />

13<br />

14<br />

15 //--------------------------------------------<br />

16 // CView class declaration<br />

17 //--------------------------------------------<br />

18<br />

19 class CView<br />

20 {<br />

21 private:<br />

22 <strong>in</strong>t mView<strong>Index</strong>; // <strong>in</strong>dex <strong>of</strong> the current view<br />

23<br />

24 // Four sub images for display<br />

25 picture_<strong>of</strong>_<strong>in</strong>t *mpDepthPic; // depth map<br />

26 picture_<strong>of</strong>_<strong>in</strong>t *mpTextPic; // texth map<br />

27 colour_picture *mpModelPic; // rendered model<br />

28 colour_picture *mpColourPic;// colour map<br />

29<br />

30 // 4 sub rect, each is half size <strong>of</strong> 640x480<br />

240


31 CvRect mDepthRect, mTextRect, mModelRect, mColourRect;<br />

32<br />

33 // Thumbnail position<br />

34 CvRect mThumbRect;<br />

35<br />

36 // buffer image for fast push and pop central display area<br />

37 colour_picture *mpCentralDisplayPic;<br />

38<br />

39 // Cursor member, for cursor render<strong>in</strong>g and position<strong>in</strong>g<br />

40 CCursor mCursor;<br />

41<br />

42 // Po<strong>in</strong>t set member<br />

43 CPo<strong>in</strong>tSet *mPo<strong>in</strong>tset;<br />

44 // flag <strong>in</strong>dicat<strong>in</strong>g current tun<strong>in</strong>g mode (rotation or<br />

translation)<br />

45 bool mFlagF<strong>in</strong>eTuneRorT;<br />

46 // if po<strong>in</strong>tset <strong>of</strong> current view is merged away to other<br />

views, set it true.<br />

47 bool flag_Po<strong>in</strong>tSetMergedAway;<br />

48<br />

49 // ROI for image registration (left image)<br />

50 CvRect mCorrespROIRect1;<br />

51 // ROI for image registration (right image)<br />

52 CvRect mCorrespROIRect2;<br />

53<br />

54 // Constructor<br />

55 CView(<strong>in</strong>t);<br />

56 // Destructor<br />

57 ˜CView();<br />

58<br />

59 public:<br />

60<br />

61 //--------------------------------------------<br />

62 // primary functions<br />

63 //--------------------------------------------<br />

64<br />

65 // Initialise current view, allocate memory, assign<br />

positions<br />

66 void Initialise();<br />

67<br />

68 // Get ROI based on object dimension and po<strong>in</strong>tset centroid<br />

, then work out the estimated area the object is go<strong>in</strong>g<br />

to appear <strong>in</strong> the observed image, crop it.<br />

69 void PrepareThumbnail(char* fname);<br />

70<br />

71 // Attach four sub images<br />

72 void AttachDisplay();<br />

73<br />

241


74 void PushCentralDisplay();<br />

75 void PopCentralDisplay();<br />

76 void ClearCentralDisplay();<br />

77 void FadeCentralDisplay();<br />

78 void UnfadeCentralDisplay();<br />

79<br />

80 // Attach small box on thumbnail and big box on central<br />

display, draw all connections l<strong>in</strong>es<br />

81 void AttachBox(<strong>in</strong>t Rval, <strong>in</strong>t Gval, <strong>in</strong>t Bval);<br />

82 void DetachBox();<br />

83<br />

84<br />

85 //--------------------------------------------<br />

86 // Touchup mode<br />

87 //--------------------------------------------<br />

88<br />

89 // Adjust rendered model picture, based on <strong>in</strong>com<strong>in</strong>g flag n<br />

=0˜5: up, down, left, right, <strong>in</strong>, out<br />

90 void AdjustFijipic(<strong>in</strong>t n);<br />

91<br />

92 // Do TouchUp on depth image, based on current cursor<br />

location. This will change contents <strong>of</strong> depth data,<br />

po<strong>in</strong>t set data, and colour map, all with backup. Once<br />

done, set flagTouchUpModified = true<br />

93 void TouchUp();<br />

94<br />

95 bool flagTouchUpModified;<br />

96<br />

97<br />

98 //--------------------------------------------<br />

99 // Correspondence mode<br />

100 //--------------------------------------------<br />

101<br />

102 // Called when user selects ’from’ and ’to’ image for<br />

registration<br />

103 void UpdateCorrespSelectionDisplay(<strong>in</strong>t selection);<br />

104<br />

105 // Save as above, just remove everyth<strong>in</strong>g completely (<br />

without any repairs)<br />

106 void RemoveCorrespThumbMa<strong>in</strong>DisplayCompletely();<br />

107<br />

108 // System gives trial ROI selections<br />

109 void AutoSelectRODisplay();<br />

110<br />

111 // Select ROI <strong>of</strong> the chosen image, slide them <strong>in</strong>to centre<br />

for better view<br />

112 void SlideImages();<br />

113<br />

242


114<br />

115 //--------------------------------------------<br />

116 // Visualize Mode<br />

117 //--------------------------------------------<br />

118<br />

119 // Default is -1: do noth<strong>in</strong>g; flag 0˜5: for rotations;<br />

flag 6˜11: for translations<br />

120 void UpdateVisualPanelArea(<strong>in</strong>t flag = -1);<br />

121<br />

122 };<br />

List<strong>in</strong>g C.1: Header: View.h<br />

243

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!