Human-Computer Collaboration in Video-Augmented ... - Index of
Human-Computer Collaboration in Video-Augmented ... - Index of
Human-Computer Collaboration in Video-Augmented ... - Index of
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Human</strong>-<strong>Computer</strong> <strong>Collaboration</strong><br />
<strong>in</strong> <strong>Video</strong>-<strong>Augmented</strong> Environment<br />
for 3D Input<br />
Lijiang Li<br />
Doctor <strong>of</strong> Philosophy Dissertation<br />
University <strong>of</strong> York<br />
Department <strong>of</strong> Electronics<br />
May 2008
Declaration<br />
Except where otherwise stated <strong>in</strong> the text, this dissertation is the result<br />
<strong>of</strong> my own <strong>in</strong>dependent work and <strong>in</strong>vestigation, and is not the outcome<br />
<strong>of</strong> work done <strong>in</strong> collaboration.<br />
Other sources are acknowledged by footnotes given explicit references.<br />
A bibliography is appended at the end <strong>of</strong> this thesis.<br />
This dissertation is not substantially the same as any I have submitted<br />
for a degree or diploma or any other qualification at any other university.<br />
No part <strong>of</strong> this dissertation has already been, or is be<strong>in</strong>g currently sub-<br />
mitted for any such degree, diploma or other qualification.<br />
2
Abstract<br />
The role <strong>of</strong> the computer has gradually changed from merely a tool to an assis-<br />
tant to the human. Equipp<strong>in</strong>g computers with I/O devices and sensors makes<br />
them understand<strong>in</strong>g <strong>of</strong> the surround<strong>in</strong>g world, and capable <strong>of</strong> <strong>in</strong>teract<strong>in</strong>g with<br />
humans. <strong>Video</strong> cameras and data projectors are ideally suited as these sensor de-<br />
vices, especially with the dramatic drops <strong>in</strong> their manufactur<strong>in</strong>g costs it makes<br />
them more and more popular. A new type <strong>of</strong> user <strong>in</strong>terface emerged where the<br />
video signals are used as an augmentation to enhance the physical world, and<br />
here comes the name <strong>Video</strong>-<strong>Augmented</strong> Environment.<br />
This thesis presents a design <strong>of</strong> human-computer <strong>in</strong>teractions <strong>in</strong> a VAE for<br />
3D <strong>in</strong>put. It beg<strong>in</strong>s with <strong>in</strong>troduc<strong>in</strong>g an automated and efficient full calibration<br />
method for calibrat<strong>in</strong>g the projector-camera system. Shape acquisition techniques<br />
are discussed and then one particular technique based on structured light sys-<br />
tems is adapted for captur<strong>in</strong>g depth <strong>in</strong>formation. A user guided approach for<br />
register<strong>in</strong>g depth <strong>in</strong>formation scanned from different part <strong>of</strong> the target object is<br />
<strong>in</strong>troduced. F<strong>in</strong>ally a practical realisation <strong>of</strong> a <strong>Video</strong>-<strong>Augmented</strong> Environment is<br />
presented comb<strong>in</strong><strong>in</strong>g the techniques discussed earlier.<br />
Overall, the VAE designed <strong>in</strong> this thesis has the feasibility <strong>of</strong> complet<strong>in</strong>g com-<br />
puter vision tasks <strong>in</strong> a human-computer collaborative environment, and shows<br />
the potential and viability <strong>of</strong> be<strong>in</strong>g deployed not only <strong>in</strong> laboratory but also <strong>in</strong><br />
<strong>of</strong>fice and home environment.<br />
3
Acknowledgements<br />
Complet<strong>in</strong>g a PhD is a marathon event, and I would not have been<br />
able to complete this journey without the support and encouragement <strong>of</strong><br />
countless people over the last four years.<br />
First and foremost, I would like to express my deep and s<strong>in</strong>cere grati-<br />
tude to my supervisor, Pr<strong>of</strong>essor John Rob<strong>in</strong>son, Head <strong>of</strong> the Department<br />
<strong>of</strong> Electronics, University <strong>of</strong> York. His wide knowledge and expertise have<br />
been <strong>in</strong>valuable to me, while his personal guidance and constructive criti-<br />
cism have provided a good basis for my research and this thesis work.<br />
Many thanks to Justen Hyde and Daniel Parnham for provid<strong>in</strong>g the<br />
OpenIllusionist framework where the frame grabber is orig<strong>in</strong>ated from,<br />
and the help <strong>of</strong> other variety <strong>of</strong> implementation issues. I wish to express<br />
my thanks to my lab partner Matthew Day and Eddie Munday for lots<br />
<strong>of</strong> <strong>in</strong>spir<strong>in</strong>g talks and their participation <strong>in</strong> user experiments. My warm<br />
thanks are due to Owen Francis and other CSG group members for their<br />
assistance.<br />
Dur<strong>in</strong>g my placement at the FCG team, British Telecommunications at<br />
4
Ipswich, I have collaborated with many colleagues, and I wish to extend<br />
my warmest thanks to Dr.Li-Qun Xu, Ian Kegel and all those who have<br />
helped me with my work. Their <strong>in</strong>sights and comments were <strong>of</strong> great<br />
value dur<strong>in</strong>g my placement, and I look forward to a cont<strong>in</strong>u<strong>in</strong>g collabora-<br />
tion with the FCG team <strong>in</strong> the near future.<br />
F<strong>in</strong>ally, I owe my most lov<strong>in</strong>g thanks to my mum, for s<strong>in</strong>gle-handedly<br />
rais<strong>in</strong>g my up over the last twenty years. I would not have been where I<br />
am without her support, constant <strong>in</strong>still<strong>in</strong>g my confidence, but most im-<br />
portantly her love.<br />
5<br />
Lijiang Li<br />
York, UK, May 2008
Contents<br />
Abstract 3<br />
Acknowledgement 4<br />
1 Introduction 19<br />
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.2 Term<strong>in</strong>ologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.2.1 <strong>Augmented</strong> Reality and Virtual Reality . . . . . . . . 21<br />
1.2.2 <strong>Video</strong>-<strong>Augmented</strong> Environments . . . . . . . . . . . . 21<br />
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
1.4 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2 Background and Prior Art 28<br />
2.1 Image based 3D capture methods for depth estimation . . . 29<br />
2.1.1 Feature Based Methods . . . . . . . . . . . . . . . . . 30<br />
2.1.2 Optical Flow Based Methods . . . . . . . . . . . . . . 31<br />
2.2 Active Shape Acquisition Methods . . . . . . . . . . . . . . . 33<br />
2.2.1 The Use <strong>of</strong> Structured Light System . . . . . . . . . . 35<br />
6
2.3 <strong>Video</strong> <strong>Augmented</strong> Environments (<strong>Video</strong>-<strong>Augmented</strong> Environment<br />
(VAE)s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.3.1 Related example VAEs <strong>in</strong> the past . . . . . . . . . . . 36<br />
2.3.2 Previous work at York . . . . . . . . . . . . . . . . . . 41<br />
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
3 Calibration 51<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
3.3 Calibration Parameters . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.3.1 Intr<strong>in</strong>sic Parameters . . . . . . . . . . . . . . . . . . . 57<br />
3.3.2 The Reduced Camera Model . . . . . . . . . . . . . . 61<br />
3.3.3 Extr<strong>in</strong>sic Parameters . . . . . . . . . . . . . . . . . . . 62<br />
3.3.4 Full Model . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
3.4 Calibrate Camera-Projector Pair . . . . . . . . . . . . . . . . . 65<br />
3.4.1 World Coord<strong>in</strong>ate System . . . . . . . . . . . . . . . . 65<br />
3.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . 66<br />
3.4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . 67<br />
3.4.4 Choice <strong>of</strong> colour . . . . . . . . . . . . . . . . . . . . . 70<br />
3.4.5 Camera Calibration . . . . . . . . . . . . . . . . . . . . 73<br />
3.4.6 Projector Calibration . . . . . . . . . . . . . . . . . . . 74<br />
3.5 Plane to Plane Calibration . . . . . . . . . . . . . . . . . . . . 78<br />
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
3.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4 Shape Acquisition 87<br />
7
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
4.3 Gray Codification . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
4.3.1 Gray Code Patterns . . . . . . . . . . . . . . . . . . . . 93<br />
4.3.2 Pattern Generation . . . . . . . . . . . . . . . . . . . . 95<br />
4.3.3 Codification Mechanism . . . . . . . . . . . . . . . . . 98<br />
4.4 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
4.4.1 Image Levels . . . . . . . . . . . . . . . . . . . . . . . 101<br />
4.4.2 Limited Camera Resolution . . . . . . . . . . . . . . . 102<br />
4.4.3 Inverse subtraction . . . . . . . . . . . . . . . . . . . . 104<br />
4.4.4 Adaptive threshold<strong>in</strong>g . . . . . . . . . . . . . . . . . . 107<br />
4.5 Depth from Triangulation . . . . . . . . . . . . . . . . . . . . 109<br />
4.5.1 F<strong>in</strong>al Captured Data . . . . . . . . . . . . . . . . . . . 112<br />
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
4.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
5 Registration <strong>of</strong> Po<strong>in</strong>t Sets 121<br />
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
5.2.1 Rotations and Translations <strong>in</strong> 3D . . . . . . . . . . . . 125<br />
5.2.2 A S<strong>in</strong>gular Value Decomposition (SVD) Based Least<br />
Square Fitt<strong>in</strong>g Method . . . . . . . . . . . . . . . . . . 126<br />
5.3 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
5.3.1 Corner Detector . . . . . . . . . . . . . . . . . . . . . . 127<br />
5.3.2 Normalised Cross Correlation . . . . . . . . . . . . . 129<br />
5.3.3 Outlier Removals . . . . . . . . . . . . . . . . . . . . . 131<br />
8
5.4 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />
5.4.1 Data structure <strong>of</strong> a po<strong>in</strong>t set . . . . . . . . . . . . . . . 136<br />
5.4.2 Po<strong>in</strong>t set fusion with voxel quantisation . . . . . . . . 137<br />
5.4.3 User Assisted Tun<strong>in</strong>g . . . . . . . . . . . . . . . . . . . 141<br />
5.5 Render<strong>in</strong>g A Rotat<strong>in</strong>g Object . . . . . . . . . . . . . . . . . . 143<br />
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />
5.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
6 System Design 148<br />
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />
6.2 Widgets Provided for Interaction . . . . . . . . . . . . . . . . 151<br />
6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 151<br />
6.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 154<br />
6.2.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . 156<br />
6.2.4 Implementation <strong>of</strong> Pushbutton . . . . . . . . . . . . . 157<br />
6.2.5 Implementation <strong>of</strong> Touchpad . . . . . . . . . . . . . . 164<br />
6.3 User <strong>in</strong>terface . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />
6.4 Ma<strong>in</strong> Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />
6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 170<br />
6.4.2 Mode 1: Inspect . . . . . . . . . . . . . . . . . . . . . . 172<br />
6.4.3 Mode 2: Touchup . . . . . . . . . . . . . . . . . . . . . 175<br />
6.4.4 Mode 3: Correspondence . . . . . . . . . . . . . . . . 179<br />
6.4.5 Mode 4: Visualisation . . . . . . . . . . . . . . . . . . 186<br />
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193<br />
6.5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 194<br />
9
7 System Evaluation 195<br />
7.1 Test Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196<br />
7.1.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . 196<br />
7.1.2 Object Descriptions . . . . . . . . . . . . . . . . . . . . 196<br />
7.2 Shape Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 198<br />
7.2.1 The Owl Experiment . . . . . . . . . . . . . . . . . . . 200<br />
7.2.2 The Football and Stand Experiment . . . . . . . . . . . 200<br />
7.2.3 The Cushion and <strong>Human</strong> Body Experiment . . . . . . . 206<br />
7.3 Correspondences F<strong>in</strong>d<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . 208<br />
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />
8 Conclusions 213<br />
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213<br />
8.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />
8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216<br />
A Declarations for class CButton 230<br />
B Declarations for class CPo<strong>in</strong>tSet 236<br />
C Declarations for class CView 240<br />
10
List <strong>of</strong> Figures<br />
1.1 Mixed Reality. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
2.1 Optical flow <strong>of</strong> approach<strong>in</strong>g objects. . . . . . . . . . . . . . . 31<br />
2.2 The DigitalDesk. (image courtesy <strong>of</strong> the <strong>Computer</strong> Labora-<br />
tory, University <strong>of</strong> Cambridge) . . . . . . . . . . . . . . . . . 37<br />
2.3 An image <strong>of</strong> the BrightBoard. (image courtesy <strong>of</strong> the Com-<br />
puter Laboratory, University <strong>of</strong> Cambridge) . . . . . . . . . . 39<br />
2.4 User <strong>in</strong>teracts with the ALIVE system. (image courtesy <strong>of</strong><br />
the MIT Media Lab) . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
2.5 The LivePaper system <strong>in</strong> use. (image courtesy <strong>of</strong> the Visual<br />
Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . . . . . 42<br />
2.6 The LivePaper applications. (image courtesy7 <strong>of</strong> the Visual<br />
Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . . . . . 43<br />
2.7 Snapshots <strong>of</strong> Penpets <strong>in</strong> action. (image courtesy <strong>of</strong> the Visual<br />
Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . . . . . 45<br />
2.8 Audio d-touch <strong>in</strong>terface (the augmented musical stave). (im-<br />
age courtesy <strong>of</strong> the <strong>Computer</strong> Laboratory, University <strong>of</strong> Cam-<br />
bridge) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
11
2.9 Snapshots <strong>of</strong> Robot Ships <strong>in</strong> action. (image courtesy <strong>of</strong> the<br />
Visual Systems Lab, University <strong>of</strong> York) . . . . . . . . . . . . 49<br />
3.1 Calibration objects. (image courtesy <strong>of</strong> [109]) . . . . . . . . . 53<br />
3.2 Pr<strong>in</strong>cipal po<strong>in</strong>ts. Bottom right subimage is the imag<strong>in</strong>g plane. 58<br />
3.3 The distortion effects. . . . . . . . . . . . . . . . . . . . . . . . 60<br />
3.4 Transformation from world to camera coord<strong>in</strong>ate system. . . 62<br />
3.5 Flow chart <strong>of</strong> the camera-projector pair calibration. (dia-<br />
gram <strong>of</strong> image process<strong>in</strong>g after the projections and captures<br />
are done) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
3.6 Extraction <strong>of</strong> the projected pattern from the mixed one. . . . 71<br />
3.7 Extraction <strong>of</strong> the projected pattern from the mixed one (a<br />
closer look). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
3.8 Extraction <strong>of</strong> the projected pattern from the mixed one (a<br />
closer look). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
3.9 Pixel values <strong>of</strong> an image captured from a pla<strong>in</strong> desktop.<br />
(bottom two show<strong>in</strong>g the red channel only) . . . . . . . . . . 85<br />
4.1 A 9-level Gray-coded image. (only a slice from each im-<br />
age is shown here, to illustrate the change between adjacent<br />
codewords) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
4.2 Comparison: m<strong>in</strong>imum level <strong>of</strong> Gray-coded and b<strong>in</strong>ary-<br />
coded images needed to encode 16 columns. . . . . . . . . . 95<br />
4.3 Po<strong>in</strong>t-l<strong>in</strong>e triangulation. . . . . . . . . . . . . . . . . . . . . . 98<br />
4.4 B<strong>in</strong>ary encoded pattern divides the surface <strong>in</strong>to many sub-<br />
regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
12
4.5 Stripes be<strong>in</strong>g projected onto a fluffy doll.(10 level Gray coded<br />
stripes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
4.6 The alias effect caus<strong>in</strong>g errors <strong>in</strong> depth map. . . . . . . . . . 103<br />
4.7 3D plots <strong>of</strong> figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . 105<br />
4.8 Inverse subtraction <strong>of</strong> orig<strong>in</strong>al image and its flipped version. 106<br />
4.9 The <strong>in</strong>verse subtraction: the football experiment. . . . . . . . 108<br />
4.10 Depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
4.11 Colour texture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
4.12 Scattered po<strong>in</strong>t set <strong>in</strong> 3D. (re-sampled at every 2 millimetre) 115<br />
4.13 Scattered po<strong>in</strong>t set <strong>in</strong> 3D, attached with colour <strong>in</strong>formation.<br />
(re-sampled at every 2 millimetre) . . . . . . . . . . . . . . . 116<br />
4.14 Illustration <strong>of</strong> camera limited resolution. . . . . . . . . . . . . 118<br />
5.1 A rout<strong>in</strong>e <strong>of</strong> po<strong>in</strong>t set registration. . . . . . . . . . . . . . . . 125<br />
5.2 Corner detection. . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />
5.3 NCC results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />
5.4 NCC results (periodic pattern). . . . . . . . . . . . . . . . . . 133<br />
5.5 Robust estimation. (<strong>in</strong>liers shown by red connect<strong>in</strong>g l<strong>in</strong>es) . 135<br />
5.6 Robust estimation. (<strong>in</strong>liers shown by <strong>in</strong>dex numbers) . . . . 136<br />
5.7 Data structure <strong>of</strong> a po<strong>in</strong>t set. . . . . . . . . . . . . . . . . . . . 137<br />
5.8 Voxel quantisation <strong>of</strong> the large data set. . . . . . . . . . . . . 138<br />
5.9 Different quantisation level by choos<strong>in</strong>g different voxel size. 139<br />
5.10 The captured objects <strong>of</strong> figure 5.12. . . . . . . . . . . . . . . . 140<br />
5.11 The captured objects <strong>of</strong> figure 5.12. . . . . . . . . . . . . . . . 141<br />
5.12 The quantisation effect <strong>of</strong> choos<strong>in</strong>g different voxel size on<br />
the total po<strong>in</strong>t set size. . . . . . . . . . . . . . . . . . . . . . . 142<br />
13
5.13 Manual tun<strong>in</strong>g <strong>of</strong> po<strong>in</strong>t sets registration. . . . . . . . . . . . . 143<br />
5.14 Different rendered views. (top:rendered range images; bot-<br />
tom:rendered object attached with colour texture) . . . . . . 144<br />
6.1 A snapshot with touchpad and buttons. . . . . . . . . . . . . 153<br />
6.2 A captured image show<strong>in</strong>g an object is be<strong>in</strong>g scanned. . . . 154<br />
6.3 F<strong>in</strong>ger detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
6.4 Button calibration. . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
6.5 The True Positive Rate (TPR) and False Positive Rate (FPR)<br />
<strong>of</strong> button push detection. . . . . . . . . . . . . . . . . . . . . . 160<br />
6.6 The projected buttons and their observations <strong>in</strong> camera im-<br />
age. (The red blocks only <strong>in</strong>dicate the area to be monitored). 163<br />
6.7 F<strong>in</strong>gertip detection us<strong>in</strong>g background segmentation algo-<br />
rithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />
6.8 A screen shot <strong>of</strong> the work<strong>in</strong>g environment. . . . . . . . . . . 167<br />
6.9 Screen shot <strong>of</strong> the system start-up state. . . . . . . . . . . . . 171<br />
6.10 Owl experiment, 3 views captured, current on view 1. . . . . 174<br />
6.11 Owl experiment, 3 views captured, current on view 0, model<br />
rotated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174<br />
6.12 The row <strong>in</strong>dex picture <strong>of</strong> the first view (the brighter pixel<br />
values correspond to higher rows <strong>in</strong> the projection image.) . 177<br />
6.13 The touchup result <strong>of</strong> 6.10. . . . . . . . . . . . . . . . . . . . . 178<br />
6.14 Correspondence Mode: two images are selected as ’from’ and<br />
’to’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181<br />
6.15 Correspondence Mode: Region <strong>of</strong> Interest (ROI)s are selected. . 181<br />
6.16 Correspondence Mode: extracted corners. . . . . . . . . . . . . 183<br />
14
6.17 Correspondence Mode: correlated and improved po<strong>in</strong>t corre-<br />
spondences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183<br />
6.18 Correspondence Mode: visualised po<strong>in</strong>t sets tun<strong>in</strong>g, with con-<br />
trollable rotation and translation. . . . . . . . . . . . . . . . . 185<br />
6.19 Correspondence Mode: two po<strong>in</strong>t sets are fused. . . . . . . . . 186<br />
6.20 The Visualisation Mode. . . . . . . . . . . . . . . . . . . . . . . 187<br />
6.21 View 2 and 3 fused together. View completes the left w<strong>in</strong>g<br />
<strong>of</strong> the owl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191<br />
6.22 View 2 and 3 fused together. . . . . . . . . . . . . . . . . . . . 192<br />
6.23 Fusion <strong>of</strong> view 2, 3, and 4. . . . . . . . . . . . . . . . . . . . . 192<br />
7.1 Shape acquisition test: Owl. Top two: before touchup; bot-<br />
tom two: after touchup. . . . . . . . . . . . . . . . . . . . . . 201<br />
7.2 The projector-camera pair setup. The shaded part is the<br />
’dead’ area that can not be illum<strong>in</strong>ated by the projector but<br />
<strong>in</strong> the view<strong>in</strong>g range <strong>of</strong> the camera. . . . . . . . . . . . . . . . 202<br />
7.3 Shape acquisition test: Stand. Left column: depth maps;<br />
right column: the correspond<strong>in</strong>g textures. . . . . . . . . . . . 204<br />
7.4 Shape acquisition test: Football. Left column: depth maps;<br />
right column: the correspond<strong>in</strong>g textures. . . . . . . . . . . . 205<br />
7.5 Shape acquisition test: Cushion. Left column: depth maps;<br />
right column: the correspond<strong>in</strong>g textures. . . . . . . . . . . . 207<br />
7.6 Shape acquisition test: <strong>Human</strong> Body. Left column: depth<br />
maps; right column: the correspond<strong>in</strong>g textures. . . . . . . . 208<br />
7.7 Number <strong>of</strong> extracted corner po<strong>in</strong>ts and matched correspon-<br />
dence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211<br />
15
List <strong>of</strong> Tables<br />
4.1 10 level Gray code look-up table. . . . . . . . . . . . . . . . . 97<br />
6.1 Group<strong>in</strong>g status <strong>of</strong> the po<strong>in</strong>t sets at different stages. . . . . . 190<br />
7.1 An overview <strong>of</strong> the objects used for the tests. . . . . . . . . . 197<br />
7.2 Evaluation: depth capture error, and their corrections. . . . . 199<br />
7.3 Evaluation: build<strong>in</strong>g correspondences. . . . . . . . . . . . . . 210<br />
16
List <strong>of</strong> Acronyms<br />
AD Absolute Intensity Differences<br />
ALIVE Artificial Life <strong>in</strong>teractive <strong>Video</strong> Environment<br />
AR <strong>Augmented</strong> Reality<br />
DOF Degree <strong>of</strong> Freedom<br />
FOE Focus <strong>of</strong> Expansion<br />
FOV Field <strong>of</strong> View<br />
FPR False Positive Rate<br />
FRF Fast Rejection Filter<br />
GUI Graphical User Interface<br />
HCI <strong>Human</strong>-<strong>Computer</strong> Interface<br />
HMD Head Mounted Display<br />
MAD Mean Absolute Difference<br />
MR Mixed Reality<br />
17
MSE Mean Squared Error<br />
NCC Normalised Cross-Correlation<br />
PCA Pr<strong>in</strong>ciple Component Analysis<br />
PTZ Pan-Tilt-Zoom<br />
RANSAC RANdom SAmple Consensus<br />
ROI Region <strong>of</strong> Interest<br />
SD Squared Intensity Differences<br />
TPR True Positive Rate<br />
TUI Tangible User Interface<br />
SVD S<strong>in</strong>gular Value Decomposition<br />
UI User Interface<br />
VAE <strong>Video</strong>-<strong>Augmented</strong> Environment<br />
VR Virtual Reality<br />
WCS World Coord<strong>in</strong>ate System<br />
WTA W<strong>in</strong>ner-Takes-All<br />
WWW World Wide Web<br />
XML Extensible Markup Language<br />
18
Chapter 1<br />
Introduction<br />
1.1 Problem Statement<br />
The goal <strong>of</strong> computer vision is to make useful decisions about physical ob-<br />
jects and scenes based on sensed images [89]. Therefore, it is almost always<br />
necessary to describe or model these objects <strong>in</strong> some way from images. It<br />
is safe to say there is no better way than reconstruct<strong>in</strong>g the 3D models from<br />
2D images, because 3D vision is natural to humans and can therefore pro-<br />
vide structural <strong>in</strong>formation <strong>in</strong> probably the most obvious way perceived<br />
by humans.<br />
19
Over recent years, researchers and scientists have been fasc<strong>in</strong>ated by<br />
the possibility <strong>of</strong> build<strong>in</strong>g <strong>in</strong>telligent mach<strong>in</strong>es or vision systems which<br />
are capable <strong>of</strong> understand<strong>in</strong>g the physical world and represent<strong>in</strong>g it <strong>in</strong> 3D<br />
space. On the other hand, they are also keen <strong>in</strong> br<strong>in</strong>g<strong>in</strong>g these vision sys-<br />
tems <strong>in</strong>to people’s day to day life, and use them to bridge the gap between<br />
the physical world where humans live and the virtual world the computer<br />
generates.<br />
This research is <strong>in</strong>spired under this context. We aim to develop a vi-<br />
sion system for efficient 3D shape <strong>in</strong>put. From data <strong>in</strong>put to f<strong>in</strong>ally build<br />
the complete 3D model for the target object, it might take several captures<br />
<strong>of</strong> the object be<strong>in</strong>g positioned <strong>in</strong> different orientations, and the tasks such<br />
as error removal and data fusion are carried out <strong>in</strong> a human-computer<br />
collaborative way <strong>in</strong> an environment mixed with real world objects and<br />
augmented video signals.<br />
It is also important that all hardware used <strong>in</strong> the system is day-to-day<br />
equipment that is not hard to get <strong>in</strong> an <strong>of</strong>fice environment. Inexpensive<br />
peripherals and easy to use s<strong>of</strong>tware are used so that the system can be ap-<br />
plied <strong>in</strong> various environments, specially targeted to museum exhibitions<br />
and home gamers.<br />
In conclusion, the system should not only accomplish the 3D shape <strong>in</strong>-<br />
put task but also efficiently and collectively utilise the skills <strong>of</strong> human and<br />
20
the power <strong>of</strong> computer, <strong>in</strong> a visual environment subject to illum<strong>in</strong>ation<br />
changes where physical objects and virtual elements co-exist.<br />
1.2 Term<strong>in</strong>ologies<br />
1.2.1 <strong>Augmented</strong> Reality and Virtual Reality<br />
Virtual Reality (VR) is a synthetic world where we <strong>in</strong>teract with virtual<br />
objects generated by computers or other equipment, <strong>in</strong>stead <strong>of</strong> those real<br />
objects surround<strong>in</strong>g us <strong>in</strong> the real world. <strong>Augmented</strong> Reality (AR), some-<br />
times known as Mixed Reality (MR), mixes the real physical world and<br />
the world <strong>of</strong> VR, by enhanc<strong>in</strong>g the real world with augmented virtual <strong>in</strong>-<br />
formation.<br />
1.2.2 <strong>Video</strong>-<strong>Augmented</strong> Environments<br />
In this thesis, A VAE is a k<strong>in</strong>d <strong>of</strong> projector-camera system <strong>in</strong> which a user’s<br />
<strong>in</strong>teraction with objects and projections are <strong>in</strong>terpreted by a vision system,<br />
lead<strong>in</strong>g to changes <strong>in</strong> the augmented signals. It is a specific type <strong>of</strong> AR,<br />
while the augmentation could be anyth<strong>in</strong>g from overlay<strong>in</strong>g <strong>in</strong>structions,<br />
to generat<strong>in</strong>g a virtual object that appears to exist <strong>in</strong> the physical world<br />
and responds to the environment accord<strong>in</strong>g to human’s <strong>in</strong>structions. In<br />
21
Figure 1.1: Mixed Reality.<br />
this way, objects can appear to be augmented [52, 73], or the user can ma-<br />
nipulate graphical data by gesture [63, 13, 30]. A significant property <strong>of</strong><br />
such systems is that 3D objects and projected images are comb<strong>in</strong>ed <strong>in</strong> a<br />
s<strong>in</strong>gle mixed environment.<br />
1.3 Goals<br />
We contend that <strong>in</strong> many non-<strong>in</strong>teractive vision problems, a valid and<br />
sometimes superior solution can be atta<strong>in</strong>ed through a human user or<br />
users collaborat<strong>in</strong>g with automated analysis. Previous work at York has<br />
reported applications <strong>in</strong> fast panorama construction [80], AudioPhotoDesk<br />
[41], d-touch [29], movie footage logg<strong>in</strong>g [70] and this thesis considers 3D<br />
22
object acquisition – <strong>in</strong> a human-computer collaborative way. The impor-<br />
tance <strong>of</strong> user <strong>in</strong> the VAE presented <strong>in</strong> this thesis is highlighted, as it enables<br />
3D modell<strong>in</strong>g without expensive presentation systems (e.g. servos etc.).<br />
Any such system must comb<strong>in</strong>e vision and <strong>in</strong>teraction techniques with<br />
the design goal <strong>of</strong> higher efficiency than a purely automated system that<br />
requires passive human operator time. The one presented <strong>in</strong> this thesis<br />
uses the projector-camera pair <strong>of</strong> a VAE to acquire range images, then<br />
user <strong>in</strong>teraction <strong>in</strong> the augmented environment to identify correspond-<br />
<strong>in</strong>g po<strong>in</strong>ts for build<strong>in</strong>g a full 3D model. The automation can then take<br />
over aga<strong>in</strong> to suggest prototype registrations for further adjustment by<br />
the user. There are also simple facilities for touch<strong>in</strong>g up range images.<br />
The result is an efficient 3D acquisition system that can be deployed with-<br />
out conventional <strong>in</strong>put devices such as keyboard, mouse, or laser po<strong>in</strong>ters.<br />
In short, this work aims:<br />
• To analyse and extend the use <strong>of</strong> video as an <strong>in</strong>put device.<br />
• To devise and implement different image and video process<strong>in</strong>g com-<br />
ponents that make an augmented reality 3D <strong>in</strong>put device possible.<br />
• To design a human-computer collaborative system for <strong>in</strong>putt<strong>in</strong>g 3D<br />
shapes, and evaluate its performance.<br />
23
1.4 Thesis Organisation<br />
The rest <strong>of</strong> the thesis is organised as follows.<br />
A literature review with a history <strong>of</strong> major image based 3D capture<br />
methods and prior art <strong>of</strong> VAEs is given <strong>in</strong> chapter 2.<br />
Chapter 3 <strong>in</strong>troduces the calibration <strong>of</strong> the projector-camera pair as sys-<br />
tem configuration stage. The calibration serves for two purposes. It gives<br />
the <strong>in</strong>ternal and external geometry <strong>of</strong> the projector camera system, and<br />
also provides the bi-directional transform between the projection signals<br />
and their observations <strong>in</strong> the camera image.<br />
Chapter 4 is concerned with the technique used for the shape acquisi-<br />
tion as the first stage <strong>of</strong> data <strong>in</strong>put from real objects.<br />
In chapter 5 we <strong>in</strong>troduce the method for extract<strong>in</strong>g 3D <strong>in</strong>formation<br />
from the scanned range images and how to fuse the 2.5D data <strong>in</strong>to a com-<br />
plete 3D model.<br />
Chapter 6 gives the work flow with the aforementioned key compo-<br />
nents. The <strong>in</strong>teractive user <strong>in</strong>terface is also presented <strong>in</strong> this chapter.<br />
Experimental results and performance evaluation from the user test are<br />
given <strong>in</strong> chapter 7.<br />
24
Chapter 8 draws the conclusions and possible future work.<br />
1.5 Contributions<br />
The work described <strong>in</strong> this thesis is the development <strong>of</strong> a tabletop based<br />
VAE for fast 3D <strong>in</strong>put via collaborative work between the human and the<br />
computer.<br />
In chapter 3, a fully automatic method for calibration <strong>of</strong> the camera-<br />
projector pair is proposed and implemented. The work is <strong>in</strong>spired by a<br />
widely used Matlab based camera calibration toolbox which is extended<br />
and converted to C++ to make it capable <strong>of</strong> calibrat<strong>in</strong>g the projector cam-<br />
era system <strong>in</strong> a fully automatic manner. The Matlab toolbox is also used<br />
<strong>of</strong>f-l<strong>in</strong>e to manually evaluate and validate the calibration results from our<br />
own automatic method. Initial test<strong>in</strong>g <strong>of</strong> the calibration data suggests it is<br />
not only suitable for table-top based monitor<strong>in</strong>g, but also capable <strong>of</strong> full<br />
3D applications.<br />
In chapter 4, a Gray coded structured light projection is implemented<br />
for the acquisition <strong>of</strong> the two and half dimensional depth maps. Despite<br />
the method itself be<strong>in</strong>g well-established and widely used, efforts are made<br />
to <strong>in</strong>corporate it <strong>in</strong>to the <strong>in</strong>teractive VAE system and conquer the issues<br />
25
aised <strong>in</strong> practice such as the ever-chang<strong>in</strong>g light<strong>in</strong>g conditions and vari-<br />
ous surface reflections caused by different object materials. The problems<br />
such as the large distance between the camera-projector pair and the pro-<br />
jection surface, and the alias<strong>in</strong>g effect caused by limited camera capture<br />
resolution are tackled as well.<br />
A framework for 3D po<strong>in</strong>t set registration is developed <strong>in</strong> chapter 5. It<br />
beg<strong>in</strong>s with the conventional image registration method for planar objects<br />
then extends it to work for arbitrary objects <strong>in</strong> the VAE, with no a priori<br />
ground truth <strong>in</strong>formation available.<br />
This thesis proposes a new system design for <strong>in</strong>putt<strong>in</strong>g 3D us<strong>in</strong>g an<br />
<strong>in</strong>teractive VAE system. This is a major contribution and it is detailed<br />
<strong>in</strong> chapter 6. The proposed system is cheap to ma<strong>in</strong>ta<strong>in</strong> with <strong>of</strong>f-the-<br />
shelf hardware, and easy to deploy with m<strong>in</strong>imum configuration <strong>of</strong> the<br />
projector-camera pair. It is also not just restricted to controlled laborato-<br />
rial environments.<br />
The designed system is also possible for multi-user collaboration, and<br />
the user is able to walk up to the VAE us<strong>in</strong>g their bare hands without the<br />
need <strong>of</strong> Head Mounted Display (HMD), gloves or markers, which most <strong>of</strong><br />
the current VAE systems rely on. The top-down projection mechanism is<br />
also user-friendly, as it dramatically reduces the chance <strong>of</strong> the user’s eyes<br />
be<strong>in</strong>g hurt by the bright projection lights.<br />
26
Although the system proposed here conta<strong>in</strong>s techniques that are al-<br />
ready widely used <strong>in</strong> the field, it br<strong>in</strong>gs together these techniques <strong>in</strong> a new,<br />
practical and efficient way. There are also very few systems as such that<br />
can be deployed not only <strong>in</strong> restricted laboratory environments, but also at<br />
a very low cost by avoid<strong>in</strong>g expensive hardware such as touch screens and<br />
HMD. Initial test results shows it provides a solid foundation for future<br />
research <strong>in</strong> this field, and opens up the possibilities <strong>of</strong> a lot <strong>of</strong> promis<strong>in</strong>g<br />
future work.<br />
27
Chapter 2<br />
Background and Prior Art<br />
This research comb<strong>in</strong>es 3D shape acquisition with video augmented real-<br />
ity. The shape acquisition is a tool for captur<strong>in</strong>g 2.5D depth <strong>in</strong>formation,<br />
and it can be used repeatedly to built a complete 3D model from a set <strong>of</strong><br />
different views <strong>of</strong> the object be<strong>in</strong>g measured. The VAE is an augmented<br />
reality where the projected visual signals are used to augment the real<br />
world. The background to both areas is reviewed here.<br />
28
2.1 Image based 3D capture methods for depth<br />
estimation<br />
<strong>Human</strong>s visually perceive depth us<strong>in</strong>g both <strong>of</strong> their eyes. There is a sim-<br />
ple experiment that if one tries to po<strong>in</strong>t the tips <strong>of</strong> two pens towards each<br />
other with one eye closed, it is almost impossible to succeed. The same<br />
th<strong>in</strong>g happens if one approaches his f<strong>in</strong>ger towards the wall with one eye<br />
closed, then it is very hard to visually measure the distance between the<br />
f<strong>in</strong>ger tip and the wall. The reason beh<strong>in</strong>d this is not hard to expla<strong>in</strong>, be-<br />
cause humans rely on the visual ability for depth perception us<strong>in</strong>g b<strong>in</strong>oc-<br />
ular stereopsis.<br />
The root <strong>of</strong> the word stereopsis, stereo, comes from the Greek word<br />
stereos and it means firm or solid [100]. With stereo vision a solid object<br />
is perceived <strong>in</strong> three spatial dimensions width, height and depth which<br />
are geometrically represented as X, Y, and Z axes. Dur<strong>in</strong>g the perception<br />
process, each human eye captures its own view and the two separate im-<br />
ages are sent on to the bra<strong>in</strong> for process<strong>in</strong>g. When the two images arrive<br />
simultaneously at the back <strong>of</strong> the bra<strong>in</strong>, they are united <strong>in</strong>to one 2.5D rep-<br />
resentation based on their similarities and give the human an observation<br />
<strong>in</strong> three dimensions. In the field <strong>of</strong> computer vision, the human visual abil-<br />
ity for depth perception us<strong>in</strong>g b<strong>in</strong>ocular stereopsis has been modelled by<br />
two displaced cameras to obta<strong>in</strong> 3D <strong>in</strong>formation <strong>of</strong> the <strong>in</strong>vestigated scene.<br />
29
2.1.1 Feature Based Methods<br />
A feature based stereo match<strong>in</strong>g algorithm produces a depth map that<br />
best describes the shape <strong>of</strong> the surfaces <strong>in</strong> the scene via a set <strong>of</strong> match<strong>in</strong>g<br />
features as correspondences. The correspondences are always found from<br />
po<strong>in</strong>ts, l<strong>in</strong>es, corners, contours or other dist<strong>in</strong>guish<strong>in</strong>g features extracted<br />
from both <strong>of</strong> the observed images [37].<br />
Dur<strong>in</strong>g the match<strong>in</strong>g, the most commonly used match<strong>in</strong>g criteria func-<br />
tions are pixel-based Squared Intensity Differences (SD) [1, 55] and Ab-<br />
solute Intensity Differences (AD) [55], which are sometimes averaged to<br />
the Mean Squared Error (MSE) and Mean Absolute Difference (MAD).<br />
Other widely used traditional match<strong>in</strong>g costs <strong>in</strong>clude Normalised Cross-<br />
Correlation (NCC) which is similar to the MSE, and b<strong>in</strong>ary match<strong>in</strong>g costs<br />
such as edges [20, 43] or the sign <strong>of</strong> Laplacian [71]. More recently, various<br />
robust measures [10, 11, 86] have been proposed to limit the <strong>in</strong>fluence <strong>of</strong><br />
mismatches. Once the match<strong>in</strong>g costs are computed, local and w<strong>in</strong>dow-<br />
based methods are used to aggregate the cost by summ<strong>in</strong>g or averag<strong>in</strong>g<br />
over a support region. In local methods, the f<strong>in</strong>al disparities are chosen at<br />
each pixel where the disparity is associated with the m<strong>in</strong>imum cost value –<br />
this is <strong>of</strong>ten known as W<strong>in</strong>ner-Takes-All (WTA) approach. This results <strong>in</strong> a<br />
limitation that only one <strong>of</strong> the two images has the unique matches, because<br />
pixels <strong>in</strong> the second image might be po<strong>in</strong>ted to by multiple po<strong>in</strong>ts from the<br />
first image, or vice versa. Efficient global methods such as max-flow [82]<br />
and graph-cut [91, 17] have been proposed to solve this optimisation prob-<br />
lem and have produced promis<strong>in</strong>g results.<br />
30
Comprehensive reviews <strong>of</strong> the aforementioned applicable technologies<br />
are provided <strong>in</strong> [57, 87, 47].<br />
2.1.2 Optical Flow Based Methods<br />
Optical flow based methods recover structure <strong>in</strong>formation from the opti-<br />
cal flow observed from two images <strong>of</strong> a mov<strong>in</strong>g rigid object, or from two<br />
images taken from different po<strong>in</strong>t <strong>of</strong> views <strong>of</strong> a stationary object.<br />
Optical flow is the distribution <strong>of</strong> velocities <strong>of</strong> movement <strong>of</strong> brightness<br />
patterns <strong>in</strong> an image, where brightness patterns can be objects but nor-<br />
mally refer to pixels for further process<strong>in</strong>g [49]. It can arise from relative<br />
motion <strong>of</strong> objects and the observer, which means it can be either a mov<strong>in</strong>g<br />
camera imag<strong>in</strong>g a static scene or objects mov<strong>in</strong>g <strong>in</strong> front <strong>of</strong> the camera. In<br />
either sett<strong>in</strong>g, more than one images are taken and optical flow is com-<br />
puted to estimate 3D locations <strong>of</strong> the <strong>in</strong>terested features.<br />
(a) first image (b) second image (c) observed flow<br />
Figure 2.1: Optical flow <strong>of</strong> approach<strong>in</strong>g objects.<br />
31
As seen <strong>in</strong> figure 2.1, it is simulated that the ground and an object are<br />
approach<strong>in</strong>g the observed <strong>in</strong> relative different speed and directions. Some<br />
<strong>of</strong> the po<strong>in</strong>ts on the ground all have same <strong>in</strong>stantaneous velocities, but<br />
when they are perceived by human eyes, their images will cross the ret<strong>in</strong>a<br />
with different velocities and direction. All the velocities are represented<br />
<strong>in</strong> rays which have the same vanish<strong>in</strong>g po<strong>in</strong>t and it is called the Focus <strong>of</strong><br />
Expansion (FOE). The FOE <strong>of</strong> the ground is <strong>in</strong> the image and easy to f<strong>in</strong>d,<br />
but for the mov<strong>in</strong>g object, its FOE is located outside <strong>of</strong> the image.<br />
In recent approaches, optical flow [49, 65, 7, 1, 40, 10, 11] is widely used<br />
to estimate the dense correspondence derived from consequent frames. In<br />
[49] a gradient-based method is presented to compute the optical flow,<br />
while there are feature-based [35, 12, 92] and correlation-based methods<br />
[8, 90, 61]. Once the correspondence is established, the 3D location <strong>of</strong><br />
these correspond<strong>in</strong>g features can then be computed if <strong>in</strong>formation about<br />
the camera is known.<br />
One <strong>of</strong> the most comprehensive discussion and evaluation <strong>of</strong> the exist-<br />
<strong>in</strong>g optical flow computation is given <strong>in</strong> [5].<br />
32
2.2 Active Shape Acquisition Methods<br />
Techniques discussed <strong>in</strong> section 2.1 cover most <strong>of</strong> the scenarios <strong>in</strong> com-<br />
puter vision for depth estimation, but under certa<strong>in</strong> circumstances there<br />
are still th<strong>in</strong>gs can be done <strong>in</strong> a proactive way as an aid to help enhance<br />
the performance. For example, when measur<strong>in</strong>g a white object with no<br />
texture at all, it is <strong>of</strong>ten hard to extract the dist<strong>in</strong>guished features or com-<br />
pute the optical flow. In this case it would be helpful to manually put some<br />
marks on the object such as squares or triangles to help locate the <strong>in</strong>terest<br />
po<strong>in</strong>ts. Similarly <strong>in</strong> an augmented environment, controlled light<strong>in</strong>g can<br />
replace those squares and triangles as an aid to help identify more <strong>in</strong>ter-<br />
est<strong>in</strong>g features.<br />
Imag<strong>in</strong>e wav<strong>in</strong>g a pen over the <strong>in</strong>spected object under a constant light<br />
to cast shadows across the scene. The shadow is expected to be a th<strong>in</strong><br />
l<strong>in</strong>e, but deformed due to the shape <strong>of</strong> the surface underneath. Can the<br />
structural <strong>in</strong>formation be retrieved from the deformed shadows? Another<br />
example is by turn<strong>in</strong>g on different lights <strong>in</strong> the room and us<strong>in</strong>g a camera<br />
to monitor an object under different light<strong>in</strong>g conditions, aga<strong>in</strong>, is there any<br />
structural <strong>in</strong>formation <strong>in</strong>duced <strong>in</strong> the observed images?<br />
One <strong>of</strong> the active methods is photometric stereo [106], which has the<br />
ability to estimate the object surface orientation by us<strong>in</strong>g several images<br />
taken from the same viewpo<strong>in</strong>t but under dist<strong>in</strong>ct illum<strong>in</strong>ation from dif-<br />
ferent directions. Under most circumstances, the surfaces be<strong>in</strong>g measured<br />
are thought to obey Lambert’s cos<strong>in</strong>e law, which states that the irradiance<br />
33
(i.e. light emitted or perceived) is proportional to the cos<strong>in</strong>e <strong>of</strong> the angle<br />
between surface normal and light source direction, and this relationship<br />
can be represented by the reflectance map [107, 108]. A big advantage<br />
<strong>of</strong> the photometric stereo method is it can be used as a texture classifier.<br />
For <strong>in</strong>stance, suppose a surface with lots <strong>of</strong> protuberant horizontal curved<br />
stripes is imaged under a s<strong>in</strong>gle constant illum<strong>in</strong>ation. If the surface is ro-<br />
tated by 90 degrees while the light<strong>in</strong>g rema<strong>in</strong>s the same (i.e. strength, ori-<br />
entation), it causes failure <strong>of</strong> conventional texture based correspondence<br />
match<strong>in</strong>g because <strong>of</strong> the big change <strong>in</strong> the appearance <strong>of</strong> the surface. For<br />
these types <strong>of</strong> applications, photometric stereo is the answer provided the<br />
rotation is known.<br />
Another active technique for shape acquisition is structured light [81,<br />
45, 6, 15, 84]. In structured light systems, a projector is used to completely<br />
replace one <strong>of</strong> the cameras <strong>in</strong> a stereo vision system. With the projector<br />
project<strong>in</strong>g light patterns such as dots, l<strong>in</strong>es, grids or stripes onto the object<br />
surface, all the illum<strong>in</strong>ation sources <strong>of</strong> these projected signals are known<br />
<strong>in</strong> the projector space. At the same time, a camera is used to capture the<br />
illum<strong>in</strong>ated scene as the observer. By project<strong>in</strong>g one or a set <strong>of</strong> known<br />
image patterns, it is possible to uniquely label each pixel <strong>in</strong> the image ob-<br />
served by the camera.<br />
Unlike stereo vision methods <strong>in</strong>troduced <strong>in</strong> section 2.1 which rely on<br />
the accuracy <strong>of</strong> match<strong>in</strong>g algorithms, structured light automatically estab-<br />
lishes the geometric relationship by direct mapp<strong>in</strong>g from the codewords<br />
34
assigned to each pixel to their correspond<strong>in</strong>g coord<strong>in</strong>ates <strong>in</strong> the source pat-<br />
tern.<br />
A detailed discussion <strong>of</strong> structured light techniques is presented <strong>in</strong><br />
chapter 4.<br />
2.2.1 The Use <strong>of</strong> Structured Light System<br />
This research presents a projector-camera based VAE system. Although<br />
there is controlled light<strong>in</strong>g available such as turn<strong>in</strong>g on and <strong>of</strong>f the lights<br />
or adjust<strong>in</strong>g the bl<strong>in</strong>ds, it is not to the extent that is sufficient for photo-<br />
metric stereo. With the projector-camera pair available, structured light<br />
fits well <strong>in</strong> terms <strong>of</strong> hardware requirements, and is used for <strong>in</strong>itial capture<br />
<strong>of</strong> depth <strong>in</strong>formation <strong>in</strong> this research. Follow<strong>in</strong>g this, stereo-match<strong>in</strong>g cor-<br />
respondence methods are applied to fuse the depth data captured from<br />
different views <strong>of</strong> the object be<strong>in</strong>g measured.<br />
2.3 <strong>Video</strong> <strong>Augmented</strong> Environments (VAEs)<br />
A VAE is a visual environment where physical objects from the real world<br />
and virtual elements co-exist coherently. Data projectors are normally<br />
35
used <strong>in</strong> VAEs to augment the real objects by project<strong>in</strong>g video signals onto<br />
the scene. The visual environment is also monitored by the camera, so<br />
that the VAE system can detect the changes and make response by chang-<br />
<strong>in</strong>g the projections.<br />
Over the last decade various VAE systems are developed for different<br />
purposes. We list some related example VAEs to show their range and di-<br />
versity.<br />
2.3.1 Related example VAEs <strong>in</strong> the past<br />
DigitalDesk<br />
In the early 90s, one <strong>of</strong> the earliest projects <strong>in</strong> the history <strong>of</strong> VAE emerged<br />
as Wellner’s DigitalDesk [102, 104, 103] (figure 2.2(a)). A major feature <strong>of</strong><br />
the project is the blurr<strong>in</strong>g <strong>of</strong> the boundary between physical paper and<br />
electronic documents. DigitalDesk also tackles the problem <strong>of</strong> calibrat<strong>in</strong>g<br />
the multiple <strong>in</strong>puts (cameras) and output devices (projector) so to enable<br />
the planar mapp<strong>in</strong>g between their <strong>in</strong>dividual coord<strong>in</strong>ate systems.<br />
36
(a) Hardware setup. (b) The DigitalDesk<br />
Figure 2.2: The DigitalDesk. (image courtesy <strong>of</strong> the <strong>Computer</strong> Laboratory,<br />
University <strong>of</strong> Cambridge)<br />
In DigitalDesk, a projector and one or more cameras are mounted above<br />
the desk shar<strong>in</strong>g the common view area. On the desk, a user can place<br />
normal day to day objects such as papers, books, and mugs. The desk has<br />
other characteristics <strong>of</strong> a workstation, that the projector and camera(s) are<br />
connected to a PC and the system can (1) read the documents placed on<br />
the desktop; (2) monitor a user’s activity at the desk; (3) project video sig-<br />
nals such as images and annotations down onto the desk surface.<br />
Inspired by DigitalDesk, a number <strong>of</strong> prototype applications have been<br />
built. For example, the PaperPa<strong>in</strong>t application [104] allows copy<strong>in</strong>g and<br />
37
past<strong>in</strong>g <strong>of</strong> images and text from paper documents laid on the desk <strong>in</strong>to<br />
electronic versions. The DigitalDesk Calculator [102] (figure 2.2(b)) enables<br />
mathematical operations on numeric data conta<strong>in</strong>ed <strong>in</strong> paper documents,<br />
provid<strong>in</strong>g the user a virtual calculator by project<strong>in</strong>g a set <strong>of</strong> buttons along-<br />
side the paper documents. Another application is Marcel [72], where user<br />
can po<strong>in</strong>t their f<strong>in</strong>ger at the words <strong>in</strong> a French document and the po<strong>in</strong>ted<br />
words are translated <strong>in</strong>to English, which is subsequently projected along-<br />
side the orig<strong>in</strong>al French word.<br />
BrightBoard<br />
BrightBoard [93] explores the use <strong>of</strong> a ord<strong>in</strong>ary whiteboard as a com-<br />
puter <strong>in</strong>terface. A vision system is developed to monitor what is happen-<br />
<strong>in</strong>g on the board. A major difference <strong>of</strong> BrightBoard from other VAE sys-<br />
tems is it is not designed to cont<strong>in</strong>uously respond to the captured images.<br />
Instead, events are only triggered whenever the system detects a signifi-<br />
cant change such as the user obstruct<strong>in</strong>g the whiteboard.<br />
A few commands are provided (previously written by marker pens)<br />
on the board, with a square check box alongside each command (figure<br />
2.3). For each check box, the system monitors the square as the active area<br />
and detects when the zone becomes significantly darker or lighter, which<br />
38
corresponds to the conclusion that a mark is made on the board or erased,<br />
respectively.<br />
Figure 2.3: An image <strong>of</strong> the BrightBoard. (image courtesy <strong>of</strong> the <strong>Computer</strong><br />
Laboratory, University <strong>of</strong> Cambridge)<br />
After <strong>in</strong>itial success <strong>of</strong> the prototype, <strong>in</strong>stead <strong>of</strong> expand<strong>in</strong>g the system<br />
<strong>in</strong>to a monolithic application with more and more features, the developers<br />
decide to simplify the BrightBoard <strong>in</strong>to a whiteboard based control panel,<br />
from which the scripts and external programs can be activated, such as<br />
pr<strong>in</strong>t<strong>in</strong>g and sav<strong>in</strong>g what is written on the whiteboard, email<strong>in</strong>g the im-<br />
ages <strong>of</strong> board or pass<strong>in</strong>g them onto other programs for further process<strong>in</strong>g.<br />
However, one <strong>of</strong> the limitations <strong>of</strong> the system is that calibration is not<br />
<strong>in</strong>volved at any stage. The system relies on the fact that the active areas <strong>in</strong><br />
the camera image are crudely fixed. Once the camera or the board itself is<br />
moved the system needs to be reconfigured.<br />
39
Artificial Life <strong>in</strong>teractive <strong>Video</strong> Environment (ALIVE)<br />
The ALIVE [67, 68] system was developed at the MIT Media Lab, <strong>in</strong>-<br />
spired by the ideas beh<strong>in</strong>d Myron Krueger’s <strong>Video</strong>Place [59, 60]. A large<br />
projection screen roughly the same height as a human is placed vertically<br />
on the ground. A camera is fixed on the top edge <strong>of</strong> the screen and moni-<br />
tors the user stand<strong>in</strong>g right <strong>in</strong> front <strong>of</strong> the screen and free to move about.<br />
In the observed image, the background is cut <strong>of</strong>f so that only the fore-<br />
ground image <strong>of</strong> the user is conserved. It is then <strong>in</strong>corporated <strong>in</strong>to another<br />
different scene (e.g. a different room) mixed with some animated creatures<br />
(figure 2.4). User can <strong>in</strong>teract with the computer generated creatures by ei-<br />
ther their movement or <strong>in</strong>structions expressed by their gestures.<br />
To enable this type <strong>of</strong> <strong>in</strong>teraction, the user’s 3D position <strong>in</strong> the physical<br />
world has to be known. With a couple <strong>of</strong> assumptions, this can be achieved<br />
even by a s<strong>in</strong>gle camera. First, the relative positions and orientation <strong>of</strong> the<br />
camera to the floor need to be known. Also, the users is assumed to be<br />
stand<strong>in</strong>g on the floor all the time so that simply locat<strong>in</strong>g the user’s lowest<br />
po<strong>in</strong>t <strong>in</strong> the observed image can approximately estimate his/her position<br />
<strong>in</strong> the room.<br />
40
Figure 2.4: User <strong>in</strong>teracts with the ALIVE system. (image courtesy <strong>of</strong> the<br />
MIT Media Lab)<br />
2.3.2 Previous work at York<br />
At the Visual System Group, University <strong>of</strong> York, the ma<strong>in</strong> research <strong>in</strong>ter-<br />
est <strong>of</strong> the VAEs concerns image <strong>in</strong>put and analysis technologies that are<br />
resilient to light<strong>in</strong>g changes and shadow<strong>in</strong>g. Sufficiently fast VAE imple-<br />
mentations are aimed to support richly <strong>in</strong>teractive applications.<br />
A number <strong>of</strong> practical VAE applications have been designed and im-<br />
plemented with<strong>in</strong> the group [52, 73, 29, 74]. Prior to this research, one <strong>of</strong><br />
the most recent is Robot Ships, developed for the National Museum <strong>of</strong> Scot-<br />
land’s Connect gallery.<br />
41
LivePaper<br />
A recent system [52] by Rob<strong>in</strong>son and Robertson provides a VAE <strong>in</strong><br />
which <strong>in</strong>dividual sheets <strong>of</strong> pages, cards and books are placed on an <strong>in</strong>stru-<br />
mented tabletop to activate their enhancement. It appears to the user as if<br />
the paper has additional properties with new visual and auditory features.<br />
Figure 2.5: The LivePaper system <strong>in</strong> use. (image courtesy <strong>of</strong> the Visual<br />
Systems Lab, University <strong>of</strong> York)<br />
A sheet <strong>of</strong> paper is detected through boundary extraction <strong>in</strong> an ob-<br />
served image, and then the projector displays the associated augmenta-<br />
42
tions accord<strong>in</strong>g to the contents on the current page recognised by the sys-<br />
tem. The augmented video signals rema<strong>in</strong> projected onto the page. An<br />
<strong>in</strong>teractive menu is provided beside the page to provide f<strong>in</strong>ger-triggered<br />
functionality.<br />
A number <strong>of</strong> sample applications have been developed to illustrate the<br />
feasibility <strong>of</strong> the LivePaper system. These applications <strong>in</strong>clude an archi-<br />
tectural visualisation tool (figure 2.6(a)) which projects a 3D hidden-l<strong>in</strong>e<br />
render<strong>in</strong>g <strong>of</strong> walls onto a page, page shar<strong>in</strong>g, remote collaboration (figure<br />
2.6(b)), and World Wide Web (WWW) page view<strong>in</strong>g. From the user’s per-<br />
spective, all <strong>of</strong> these applications are attributes <strong>of</strong> the particular page, not<br />
features <strong>of</strong> the tabletop.<br />
(a) The architectural visualisation appli-<br />
cation.<br />
(b) The collaborative draw<strong>in</strong>g applica-<br />
tion.<br />
Figure 2.6: The LivePaper applications. (image courtesy7 <strong>of</strong> the Visual Sys-<br />
tems Lab, University <strong>of</strong> York)<br />
Another application <strong>of</strong> LivePaper is an audio player. When a page such<br />
43
as a bus<strong>in</strong>ess card is laid on the desk, the player beg<strong>in</strong>s play<strong>in</strong>g an audio<br />
clip, <strong>of</strong> which the playback can be controlled by the user, by press<strong>in</strong>g the<br />
projected buttons.<br />
PenPets<br />
The PenPets application developed by O’Mahony and Rob<strong>in</strong>son [73]<br />
is an application runn<strong>in</strong>g on a VAE called SketchTop which supports rich<br />
<strong>in</strong>teraction through sketch<strong>in</strong>g, augmented physical objects and mobile vir-<br />
tual objects.<br />
SketchTop is a whiteboard mounted horizontally at desk height together<br />
with other physical objects that can be augmented. Problems are encoun-<br />
tered <strong>in</strong> some <strong>of</strong> the other whiteboard based VAE systems. First, the white-<br />
board is horizontally placed because vertically mounted whiteboard can-<br />
not support augmented objects other than the video signal itself. Second,<br />
once the mark<strong>in</strong>gs are static once written on the whiteboard, so the liter-<br />
ality <strong>of</strong> <strong>in</strong>teraction that comes through register<strong>in</strong>g augmented signals to<br />
mov<strong>in</strong>g objects is lost. SketchTop was designed to solve both these prob-<br />
lems and thereby provide rich <strong>in</strong>teractions via static-but-erasable writ<strong>in</strong>gs.<br />
The focus <strong>of</strong> SketchTop demonstration is Penpets, an artificial life appli-<br />
44
(a) A maze-solv<strong>in</strong>g agent tries to f<strong>in</strong>d its<br />
way out while user modifies the struc-<br />
ture <strong>of</strong> the maze.<br />
(b) Mov<strong>in</strong>g an agent with a fishnet-like<br />
tool.<br />
Figure 2.7: Snapshots <strong>of</strong> Penpets <strong>in</strong> action. (image courtesy <strong>of</strong> the Visual<br />
Systems Lab, University <strong>of</strong> York)<br />
cation <strong>in</strong> which virtual animals roam the augmented surface, runn<strong>in</strong>g <strong>in</strong>to<br />
objects and trigger<strong>in</strong>g events subject to their various behavioural models.<br />
Figure 2.7 shows two snapshots <strong>of</strong> Penpets <strong>in</strong> action. The demonstrated<br />
agent <strong>in</strong> figure 2.7(a) has hazard detection and maze solv<strong>in</strong>g ability. The<br />
tunnels and walls on the whiteboard are drawn by users, therefore users<br />
can easily h<strong>in</strong>der the agents by open<strong>in</strong>g up new exits, clos<strong>in</strong>g old ones, or<br />
taper<strong>in</strong>g the current lane <strong>in</strong> which the agent is travell<strong>in</strong>g. Figure 2.7 shows<br />
an agent is be<strong>in</strong>g carried to another part <strong>of</strong> the environment by a fishnet-<br />
like tool.<br />
Based on different behaviour models, developments <strong>of</strong> SketchTop appli-<br />
cations such as circuit simulator, traffic simulator, and sketchable p<strong>in</strong>ball<br />
(us<strong>in</strong>g agents as balls) are implemented. Another <strong>in</strong>terest<strong>in</strong>g implemen-<br />
45
tation is to simulate the agents cul<strong>in</strong>ary <strong>in</strong>terests by provid<strong>in</strong>g a mean <strong>of</strong><br />
recognis<strong>in</strong>g different objects such as apple, cheese, or teapot.<br />
Audio d-touch<br />
Audio d-touch [29, 28] uses a consumer-grade web camera and cus-<br />
tomisable block objects with markers attached to provide an <strong>in</strong>teractive<br />
Tangible User Interface (TUI) for a variety <strong>of</strong> time based musical tasks such<br />
as sequenc<strong>in</strong>g, drum edit<strong>in</strong>g and collaborative composition. Three musi-<br />
cal applications have been reported by previous research <strong>in</strong> the group: the<br />
augmented musical stave (figure 2.8), the tangible drum mach<strong>in</strong>e, and the<br />
physical sequencer. Although there is no presence <strong>of</strong> a data projector <strong>in</strong><br />
this system, the Audio d-touch is very similar to other standard VAEs, the<br />
only difference be<strong>in</strong>g the video signals projected by the projector are re-<br />
placed by the audio signals from the speakers.<br />
TUIs are a recent research field <strong>in</strong> <strong>Human</strong>-<strong>Computer</strong> Interface (HCI)s.<br />
Compared to a Graphical User Interface (GUI) where users <strong>in</strong>teract with<br />
virtual objects represented on a screen through mouse and keyboard to<br />
control and represent digital <strong>in</strong>formation, <strong>in</strong> a TUI physical objects are<br />
used <strong>in</strong> the real space to achieve the same goals. Grasp<strong>in</strong>g a physical<br />
object is equivalent to grasp<strong>in</strong>g a piece <strong>of</strong> digital <strong>in</strong>formation, and nor-<br />
46
mally different objects represents different pieces <strong>of</strong> <strong>in</strong>formation <strong>of</strong> the<br />
virtual model. As feedback, the computer output is usually represented<br />
<strong>in</strong> the same physical environment to susta<strong>in</strong> the perceptual l<strong>in</strong>k between<br />
the physical and virtual objects.<br />
In Audio d-touch the user can create patterns and beats. This is realised<br />
by mapp<strong>in</strong>g physical quantities to musical parameters such as timbre and<br />
frequency. The visual part <strong>of</strong> the system tracks the position <strong>of</strong> the control<br />
objects with a web-cam by means <strong>of</strong> a robust image fiducial recognition<br />
algorithm. Technical details <strong>of</strong> the fiducial algorithms can be found <strong>in</strong><br />
[27, 75].<br />
Figure 2.8: Audio d-touch <strong>in</strong>terface (the augmented musical stave). (image<br />
courtesy <strong>of</strong> the <strong>Computer</strong> Laboratory, University <strong>of</strong> Cambridge)<br />
Figure 2.8 shows one <strong>of</strong> the Audio d-touch application: the augmented<br />
musical stave. Only the <strong>in</strong>teractive surface is shown <strong>in</strong> the figure, while<br />
47
the web camera is vertically mounted above the surface and a pair <strong>of</strong><br />
speakers are placed on the side – all connected to a PC. In the augmented<br />
stave, physical representations <strong>of</strong> musical notes can be placed on a stave<br />
drawn on an A4 sheet <strong>of</strong> paper for either teach<strong>in</strong>g score notations or com-<br />
position <strong>of</strong> melodies. The <strong>in</strong>teractive objects are rectangular blocks, each <strong>of</strong><br />
which is labelled with a fiducial symbol correlated to a variety <strong>of</strong> musical<br />
notes. Once the notes are placed on the stave, the correspond<strong>in</strong>g sounds<br />
are played by the computer. Various musical parameters such as the pitch,<br />
the duration (quavers, crotchets, m<strong>in</strong>ims, etc...), the play<strong>in</strong>g sequence are<br />
decided by the position <strong>of</strong> the object on the musical stave.<br />
Prototypes <strong>of</strong> the designed <strong>in</strong>struments have been tested by a group<br />
<strong>of</strong> people with different musical backgrounds, rang<strong>in</strong>g from music aca-<br />
demics to amateurs with little experience <strong>in</strong> music composition. Each en-<br />
joyed <strong>in</strong>teract<strong>in</strong>g with the <strong>in</strong>struments and managed to make <strong>in</strong>terest<strong>in</strong>g<br />
compositions.<br />
Robot Ships<br />
Robot Ships is a commercial application developed as a featured exhibi-<br />
tion for the Connect Gallery [101] at the National Museums <strong>of</strong> Scotland <strong>in</strong><br />
Ed<strong>in</strong>burgh.<br />
48
Designed with the technology <strong>of</strong> VAE, Robot Ships turns a tabletop <strong>in</strong>to<br />
a stretch <strong>of</strong> ocean, upon which robotic boats work together to clean up oil<br />
spills. An audience walks up to the tabletop, reaches onto it, and becomes<br />
part <strong>of</strong> the <strong>in</strong>teractive environment to create various events (figure 2.9(a)).<br />
(a) A picture show<strong>in</strong>g the user s<strong>in</strong>k<strong>in</strong>g<br />
an oil tanker for the workers to start the<br />
clean-up work.<br />
(b) A screen shot show<strong>in</strong>g the workers<br />
start clean<strong>in</strong>g up the toxic spill that has<br />
been located by a scout.<br />
Figure 2.9: Snapshots <strong>of</strong> Robot Ships <strong>in</strong> action. (image courtesy <strong>of</strong> the Visual<br />
Systems Lab, University <strong>of</strong> York)<br />
On the biological scale, the idea beh<strong>in</strong>d the Robot Ships is <strong>in</strong>spired by<br />
comb<strong>in</strong><strong>in</strong>g user assistance and the work force to solve environmental tasks.<br />
In this case, the scout ship is first sent out to search for toxic spills, and<br />
upon f<strong>in</strong>d<strong>in</strong>g one it will return to the central control rig. On its way back,<br />
it navigates the obstructions and leaves a series <strong>of</strong> trail po<strong>in</strong>ts. Cleanup<br />
49
worker ships are then dispatched. Without know<strong>in</strong>g the location <strong>of</strong> the<br />
spill, the workers rely only on the trail po<strong>in</strong>ts left by the scout. Due to<br />
the fact that the workers don’t know where they are head<strong>in</strong>g, but <strong>in</strong>stead<br />
only us<strong>in</strong>g their limited view<strong>in</strong>g cone, they are more manipulatable by<br />
the audience. As the entire <strong>in</strong>terface is on a round table which is used by<br />
reach<strong>in</strong>g over it, it is open to all ages and to multi-user collaborations.<br />
Robot Ships is a VAE that runs on top <strong>of</strong> the OpenIllusionist framework<br />
[50] <strong>in</strong>dependently developed by previous members <strong>of</strong> the Visual Systems<br />
Lab, Justen Hyde and Dan Parnham. More details <strong>of</strong> Robot Ships and Ope-<br />
nIllusionist is given <strong>in</strong> [74].<br />
2.4 Conclusions<br />
There are many other good VAEs apart from those aforementioned. Here<br />
we only <strong>in</strong>troduce some <strong>of</strong> the pioneer<strong>in</strong>g and well-known VAEs, and the<br />
related previous work carried out <strong>in</strong> our Visual System Lab.<br />
At present many <strong>of</strong> the research groups cont<strong>in</strong>ue their work on VAEs<br />
and some <strong>of</strong> the related <strong>in</strong>dividual contributions will be reviewed <strong>in</strong> more<br />
detail at appropriate stage later <strong>in</strong> this thesis.<br />
50
Chapter 3<br />
Calibration<br />
3.1 Introduction<br />
In a camera-projector based VAE system, different components have their<br />
own coord<strong>in</strong>ate systems: the camera coord<strong>in</strong>ate system, the projector co-<br />
ord<strong>in</strong>ate system, and the World Coord<strong>in</strong>ate System (WCS) which the real<br />
objects are placed with<strong>in</strong>. For accurately measur<strong>in</strong>g the objects’ place on<br />
the tabletop us<strong>in</strong>g structured light scan method, it is vital to have a reli-<br />
able calibration process so that the <strong>in</strong>ternal and external geometry <strong>of</strong> the<br />
camera and the projector are known. When a user <strong>in</strong>teracts with the aug-<br />
51
mented signals projected onto the desktop, there is a need to susta<strong>in</strong> the<br />
coherent spatial relationship between the physical objects and the virtual<br />
elements <strong>in</strong> a cont<strong>in</strong>uously chang<strong>in</strong>g visual environment.<br />
For example, if a light dot is projected onto the centre <strong>of</strong> the desktop,<br />
it won’t necessarily appear at the centre <strong>of</strong> the observed image. There-<br />
fore the orig<strong>in</strong>al location <strong>of</strong> the light dot <strong>in</strong> the projector image and its<br />
observed position <strong>in</strong> the captured image need to be correlated, so that the<br />
system knows where to look for it <strong>in</strong> the captured image. Furthermore,<br />
if the light dot is projected onto an object, the 3D position <strong>of</strong> the illumi-<br />
nated po<strong>in</strong>t on the object <strong>in</strong> the real world might need to be measured. In<br />
this case, the <strong>in</strong>ternal geometry <strong>of</strong> the camera and the projector need to<br />
be known to solve th<strong>in</strong>gs like the mapp<strong>in</strong>g between pixels and real world<br />
measurements and to what extent the image is distorted due to the lens<br />
imperfection. The recovery <strong>of</strong> all the necessary <strong>in</strong>formation is called the<br />
calibration process.<br />
This chapter addresses this calibration problem.<br />
Calibration task<br />
The objective <strong>of</strong> the camera calibration process is to f<strong>in</strong>d the <strong>in</strong>ternal pa-<br />
rameters (a series <strong>of</strong> parameters that a camera has <strong>in</strong>herently) and the ex-<br />
ternal parameters (position <strong>of</strong> the camera and its orientation relatively to<br />
the World Coord<strong>in</strong>ate System (WCS)).<br />
52
Calibration pr<strong>in</strong>ciple<br />
To calibrate the camera requires measurements <strong>of</strong> a set <strong>of</strong> 3D po<strong>in</strong>ts and<br />
their image correspondences [37]. The most common way to do this is<br />
to have the camera observe a 2D planar pattern consist<strong>in</strong>g <strong>of</strong> multiple<br />
coll<strong>in</strong>ear po<strong>in</strong>ts and the pattern is shown to the camera <strong>in</strong> different views.<br />
Alternatively a 3D rig marked with ground truth po<strong>in</strong>ts can also be used<br />
as the calibration object. The same pr<strong>in</strong>ciple applies to the projector cali-<br />
bration, although it is implemented <strong>in</strong> a slightly different way.<br />
Camera calibration<br />
In practice, a black and white checkerboard plane is usually chosen as the<br />
calibration object because it <strong>of</strong>fers a set <strong>of</strong> known po<strong>in</strong>ts as ground truth<br />
po<strong>in</strong>ts straightaway, although there are other types <strong>of</strong> calibration objects<br />
that can be used [109]. In this research a 20 × 20 checkerboard is used as<br />
the calibration object.<br />
(a) 3D rig. (b) 2D planar object. (c) 1D object with marked<br />
po<strong>in</strong>ts.<br />
Figure 3.1: Calibration objects. (image courtesy <strong>of</strong> [109])<br />
Projector calibration<br />
53
When calibrat<strong>in</strong>g the projector we aim for the same set <strong>of</strong> parameters, the<br />
<strong>in</strong>ternal and external parameters for the projector. Unlike the camera, the<br />
projector already has a set <strong>of</strong> 2D po<strong>in</strong>ts as ground truth s<strong>in</strong>ce the pattern<br />
to be projected is a known image, but their 3D correspondences are un-<br />
known (the 3D positions <strong>of</strong> their projections). F<strong>in</strong>d<strong>in</strong>g these 3D locations<br />
is essential so that there are two sets <strong>of</strong> po<strong>in</strong>ts available to complete the<br />
projector calibration. Therefore it is a prerequisite that the camera needs<br />
to be calibrated first to provide the transform <strong>of</strong> these unknown 3D po<strong>in</strong>ts,<br />
from the camera image space to the real world coord<strong>in</strong>ate system.<br />
2D plane to plane calibration<br />
The user <strong>in</strong>terface <strong>of</strong> the collaborative system designed <strong>in</strong> this research<br />
for <strong>in</strong>putt<strong>in</strong>g 3D is based on a plane (i.e. the table top). Therefore a pre-<br />
cise registration between the image space <strong>of</strong> the camera and the rendered<br />
space <strong>of</strong> the projector is desired so that the spatial relationship between<br />
the projected signals and their observed images is susta<strong>in</strong>ed. To work out<br />
this plane to plane geometry it is not necessary that the <strong>in</strong>ternal parame-<br />
ters <strong>of</strong> the camera and the projector are known. The method <strong>of</strong> this plane<br />
to plane calibration is <strong>in</strong>troduced <strong>in</strong> later part <strong>of</strong> this chapter.<br />
The rest <strong>of</strong> this chapter is structured as follows. In section 3.2 we re-<br />
view other related works. In section 3.3 we expla<strong>in</strong> the calibration param-<br />
eters and the formalised full calibration model is given. In section 3.4 we<br />
<strong>in</strong>troduce the implementation <strong>of</strong> calibrat<strong>in</strong>g the camera and the projector,<br />
respectively. A method <strong>of</strong> 2D plane to plane calibration is presented <strong>in</strong><br />
54
section 3.5. Conclusions are given <strong>in</strong> section 3.6.<br />
3.2 Background<br />
Dur<strong>in</strong>g the past decade camera calibration has received a lot <strong>of</strong> attention<br />
because it is strongly related to many computer vision applications such<br />
as stereo vision, motion detection, structure from motion, and robotics<br />
[99, 37, 39, 48, 111].<br />
One <strong>of</strong> the most used methods is Tsai’s camera calibration method [99]<br />
that is also suitable for a wide range <strong>of</strong> applications. It is because his<br />
method deals with planar and non-planar calibration objects which makes<br />
it possible to calibrate <strong>in</strong>ternal and external parameters separately. This is<br />
important because <strong>in</strong> some cases the <strong>in</strong>ternal parameters are known (pro-<br />
vided by the manufacturer) so that one can fix the <strong>in</strong>ternal parameters <strong>of</strong><br />
the camera, and carry out iterative non-l<strong>in</strong>ear optimisation only on the ex-<br />
ternal parameters.<br />
The conventional calibration process can be consum<strong>in</strong>g <strong>in</strong> terms <strong>of</strong> time<br />
and effort, and calibration objects might not be always available. This <strong>in</strong>-<br />
spires self-calibrat<strong>in</strong>g methods which use the horizon l<strong>in</strong>e and vanish<strong>in</strong>g<br />
po<strong>in</strong>ts that are estimated from structural <strong>in</strong>formation such as landscape or<br />
55
uild<strong>in</strong>gs [26, 79]. These methods are <strong>of</strong>ten used <strong>in</strong> computer vision tasks<br />
based on s<strong>in</strong>gle view geometry or video surveillance applications [31]. Lv<br />
et al. [66] approaches the camera self-calibration problem us<strong>in</strong>g extracted<br />
positions from a s<strong>in</strong>gle walk<strong>in</strong>g man via PCA analysis, to estimate the<br />
vanish<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong>directly. No rigid calibration target is needed for the<br />
aforementioned approaches, however they are more onl<strong>in</strong>e-oriented and<br />
not very practical for our table-top VAE applications.<br />
In this research, we first carry out the camera calibration process us-<br />
<strong>in</strong>g the Matlab toolbox developed by Bouguet [14]. This Matlab toolbox<br />
is developed by Jean-Yves Bouguet at California Institute <strong>of</strong> Technology<br />
and its C implementation is also available <strong>in</strong> the Open Source <strong>Computer</strong><br />
Vision Library [51]. The toolbox is then extended and converted to C++ to<br />
make it capable <strong>of</strong> calibrat<strong>in</strong>g the projector-camera system. In the <strong>of</strong>f-l<strong>in</strong>e<br />
process us<strong>in</strong>g the Matlab toolbox, the projections and captures are done<br />
<strong>in</strong> the first stage and the captured images are processed on a local PC <strong>in</strong> a<br />
separate second stage. An onl<strong>in</strong>e calibration program was then developed<br />
<strong>in</strong> C++, which takes about two m<strong>in</strong>utes to calibrate the camera-projector<br />
pair <strong>in</strong> a fully automatic manner us<strong>in</strong>g 20 different poses <strong>of</strong> the calibration<br />
board.<br />
56
3.3 Calibration Parameters<br />
3.3.1 Intr<strong>in</strong>sic Parameters<br />
The <strong>in</strong>ternal camera model is described by a set <strong>of</strong> parameters known as<br />
<strong>in</strong>tr<strong>in</strong>sic parameters. These parameters represent the <strong>in</strong>ternal geometry <strong>of</strong><br />
the camera.<br />
A matrix formed by camera <strong>in</strong>tr<strong>in</strong>sic parameters is known as a camera<br />
matrix, or the K matrix that relates a 3D scene po<strong>in</strong>t (X, Y, Z) T and its pro-<br />
jection (x, y, 1) T <strong>in</strong> the 2D image plane.<br />
where the camera matrix K is<br />
w ′<br />
⎛ ⎞ ⎛ ⎞<br />
x X<br />
⎜ ⎟ ⎜ ⎟<br />
⎜ ⎟ ⎜ ⎟<br />
⎜y⎟<br />
≈ K ⎜Y<br />
⎟<br />
⎝ ⎠ ⎝ ⎠<br />
1 Z<br />
⎡<br />
⎤<br />
fc1<br />
⎢<br />
K = ⎢ 0<br />
⎣<br />
α × fc1<br />
fc2<br />
c1<br />
⎥<br />
c2⎥<br />
⎦<br />
0 0 1<br />
All related parameters that compose K are expla<strong>in</strong>ed as follows.<br />
(3.1)<br />
(3.2)<br />
fc is the focal length represented as a 2 ×1 vector. It is <strong>in</strong> units <strong>of</strong> hor-<br />
izontal and vertical pixels. Both components are normally equal to each<br />
other. However when the camera CCD array is not square, fc1 is slightly<br />
different from fc2. Therefore, the camera model handles non-square pix-<br />
57
els, and fc1/fc2 is called the aspect ratio.<br />
cc is the pr<strong>in</strong>cipal po<strong>in</strong>t represented as a 2 × 1 vector (c1, c2), and it<br />
means how the projection centre is positioned <strong>in</strong> the image. As shown<br />
from figure 3.2, a 3D po<strong>in</strong>t (X, Y, Z, 1) T is be<strong>in</strong>g projected onto the imag<strong>in</strong>g<br />
plane, its projection be<strong>in</strong>g (x, y, 1) T . When this is be<strong>in</strong>g represented <strong>in</strong> UV<br />
space (the 2D image coord<strong>in</strong>ate), the follow<strong>in</strong>g relationship holds:<br />
⎧<br />
⎪⎨ u = x + c1<br />
⎪⎩ v = y + c2<br />
(3.3)<br />
Figure 3.2: Pr<strong>in</strong>cipal po<strong>in</strong>ts. Bottom right subimage is the imag<strong>in</strong>g plane.<br />
Generally the pr<strong>in</strong>cipal po<strong>in</strong>t cc is always considered to be at the centre<br />
<strong>of</strong> projection, but not precisely so because there is always a slight decen-<br />
tr<strong>in</strong>g effect <strong>in</strong> camera design. This defect could be taken care <strong>of</strong> by accurate<br />
camera calibration.<br />
58
α is the skew coefficient, a scalar which encodes the angle between the<br />
X and Y axes <strong>in</strong> the imag<strong>in</strong>g plane. It equals to zero when X and Y axes are<br />
perpendicular, but like the aspect ratio fc1/fc2 handl<strong>in</strong>g non-square pixels,<br />
the skew coefficient α handles non-rectangular pixels.<br />
kc is a 5×1 distortion vector. Although kc is not directly <strong>in</strong>cluded <strong>in</strong> the<br />
<strong>in</strong>tr<strong>in</strong>sic matrix for perspectively transform<strong>in</strong>g the po<strong>in</strong>t between different<br />
coord<strong>in</strong>ate systems, it still plays a part <strong>in</strong> the camera <strong>in</strong>ternal geometry.<br />
The lens distortion model was first <strong>in</strong>troduced by Brown <strong>in</strong> 1966 [18] and<br />
called the ”Plumb Bob” model. There are three types <strong>of</strong> lens distortions:<br />
radial, tangential and decentr<strong>in</strong>g distortion, with the radial distortion be-<br />
<strong>in</strong>g the most commonly known and most dist<strong>in</strong>guished. The full distor-<br />
tion is modelled as follows.<br />
For an image po<strong>in</strong>t (x, y),<br />
where<br />
and<br />
⎛<br />
⎝ xd<br />
⎠ = (1 + kc1r 2 + kc2r 4 + kc5r 6 )<br />
yd<br />
⎞<br />
dx =<br />
⎛<br />
r 2 = x 2 + y 2<br />
⎝ 2kc3xy + kc4(r2 + 2x2 )<br />
kc3(r2 + 2y2 ) + 2kc4xy<br />
⎛<br />
⎝ x<br />
⎞<br />
⎠ + dx (3.4)<br />
y<br />
⎞<br />
(3.5)<br />
⎠ (3.6)<br />
The term dx is the tangential distortion. It is due to the imperfect cen-<br />
tr<strong>in</strong>g <strong>of</strong> lens components and other manufactur<strong>in</strong>g defects. Therefore tan-<br />
gential distortion is also known as decentr<strong>in</strong>g distortion. Also, the radial<br />
59
distortion is more visible, be<strong>in</strong>g affected by three entries <strong>of</strong> the distortion<br />
vector, kc1, kc2 and kc5. Because <strong>of</strong> the concavity <strong>of</strong> the lens, pixels further<br />
away from the image centre suffer more severe distortion, and the amount<br />
<strong>of</strong> distortion is monotonically <strong>in</strong>creas<strong>in</strong>g with the factor x 2 +y 2 . This effect<br />
is illustrated <strong>in</strong> figure 3.3.<br />
(a) Distorted image (b) Distorted image<br />
(c) Orig<strong>in</strong>al image (d) Orig<strong>in</strong>al image<br />
Figure 3.3: The distortion effects.<br />
60
3.3.2 The Reduced Camera Model<br />
The above optical model is not always required <strong>in</strong> current manufactured<br />
cameras. In practice, the 6th order radial + tangential distortion model is<br />
<strong>of</strong>ten not considered completely. A few reductions are possible.<br />
• Nowadays most cameras on the market have pretty good optical<br />
systems, and it is hard to f<strong>in</strong>d lenses with imperfection <strong>in</strong> centr<strong>in</strong>g.<br />
Therefore tangential distortion can be discarded. The skew coeffi-<br />
cient α is <strong>of</strong>ten assumed to be zero for the same reason.<br />
• For cameras with good optical systems or standard Field <strong>of</strong> View<br />
(FOV) lenses (non wide-angle lenses), it is not necessary to push the<br />
lens distortion model to high orders. Commonly a second order ra-<br />
dial distortion is used.<br />
• In some <strong>in</strong>stances such as the calibration data is not sufficient (e.g.<br />
us<strong>in</strong>g only two or three images for calibration), it is an option to set<br />
the pr<strong>in</strong>cipal po<strong>in</strong>t cc at the centre <strong>of</strong> the image ( nx−1<br />
2 , ny−1<br />
2 ) and reject<br />
the aspect ratio fc1/fc2 (set it to 1). However when sufficient images<br />
are used for calibration, this reduction is not necessary.<br />
Therefore, the reduced camera model can be def<strong>in</strong>ed as:<br />
⎛<br />
fc1 0 c1<br />
⎞<br />
⎜<br />
K = ⎜ 0<br />
⎝<br />
fc2<br />
⎟<br />
c2⎟<br />
⎠<br />
0 0 1<br />
61<br />
(3.7)
with distortion modelled as:<br />
⎛ ⎞<br />
where r 2 = x 2 + y 2 .<br />
⎝ xd<br />
⎠ = (1 + kc1r 2 )<br />
yd<br />
3.3.3 Extr<strong>in</strong>sic Parameters<br />
⎛<br />
⎝ x<br />
⎞<br />
⎠ (3.8)<br />
y<br />
Figure 3.4: Transformation from world to camera coord<strong>in</strong>ate system.<br />
Figure 3.4 is an example <strong>of</strong> how a triangle <strong>in</strong> the world coord<strong>in</strong>ate space<br />
is imaged. Let (Xw, Yw, Zw) T be an object po<strong>in</strong>t (the blue po<strong>in</strong>t <strong>in</strong> the pic-<br />
ture) and its 3D position <strong>in</strong> the camera coord<strong>in</strong>ate system is (Xc, Yc, Zc) T .<br />
Let po<strong>in</strong>t (x, y, f) T be its projection (the red po<strong>in</strong>t <strong>in</strong> the picture) on the<br />
imag<strong>in</strong>g plane and f is the focal length.<br />
The rotation matrix R and the translation vector T characterise the 3D<br />
transformation for a scene po<strong>in</strong>t from the world coord<strong>in</strong>ate to camera co-<br />
62
ord<strong>in</strong>ate,<br />
⎛<br />
⎜<br />
⎝<br />
Xc<br />
Yc<br />
Zc<br />
⎞<br />
⎛<br />
⎟ ⎜<br />
⎟ ⎜<br />
⎟ = R ⎜<br />
⎠ ⎝<br />
Xw<br />
Yw<br />
Zw<br />
⎞<br />
⎟ + T (3.9)<br />
⎠<br />
where R is a 3 × 3 rotation matrix and T is a 3 × 1 translation vector be-<br />
tween the two system orig<strong>in</strong>s <strong>in</strong> 3D space.<br />
After the scene po<strong>in</strong>t is transferred from world <strong>in</strong>to camera coordi-<br />
nates, its 2D image po<strong>in</strong>t can be known as<br />
⎧<br />
Rotation matrix<br />
⎪⎨<br />
x = f Xc<br />
Zc<br />
⎪⎩ y = f Yc<br />
Zc<br />
(3.10)<br />
Three ma<strong>in</strong> rotation parameter Rx, Ry, Rz, also known as pan, tilt, yaw<br />
angles, are the Euler angles <strong>of</strong> the rotation from world to camera coordi-<br />
nate system around three major axes, are represented by a 3 × rotation<br />
matrix R,<br />
⎛<br />
⎜<br />
R = ⎜<br />
⎝<br />
r11 r12 r13<br />
r21 r22 r23<br />
r31 r32 r33<br />
63<br />
⎞<br />
⎟<br />
⎠<br />
(3.11)
where<br />
Translation vector<br />
3.3.4 Full Model<br />
r11 = cos(Ry) s<strong>in</strong>(Rz) (3.12)<br />
r12 = cos(Rz) s<strong>in</strong>(Rx) s<strong>in</strong>(Ry) − cos(Rx) s<strong>in</strong>(Rz) (3.13)<br />
r13 = s<strong>in</strong>(Rx) s<strong>in</strong>(Rz) + cos(Rx) cos(Rz) s<strong>in</strong>(Ry) (3.14)<br />
r21 = cos(Ry) s<strong>in</strong>(Rz) (3.15)<br />
r22 = s<strong>in</strong>(Rx) s<strong>in</strong>(Ry) s<strong>in</strong>(Rz) + cos(Rx) cos(Rz) (3.16)<br />
r23 = cos(Rx) s<strong>in</strong>(Ry) s<strong>in</strong>(Rz) − cos(Rz) s<strong>in</strong>(Rx) (3.17)<br />
r31 = − s<strong>in</strong>(Ry) (3.18)<br />
r32 = cos(Ry) s<strong>in</strong>(Rx) (3.19)<br />
r33 = cos(Rx) cos(Ry) (3.20)<br />
⎛<br />
⎜<br />
T = ⎜<br />
⎝<br />
Tx<br />
Ty<br />
Tz<br />
⎞<br />
⎟<br />
⎠<br />
(3.21)<br />
Comb<strong>in</strong><strong>in</strong>g the camera <strong>in</strong>tr<strong>in</strong>sic and extr<strong>in</strong>sic parameters, it gives the full<br />
projection model, which performs the transform <strong>of</strong> a scene po<strong>in</strong>t (Xw, Yw, Zw) T<br />
from the World Coord<strong>in</strong>ate System (WCS) to the camera coord<strong>in</strong>ate system<br />
(Xc, Yc, Zc) T , then to the 2D imag<strong>in</strong>g space (x, y) T , as shown <strong>in</strong> equation<br />
3.22. By represent<strong>in</strong>g all the po<strong>in</strong>ts <strong>in</strong> their homogeneous form, the above<br />
transform relationships can be formalised as<br />
64
⎛ ⎞ ⎛<br />
x<br />
⎜ ⎟ ⎜<br />
⎜ ⎟ ⎜<br />
⎜y⎟<br />
≈ K ⎜<br />
⎝ ⎠ ⎝<br />
1<br />
Xc<br />
Yc<br />
Zc<br />
where K(R|T ) is a 3 × 4 projection matrix.<br />
⎛ ⎞<br />
⎞<br />
Xw ⎜ ⎟<br />
⎜ ⎟<br />
⎟ ⎜<br />
⎟ ⎜Yw<br />
⎟<br />
⎟ = K(R|T ) ⎜ ⎟<br />
⎠ ⎜ ⎟<br />
⎜Zw<br />
⎟<br />
⎝ ⎠<br />
1<br />
(3.22)<br />
So to calibrate the camera, it is necessary to estimate both <strong>in</strong>tr<strong>in</strong>sic<br />
and extr<strong>in</strong>sic parameters, and the distortion model. This can be done by<br />
match<strong>in</strong>g a set <strong>of</strong> ground truth po<strong>in</strong>ts from the calibration object, and their<br />
correspondences <strong>in</strong> the observed image.<br />
3.4 Calibrate Camera-Projector Pair<br />
3.4.1 World Coord<strong>in</strong>ate System<br />
The camera extr<strong>in</strong>sic parameters are not <strong>in</strong>herent parameters <strong>of</strong> the cam-<br />
era. The rotation and translation only represent the current camera pose <strong>in</strong><br />
reference to the world coord<strong>in</strong>ate system chosen by user. Without a world<br />
coord<strong>in</strong>ate system or a reference coord<strong>in</strong>ate system, the extr<strong>in</strong>sic param-<br />
eters are mean<strong>in</strong>gless. Therefore, a world coord<strong>in</strong>ate system needs to be<br />
chosen first as a reference to describe the relative camera position. In our<br />
65
system the white board is chosen to be the world reference frame.<br />
Be<strong>in</strong>g more specific, by lay<strong>in</strong>g a checkerboard flat on the table plane,<br />
the checkerboard plane is chosen as the XOY plane <strong>of</strong> the world coordi-<br />
nate system with its bottom and leftmost edge taken as X and Y axis. The<br />
surface normal vector po<strong>in</strong>ted from the bottom-left corner <strong>of</strong> the checker-<br />
board is chosen as the Z axis. Thus the orig<strong>in</strong> <strong>of</strong> the WCS is arbitrary <strong>in</strong> X<br />
and Y, depend<strong>in</strong>g on how and where the checkerboard was laid.<br />
3.4.2 Methodology<br />
Before we can calibrate the camera and projector pair, a set <strong>of</strong> calibration<br />
images is needed. In this research we use 20 images for camera calibration<br />
and 20 images for projector calibration, each pair be<strong>in</strong>g captured from a<br />
different angle <strong>of</strong> the white board.<br />
The ma<strong>in</strong> methodology is to take an image <strong>of</strong> a known 3D pattern as<br />
ground truth. Then <strong>in</strong> the captured image one selects a set <strong>of</strong> po<strong>in</strong>ts <strong>of</strong> that<br />
pattern as <strong>in</strong>terest po<strong>in</strong>ts, to use the 2D coord<strong>in</strong>ate <strong>in</strong>formation <strong>of</strong> those<br />
<strong>in</strong>terest po<strong>in</strong>ts along with their 3D match<strong>in</strong>g po<strong>in</strong>ts as correspondence to<br />
calibrate the camera. Normally this process is iterated by orient<strong>in</strong>g the cal-<br />
ibration pattern <strong>in</strong> different angles to <strong>in</strong>crease accuracy.<br />
The projector is calibrated <strong>in</strong> a similar way. A pre-designed pattern<br />
with ground truth <strong>in</strong>formation is projected onto a surface (which is re-<br />
66
garded as <strong>in</strong> world coord<strong>in</strong>ate space), and the projection is monitored from<br />
the calibrated camera. S<strong>in</strong>ce at this po<strong>in</strong>t the camera is already calibrated,<br />
with the captured image and full camera model we can recover the 3D<br />
<strong>in</strong>formation <strong>of</strong> the projected pattern. These 3D <strong>in</strong>formation together with<br />
prior knowledge <strong>of</strong> the pre-designed 2D pattern forms a correspondence,<br />
and hence the projector can be calibrated by these two sets <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> a<br />
”reversed camera” way.<br />
Figure 3.5 shows the flow chart <strong>of</strong> the whole calibration process. The<br />
diagram shows the whole process after the data collection stage is done,<br />
dur<strong>in</strong>g which the black patterns are projected onto the cyan checkerboard<br />
and images are taken at the same time.<br />
3.4.3 Data Collection<br />
We use a pr<strong>in</strong>ted checkerboard as the camera calibration target, and we<br />
let the projector project another checkerboard as the projector calibration<br />
target. As mentioned <strong>in</strong> section 3.3, the camera calibration results - partic-<br />
ularly the camera extr<strong>in</strong>sic parameters - are needed to perform the trans-<br />
formation <strong>of</strong> the observed projected pattern from camera coord<strong>in</strong>ate space<br />
to world coord<strong>in</strong>ate space. Therefore when the pr<strong>in</strong>ted pattern is be<strong>in</strong>g<br />
captured, we have to make sure a projected pattern is captured as well<br />
with the base plane stay<strong>in</strong>g exactly at the same pose – to ma<strong>in</strong>ta<strong>in</strong> accu-<br />
racy.<br />
67
Figure 3.5: Flow chart <strong>of</strong> the camera-projector pair calibration. (diagram<br />
<strong>of</strong> image process<strong>in</strong>g after the projections and captures are done)<br />
However, this is not easy, if the user has to slide <strong>in</strong> and out the pr<strong>in</strong>ted<br />
checkerboard every time the checkerboard changes the orientation, and<br />
68
manually it is very hard to hold the base plane firmly stationary while<br />
perform<strong>in</strong>g these activities. It might require one tester to hold the board<br />
still while another one is handl<strong>in</strong>g the sheet. For this reason a mechanism<br />
that allows us to take a picture <strong>of</strong> two superimposed checkerboards and<br />
extract one from each other is desired, to prevent any slight movement <strong>of</strong><br />
the base plane. This is possible by choos<strong>in</strong>g appropriate colours for the<br />
checkerboards.<br />
We use a cyan-white checkerboard for the pr<strong>in</strong>ted pattern, and a blue-<br />
black checkerboard for the projected pattern. Cyan and white have very<br />
similar blue components under white ambient light. Therefore, <strong>in</strong> a cap-<br />
tured image with both checkerboards there, by <strong>in</strong>spect<strong>in</strong>g the blue chan-<br />
nel, the cyan checkerboard is barely seen and the blue checkerboard can<br />
be extracted.<br />
On the other hand, blue and black grids have near-zero red compo-<br />
nents. This means by super-impos<strong>in</strong>g a blue-black checkerboard onto a<br />
cyan-white, <strong>in</strong> the red channel no components are added. This property<br />
allows us to extract the cyan-white checkerboard out <strong>of</strong> the superimposed<br />
version easily. In figure 3.6, the top image shows the captured image <strong>of</strong><br />
superimposed checkerboards. The bottom two images are images <strong>of</strong> ex-<br />
tracted checkerboards.<br />
69
3.4.4 Choice <strong>of</strong> colour<br />
Gett<strong>in</strong>g the pr<strong>in</strong>ted pattern from the mixed image is simple, because when<br />
it is captured the projected pattern is switched <strong>of</strong>f. More effort is made <strong>in</strong><br />
extract<strong>in</strong>g the projected pattern from the mixed pattern, and the key is to<br />
f<strong>in</strong>d the difference between the blue projected area and black projected<br />
area under the <strong>in</strong>terference <strong>of</strong> the pr<strong>in</strong>ted pattern on the white board.<br />
Zhang [109] chooses red and blue for the pr<strong>in</strong>ted and projected pattern<br />
respectively, because <strong>of</strong> their dist<strong>in</strong>ctively different RGB values. In prac-<br />
tice, other factors such as the surface reflection and room light<strong>in</strong>g con-<br />
dition need to be considered. After evaluat<strong>in</strong>g colour comb<strong>in</strong>ations we<br />
choose cyan as the colour for the pr<strong>in</strong>ted pattern <strong>in</strong>stead <strong>of</strong> red, and figure<br />
3.6 shows its performance.<br />
Figure 3.7 gives a closer look at the mixed area. In figure 3.7(a), area A<br />
and C are the non projected area (projection is zero) but A appears yellow-<br />
ish because the surface absorbs part <strong>of</strong> the ambient light, and C appears<br />
darker as it is affected by the blue grid on the pr<strong>in</strong>ted sheet. D and B are<br />
the blue projection area, but B is affected by the pr<strong>in</strong>ted pattern <strong>in</strong> the same<br />
way. The task is to differentiate area A and C from D and B by exploit<strong>in</strong>g<br />
their colour channels. The <strong>in</strong>stant f<strong>in</strong>d is that the pr<strong>in</strong>ted cyan colour has<br />
very little effect <strong>in</strong> the blue channels <strong>in</strong> the captured image – A and C have<br />
very little blue component, and B and D have heavy blue channels despite<br />
that B and C are the areas where the surface is pr<strong>in</strong>ted as cyan. The ex-<br />
traction result is shown <strong>in</strong> figure 3.7(b). The same can not be applied to<br />
70
(a) blue and cyan mixed pattern (b) extracted blue pattern<br />
(c) blue and red mixed pattern (d) extracted blue pattern<br />
Figure 3.6: Extraction <strong>of</strong> the projected pattern from the mixed one.<br />
the red-blue method (figure 3.7(c),(d)), where the pr<strong>in</strong>ted red area appears<br />
full red <strong>in</strong> the observed image regardless whether it is mixed with blue<br />
projection or not.<br />
This method is also tested under different ambient illum<strong>in</strong>ations. In<br />
general, experiments conducted when sufficient day light is available out-<br />
perform those conducted dur<strong>in</strong>g the night, and it is mostly reflected <strong>in</strong> the<br />
failure <strong>of</strong> extract<strong>in</strong>g the all the corners successfully because <strong>of</strong> less satis-<br />
factory results from cyan and blue colour filter<strong>in</strong>g. This is because dur<strong>in</strong>g<br />
71
(a) blue and cyan mixed pattern (b) extracted blue pattern<br />
(c) blue and red mixed pattern (d) extracted blue pattern<br />
Figure 3.7: Extraction <strong>of</strong> the projected pattern from the mixed one (a closer<br />
look).<br />
the night room light<strong>in</strong>g needs to be turned on to illum<strong>in</strong>ate the physical<br />
checkerboard while the projection is <strong>of</strong>f, and this contributes negatively<br />
to the colour filter<strong>in</strong>g at later stage as the fluorescent lamps violates the<br />
colour channels more than the sunlight. When the po<strong>in</strong>ts are extracted<br />
automatically, any captured images with not enough corner po<strong>in</strong>ts will be<br />
rejected (e.g. a precise 81 <strong>in</strong>ner corner po<strong>in</strong>ts are expected from a 10 × 10<br />
checkerboard). Disqualify<strong>in</strong>g more images leads to degradation the accu-<br />
racy <strong>of</strong> the calibration.<br />
72
3.4.5 Camera Calibration<br />
An automated process is implemented. All the user need to do is to hold<br />
the whiteboard which is attached with a physical checkerboard pattern at<br />
one pose for a short period (around 2 seconds), to let the camera take two<br />
pictures with the projections turned on and <strong>of</strong>f, then re-position the board<br />
<strong>in</strong>to a different orientation as long as the whole pr<strong>in</strong>ted checkerboard pat-<br />
tern is with<strong>in</strong> the common FOV between the camera and the projector.<br />
1. After the image capture stage, the colour filtered images as shown <strong>in</strong><br />
the bottom left image <strong>of</strong> figure 3.6 captured from ten different orien-<br />
tations <strong>of</strong> the white board are used as the camera calibration images.<br />
2. For each image, the user manually clicks the four top corners <strong>of</strong> the<br />
checkerboard. The user is also prompted to <strong>in</strong>put the physical grid<br />
size <strong>of</strong> the checkerboard to set up the units world coord<strong>in</strong>ate system.<br />
Grid numbers and the <strong>in</strong>ner cross po<strong>in</strong>ts are located automatically<br />
after the four top corners are given.<br />
3. Normally the lens distortion can be tolerated at this stage as the dis-<br />
tortion model will be estimated later us<strong>in</strong>g the camera <strong>in</strong>tr<strong>in</strong>sic pa-<br />
rameters. In case <strong>of</strong> severe lens distortion, the user is advised to give<br />
an <strong>in</strong>itial guess for the first order distortion factor kc1. Then the sys-<br />
tem will take a guess and locate the corners more precisely, as shown<br />
<strong>in</strong> figure 3.8.<br />
4. After corner po<strong>in</strong>ts are extracted for all <strong>in</strong>put images, the user can<br />
deploy the camera calibration. By def<strong>in</strong><strong>in</strong>g the checkerboard plane<br />
73
as the world coord<strong>in</strong>ate XOY plane and the first po<strong>in</strong>t user clicked<br />
as bottom left corner as the world coord<strong>in</strong>ate orig<strong>in</strong>, 3D po<strong>in</strong>ts <strong>of</strong> all<br />
corners are known. Calibration parameters are first <strong>in</strong>itialised, and<br />
then optimised by redo the calibration us<strong>in</strong>g the improved repro-<br />
jected corners based on the estimated camera parameters.<br />
(a) blue and cyan mixed pattern (b) extracted blue pattern<br />
Figure 3.8: Extraction <strong>of</strong> the projected pattern from the mixed one (a closer<br />
look).<br />
3.4.6 Projector Calibration<br />
By the time the projector is calibrated, calibration for the camera is already<br />
done. Therefore, the calibration images used for projector calibration (<strong>in</strong><br />
our case, 10 blue checkerboard images) will go through an ’Undistort’<br />
stage before be<strong>in</strong>g used as <strong>in</strong>put images for corner extraction. The two<br />
74
dimensional distortion vector is used to removed distortion from the im-<br />
ages.<br />
The first few steps <strong>of</strong> projector calibration are the same as camera: read<br />
images, extract corners.<br />
The extracted corners here cannot be used directly for calibration. They<br />
are the corner po<strong>in</strong>ts <strong>in</strong> the captured image <strong>of</strong> the projected checkerboard.<br />
The <strong>in</strong>formation we need is the 3D coord<strong>in</strong>ates <strong>of</strong> corners <strong>of</strong> the projected<br />
pattern. Now the camera model can be used to perform these transforma-<br />
tions.<br />
In theory it is impossible to recover a 3D scene po<strong>in</strong>t merely from its<br />
2D projection <strong>in</strong> the image plane. Because given the projection <strong>in</strong> the im-<br />
age, its orig<strong>in</strong>al 3D po<strong>in</strong>t could be anywhere down the projection ray if<br />
the scene structure is unknown. However <strong>in</strong> our case, all the po<strong>in</strong>ts we<br />
are try<strong>in</strong>g to recover is on the checkerboard plane which is chosen as the<br />
XOY plane <strong>of</strong> the WCS, that means for all <strong>of</strong> them Z = 0. This relation-<br />
ship holds for all different poses <strong>of</strong> the checkerboard, as the <strong>in</strong>stantaneous<br />
plane where the pr<strong>in</strong>ted checkerboard lies <strong>in</strong> is assumed to be the XOY<br />
plane <strong>of</strong> the WCS.<br />
Technically, there is a different WCS for each tilt <strong>of</strong> the plane. It won’t<br />
affect the f<strong>in</strong>al calibration result because for N tilts there will be N sets<br />
<strong>of</strong> different rotation and translation vectors. Geometrically, each <strong>of</strong> them<br />
75
only represents the relative geometry towards temporary WCS, but there<br />
is only one set <strong>of</strong> rotation and translation vector will be used to estimate<br />
the f<strong>in</strong>al extr<strong>in</strong>sic parameters – the one from the view where the white-<br />
board is laid flat on tabletop, as that is where the VAE runs upon.<br />
Let x, y be the image po<strong>in</strong>t, we are try<strong>in</strong>g to recover its 3D coord<strong>in</strong>ate<br />
<strong>in</strong> world coord<strong>in</strong>ate system, given camera calibration parameters and the<br />
constra<strong>in</strong>t Z = 0.<br />
⎛ ⎞<br />
⎛ ⎞<br />
⎜<br />
X<br />
⎟<br />
x<br />
⎜ ⎟<br />
⎜ ⎟ ⎜<br />
⎜ ⎟ ⎜Y<br />
⎟<br />
⎜y⎟<br />
≈ K(R|T ) ⎜ ⎟<br />
⎝ ⎠ ⎜ 0<br />
⎟<br />
1<br />
⎝ ⎠<br />
1<br />
(3.23)<br />
Here ≈ means equal up to a scale, so we replace it with a non-zero<br />
factor w<br />
⎛ ⎞<br />
⎛ ⎞<br />
⎜<br />
X<br />
⎟<br />
x<br />
⎜ ⎟<br />
⎜ ⎟ ⎜<br />
⎜ ⎟ ⎜Y<br />
⎟<br />
w ⎜y⎟<br />
= K(R|T ) ⎜ ⎟<br />
⎝ ⎠ ⎜ 0<br />
⎟<br />
1<br />
⎝ ⎠<br />
1<br />
Replace K(R|T ) with the 3 × 4 projection matrix P<br />
⎛<br />
⎞<br />
⎜<br />
P = K(R|T ) = ⎜<br />
⎝<br />
From Equ. 3.24 and 3.25, we have<br />
⎛ ⎞<br />
x<br />
⎜ ⎟<br />
⎜ ⎟<br />
w ⎜y⎟<br />
⎝ ⎠<br />
1<br />
=<br />
⎛<br />
⎜<br />
⎝<br />
p11 p12 p13 p14<br />
p21 p22 p23 p24<br />
p21 p32 p33 p34<br />
p11 p12 p13 p14<br />
p21 p22 p23 p24<br />
p21 p32 p33 p34<br />
76<br />
⎞<br />
⎟<br />
⎠<br />
⎟<br />
⎠<br />
(3.24)<br />
(3.25)<br />
(3.26)
Cancel out the scale factor w by divid<strong>in</strong>g the first and second row by<br />
the third row from Equ. 3.26<br />
x = p11X + p12Y + p14<br />
p31X + p32Y + p34<br />
y = p21X + p22Y + p24<br />
p31X + p32Y + p34<br />
From Equ. 3.27 and 3.28, X and Y <strong>in</strong> Equ. 3.23 can be solved<br />
X = (xp34 − p14)(p22 − yp32) − (yp34 − p24)(p12 − xp32)<br />
(p11 − xp31)(p22 − yp32) − (p21 − yp31)(p12 − xp32)<br />
Y = (xp34 − p14)(p21 − yp31) − (yp34 − p24)(p11 − xp31)<br />
(p12 − xp32)(p21 − yp31) − (p22 − yp32)(p11 − xp31)<br />
(3.27)<br />
(3.28)<br />
(3.29)<br />
(3.30)<br />
A programs was written by the author to implement all the calcula-<br />
tions above. So given an extracted po<strong>in</strong>t x, y from a corner po<strong>in</strong>t <strong>in</strong> the<br />
observed blue pattern, with the camera already calibrated, its position <strong>in</strong><br />
the world coord<strong>in</strong>ate space (X, Y ) is located from Equ. 3.29 and 3.30.<br />
S<strong>in</strong>ce the projection pattern (the blue checkerboard) is pre-designed, its<br />
corner po<strong>in</strong>ts are all known. Along with the calculated 3D corners <strong>of</strong> the<br />
projected pattern, the projector can be calibrated <strong>in</strong> a similar way as cam-<br />
era calibration. The estimated distortion vector kc for the projector is very<br />
close to all zero, therefore the projector is assumed to have zero distortion.<br />
77
3.5 Plane to Plane Calibration<br />
The whole user <strong>in</strong>terface <strong>of</strong> our collaborative system for <strong>in</strong>putt<strong>in</strong>g 3D is<br />
based on a plane (i.e. the table top). Therefore a precise estimation <strong>of</strong> the<br />
projective transform between the projector and the camera for this plane is<br />
desired, because we need constant and real-time monitor<strong>in</strong>g <strong>of</strong> augmented<br />
signals <strong>in</strong> captured frames and response to them. Although the calibration<br />
data we previously worked out can be used, a more straightforward and<br />
accurate match<strong>in</strong>g is preferred.<br />
A homography matrix is modelled to represent this match<strong>in</strong>g. A ho-<br />
mography is a 3 × 3 non-s<strong>in</strong>gular matrix, which def<strong>in</strong>es a homogeneous<br />
l<strong>in</strong>ear transformation from a plane to another plane. Although there is<br />
never a direct projective transform between the projector plane and the<br />
camera imag<strong>in</strong>g plane, a homography still exists coherently between these<br />
two planes because they are <strong>in</strong>duced by a reference plane, which is the<br />
white board <strong>in</strong> our case. Estimat<strong>in</strong>g the homography can be regarded as<br />
a 2D calibration process between the projector plane and camera plane.<br />
Normally a homography has 9 entries but only has 8 degrees <strong>of</strong> freedom,<br />
be<strong>in</strong>g constra<strong>in</strong>ed by ||H|| = 1 to only carry out an up to scale match<strong>in</strong>g.<br />
Let the model plane (i.e. the white board) co<strong>in</strong>cide with the XOY plane<br />
<strong>of</strong> the world coord<strong>in</strong>ate system, then a 3D po<strong>in</strong>t on the model plane is Pw =<br />
(Xw, Yw, 0, 1) T , with its observed po<strong>in</strong>t <strong>in</strong> the camera plane Pc = (xc, yc, 1) T<br />
and its projection source po<strong>in</strong>t <strong>in</strong> the projector plane Pp = (xp, yp, 1) T . Sim-<br />
ilar to Equ. 3.23 and 3.24, we have<br />
78
⎛ ⎞<br />
⎛ ⎞<br />
Xw ⎜ ⎟<br />
xc<br />
⎜ ⎟<br />
⎜ ⎟<br />
⎜<br />
⎜ ⎟<br />
⎜Yw<br />
⎟<br />
⎜yc⎟<br />
≈ Kc(Rc|Tc) ⎜ ⎟<br />
⎝ ⎠ ⎜ 0<br />
⎟<br />
1<br />
⎝ ⎠<br />
1<br />
== Kc<br />
<br />
rc1 rc2 tc<br />
⎛<br />
Xw<br />
⎞<br />
⎜ ⎟<br />
⎜ ⎟<br />
⎜Yw<br />
⎟<br />
⎝ ⎠<br />
1<br />
(3.31)<br />
where rci denotes the i th column <strong>of</strong> the camera rotation matrix Rc and tc<br />
denotes the column vector <strong>of</strong> the translation matrix Tc.<br />
The homography Hwc from world plane to camera plane can be ex-<br />
pressed as<br />
Hwc ≈ Kc<br />
<br />
rc1 rc2 tc<br />
<br />
(3.32)<br />
Likewise, the homography Hwp form world plane to projector plane is<br />
Hwp ≈ Kp<br />
<br />
rp1 rp2 tp<br />
Substitution <strong>of</strong> Equ. 3.32 and 3.33 <strong>in</strong>to 3.31 yields<br />
Pc ≈ HwcPw<br />
Pp ≈ HwpPw<br />
<br />
(3.33)<br />
(3.34)<br />
(3.35)<br />
From Equ. 3.34 and 3.35, it is not hard to f<strong>in</strong>d out that although the two<br />
po<strong>in</strong>ts Pc and Pp are still related by a projective transform although be<strong>in</strong>g<br />
<strong>in</strong>duced by a third plane<br />
Pc ≈ Hpc Pp<br />
where the homography <strong>of</strong> projector plane to camera plane is<br />
Hpc = HwcH −1<br />
wp<br />
79<br />
(3.36)<br />
(3.37)
However, it can also be seen from Equ. 3.37 that this homography Hpc<br />
only holds the current camera-projector relationship when and only when<br />
the reference plane not be<strong>in</strong>g changed. This is known as the plane to plane<br />
homography <strong>in</strong>duced by a third plane. Dur<strong>in</strong>g our calibration, tilt<strong>in</strong>g the<br />
whiteboard 20 times yields 20 different homographies between the cam-<br />
era space and the projector space. Similar to the discussion <strong>in</strong> section 3.4.6,<br />
only the homography <strong>in</strong>duced by the flat-placed whiteboard is the one we<br />
are <strong>in</strong>terested <strong>in</strong>, because once the VAE is up and runn<strong>in</strong>g the whiteboard<br />
is fixed onto the tabletop.<br />
To solve the homography, all participat<strong>in</strong>g frames will go through the<br />
distortion removal stage us<strong>in</strong>g the calibrated camera <strong>in</strong>ternal model and<br />
distortion parameters. Keep<strong>in</strong>g the same notations from Equ. 3.36, and by<br />
<strong>in</strong>troduc<strong>in</strong>g the scale factor w, Equ. 3.36 can be rewritten as<br />
⎛<br />
⎞<br />
⎛<br />
wxc<br />
⎜ ⎟<br />
⎜ ⎟<br />
⎜wyc⎟<br />
⎝ ⎠<br />
w<br />
=<br />
⎜<br />
⎝<br />
h1 h2 h3<br />
h4 h5 h6<br />
h7 h8 h9<br />
⎞ ⎛<br />
xp<br />
⎞<br />
⎟ ⎜ ⎟<br />
⎟ ⎜ ⎟<br />
⎟ ⎜yp⎟<br />
⎠ ⎝ ⎠<br />
1<br />
(3.38)<br />
Us<strong>in</strong>g the similar method as described <strong>in</strong> section 3.4.6, Equ. 3.27 and<br />
3.28 to cancel out w,<br />
xc = h1xp + h2yp + h3<br />
h7xp + h8yp + h9<br />
yc = h4xp + h5yp + h6<br />
h7xp + h8yp + h9<br />
(3.39)<br />
(3.40)<br />
Each po<strong>in</strong>t gives two equations, thus to solve H which has 8 Degree <strong>of</strong><br />
80
Freedom (DOF), a m<strong>in</strong>imum <strong>of</strong> 4 po<strong>in</strong>ts is needed. With (N ≥ 4) po<strong>in</strong>ts,<br />
⎛<br />
xp1<br />
⎜<br />
0<br />
⎜<br />
⎜xp2<br />
⎜ 0<br />
⎜ .<br />
⎜<br />
⎜xpn<br />
⎝<br />
yp1<br />
0<br />
yp2<br />
0<br />
.<br />
ypn<br />
1<br />
0<br />
1<br />
0<br />
.<br />
1<br />
0<br />
xp1<br />
0<br />
xp2<br />
.<br />
0<br />
0<br />
yp1<br />
0<br />
yp2<br />
.<br />
0<br />
0<br />
1<br />
0<br />
1<br />
.<br />
0<br />
−xp1xc1<br />
−xp1xc1<br />
−xp2xc2<br />
−xp2xc2<br />
.<br />
−xpnxcn<br />
−yp1xc1<br />
−yp1xc1<br />
−yp2xc2<br />
−yp2xc2<br />
.<br />
−ypnxcn<br />
−xc1<br />
⎟<br />
−xc1⎟<br />
⎛ ⎞<br />
⎟ h1<br />
−xc2<br />
⎟ ⎜ ⎟<br />
⎟ ⎜ ⎟<br />
⎟ ⎜<br />
⎟ ⎜h2<br />
⎟<br />
−xc2⎟<br />
⎜ ⎟ = 0 (3.41)<br />
⎟ ⎜<br />
⎟ ⎜ .<br />
⎟<br />
. ⎟ ⎝ ⎠<br />
⎟ h9<br />
−xcn⎟<br />
⎠<br />
0 0 0 xpn ypn 1 −xpnxcn −ypnxcn −xcn<br />
Let the 2N × 9 matrix <strong>in</strong> Equ. 3.41 be A, this becomes a typical prob-<br />
lem <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g the least square solution <strong>in</strong> an over-determ<strong>in</strong>ed situation,<br />
to m<strong>in</strong>imise errors over |AH = 0. H can be solved by expand<strong>in</strong>g the mea-<br />
surement matrix A to a square matrix and f<strong>in</strong>d<strong>in</strong>g its <strong>in</strong>verse matrix. We<br />
used an alternative solution, which obta<strong>in</strong>s H by f<strong>in</strong>d<strong>in</strong>g the eigenvector<br />
which corresponds to the lease eigenvalue <strong>of</strong> A T A [4].<br />
The solution to equation 3.41 is the homography between the camera<br />
and projector planes. It holds the transform <strong>in</strong> equation 3.38 from a po<strong>in</strong>t<br />
(xp, yp, 1) T <strong>in</strong> the projection image to its observation (xc, yc, 1) T <strong>in</strong> the cam-<br />
era image. Transform <strong>of</strong> the other way round from (xc, yc, 1) T to (xc, yc, 1) T<br />
is held by <strong>in</strong>verse matrix <strong>of</strong> this homography. By do<strong>in</strong>g this, a two way<br />
transform is for any augmentations <strong>in</strong> the VAE is available at any time,<br />
between its projection source and camera observation.<br />
81<br />
⎞
3.6 Conclusions<br />
This chapter beg<strong>in</strong>s with the <strong>in</strong>troduction <strong>of</strong> the fundamental <strong>of</strong> conven-<br />
tional camera calibration technique, followed by a detailed implementa-<br />
tion <strong>of</strong> the camera calibration process us<strong>in</strong>g the Matlab toolbox designed<br />
by previous researchers. The method is then extended to calibrate the<br />
projector as a reverse camera. F<strong>in</strong>ally, a fully automated method is im-<br />
plemented to calibrate the projector-camera system and used by the VAE<br />
framework <strong>in</strong> this research.<br />
The proposed method provides a means <strong>of</strong> estimat<strong>in</strong>g the <strong>in</strong>ternal and<br />
external parameters <strong>of</strong> the camera and the projector, <strong>in</strong> an automated way.<br />
It is fast, efficient, and requires little <strong>in</strong>vasion to the scene from the tester. A<br />
colour filter<strong>in</strong>g technique is also proposed so that the extraction <strong>of</strong> phys-<br />
ical pr<strong>in</strong>ted pattern and the projected pattern from the mixed version is<br />
possible, while they are <strong>in</strong>stantaneously susta<strong>in</strong>ed firmly with a same sur-<br />
face plane. This effectively exempts the user’s duty <strong>of</strong> manually manipu-<br />
lat<strong>in</strong>g the calibration objects, such as slid<strong>in</strong>g <strong>in</strong> and out the physical pat-<br />
tern to avoid its super-imposition with the projected pattern.<br />
A method <strong>of</strong> plane-to-plane calibration is presented <strong>in</strong> section 3.5. The<br />
result <strong>of</strong> this calibration is used once the VAE is up and runn<strong>in</strong>g, to susta<strong>in</strong><br />
the spatial relationship <strong>of</strong> the virtual augmentations and their observation<br />
<strong>in</strong> the camera image. This ensures a quick and reliable mapp<strong>in</strong>g for the<br />
VAE to monitor the changes <strong>in</strong> the <strong>in</strong>teractive environment, and respond<br />
to them by augment<strong>in</strong>g the scene with correspond<strong>in</strong>g video signals.<br />
82
Although not hav<strong>in</strong>g a comprehensive test, the proposed methods have<br />
been used reliably for the VAE system designed <strong>in</strong> this research <strong>in</strong> the<br />
past two years. Results from section 6.2.4 <strong>in</strong> chapter 6 suggests that accu-<br />
rate button locat<strong>in</strong>g is achieved, which is only estimated from the calibra-<br />
tion results <strong>of</strong> the projector-camera pair, without do<strong>in</strong>g any local image<br />
process<strong>in</strong>g <strong>in</strong> the observed image to detect the button positions. Hence<br />
the results were positive and warrant further research <strong>in</strong>to the use <strong>of</strong> this<br />
method.<br />
3.6.1 Future Work<br />
This chapter is concerned with the calibration process which estimates the<br />
<strong>in</strong>tr<strong>in</strong>sic and extr<strong>in</strong>sic parameters <strong>of</strong> the projector-camera pair, provides<br />
an accurate registration between the camera image space and the projec-<br />
tor render<strong>in</strong>g space, but only on a geometric scale.<br />
To deal with the light<strong>in</strong>g situation, photometric camera sett<strong>in</strong>gs such as<br />
brightness, contrast, exposure, and white balance are manually tuned and<br />
evaluated before the calibration. The photometric parameters <strong>of</strong> the pro-<br />
jector are also pre-set. For example, to project a blue-black checkerboard<br />
pattern, the blue channel <strong>of</strong> the rendered image is set to full illum<strong>in</strong>ation<br />
(i.e. 255). One might wonder, is 255 the optimal value for the brightness<br />
<strong>in</strong> all scenarios?<br />
83
A similar problem is also encountered <strong>in</strong> chapter 4, where a pla<strong>in</strong> white<br />
image is projected onto the <strong>in</strong>terface to illum<strong>in</strong>ate the object be<strong>in</strong>g mea-<br />
sured for the camera to take an image as the colour map. In day time<br />
where sufficient ambient light is available, the image can be taken without<br />
any illum<strong>in</strong>ation from the projector. However <strong>in</strong> the even<strong>in</strong>g when lights<br />
<strong>of</strong>f, projector illum<strong>in</strong>ation is essential while captur<strong>in</strong>g an image <strong>of</strong> the ob-<br />
ject because it is the only light source. Furthermore, ambient light be<strong>in</strong>g<br />
too strong will affect the optimal projection affects the projection as well<br />
because it can over-illum<strong>in</strong>ate the scene and weaken the projection signals.<br />
Therefore, choos<strong>in</strong>g a universal brightness level <strong>of</strong> projector illum<strong>in</strong>ation<br />
for all the aforementioned scenarios can be problematic.<br />
84
(a) projector brightness = 0 (b) projector brightness = 128<br />
(c) red pixel values <strong>of</strong> (a) (d) red pixel values <strong>of</strong> (b)<br />
Figure 3.9: Pixel values <strong>of</strong> an image captured from a pla<strong>in</strong> desktop. (bot-<br />
tom two show<strong>in</strong>g the red channel only)<br />
Figure 3.9 shows an example <strong>of</strong> different projector illum<strong>in</strong>ations. The<br />
top two images are the image captured when the projection brightness is<br />
0 and 128 respectively. The bottom two are the correspond<strong>in</strong>g distribution<br />
<strong>of</strong> the pixel values across the planar surface (only red channels are shown,<br />
while the green and blue channels have similar distributions). The average<br />
pixel values <strong>in</strong> (d) is higher and (b) as expected. In both images a slope is<br />
noticed because the top <strong>of</strong> the desktop is closer to the w<strong>in</strong>dow hence more<br />
85
ambient light is received on that part. When the projection brightness is<br />
set at 128 <strong>in</strong> (b), a reflection is caused and this is reflected as a spike <strong>in</strong> (d)<br />
<strong>in</strong> the bottom centre part <strong>of</strong> image (d).<br />
In this research, the photometric sett<strong>in</strong>gs <strong>of</strong> the camera and the pro-<br />
jector are both manually tuned until the camera can see the projections<br />
reasonably well. Future development for the calibration framework could<br />
<strong>in</strong>clude automatic photomatric calibration which adjusts the camera and<br />
the projector light<strong>in</strong>g. Hav<strong>in</strong>g a projector-camera pair is a big advantage<br />
<strong>of</strong> photomatric calibration, because it makes it feasible for self-adjust<strong>in</strong>g<br />
the projector brightness by analys<strong>in</strong>g the observed image, and the camera<br />
can be self-adjusted based on evaluat<strong>in</strong>g the image quality captured from<br />
different projector illum<strong>in</strong>ations.<br />
Previous researchers at York [74] has proposed a means <strong>of</strong> photometric<br />
calibration, as an prelim<strong>in</strong>ary framework for future research to be built on.<br />
86
Chapter 4<br />
Shape Acquisition<br />
4.1 Introduction<br />
Shape acquisition is one <strong>of</strong> the key topics <strong>in</strong> computer vision. The hu-<br />
man visual ability to perceive depth us<strong>in</strong>g b<strong>in</strong>ocular stereopsis has been<br />
modelled by two displaced cameras to obta<strong>in</strong> the range <strong>in</strong>formation <strong>of</strong> the<br />
scene, as described earlier <strong>in</strong> chapter 2. The pr<strong>in</strong>ciple <strong>of</strong> this computer<br />
vision task is to establish correspondences, or <strong>in</strong> other words the match-<br />
<strong>in</strong>g po<strong>in</strong>ts, between two or more images. In this thesis structured light<br />
is utilised as an active method to obta<strong>in</strong> range <strong>in</strong>formation with the help<br />
87
<strong>of</strong> a camera-projector pair. In VAE applications, it is always required that<br />
the structure <strong>in</strong>formation is extracted quickly and efficiently so that col-<br />
laborative work between user, PC and video sensors is feasible. This can<br />
be fulfilled by structured light because <strong>of</strong> its flexibility, rapidity, and effi-<br />
ciency.<br />
This chapter aims to provide an overview <strong>of</strong> structured light solutions,<br />
and then expla<strong>in</strong> one particular method that is used <strong>in</strong> the later parts <strong>of</strong><br />
this thesis. New contributions have been made to tackle the issues raised<br />
<strong>in</strong> practice, such as the alias<strong>in</strong>g effect caused by limited camera resolution,<br />
and deal<strong>in</strong>g with challeng<strong>in</strong>g surface material from some <strong>of</strong> the objects.<br />
The chapter beg<strong>in</strong>s by consider<strong>in</strong>g different scenarios <strong>of</strong> the <strong>in</strong>vestigated<br />
method, then a specification is def<strong>in</strong>ed with the most practical subset <strong>of</strong><br />
parameters regard<strong>in</strong>g the current available hardware equipped <strong>in</strong> the lab.<br />
It is acknowledged that a full 3D description is not achieved by a s<strong>in</strong>gle<br />
structured light projection, not with a s<strong>in</strong>gle camera which can only see<br />
part <strong>of</strong> the object. By chang<strong>in</strong>g the pose or position <strong>of</strong> the target object<br />
it is possible to build the 3D model (see chapter 5), but each structured<br />
light projection only gives depth <strong>in</strong>formation which is <strong>of</strong>ten referred to as<br />
a 2.5D model. However this aspect is not <strong>in</strong> the scope <strong>of</strong> this chapter, and<br />
it will be <strong>in</strong>troduced by later chapters.<br />
The rest <strong>of</strong> the chapter is organised as follows. A review <strong>of</strong> the exist<strong>in</strong>g<br />
methods and recent research <strong>of</strong> structured light systems is presented <strong>in</strong><br />
section 4.2. Section 4.3 <strong>in</strong>troduces the codification scheme chosen for our<br />
88
application and the generation <strong>of</strong> the projection image stack with the as-<br />
sociated look-up table. This is followed by section 4.3.3 where we discuss<br />
how the correspondence is established. Practical issues <strong>in</strong> the real world<br />
and hardware limitations are considered <strong>in</strong> section 4.4, where experimen-<br />
tal results are also present to validate the solutions proposed to tackle the<br />
problems. Section 4.5 expla<strong>in</strong>s depth calculation via triangulation. Then<br />
we address the conclusions <strong>in</strong> section 4.6.<br />
4.2 Background<br />
Structured light projection systems use a projector which can project a<br />
light pattern such as dots, l<strong>in</strong>es, grids or stripes onto the object surface,<br />
and a camera which captures the illum<strong>in</strong>ated scene. By project<strong>in</strong>g one or<br />
a set <strong>of</strong> image patterns, it is possible to uniquely label each pixel <strong>in</strong> the im-<br />
age observed by the camera. Unlike stereo vision methods which rely on<br />
the accuracy <strong>of</strong> match<strong>in</strong>g algorithms, structured light automatically estab-<br />
lishes the geometric relationship by direct mapp<strong>in</strong>g from the codewords<br />
assigned to each pixel to their correspond<strong>in</strong>g coord<strong>in</strong>ates <strong>in</strong> the source pat-<br />
tern. Comprehensive literature review and taxonomy <strong>of</strong> structured light<br />
systems can be found <strong>in</strong> [81, 45, 6, 15, 84]<br />
The simplest way to label each pixel is to project a 2D grey ramp and<br />
89
a solid white pattern onto the measur<strong>in</strong>g surface, tried by Carrihill et al.<br />
and Chazan et al. [21, 23]. By tak<strong>in</strong>g the ratios <strong>of</strong> the two observed im-<br />
ages, the brightness at each pixel determ<strong>in</strong>es the pixel’s correspond<strong>in</strong>g<br />
coord<strong>in</strong>ate <strong>in</strong> the orig<strong>in</strong>al grey ramp image. However, this method is too<br />
sensitive to noise. Slight variation <strong>in</strong> surface reflection and light<strong>in</strong>g will<br />
cause brightness mismeasurement which results <strong>in</strong> substantial triangula-<br />
tion errors. Therefore, more sophisticated codification schemes need to be<br />
considered.<br />
One <strong>of</strong> the most commonly used strategies is temporal cod<strong>in</strong>g, where<br />
a set <strong>of</strong> images are successively projected onto the surface to be measured.<br />
In 1982, Posdamer and Altschuler [76] were the first to propose a projec-<br />
tion <strong>of</strong> n images to encode 2 n stripes with pla<strong>in</strong> b<strong>in</strong>ary code. The resultant<br />
codewords are n bit b<strong>in</strong>ary codes formed by 0s and 1s, with more signif-<br />
icant bits associated with earlier pattern images and less significant bits<br />
associated with later ones. The symbol 0 corresponds to black <strong>in</strong>tensity<br />
level for a pixel <strong>in</strong> the observed image and 1 corresponds to full illum<strong>in</strong>a-<br />
tion. By do<strong>in</strong>g this the number <strong>of</strong> stripes <strong>in</strong> every two consecutive pattern<br />
images is <strong>in</strong>creas<strong>in</strong>g by a factor <strong>of</strong> two.<br />
Sato et al. [84] used Gray codes <strong>in</strong>stead <strong>of</strong> pla<strong>in</strong> b<strong>in</strong>ary. The Gray code<br />
has the advantage <strong>of</strong> hav<strong>in</strong>g successive codewords with unit Hamm<strong>in</strong>g<br />
distance which makes the codification more robust. Trob<strong>in</strong>a [97] presented<br />
a b<strong>in</strong>ary threshold model to improve the scheme. In their method, a Gray<br />
code is used but the b<strong>in</strong>ary threshold between black and white <strong>in</strong> the ob-<br />
90
served image is fixed for every pixel <strong>in</strong>dependently. This is achieved by<br />
tak<strong>in</strong>g a pair <strong>of</strong> full white and full black images at the beg<strong>in</strong>n<strong>in</strong>g, and the<br />
variant threshold is the mean between the grey level <strong>of</strong> the two observed<br />
images <strong>of</strong> full white and full black. In recent years, Rocch<strong>in</strong>i [81] proposed<br />
a method to address the problem <strong>of</strong> localisation <strong>of</strong> the stripe transitions<br />
<strong>in</strong> Gray code images. They encode the stripes with blue and red <strong>in</strong>stead<br />
<strong>of</strong> black and white, with a green slit <strong>of</strong> pixels between every two stripes<br />
to help f<strong>in</strong>d<strong>in</strong>g the zero-cross<strong>in</strong>g <strong>of</strong> the transitions between stripe bound-<br />
aries.<br />
The aforementioned schemes <strong>of</strong>ten employ b<strong>in</strong>ary codes and use a<br />
coarse-to-f<strong>in</strong>e paradigm. This eases the segmentation <strong>of</strong> the image pat-<br />
terns, and the codewords can normally be generated by threshold<strong>in</strong>g the<br />
observed image stack. However, a number <strong>of</strong> patterns need to be projected<br />
and problems are caused from top level patterns with very narrow stripes<br />
– too narrow for the camera to perceive.<br />
Us<strong>in</strong>g a comb<strong>in</strong>ation <strong>of</strong> Gray code methods and phase shift methods<br />
answers this problem [9, 83, 105, 45, 98]. This is achieved by reduc<strong>in</strong>g the<br />
range resolution <strong>of</strong> the source patterns (i.e. us<strong>in</strong>g fewer levels <strong>of</strong> Gray<br />
code patterns to avoid narrow stripes), and compensat<strong>in</strong>g by exploit<strong>in</strong>g<br />
the spatial neighbourhood <strong>in</strong>formation. This is done by periodically shift-<br />
<strong>in</strong>g the pattern <strong>in</strong> every projection to dist<strong>in</strong>guish the codewords <strong>of</strong> those<br />
pixels fall<strong>in</strong>g <strong>in</strong>to the same stripe. The limitation <strong>of</strong> these methods is by<br />
us<strong>in</strong>g patterns with shifted versions more images need to be projected and<br />
91
the total projection time <strong>in</strong>creases considerably.<br />
In the direction <strong>of</strong> us<strong>in</strong>g fewer images to make it feasible to measure<br />
mov<strong>in</strong>g scenes, Boyer and Kak [16] employ colour patterns to try to en-<br />
code more <strong>in</strong>formation <strong>in</strong>to the codewords. They propose a colour stripe<br />
pattern where a group <strong>of</strong> consecutive stripes has a unique colour <strong>in</strong>tensity<br />
configuration. Caspi et al. [22] use a colour generalisation <strong>of</strong> Gray codes.<br />
Davies and Nixon [33] use a colour dot pattern but with a similar spatial<br />
w<strong>in</strong>dow configuration to Boyer and Kak’s [16]. Chen et al. [24] and Zhang<br />
et al.[109, 110] propose a stereo vision based method that only requires one<br />
image. The underly<strong>in</strong>g idea <strong>of</strong> their methods is to use more than one cam-<br />
era to solve the correspondences between stripe edges through dynamic<br />
programm<strong>in</strong>g.<br />
These colour based methods have the capability <strong>of</strong> measur<strong>in</strong>g quasi-<br />
stationary or mov<strong>in</strong>g scenes s<strong>in</strong>ce fewer images are used, however there<br />
are restra<strong>in</strong>ts as well. Some <strong>of</strong> them use more than one camera, which re-<br />
quire extra work to calibrate the camera pair with the projector. Others<br />
require the measur<strong>in</strong>g surface to have uniform reflectance over all three<br />
channels <strong>of</strong> RGB to accurately extract the colour <strong>in</strong>formation, therefore<br />
they are more suitable for certa<strong>in</strong> applications such as monitor<strong>in</strong>g hand<br />
gestures.<br />
The Gray coded structured light codification scheme is considered <strong>in</strong><br />
this thesis because <strong>of</strong> its simplicity and robustness. Colour or phase based<br />
92
methods have their own strengths, however, we aim to develop a VAE sys-<br />
tem which can be deployed <strong>in</strong> various environments such as <strong>of</strong>fices, mu-<br />
seums, libraries, or other open environment. The system considered here<br />
is not just designed for laboratory purposes where the projector-camera<br />
system is normally setup close to the <strong>in</strong>teractive surface. We consider a<br />
top-down setup <strong>in</strong> which the vision sensor is relatively far away (high<br />
up) from the projection surface, and low-end cameras such as ord<strong>in</strong>ary<br />
web cameras will have difficulties pick<strong>in</strong>g up the colour details <strong>in</strong> such<br />
a distance. In this context, with a few adaptations made to enhance the<br />
performance <strong>of</strong> Gray coded structured light method, it yields reasonable<br />
results.<br />
4.3 Gray Codification<br />
4.3.1 Gray Code Patterns<br />
Images with Gray coded stripes are used <strong>in</strong> this work. All images are actu-<br />
ally stacked sequentially <strong>in</strong> time doma<strong>in</strong>. In figure 4.1 one slice from each<br />
image level is taken out and aligned spatially from bottom up, and this is<br />
only to illustrate the codeword changes <strong>in</strong> adjacent image levels.<br />
93
Figure 4.1: A 9-level Gray-coded image. (only a slice from each image is<br />
shown here, to illustrate the change between adjacent codewords)<br />
Some <strong>of</strong> the advantages are already mentioned earlier <strong>in</strong> section 4.2,<br />
and here are a few other reasons to use this scheme. First, compared to<br />
dots and l<strong>in</strong>es patterns, stripe patterns <strong>of</strong>fer high resolution range <strong>in</strong>for-<br />
mation by labell<strong>in</strong>g a dense and even distribution <strong>of</strong> 3D po<strong>in</strong>ts over the<br />
scene. Second, the black and white coded pattern is more resilient to the<br />
variation <strong>in</strong> surface reflectance than to colour based methods, and it han-<br />
dles objects with challeng<strong>in</strong>g material with proper adaptations (which will<br />
94
e discussed later <strong>in</strong> section 4.4). F<strong>in</strong>ally, Gray-coded images have more<br />
advantages than pla<strong>in</strong> coded b<strong>in</strong>ary images, for be<strong>in</strong>g less sensitive to er-<br />
rors and us<strong>in</strong>g wider stripes <strong>in</strong> higher levels (see figure 4.2). This is a<br />
desirable property, as it causes less <strong>in</strong>terference between the neighbour<strong>in</strong>g<br />
stripes.<br />
(a) 4-bit pla<strong>in</strong> b<strong>in</strong>ary code, top level<br />
stripes are 1 pixel wide.<br />
(b) 4-bit Gray code, top level stripes are<br />
two pixels wide.<br />
Figure 4.2: Comparison: m<strong>in</strong>imum level <strong>of</strong> Gray-coded and b<strong>in</strong>ary-coded<br />
images needed to encode 16 columns.<br />
4.3.2 Pattern Generation<br />
The pattern generation stage is <strong>of</strong>f-l<strong>in</strong>e and it serves two purposes: to gen-<br />
erate a Gray coded image stack and then create an look-up table for future<br />
codification use. This is only carried out once, and they are both held lo-<br />
cally.<br />
The stack <strong>of</strong> Gray-coded images are prepared <strong>in</strong> a temporal paradigm.<br />
95
All images are coded only <strong>in</strong> one-dimensional Gray-code as the po<strong>in</strong>t-l<strong>in</strong>e<br />
correspondences is sufficient to solve depth <strong>in</strong>formation. The reason for<br />
do<strong>in</strong>g this will be expla<strong>in</strong>ed later <strong>in</strong> 4.5. Because <strong>of</strong> the b<strong>in</strong>arity <strong>of</strong> Gray<br />
code, the pattern generation is straightforward. It can be considered as<br />
recreat<strong>in</strong>g a square wave by doubl<strong>in</strong>g the frequency and halv<strong>in</strong>g the wave<br />
length at each image level along the time axis. For a data projector project-<br />
<strong>in</strong>g images with resolution <strong>of</strong> 1024 × 768, a 10-level Gray code is needed to<br />
make sure:<br />
1. All neighbour<strong>in</strong>g rows or columns hav<strong>in</strong>g different code words,<br />
2. All rows or columns hav<strong>in</strong>g unique code words,<br />
Consider a 10 level horizontally Gray-coded image stack. Dur<strong>in</strong>g the<br />
look-up table generation, <strong>in</strong>stead <strong>of</strong> assign<strong>in</strong>g a 10 bit long code value to<br />
each row number, all possibilities <strong>of</strong> decimal code values are listed and<br />
then attached to the row numbers. By do<strong>in</strong>g this, dur<strong>in</strong>g the table look-<br />
up stage later on, for each <strong>in</strong>com<strong>in</strong>g pixel with a 10 bit long code word,<br />
faster table look-up can be done to f<strong>in</strong>d its correspond<strong>in</strong>g row number by<br />
us<strong>in</strong>g its decimal value. In horizontal cod<strong>in</strong>g (row-wise cod<strong>in</strong>g) for a 1024<br />
× 768 image, some code words do not exist after the whole image stack<br />
is coded and they are attached with -1. A section <strong>of</strong> look-up table for 10<br />
level Gray code will look like table 4.1. In vertical cod<strong>in</strong>g (column-wise),<br />
all 1024 columns will be assigned a valid positive decimal code value.<br />
96
Row Decimal (B<strong>in</strong>ary)<br />
0 767 (1011111111)<br />
1 766 (1011111110)<br />
2 764 (1011111100)<br />
.<br />
.<br />
510 427 (0110101011)<br />
511 426 (0110101010)<br />
512 -1 -<br />
513 -1 -<br />
.<br />
. -<br />
1022 84 (0001010100)<br />
1023 85 (0001010101)<br />
Table 4.1: 10 level Gray code look-up table.<br />
97<br />
.
For implementation, only a one dimensional Gray code image set needs<br />
to be generated. As can be seen from figure 4.3, once the correspondence<br />
between the 2D po<strong>in</strong>t p <strong>in</strong> the camera plane and the stripe l <strong>in</strong> the projector<br />
plane is established via Gray code, the 3D position <strong>of</strong> the 3D object po<strong>in</strong>t<br />
P is the <strong>in</strong>tersection <strong>of</strong> a ray and a plane. The mathematical justification <strong>of</strong><br />
1D Gray code is presented <strong>in</strong> section 4.5.<br />
Figure 4.3: Po<strong>in</strong>t-l<strong>in</strong>e triangulation.<br />
4.3.3 Codification Mechanism<br />
The projection procedure consists <strong>of</strong> project<strong>in</strong>g a series <strong>of</strong> light patterns so<br />
that every encoded po<strong>in</strong>t from the observed image is identified with the<br />
sequence <strong>of</strong> <strong>in</strong>tensities, which can be coded as a str<strong>in</strong>g <strong>of</strong> b<strong>in</strong>ary values<br />
98
Figure 4.4: B<strong>in</strong>ary encoded pattern divides the surface <strong>in</strong>to many sub-<br />
regions.<br />
(figure 4.4).<br />
The capture process starts with tak<strong>in</strong>g a snapshot <strong>of</strong> the scene with no<br />
projection. In severe light<strong>in</strong>g conditions such as a dark room, uniform<br />
light<strong>in</strong>g from the projector can be considered to help illum<strong>in</strong>ate the scene.<br />
The level <strong>of</strong> projection brightness can vary depend<strong>in</strong>g on the current light-<br />
<strong>in</strong>g condition, rang<strong>in</strong>g from zero brightness to a full white illum<strong>in</strong>ation.<br />
The first captured image serves as the colour texture map <strong>in</strong> the f<strong>in</strong>al rep-<br />
resentation <strong>of</strong> the current pose.<br />
After the first shot, the whole image stack is projected sequentially and<br />
99
images <strong>of</strong> the illum<strong>in</strong>ated scene are captured <strong>in</strong> the same order (figure 4.5).<br />
Cod<strong>in</strong>g the b<strong>in</strong>ary image stack is similar to that <strong>of</strong> the Gray-coded pattern<br />
images. For a pixel with 2D image coord<strong>in</strong>ate x, y <strong>in</strong> a 10 level image stack,<br />
a b<strong>in</strong>ary code word is formed by all the other pixels from the same posi-<br />
tion along the time axis, and its decimal representation is used to look up<br />
the table for the correspond<strong>in</strong>g row number from the projector space (ta-<br />
ble 4.1).<br />
(a) level = 4 (b) level = 5<br />
(c) level = 6 (d) level = 7<br />
Figure 4.5: Stripes be<strong>in</strong>g projected onto a fluffy doll.(10 level Gray coded<br />
stripes)<br />
100
By iterat<strong>in</strong>g this approach across the whole observed image, each pixel<br />
is first labelled with a 10-bit long b<strong>in</strong>ary code word, and then attached with<br />
a row number – to represent its orig<strong>in</strong>al position <strong>in</strong> the projector image as<br />
if the projection ray is reversed. A dense po<strong>in</strong>t-l<strong>in</strong>e correspondence map<br />
is established. Us<strong>in</strong>g appropriate triangulation method, the scene po<strong>in</strong>t<br />
(X, Y, Z, 1) T can be recovered as discussed <strong>in</strong> section 4.5.<br />
4.4 Practical Issues<br />
4.4.1 Image Levels<br />
To elim<strong>in</strong>ate ambiguities <strong>in</strong> table look-up, it is always important not to<br />
have two or more rows (columns) shar<strong>in</strong>g the same codeword, so that for<br />
every s<strong>in</strong>gle pixel <strong>in</strong> the observed image there can only be one row (col-<br />
umn) <strong>in</strong> the projector image that is match<strong>in</strong>g to that pixel. Therefore, to<br />
explicitly code the images be<strong>in</strong>g projected by a data projector with resolu-<br />
tion set at 1024 × 768, a log 2 1024 = 10 level Gray code is used to encode<br />
the pattern image to make sure each row or column is assigned with a<br />
unique codeword. By do<strong>in</strong>g this, it is possible to do the table look-up for<br />
the observed image solely based on the b<strong>in</strong>ary output image stack.<br />
An alternative to this is to use fewer patterns so that th<strong>in</strong> stripes are<br />
avoided. However, there are a few drawbacks to this. First because not<br />
101
enough bits are used, there will be group <strong>of</strong> pixels shar<strong>in</strong>g the same code-<br />
word. To either locate the stripe centres or the edges between neighbour-<br />
<strong>in</strong>g stripes, it <strong>in</strong>volves f<strong>in</strong>d<strong>in</strong>g zero-cross<strong>in</strong>gs to determ<strong>in</strong>e the flip posi-<br />
tion between black and white stripes, which is not easy because <strong>of</strong> the<br />
bloom<strong>in</strong>g effect <strong>of</strong> the white stripes when be<strong>in</strong>g observed <strong>in</strong> the camera.<br />
Secondly, stripe centres are not always perceivable depend<strong>in</strong>g on the con-<br />
vexity <strong>of</strong> the measur<strong>in</strong>g surface and the presence <strong>of</strong> depth discont<strong>in</strong>uities.<br />
Furthermore, even if the stripe centres and edges are successfully located,<br />
<strong>in</strong>terpolation needs to be done to estimate the other po<strong>in</strong>ts <strong>in</strong> between.<br />
Otherwise the density <strong>of</strong> the range <strong>in</strong>formation will be compromised.<br />
Therefore, the maximum level Gray code is found essential. Due to<br />
the fact that th<strong>in</strong> stripes are <strong>in</strong>evitable, more adaptations are considered to<br />
ma<strong>in</strong>ta<strong>in</strong> robustness.<br />
4.4.2 Limited Camera Resolution<br />
A good example <strong>of</strong> the problem caused by the camera is the alias effect. As<br />
illustrated <strong>in</strong> the experiment <strong>of</strong> measur<strong>in</strong>g a brick shown <strong>in</strong> figure 4.6a, af-<br />
ter distortion recovery the stripe image level 5 is nice and clean. However,<br />
when it gets to image level 10, the th<strong>in</strong> stripe is almost <strong>in</strong>visible. Instead<br />
there are effects <strong>of</strong> curly waves <strong>in</strong> the observed image (figure 4.6b), and<br />
the resultant depth map is affected too (figure 4.6c).<br />
To alleviate this problem, we simply run a scan on the pla<strong>in</strong> desktop<br />
102
(a) level = 5. (b) level = 10. (alias<strong>in</strong>g appears)<br />
(c) depth map without plane subtrac-<br />
tion.<br />
(d) depth map with plane subtraction.<br />
Figure 4.6: The alias effect caus<strong>in</strong>g errors <strong>in</strong> depth map.<br />
with no object be<strong>in</strong>g placed onto it. The depth map <strong>of</strong> the pla<strong>in</strong> surface<br />
is used as a surface base, which is subtracted from all the depth map es-<br />
timated later on to compensate this defect (figure 4.6(d)). Although the<br />
resultant depth map for the object surface is violated to a slight degree,<br />
the background noise (mostly from the tabletop) are all removed. This is<br />
the simplest and quickest way to alleviate the alias problem without re-<br />
plac<strong>in</strong>g for more expensive capture device or chang<strong>in</strong>g the system setup.<br />
103
Figure 4.7 gives better visualisation by plott<strong>in</strong>g the surface <strong>in</strong> 3D. The<br />
graph was generated us<strong>in</strong>g a sample <strong>of</strong> data every 20 pixels <strong>in</strong> both the x<br />
and y dimensions. It is clear that after base plane subtraction, the uneven<br />
background is flattened.<br />
4.4.3 Inverse subtraction<br />
For various reasons, the captured image stack cannot be used straight-<br />
away to determ<strong>in</strong>e the <strong>in</strong>vestigated pixels are on (illum<strong>in</strong>ated) or <strong>of</strong>f (not<br />
illum<strong>in</strong>ated) at each level: there are different texture and reflectance prop-<br />
erties across the scene, the ambient light is <strong>in</strong>consistent, and different pro-<br />
jection light adds variations to the light<strong>in</strong>g condition as well.<br />
For example, the theoretical threshold between white (255) and black<br />
(0) is 128, but <strong>in</strong> reality this is never the case. A pixel from a dark object<br />
can still appear close to 0 brightness even if it is illum<strong>in</strong>ated by full white<br />
projection. However, subtract the image taken with full white projection<br />
by another image which is taken with black projection, all pixels will have<br />
positive value <strong>in</strong> the subtraction image regardless if it’s from a black object<br />
or white object, as long as it goes through full white projection first then<br />
full black projection.<br />
To address this issue <strong>in</strong> our system, for each level <strong>of</strong> projection, one<br />
orig<strong>in</strong>al pattern and its <strong>in</strong>verted version (the black-white flipped image)<br />
are projected and the observed image is subtracted from its <strong>in</strong>verse im-<br />
104
(a) Before the subtraction.<br />
(b) After the subtraction.<br />
Figure 4.7: 3D plots <strong>of</strong> figure 4.6.<br />
age to yield an image with positive and negative values. It shows that the<br />
optimal black-white threshold value is likely to be brought close to zero<br />
105
Figure 4.8: Inverse subtraction <strong>of</strong> orig<strong>in</strong>al image and its flipped version.<br />
(figure 4.8). As a result, threshold<strong>in</strong>g is done on the subtracted images <strong>in</strong>-<br />
stead <strong>of</strong> the orig<strong>in</strong>al versions.<br />
In figure 4.9, a football with black stripes is be<strong>in</strong>g scanned. As it can be<br />
seen from the picture, there are glares (white spots) caused by the projec-<br />
tor light and the reflective surface <strong>of</strong> the football itself. Figure 4.9(b) and<br />
(c) shows the image captured when the level 4 stripe image and its flipped<br />
version is be<strong>in</strong>g projected onto the surface, respectively. It is noticed that<br />
the threshold output (figure 4.9(d)) <strong>of</strong> (b) has obvious errors because the<br />
coherent black pattern on the football itself stays at black while illumi-<br />
nated by either white or black projection light. An optimal threshold also<br />
is very hard to choose, because it is object dependent and can be affected<br />
106
y the light<strong>in</strong>g conditions. Figure 4.9(e) is the subtraction <strong>of</strong> (b) and (c),<br />
with the white pixels stand<strong>in</strong>g for positive value <strong>of</strong> the subtraction image,<br />
the black pixels for negative value and the gray pixels for close zero val-<br />
ues. Figure 4.9(f) is the b<strong>in</strong>ary output <strong>of</strong> (e) with white for ones and black<br />
for zeros, which is a better version <strong>of</strong> (d).<br />
4.4.4 Adaptive threshold<strong>in</strong>g<br />
Trob<strong>in</strong>a [97] (see section 4.2) tries to improve threshold<strong>in</strong>g accuracy by fix-<br />
<strong>in</strong>g different threshold values to each pixel based on the white to black<br />
reflectance ratio calculated from a solid white projection and a full black<br />
projection. There are a few concerns when this is carried out <strong>in</strong> practice.<br />
Unlike a laser scanner, for a certa<strong>in</strong> po<strong>in</strong>t <strong>in</strong> the measur<strong>in</strong>g surface, the<br />
observed brightness depends on the neighbour<strong>in</strong>g projection rays around<br />
itself. Especially <strong>in</strong> the high frequency stripe image, for <strong>in</strong>stance, where<br />
each black and white stripe occupies two rows or columns, it is never guar-<br />
anteed that a particular po<strong>in</strong>t that falls <strong>in</strong>to a black stripe will appear the<br />
same as when the scene is projected by full black.<br />
To cope with this uncerta<strong>in</strong>ty, a three-level adaptive threshold<strong>in</strong>g is<br />
used <strong>in</strong>stead <strong>of</strong> b<strong>in</strong>ary threshold<strong>in</strong>g. A dead zone around zero is <strong>in</strong>tro-<br />
duced to deal with uncerta<strong>in</strong>ties. The size <strong>of</strong> the dead zone is set empir-<br />
ically. For any pixels with brightness out <strong>of</strong> the dead zone, the normal<br />
b<strong>in</strong>ary threshold is applied. Otherwise, the pixel is to be further <strong>in</strong>spected<br />
at the next image level. Pixels successively fall<strong>in</strong>g <strong>in</strong>to the dead zone twice<br />
107
(a) texture map (b) stripe image (positive, level 4)<br />
(c) stripe image (negative, level 4) (d) threshold <strong>of</strong> (b), t=100<br />
(e) subtraction <strong>of</strong> (b) and (c) (f) b<strong>in</strong>ary image <strong>of</strong> (e)<br />
Figure 4.9: The <strong>in</strong>verse subtraction: the football experiment.<br />
are rejected as background po<strong>in</strong>ts, and they will not be further processed<br />
<strong>in</strong> the rema<strong>in</strong><strong>in</strong>g levels.<br />
108
This is <strong>in</strong>spired by one <strong>of</strong> the properties <strong>of</strong> the Gray coded images that<br />
no pixel is located at any stripe transitions at two consecutive levels (see<br />
figure 4.1, 4.2), which means any uncerta<strong>in</strong> pixels encountered at one level<br />
can be verified by its appearance at the same position <strong>in</strong> the previous im-<br />
age, and it can be classified as background po<strong>in</strong>t or shadowed po<strong>in</strong>t if no<br />
clean-cut decision (either black or white) can be made <strong>in</strong> two consecutive<br />
image levels.<br />
4.5 Depth from Triangulation<br />
In some cases [6] where camera and projector have the same orientations<br />
(strictly fac<strong>in</strong>g the same direction), and their displacement is known (con-<br />
trolled displacement, for example both mounted onto a fixed rail), the co-<br />
ord<strong>in</strong>ate <strong>of</strong> a 3D po<strong>in</strong>t can be estimated through simplified triangulation<br />
without the <strong>in</strong>formation <strong>of</strong> external geometry from projector and camera.<br />
However this is not considered <strong>in</strong> our application, s<strong>in</strong>ce it requires highly<br />
customised hardware.<br />
A general purpose triangulation method for structured light systems<br />
is considered here, where the camera and the projector can be turned to<br />
any arbitrary angles, and both are properly calibrated <strong>in</strong> earlier stage. For<br />
109
details <strong>of</strong> calibration <strong>of</strong> a projector-camera system, please refer to chapter<br />
3.<br />
Let po<strong>in</strong>t (x, y) be the 2D po<strong>in</strong>t currently be<strong>in</strong>g <strong>in</strong>vestigated, to recover<br />
its 3D coord<strong>in</strong>ate (X, Y, Z), we build full projection model (equation 3.25<br />
us<strong>in</strong>g homogeneous coord<strong>in</strong>ates [42],<br />
⎛ ⎞<br />
x<br />
⎜ ⎟<br />
⎜ ⎟<br />
w ⎜y⎟<br />
⎝ ⎠<br />
z<br />
=<br />
⎛<br />
⎜<br />
⎝<br />
where C = Kc(Rc|T c) =<br />
matrix and w is a scale factor.<br />
third,<br />
⎛<br />
⎜<br />
⎝<br />
c11 c12 c13 c14<br />
c21 c22 c23 c24<br />
c31 c32 c33 c34<br />
c11 c12 c13 c14<br />
c21 c22 c23 c24<br />
c31 c32 c33 c34<br />
⎛ ⎞<br />
⎞<br />
⎜<br />
x<br />
⎟<br />
⎜ ⎟<br />
⎟ ⎜<br />
⎟ ⎜y<br />
⎟<br />
⎟ ⎜ ⎟<br />
⎠ ⎜<br />
⎜z<br />
⎟<br />
⎝ ⎠<br />
1<br />
⎞<br />
(4.1)<br />
⎟ is the camera extr<strong>in</strong>sic<br />
⎠<br />
To cancel out the scale factor w, <strong>in</strong> eq 4.1 divide the first row by the<br />
(c11 − xc21)X + (c12 − xc22)Y + (c13 − xc23)Z + (c14 − xc24) = 0 (4.2)<br />
By divid<strong>in</strong>g the second row by the third <strong>in</strong> eq 4.1,<br />
(c21 − yc31)X + (c22 − yc32)Y + (c23 − yc33)Z + (c24 − yc34) = 0 (4.3)<br />
If po<strong>in</strong>t (x, y) corresponds to (m, n) <strong>in</strong> projector plane, similarly<br />
110
⎛ ⎞<br />
m<br />
⎜ ⎟<br />
⎜ ⎟<br />
w ⎜ n ⎟<br />
⎝ ⎠<br />
1<br />
=<br />
⎛<br />
⎜<br />
⎝<br />
p11 p12 p13 p14<br />
p21 p22 p23 p24<br />
p31 p32 p33 p34<br />
⎛ ⎞<br />
⎞<br />
⎜<br />
x<br />
⎟<br />
⎜ ⎟<br />
⎟ ⎜<br />
⎟ ⎜y<br />
⎟<br />
⎟ ⎜ ⎟<br />
⎠ ⎜<br />
⎜z<br />
⎟<br />
⎝ ⎠<br />
1<br />
Use the same method to cancel out the scale factor w ′ ,<br />
(4.4)<br />
(p11 − mp31)X + (p12 − mp32)Y + (p13 − mp33)Z + (p14 − mp34) = 0 (4.5)<br />
(p21 − np31)X + (p22 − np32)Y + (p23 − np33)Z + (p24 − np34) = 0 (4.6)<br />
From equations 4.2, 4.3 and 4.6, we have<br />
⎛<br />
⎜<br />
⎝<br />
c11 − xc31 c12 − xc32 c13 − xc33 c14 − xc34<br />
c21 − yc31 c22 − yc32 c23 − yc33 c24 − xc34<br />
p21 − np31 p22 − np32 p23 − np33 p24 − xp34<br />
⎛ ⎞<br />
⎞<br />
⎜<br />
X<br />
⎟<br />
⎜ ⎟<br />
⎟ ⎜<br />
⎟ ⎜Y<br />
⎟<br />
⎟ ⎜ ⎟ = 0 (4.7)<br />
⎠ ⎜<br />
⎜Z<br />
⎟<br />
⎝ ⎠<br />
1<br />
This now becomes a problem <strong>of</strong> solv<strong>in</strong>g a set <strong>of</strong> l<strong>in</strong>ear equations. The<br />
first matrix <strong>in</strong> eq 4.7 is <strong>of</strong>ten referred as the measurement matrix A. The<br />
vector (X, Y, Z, 1) T is solved by f<strong>in</strong>d<strong>in</strong>g the eigenvector with the least eigen-<br />
value <strong>of</strong> matrix A T A [4].<br />
Equivalently, equation 4.7 can also be constructed from equations 4.2,<br />
4.3 and 4.5. It is obvious that choos<strong>in</strong>g any one <strong>of</strong> the two equations 4.5<br />
and 4.6 yields the same results, which proves that the structured light<br />
projection only need to be done <strong>in</strong> one dimension, either horizontally or<br />
111
vertically. Us<strong>in</strong>g both <strong>of</strong> them is not recommended as understandably it<br />
doubles the capture time while only provid<strong>in</strong>g an over-determ<strong>in</strong>ed l<strong>in</strong>ear<br />
equation system. The f<strong>in</strong>al system therefore only uses horizontal stripes.<br />
4.5.1 F<strong>in</strong>al Captured Data<br />
After each successful structured light scan, the follow<strong>in</strong>g data are captured<br />
and saved <strong>in</strong>to the memory for further process<strong>in</strong>g. Figure 4.10 to 4.13<br />
shows the rendered data <strong>in</strong> the form <strong>of</strong> images. The scattered 3D po<strong>in</strong>t<br />
sets can be rendered at any arbitrary pose, and figure 4.12 and 4.13 are<br />
show<strong>in</strong>g it rendered at one pose.<br />
112
Figure 4.10: Depth map.<br />
113
Figure 4.11: Colour texture.<br />
114
Figure 4.12: Scattered po<strong>in</strong>t set <strong>in</strong> 3D. (re-sampled at every 2 millimetre)<br />
115
Figure 4.13: Scattered po<strong>in</strong>t set <strong>in</strong> 3D, attached with colour <strong>in</strong>formation.<br />
(re-sampled at every 2 millimetre)<br />
4.6 Conclusions<br />
This chapter <strong>in</strong>troduces a method for acquir<strong>in</strong>g depth <strong>in</strong>formation us<strong>in</strong>g<br />
structure light system. After study<strong>in</strong>g the exist<strong>in</strong>g codification schemes,<br />
Gray coded structured light is used <strong>in</strong> this research for its simplicity and<br />
robustness. A variety <strong>of</strong> problems are encountered dur<strong>in</strong>g implementa-<br />
tion, and solutions are provided to tackle these problems. Prelim<strong>in</strong>ary ex-<br />
116
perimental results suggest these proposed techniques positively enhance<br />
the system performance.<br />
First, we justify it is essential to use maximum level <strong>of</strong> Gray coded im-<br />
ages both theoretically and experimentally. This is at the risk <strong>of</strong> mak<strong>in</strong>g<br />
the stripes too th<strong>in</strong> to detect by the camera, which has limited resolution<br />
and mounted high above the desktop surface.<br />
Secondly, because <strong>of</strong> the large distance between the ceil<strong>in</strong>g-mounted<br />
camera and the tabletop, and the limited capture resolution <strong>of</strong> the camera,<br />
the stripes go<strong>in</strong>g too th<strong>in</strong> causes alias<strong>in</strong>g effect <strong>in</strong> the observed images (fig-<br />
ure 4.6). When a s<strong>in</strong>gle-pixel-wide l<strong>in</strong>e is projected, it could be observed<br />
<strong>in</strong> the camera image as a comb<strong>in</strong>ation <strong>of</strong> neighbour<strong>in</strong>g three or four l<strong>in</strong>es.<br />
When multiple l<strong>in</strong>es that are close to each other are projected, the observed<br />
l<strong>in</strong>es are likely to mix with each other (figure 4.14). This will not only vi-<br />
sually causes alias<strong>in</strong>g effect, but also assign multiple l<strong>in</strong>es to an 2D image<br />
pixel. A base plane subtraction method is proposed to deal with this chal-<br />
lenge caused by the alias<strong>in</strong>g effect.<br />
Thirdly, the <strong>in</strong>verse subtraction and adaptive threshold<strong>in</strong>g are com-<br />
b<strong>in</strong>ed to perform robust codeword generation. This is a big boost to the<br />
codification, as we are no longer concerned with the object surface colour<br />
while these techniques ma<strong>in</strong>ta<strong>in</strong>s the optimal threshold for 0s and 1s close<br />
to zero.<br />
117
(a) The projector image. L<strong>in</strong>es are<br />
s<strong>in</strong>gle-pixel-wide and three-pixel apart<br />
from each other.<br />
(b) The observed image. One thicker<br />
l<strong>in</strong>e is observed <strong>in</strong>stead <strong>of</strong> three clean<br />
cut l<strong>in</strong>es.<br />
Figure 4.14: Illustration <strong>of</strong> camera limited resolution.<br />
F<strong>in</strong>ally, it is geometrically and mathematically justified (figure 4.3 and<br />
section 4.5), that the structured light projection is only required to run <strong>in</strong><br />
one dimension, either horizontal or vertical, provided the proper triangu-<br />
lation method is used.<br />
4.6.1 Future Work<br />
With the current projector-camera setup, the system performance is mostly<br />
h<strong>in</strong>dered by the limited capture resolution <strong>of</strong> the camera, and the distance<br />
between the ceil<strong>in</strong>g-mounted camera and the tabletop. However, once the<br />
VAE is setup and runn<strong>in</strong>g, it is not possible to change these factors. There-<br />
fore, efforts need to be made <strong>in</strong> other areas to compensate for this negative<br />
118
contribution.<br />
In section 4.4.2 a method <strong>of</strong> base plane subtraction is proposed to com-<br />
pensate for the alias<strong>in</strong>g effect caused by the aforementioned defects. This<br />
method is still prelim<strong>in</strong>ary and has its own limitations. The most signifi-<br />
cant one is that the subtraction is only restricted to the planar surface (<strong>in</strong><br />
this case, the tabletop). It is <strong>in</strong>capable <strong>of</strong> modell<strong>in</strong>g the artifact caused by<br />
the alias<strong>in</strong>g on arbitrary object surface. Future research could further <strong>in</strong>-<br />
vestigate this area to properly model this distortion.<br />
It is claimed <strong>in</strong> this chapter (section 4.4.1) that the maximum possible<br />
level <strong>of</strong> Gray coded images should be used to uniquely label every row/-<br />
column <strong>in</strong> the rendered projection image, and to avoid codeword shar<strong>in</strong>g.<br />
It is based on the fact that dense depth map is required <strong>in</strong> this shape ac-<br />
quisition process. In certa<strong>in</strong> applications where only sparse depth map<br />
is needed, it is possible to use fewer level <strong>of</strong> stripe images. Sparse depth<br />
<strong>in</strong>formation can be recovered at the stripe transitions or located stripe cen-<br />
tres, and a big plus po<strong>in</strong>t is that the camera won’t be forced to capture the<br />
scene illum<strong>in</strong>ated by th<strong>in</strong> stripes that is beyond its resolution.<br />
Future work on photometric calibration discussed <strong>in</strong> the previous chap-<br />
ter (section 3.6) also relates to the development <strong>in</strong> structured light systems.<br />
A successful calibration <strong>of</strong> the photometric properties for the camera could<br />
lead to the use <strong>of</strong> colour-based structured light systems. S<strong>in</strong>ce the colour-<br />
based methods normally use fewer images (sometimes just one), it opens<br />
119
up the possibility <strong>of</strong> turn<strong>in</strong>g the shape acquisition <strong>in</strong>to a real-time process.<br />
This would be an attractive feature for the VAE and with real-time depth<br />
scan capability lots <strong>of</strong> application can be built with<strong>in</strong> the VAE framework.<br />
120
Chapter 5<br />
Registration <strong>of</strong> Po<strong>in</strong>t Sets<br />
Creat<strong>in</strong>g 3D model for a real object is a multi-stage process, because cam-<br />
eras only deliver data from one view <strong>of</strong> the target object at a time. To ob-<br />
ta<strong>in</strong> a complete model, it requires either the scanner to shoot from different<br />
views to cover the whole object, or equivalently move the object relative<br />
to a stationary scanner. Whichever scenario is chosen, registration <strong>of</strong> the<br />
scanned data from different views is required. This chapter is focused on<br />
this subject.<br />
After each structured light scan, a cloud <strong>of</strong> po<strong>in</strong>t samples from the sur-<br />
121
face <strong>of</strong> an object is obta<strong>in</strong>ed. By plac<strong>in</strong>g the object <strong>in</strong> different positions<br />
on the tabletop or plac<strong>in</strong>g it <strong>in</strong> different orientations towards the camera<br />
yields a few po<strong>in</strong>t sets, which is expected to cover the whole surface <strong>of</strong> the<br />
object to be measured. The objective <strong>of</strong> registration is to fuse these clouds<br />
together by estimat<strong>in</strong>g the transformations between them, and try<strong>in</strong>g to<br />
place all the data <strong>in</strong>to the same reference frame to visualise or for further<br />
process<strong>in</strong>g.<br />
The process <strong>of</strong> po<strong>in</strong>t sets fusion beg<strong>in</strong>s with 2D image registration on<br />
colour textures <strong>of</strong> two participat<strong>in</strong>g views, where the <strong>in</strong>terest<strong>in</strong>g po<strong>in</strong>ts are<br />
first extracted by corner detectors and then correlated. Once the 2D corre-<br />
spondences are established, the 3D coord<strong>in</strong>ates <strong>of</strong> the matched po<strong>in</strong>ts are<br />
used as control po<strong>in</strong>ts to estimate the rotation and translation <strong>in</strong> 3D space<br />
between these two sets <strong>of</strong> po<strong>in</strong>ts. The estimated rotation and translation<br />
vectors are used as an <strong>in</strong>itial guess to perform a trial merge, by wrapp<strong>in</strong>g<br />
one po<strong>in</strong>t set to another <strong>in</strong> 3D space based on the estimated transform. The<br />
user has the f<strong>in</strong>al decision <strong>of</strong> whether to accept this trial given by the com-<br />
puter, or manually improve the fusion <strong>of</strong> po<strong>in</strong>t sets by tun<strong>in</strong>g the them<br />
<strong>in</strong>to different poses <strong>in</strong> a virtual environment us<strong>in</strong>g the augmented tools.<br />
The whole process comb<strong>in</strong>es automated image process<strong>in</strong>g and human<br />
<strong>in</strong>teraction. For example, tasks such as 2D image registration or exhausted<br />
search<strong>in</strong>g for transformation vectors are executed by automated process<br />
while the f<strong>in</strong>al tun<strong>in</strong>g and merg<strong>in</strong>g is handed over by human <strong>in</strong>teraction.<br />
This is not only because the humans are chosen to be the decision maker,<br />
122
ut also that this is what humans are good at – spot where th<strong>in</strong>gs go wrong<br />
and respond to it <strong>in</strong> an effective way. In the rest <strong>of</strong> this chapter, this is ex-<br />
pla<strong>in</strong>ed <strong>in</strong> details.<br />
5.1 Introduction<br />
Assume there exist two po<strong>in</strong>t sets {mi} and {di}, i = 1, 2, ..., N, and the cor-<br />
respondences between them are already established, either from ground<br />
truth or by match<strong>in</strong>g the po<strong>in</strong>t sets <strong>in</strong> 3D space. We name {mi} the model<br />
po<strong>in</strong>ts and {di} the data po<strong>in</strong>ts. If they are both from the same model,<br />
the objective is to f<strong>in</strong>d the relative rotation and translation from the data<br />
po<strong>in</strong>ts to the model po<strong>in</strong>ts, so that <strong>in</strong> 3D space they are related by<br />
di = Rmi + T + ei<br />
(5.1)<br />
where R is the 3 × 1 rotation matrix, T is the 3 × 1 translation vector<br />
and ei is a noise vector. Solv<strong>in</strong>g for the optimal solutions <strong>of</strong> ˆ R and ˆ T that<br />
maps the two po<strong>in</strong>t sets is a least square m<strong>in</strong>imisation problem:<br />
N<br />
di − ˆ Rmi − ˆ T 2<br />
i=1<br />
(5.2)<br />
Because the correspondences between the po<strong>in</strong>t sets are unknown a-<br />
priori, the most straightforward method to register two po<strong>in</strong>t sets is ex-<br />
haustive search <strong>in</strong> 3D space. However, this method faces the challenges<br />
from process<strong>in</strong>g time, convergence speed, and fall<strong>in</strong>g <strong>in</strong>to local m<strong>in</strong>ima. It<br />
is not complex to implement but consumes a lot <strong>of</strong> the process<strong>in</strong>g power,<br />
123
and is therefore not suitable for VAE systems.<br />
Us<strong>in</strong>g calibrated motion is another rout<strong>in</strong>e to solve the registration, but<br />
this br<strong>in</strong>gs new problems too. To control either the movement <strong>of</strong> the scan-<br />
ners or the object to be measured, additional hardware equipment such as<br />
rails and turntables are <strong>in</strong>evitable. The scanner may require extra calibra-<br />
tion as well. More importantly, <strong>in</strong> the context <strong>of</strong> VAE, it is desired that the<br />
restriction to controlled motion must be lifted, and the object to be mea-<br />
sured can be freely moved <strong>in</strong>to any different poses <strong>in</strong> 3D space, under the<br />
guidance <strong>of</strong> the user.<br />
In this research, the rout<strong>in</strong>e we choose to fuse two po<strong>in</strong>t sets <strong>in</strong>cor-<br />
porates three stages: 2D planar image registration (section 5.3), po<strong>in</strong>t set<br />
registration us<strong>in</strong>g corresponded features with a voxel based quantisation<br />
process (section 5.4), and render<strong>in</strong>g (section 5.5). They are discussed sepa-<br />
rately <strong>in</strong> the rest <strong>of</strong> this chapter. When more than two views are presented,<br />
the problem is reduced to a cha<strong>in</strong> <strong>of</strong> pairwise registrations.<br />
124
5.2 Background<br />
Figure 5.1: A rout<strong>in</strong>e <strong>of</strong> po<strong>in</strong>t set registration.<br />
5.2.1 Rotations and Translations <strong>in</strong> 3D<br />
There are several common ways to build a rotation matrix. The most fre-<br />
quently documented representation is to rotate a po<strong>in</strong>t around one <strong>of</strong> the<br />
three coord<strong>in</strong>ate axes. The advantage <strong>of</strong> us<strong>in</strong>g this representation is the<br />
generated 3 × 3 rotation matrix can be applied to 3D po<strong>in</strong>ts for matrix ma-<br />
nipulations straightaway. To rotate a po<strong>in</strong>t around X, Y, and Z axes, we<br />
have:<br />
Rx =<br />
Rx =<br />
⎡<br />
⎤<br />
1<br />
⎢<br />
⎢0<br />
⎣<br />
0<br />
cos(θ)<br />
0<br />
⎥<br />
− s<strong>in</strong>(θ) ⎥<br />
⎦<br />
0 s<strong>in</strong>(θ) cos(θ)<br />
⎡<br />
⎤<br />
cos(φ)<br />
⎢ 0<br />
⎣<br />
0<br />
1<br />
s<strong>in</strong>(φ)<br />
⎥<br />
0 ⎥<br />
⎦<br />
− s<strong>in</strong>(φ) 0 cos(φ)<br />
125<br />
(5.3)<br />
(5.4)
Rx =<br />
⎡<br />
⎤<br />
cos(ψ)<br />
⎢<br />
⎢s<strong>in</strong>(ψ)<br />
⎣<br />
− s<strong>in</strong>(ψ)<br />
cos(ψ)<br />
0<br />
⎥<br />
0⎥<br />
⎦<br />
0 0 1<br />
(5.5)<br />
where θ, φ, and ψ are the rotations around X, Y, and Z axes respectively.<br />
More detailed discussions <strong>of</strong> rotation <strong>in</strong> 3D are given <strong>in</strong> [44], [47].<br />
5.2.2 A SVD Based Least Square Fitt<strong>in</strong>g Method<br />
SVD is one <strong>of</strong> the most significant topic <strong>in</strong> l<strong>in</strong>ear algebra and it has con-<br />
siderable theoretical and practical values [54, 62, 95]. A very important<br />
feature <strong>of</strong> SVD is that it can be performed on any real matrix. The result<br />
<strong>of</strong> this decomposition is to factor matrix A <strong>in</strong>to three matrices U, S, V such<br />
that A = USV T , where U and V are orthogonal matrices and S is a diago-<br />
nal matrix. SVD is also a common tool used to solve least square solutions<br />
(section 3.5, section 4.5).<br />
Arun, Huang and Bolste<strong>in</strong> [3] proposed a method <strong>of</strong> comput<strong>in</strong>g the<br />
3D rotation matrix and translation vector by do<strong>in</strong>g the S<strong>in</strong>gular Value<br />
Decomposition (SVD) <strong>of</strong> the 3 ×correlation matrix, which is built as fol-<br />
lows,<br />
H =<br />
N<br />
i=1<br />
mc,i d T c,i<br />
(5.6)<br />
where mc,i and dc,i are obta<strong>in</strong>ed by translat<strong>in</strong>g the orig<strong>in</strong>al data sets mi<br />
126
and di (equation 5.2) to the orig<strong>in</strong>.<br />
The SVD <strong>of</strong> the correlation matrix is H = USV T , and the optimal rota-<br />
tion matrix is first computed from<br />
ˆR = V U T<br />
(5.7)<br />
The computation <strong>of</strong> ˆ R is also known as the Orthogonal Procrustes Prob-<br />
lem [88].<br />
The optimal translation matrix is the vector that aligns the centroid <strong>of</strong><br />
the po<strong>in</strong>t set di and mi, which is<br />
ˆT = ¯ d − ¯m (5.8)<br />
Of course ˆ T = ¯ d − ˆ R ¯m exists too because rotat<strong>in</strong>g a po<strong>in</strong>t set about the<br />
orig<strong>in</strong> doesn’t change the centroid <strong>of</strong> the po<strong>in</strong>t set itself.<br />
5.3 Image Registration<br />
5.3.1 Corner Detector<br />
To build up a dense correspondence map given a pair <strong>of</strong> <strong>in</strong>put images is<br />
not practical consider<strong>in</strong>g the amount <strong>of</strong> computation <strong>in</strong>volved. Therefore<br />
127
the first step is to choose a set <strong>of</strong> dist<strong>in</strong>guished po<strong>in</strong>ts as <strong>in</strong>terest po<strong>in</strong>ts,<br />
from both <strong>in</strong>put images. To f<strong>in</strong>d these <strong>in</strong>terest po<strong>in</strong>ts, a Harris corner de-<br />
tector [46] algorithm is used on the textures. The corner detector uses the<br />
follow<strong>in</strong>g structure matrix to evaluate whether the given pixel is a corner<br />
or not.<br />
⎡<br />
G = ⎣<br />
<br />
w f 2 x<br />
<br />
w fxfy<br />
<br />
w fxfy<br />
<br />
w f 2 y<br />
⎤ ⎡<br />
⎦ = Q ⎣ λ1<br />
⎤<br />
0<br />
⎦ Q T<br />
0 λ2<br />
(5.9)<br />
The second part <strong>of</strong> the equation 5.9 is the decomposition to its left, fx<br />
and fy are the first derivatives <strong>of</strong> horizontal and vertical directions respec-<br />
tively, and w is the w<strong>in</strong>dow size <strong>of</strong> aggregation. For the two output eigen-<br />
values, λ1 ≥ λ2, and the Harris corner detector def<strong>in</strong>es when λ2 ≫ 0 the<br />
pixel can be <strong>in</strong>terpreted as a corner with<strong>in</strong> a certa<strong>in</strong> region. Even though<br />
the sign <strong>of</strong> the eigenvalues conta<strong>in</strong>s <strong>in</strong>formation <strong>of</strong> the local gradient, we<br />
are not <strong>in</strong>terested <strong>in</strong> them here as our purpose is to f<strong>in</strong>d the po<strong>in</strong>t <strong>of</strong> <strong>in</strong>ter-<br />
est.<br />
To implement the corner detection algorithm, fx and fy are first com-<br />
puted from the convolution <strong>of</strong> the orig<strong>in</strong>al image with two derivative ker-<br />
nels<br />
Dx =<br />
⎡ ⎤<br />
−1<br />
⎢<br />
⎢−1<br />
⎣<br />
0<br />
0<br />
1<br />
⎥<br />
1⎥<br />
⎦<br />
−1 0 1<br />
128<br />
(5.10)
Dy =<br />
⎡<br />
⎤<br />
−1<br />
⎢ 0<br />
⎣<br />
−1<br />
0<br />
−1<br />
⎥<br />
0 ⎥<br />
⎦<br />
1 1 1<br />
(5.11)<br />
The G matrix is constructed for each pixel from the derivatives, and it is<br />
aggregated by its neighbour<strong>in</strong>g pixels. Then the two eigenvalues <strong>of</strong> G ma-<br />
trix is computed – the smaller <strong>of</strong> the two are stored. The pixel is considered<br />
as a corner if it has the biggest stored eigenvalue <strong>in</strong> the given area, and the<br />
value is greater than a threshold. This step is repeated for all pixels from<br />
both <strong>of</strong> the <strong>in</strong>put images.<br />
The whole process is shown <strong>in</strong> figure 5.2. A pair <strong>of</strong> images <strong>of</strong> a corridor<br />
are used to give better illustration <strong>of</strong> the extraction <strong>of</strong> corner po<strong>in</strong>ts. w1<br />
is the w<strong>in</strong>dow <strong>of</strong> aggregation for the structure matrix G (equ.5.9). w2 is<br />
the local evaluation w<strong>in</strong>dow with<strong>in</strong> which the pixel with the biggest λ2<br />
is considered as a corner candidate. λ is the threshold for λ2: the pixel is<br />
considered as a corner if λ2 > λ.<br />
5.3.2 Normalised Cross Correlation<br />
After the <strong>in</strong>terest po<strong>in</strong>ts are detected <strong>in</strong> the <strong>in</strong>put image pair, correspon-<br />
dences are found us<strong>in</strong>g Normalised Cross Correlation (NCC) [77]. For<br />
each <strong>in</strong>terest po<strong>in</strong>t <strong>in</strong> the left image, we look for its maximum correlation<br />
<strong>in</strong> the right image us<strong>in</strong>g the NCC cost function below,<br />
NCC =<br />
<br />
(x,y)∈W (f1(x, y) − ¯ f1)(f2(x, y) − ¯ f2)<br />
(x,y)∈W (f1(x, y) − ¯ <br />
f1) 2<br />
(x,y)∈W (f2(x, y) − ¯ f2) 2<br />
129<br />
(5.12)
(a) Orig<strong>in</strong>al image (b) Gaussian smoothed image<br />
(c) First x derivatives (d) First y derivatives<br />
(e) Eigenvalue image (f) Detected corners<br />
Figure 5.2: Corner detection.<br />
where fk(x, y) is the k − th image block, and ¯ fk is the average value <strong>of</strong><br />
the block. W is the size <strong>of</strong> the search w<strong>in</strong>dow.<br />
130
To implement the algorithm, we first take a pixel from the left image<br />
and construct a N × N block centred at that pixel. Then calculate the NCC<br />
between the current block and all the corner po<strong>in</strong>ts encountered <strong>in</strong> the<br />
right image, with<strong>in</strong> the search range W . The corner po<strong>in</strong>t with the maxi-<br />
mum NCC value is assigned the correspond<strong>in</strong>g pixel. This process is re-<br />
peated for all the pixels <strong>in</strong> the left image.<br />
Results are shown <strong>in</strong> figure 5.3 and 5.4. Choos<strong>in</strong>g different sizes <strong>of</strong> the<br />
search w<strong>in</strong>dow yields different results. Especially when periodic patterns<br />
are <strong>in</strong>volved, the checker board for example, the result is far less accurate<br />
if an <strong>in</strong>appropriate search w<strong>in</strong>dow size is chosen. Furthermore, at this<br />
stage the correspondences are not one-to-one. For a corner po<strong>in</strong>t <strong>in</strong> the<br />
right image, it is likely to happen that more than one po<strong>in</strong>t from the left<br />
image has found it as the best match. The details <strong>of</strong> mismatch removals<br />
are discussed <strong>in</strong> section 5.3.3.<br />
5.3.3 Outlier Removals<br />
With given correspondences, we feed them <strong>in</strong>to the correlation matrix<br />
(equ.5.6) so the rotation matrix and translation vector can be estimated.<br />
To do this reliably, outliers need to be removed. The RANdom SAmple<br />
Consensus (RANSAC) algorithm [38] is a widely used algorithm for ro-<br />
bust fitt<strong>in</strong>g <strong>of</strong> models <strong>in</strong> the presence <strong>of</strong> data outliers. The algorithm keeps<br />
randomly select<strong>in</strong>g data items and uses them to estimate the data model<br />
131
(a) Searchw<strong>in</strong>dow : W = 64<br />
(b) Searchw<strong>in</strong>dow : W = 256<br />
Figure 5.3: NCC results.<br />
until a good fit is found or the maximum iterations is reached. Only the<br />
data that qualify the certa<strong>in</strong> criteria are considered as mean<strong>in</strong>gful data.<br />
The choice <strong>of</strong> the criteria here depends on the data to be measured, for ex-<br />
ample it can be the Euclidean distance <strong>of</strong> a po<strong>in</strong>t to the centroid <strong>of</strong> a cloud<br />
<strong>of</strong> po<strong>in</strong>ts, the disparity <strong>in</strong> brightness <strong>of</strong> a group <strong>of</strong> w<strong>in</strong>dowed pixels, or<br />
other cost functions.<br />
In this work, s<strong>in</strong>ce the transform between two observed images can be<br />
132
(a) W = 64<br />
(b) W = 256<br />
Figure 5.4: NCC results (periodic pattern).<br />
encapsulated <strong>in</strong> a 3 × 3 homography matrix, the RANSAC algorithm is<br />
implemented with the adaptations as follows:<br />
1. Start with putative correspondences computed from NCC (section<br />
5.3.2).<br />
2. Repeat step 3-7 for N times, with N be<strong>in</strong>g updated us<strong>in</strong>g algorithm<br />
4.5 from [47].<br />
3. Select a random sample <strong>of</strong> 4 correspondences and check the data<br />
133
col<strong>in</strong>earity. If the data is bad, reselect a sample.<br />
4. Compute the homography H us<strong>in</strong>g the method presented <strong>in</strong> section<br />
3.5.<br />
5. Calculate the distance for each <strong>of</strong> the putative correspondences d =<br />
d(mi, ˆmi) 2 + d(di, ˆ di) 2 , where ˆmi and ˆ di are the transformed po<strong>in</strong>ts<br />
based on the estimated homography H.<br />
6. Compute the number <strong>of</strong> putative correspondences consistent with<br />
the current H, by the criterion that the distance calculated <strong>in</strong> step 5<br />
is no greater than a empirical threshold. The qualify<strong>in</strong>g correspon-<br />
dences are <strong>in</strong>liers.<br />
7. If the number <strong>of</strong> <strong>in</strong>liers for the current H is maximum, update H and<br />
the set <strong>of</strong> <strong>in</strong>liers consistent with H.<br />
8. When reach here, choose the group <strong>of</strong> <strong>in</strong>liers associated with the best<br />
H so far.<br />
9. Re-calculate the H us<strong>in</strong>g all the <strong>in</strong>liers left.<br />
In general cases, because the homography H is estimated by 4 ran-<br />
domly selected correspondences <strong>in</strong> each loop – even if they are estimated<br />
from the best set <strong>of</strong> 4 pairs, the f<strong>in</strong>al homography still needs to be ref<strong>in</strong>ed<br />
by calculat<strong>in</strong>g it once aga<strong>in</strong> with all the qualified <strong>in</strong>liers from putative cor-<br />
respondences. However <strong>in</strong> this work we are only focused on choos<strong>in</strong>g the<br />
reliable correspondences <strong>in</strong>stead <strong>of</strong> look<strong>in</strong>g for the 2D projective trans-<br />
134
form between them, as a result step 9 is not necessary and can be omitted.<br />
(a) T = 100, 235 putative correspondences after NCC, 142 <strong>in</strong>liers.<br />
(b) Rectified image pair.<br />
Figure 5.5: Robust estimation. (<strong>in</strong>liers shown by red connect<strong>in</strong>g l<strong>in</strong>es)<br />
135
(a) T = 50, 80 putative correspondences after NCC, 80 <strong>in</strong>liers.<br />
(b) Rectified image pair.<br />
Figure 5.6: Robust estimation. (<strong>in</strong>liers shown by <strong>in</strong>dex numbers)<br />
5.4 Fusion<br />
5.4.1 Data structure <strong>of</strong> a po<strong>in</strong>t set<br />
The data structure <strong>of</strong> a po<strong>in</strong>t set is depicted <strong>in</strong> figure 5.7. For each po<strong>in</strong>t,<br />
the follow<strong>in</strong>g <strong>in</strong>formation is stored: its <strong>in</strong>dex <strong>in</strong> the data array, 3D world<br />
coord<strong>in</strong>ates (X, Y, Z), 2D image coord<strong>in</strong>ates (x, y), and its colour <strong>in</strong>forma-<br />
tion <strong>in</strong> RGB channels.<br />
136
Figure 5.7: Data structure <strong>of</strong> a po<strong>in</strong>t set.<br />
5.4.2 Po<strong>in</strong>t set fusion with voxel quantisation<br />
For each s<strong>in</strong>gle view, a po<strong>in</strong>t set is given from estimat<strong>in</strong>g the 3D positions<br />
<strong>of</strong> the foreground pixels <strong>in</strong> the captured image. All background parts like<br />
the table top and non projected area have non-positive depth and they are<br />
rejected. Therefore the size <strong>of</strong> the po<strong>in</strong>t set is the total number <strong>of</strong> pixels<br />
that have positive depth <strong>in</strong> the correspond<strong>in</strong>g depth image.<br />
The data size can be huge sometimes. Figure 5.8(a) shows the po<strong>in</strong>t set<br />
<strong>of</strong> a fluffy doll which has the dimension <strong>of</strong> roughly 600mm <strong>in</strong> height width<br />
and depth. The resultant po<strong>in</strong>t set size is 34056, where a lot <strong>of</strong> po<strong>in</strong>ts are<br />
actually very close to its neighbour<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong> 3D space. This not only<br />
137
causes redundancy and <strong>in</strong>creases the burden <strong>of</strong> render<strong>in</strong>g the po<strong>in</strong>t set or<br />
transform<strong>in</strong>g it <strong>in</strong> 3D. A voxel quantisation method is presented here to<br />
deal with this problem.<br />
(a) the po<strong>in</strong>t set (b) voxel quantisation<br />
Figure 5.8: Voxel quantisation <strong>of</strong> the large data set.<br />
For each po<strong>in</strong>t set, we keep two copies <strong>in</strong> the memory. One copy is<br />
the orig<strong>in</strong>al data set, where all the po<strong>in</strong>ts are saved as backup so that no<br />
<strong>in</strong>formation is lost. Another copy is the slimmed version for display or<br />
other front end purposes. It beg<strong>in</strong>s with estimation <strong>of</strong> what k<strong>in</strong>d <strong>of</strong> size<br />
the po<strong>in</strong>t set is occupy<strong>in</strong>g <strong>in</strong> 3D space, by comput<strong>in</strong>g the centroid <strong>of</strong> the<br />
po<strong>in</strong>t set and the furthest po<strong>in</strong>ts along the X,Y,Z axes. Then a cube is con-<br />
structed with size <strong>of</strong> the estimated required size to conta<strong>in</strong> the whole po<strong>in</strong>t<br />
set, and it is divided <strong>in</strong>to voxels which are smaller cubes (figure 5.8(b)). All<br />
3D po<strong>in</strong>ts fall<strong>in</strong>g <strong>in</strong>to the same voxel are averaged <strong>in</strong>to one po<strong>in</strong>t, and the<br />
voxels with no po<strong>in</strong>ts fall<strong>in</strong>g <strong>in</strong>to them are not considered.<br />
138
Bigger and fewer voxels gives coarser quantisation and less details (fig-<br />
ure 5.9). A po<strong>in</strong>t set with orig<strong>in</strong>al size <strong>of</strong> 34056 is slimmed us<strong>in</strong>g voxel size<br />
s = 1mm, 10mm respectively. As the voxel size <strong>in</strong>creases, the po<strong>in</strong>t set<br />
gets more and more sparse.<br />
(a) s = 1mm, 32097 po<strong>in</strong>ts (b) s = 10mm, 4373 po<strong>in</strong>ts<br />
Figure 5.9: Different quantisation level by choos<strong>in</strong>g different voxel size.<br />
Figure 5.12 shows the choice <strong>of</strong> the voxel size can be object irrelevant.<br />
The football, fluffy owl, and the vase which is placed <strong>in</strong> different orien-<br />
tations all have different object size and surface structure (figure 5.10 and<br />
5.11). In figure 5.12(a), the total po<strong>in</strong>ts curve for the owl starts very high<br />
but has a dramatic drop. This is because the physical size <strong>of</strong> the object is<br />
much bigger than the other three object tested. By compar<strong>in</strong>g figure 5.12(a)<br />
and (b), it is not hard to f<strong>in</strong>d out that the total po<strong>in</strong>t size <strong>of</strong> the po<strong>in</strong>t set has<br />
very little impact on the amount <strong>of</strong> data be<strong>in</strong>g lost by voxel quantisation.<br />
In the graph with the percentage curves, it can be seen that all four objects<br />
drops <strong>in</strong> a similar manner as the voxel size <strong>in</strong>creases.<br />
139
(a) football (b) po<strong>in</strong>t set <strong>of</strong> (a)<br />
(c) owl (d) po<strong>in</strong>t set <strong>of</strong> (c)<br />
Figure 5.10: The captured objects <strong>of</strong> figure 5.12.<br />
Look<strong>in</strong>g from 5.12(b), an universal voxel size <strong>of</strong> 2mm can be chosen<br />
to conserve over 80% <strong>of</strong> the orig<strong>in</strong>al data, while choos<strong>in</strong>g a voxel size <strong>of</strong><br />
5mm throws half <strong>of</strong> the <strong>in</strong>formation away. This is particularly useful be-<br />
cause the voxel size can be decided by how much the data from different<br />
views are overlapp<strong>in</strong>g. The redundancy can be reduced to a m<strong>in</strong>imum if<br />
the appropriate voxel size is chosen.<br />
140
(a) vase (horizontal shot) (b) po<strong>in</strong>t set <strong>of</strong> (a)<br />
(c) vase (vertical shot) (d) po<strong>in</strong>t set <strong>of</strong> (c)<br />
Figure 5.11: The captured objects <strong>of</strong> figure 5.12.<br />
5.4.3 User Assisted Tun<strong>in</strong>g<br />
As discussed earlier, the transform between the two po<strong>in</strong>t sets <strong>in</strong> 3D space<br />
can be estimated us<strong>in</strong>g the SVD based fitt<strong>in</strong>g algorithm (section 5.2.2),<br />
from a set <strong>of</strong> match<strong>in</strong>g po<strong>in</strong>ts computed from section 5.3. Before mak-<br />
<strong>in</strong>g the commitment <strong>in</strong> sav<strong>in</strong>g the estimate transform, the user is given the<br />
chance <strong>of</strong> manually tun<strong>in</strong>g the po<strong>in</strong>t sets. This process is also visualised<br />
and the tun<strong>in</strong>g results is <strong>in</strong>stantly reflected on the desktop, as shown <strong>in</strong><br />
figure 5.13.<br />
Further discussion <strong>of</strong> this <strong>in</strong>teractive tun<strong>in</strong>g and the scenario <strong>of</strong> mul-<br />
141
(a) total po<strong>in</strong>ts<br />
(b) percentage <strong>of</strong> the orig<strong>in</strong>al data size<br />
Figure 5.12: The quantisation effect <strong>of</strong> choos<strong>in</strong>g different voxel size on the<br />
total po<strong>in</strong>t set size.<br />
tiple po<strong>in</strong>t set registration are both presented <strong>in</strong> a more detailed scale <strong>in</strong><br />
sections 6.4.4 and 6.4.5.<br />
142
Figure 5.13: Manual tun<strong>in</strong>g <strong>of</strong> po<strong>in</strong>t sets registration.<br />
5.5 Render<strong>in</strong>g A Rotat<strong>in</strong>g Object<br />
Rotat<strong>in</strong>g an object about the WCS orig<strong>in</strong> has the risk <strong>of</strong> mov<strong>in</strong>g the object<br />
out <strong>of</strong> the camera’s field <strong>of</strong> view, so the common way to visualise the 3D<br />
data is to rotate it about the centroid as if the object is placed on a turn<br />
table. For each object po<strong>in</strong>t, the <strong>in</strong>stantaneously chang<strong>in</strong>g world coordi-<br />
nates X ′ , Y ′ , Z ′ is projected onto the 2D camera space with the camera pose<br />
calibrated,<br />
143
⎛<br />
⎜<br />
X<br />
⎜<br />
K(R|T ) ⎜<br />
⎝<br />
′<br />
Y ′<br />
Z ′<br />
⎞<br />
⎛<br />
⎟ x<br />
⎟ ⎜<br />
⎟ ⎜<br />
⎟ ≈ ⎜<br />
⎟ ⎝<br />
⎟<br />
⎠<br />
1<br />
′<br />
y ′<br />
⎞<br />
⎟<br />
⎠<br />
1<br />
(5.13)<br />
x ′ , y ′ is the mov<strong>in</strong>g 2D coord<strong>in</strong>ate <strong>in</strong> the camera space. We then attach<br />
the colour <strong>in</strong>formation associated with the current po<strong>in</strong>t (figure 5.7).<br />
(a) (b) (c)<br />
(d) (e) (f)<br />
Figure 5.14: Different rendered views. (top:rendered range images; bot-<br />
tom:rendered object attached with colour texture)<br />
144
5.6 Conclusions<br />
In this chapter a framework is presented for the fusion between two 3D<br />
po<strong>in</strong>t sets, <strong>in</strong> other words, the registration between two views. This is a<br />
comb<strong>in</strong>ation <strong>of</strong> conventional automated 2D image registration, a 3D po<strong>in</strong>t<br />
set registration, and a user-guided human-computer collaborative work <strong>in</strong><br />
a VAE. The proposed framework correlates two sets <strong>of</strong> 3D data captured<br />
from different views <strong>of</strong> the same object, with ideally an overlapp<strong>in</strong>g part<br />
shared between the two views. The registration framework can be iterated<br />
to perform the fusion <strong>of</strong> multiple views.<br />
The process beg<strong>in</strong>s with 2D image registration on colour textures <strong>of</strong><br />
two participat<strong>in</strong>g views, where the <strong>in</strong>terest<strong>in</strong>g po<strong>in</strong>ts are first extracted by<br />
corner detectors and then correlated us<strong>in</strong>g Normalised Cross-Correlation<br />
(NCC). Once the 2D correspondences are built, the 3D coord<strong>in</strong>ates <strong>of</strong> the<br />
matched po<strong>in</strong>ts are used to estimate the transform <strong>in</strong> 3D space between<br />
these two sets <strong>of</strong> po<strong>in</strong>ts us<strong>in</strong>g S<strong>in</strong>gular Value Decomposition (SVD) and<br />
Orthogonal Procrustes [88]. The estimated rotation and translation vec-<br />
tors are used as an <strong>in</strong>itial guess to perform a trial merge, by wrapp<strong>in</strong>g<br />
one po<strong>in</strong>t set to another <strong>in</strong> 3D space based on the estimated rotation and<br />
translation. The user has the f<strong>in</strong>al decision <strong>of</strong> whether to accept this trial<br />
given by the computer, or manually improve the fusion <strong>of</strong> po<strong>in</strong>t sets by<br />
tun<strong>in</strong>g the them <strong>in</strong>to different poses <strong>in</strong> a virtual environment us<strong>in</strong>g the<br />
augmented tools.<br />
In addition to the registration itself, a voxel quantisation mechanism is<br />
145
proposed and implemented to reduce the data redundancy and speed up<br />
the render<strong>in</strong>g. This quantisation is <strong>in</strong> particular desired <strong>in</strong> multiple po<strong>in</strong>t<br />
sets fusion scenario, where the data redundancy is relative larger because<br />
the overlapp<strong>in</strong>g areas between a number <strong>of</strong> po<strong>in</strong>t sets. Prelim<strong>in</strong>ary results<br />
also show that the optimal quantisation level is only affected by the choice<br />
<strong>of</strong> voxel size, and it is object <strong>in</strong>dependent.<br />
5.6.1 Future Work<br />
Although reasonable results can be achieved us<strong>in</strong>g an automated regis-<br />
tration followed by user’s manual tun<strong>in</strong>g, the participat<strong>in</strong>g two views are<br />
preferred to have a fair amount <strong>of</strong> overlapp<strong>in</strong>g area, otherwise the regis-<br />
tration results can become very poor. This is the ma<strong>in</strong> reason caus<strong>in</strong>g the<br />
extra data storage, and the performance can be affected when measur<strong>in</strong>g<br />
objects with clean-cut surfaces such as a rectangular box. A feature based<br />
image registration also means it is hard to work on objects with very little<br />
texture.<br />
Future work <strong>in</strong>clude possible improvements <strong>in</strong> several areas:<br />
• First, dur<strong>in</strong>g the process <strong>of</strong> image registration, it is on purpose that<br />
we aim to hide as much technical details as possible from users,<br />
while still provid<strong>in</strong>g them a means <strong>of</strong> work<strong>in</strong>g towards optimal re-<br />
146
sults by adjust<strong>in</strong>g the parameter sett<strong>in</strong>gs randomly with<strong>in</strong> a closed<br />
<strong>in</strong>terval. However, the <strong>in</strong>terface can still be elaborated to give the<br />
user more targeted <strong>in</strong>itiative on the parameter sett<strong>in</strong>gs. For example,<br />
provid<strong>in</strong>g the user a choice <strong>of</strong> ’less corner po<strong>in</strong>ts’ or ’more tolerant<br />
cross-correlation’ could be more presentable way than a simple ran-<br />
domised repetition.<br />
• Second, the visualisation <strong>in</strong> tun<strong>in</strong>g can be improved (figure 5.13).<br />
The user can be provided with a means <strong>of</strong> <strong>in</strong>spect<strong>in</strong>g the current<br />
po<strong>in</strong>t sets be<strong>in</strong>g merged from a variety <strong>of</strong> angles, to help with the<br />
merge. This is particularly helpful for fus<strong>in</strong>g two pieces which share<br />
little overlapp<strong>in</strong>g area, for example, two halves <strong>of</strong> a sphere.<br />
• Last but not the least, there is a possibility <strong>of</strong> depth <strong>in</strong>formation be<strong>in</strong>g<br />
used for establish<strong>in</strong>g correspond<strong>in</strong>g po<strong>in</strong>ts, when there is a lack <strong>of</strong><br />
texture across the surface. This is can be regarded as us<strong>in</strong>g the depth<br />
map as an alternative feature to the texture. Although the prospect<br />
<strong>of</strong> us<strong>in</strong>g depth <strong>in</strong>formation for image registration faces the challenge<br />
from depth <strong>in</strong>accuracies (e.g. caused by depth discont<strong>in</strong>uities), it is<br />
expected that an appropriately comb<strong>in</strong>ed use <strong>of</strong> the depth <strong>in</strong>forma-<br />
tion and the texture <strong>in</strong>formation would yield positive results.<br />
147
Chapter 6<br />
System Design<br />
6.1 Introduction<br />
In chapters 4 and 5, we discussed the shape acquisition stage and the post-<br />
process<strong>in</strong>g <strong>of</strong> the scanned data. They are both separately performed com-<br />
puter vision tasks. In this chapter we address the design <strong>of</strong> a system that<br />
<strong>in</strong>corporates these two components <strong>in</strong>to a complete and <strong>in</strong>teractive sys-<br />
tem. The system provides the follow<strong>in</strong>g:<br />
1. An automatically generated and ma<strong>in</strong>ta<strong>in</strong>ed platform on which the<br />
148
data are visualised.<br />
2. A planar surface with real objects and video augmented signals.<br />
3. Widget tools for enabl<strong>in</strong>g user-computer <strong>in</strong>teractions, without the<br />
need <strong>of</strong> traditional <strong>in</strong>put devices such as mouse, keyboard or laser<br />
po<strong>in</strong>ter.<br />
4. Accurate automated facilities, with ease <strong>of</strong> use and correctability, and<br />
the user decides when, where and how to utilise them.<br />
The most important feature <strong>of</strong> system presented is that the user plays<br />
an active role <strong>in</strong> the <strong>in</strong>teractions. They make the f<strong>in</strong>al call <strong>of</strong> what is to be<br />
done next, by giv<strong>in</strong>g various <strong>in</strong>structions us<strong>in</strong>g the tools provided. Typical<br />
functionality <strong>in</strong>cludes range map touch-up, rejection <strong>of</strong> a scan, captur<strong>in</strong>g<br />
a snap shot and lots more. Apart from trigger<strong>in</strong>g various computer vision<br />
tasks, the user also decides what part <strong>of</strong> the collected data to be displayed.<br />
The central display area is limited and not all the scanned data will be<br />
used. More detailed discussions on the user <strong>in</strong>terface are presented <strong>in</strong> sec-<br />
tion 6.4.<br />
On the other hand, the computer itself <strong>of</strong>fers user help <strong>in</strong>formation ei-<br />
ther <strong>in</strong> a visualised way or <strong>in</strong> form <strong>of</strong> text messages. The help <strong>in</strong>formation<br />
can be a brief summary <strong>of</strong> the current data, <strong>of</strong>fer<strong>in</strong>g the user different op-<br />
tions about what might be the next move, or how to trigger these events.<br />
But this is a user guided, user centralised system, so users still have the<br />
f<strong>in</strong>al call under all circumstances.<br />
149
The calibration stage <strong>in</strong>troduced <strong>in</strong> chapter 3, however, has to be a<br />
stand-alone step and can not be carried out <strong>in</strong> this augmented environ-<br />
ment, because (a): it is normally performed prior to everyth<strong>in</strong>g else if the<br />
camera-projector system is uncalibrated; (b): the <strong>in</strong>terpretations <strong>of</strong> human<br />
gestures requires accurate mapp<strong>in</strong>g between the augmented projections<br />
and the observed images; (c): once the calibration is done, there is no need<br />
to perform the calibration aga<strong>in</strong> unless the position<strong>in</strong>g <strong>of</strong> the projector-<br />
camera system or the table setup has been changed.<br />
The rest <strong>of</strong> this chapter is organised as follows. In section 6.2, two wid-<br />
gets are <strong>in</strong>troduced. They are implemented to simulate two <strong>of</strong> two most<br />
frequently used gestures <strong>in</strong> the user-mach<strong>in</strong>e <strong>in</strong>teractions, the button push<br />
and the touchpad slide. The background and some other practical issues<br />
dur<strong>in</strong>g implementation are discussed as well. In section 6.3 the ma<strong>in</strong> user<br />
<strong>in</strong>terface <strong>of</strong> the system is <strong>in</strong>troduced. Some <strong>of</strong> the ma<strong>in</strong> utilities and func-<br />
tionality are presented <strong>in</strong> section 6.4. Section 6.5 is the conclusions.<br />
150
6.2 Widgets Provided for Interaction<br />
6.2.1 Introduction<br />
Where a vision system is used as the <strong>in</strong>teractive device <strong>in</strong> a man-mach<strong>in</strong>e<br />
collaboration, it is desirable to have an efficient solution for the user to give<br />
orders without hav<strong>in</strong>g to turn to the traditional <strong>in</strong>put devices. In this re-<br />
search tabletop <strong>in</strong>teraction is normally concerned with hands rather than<br />
other part <strong>of</strong> the human body or other po<strong>in</strong>t<strong>in</strong>g devices. Therefore hand<br />
gesture is the most frequently used behaviour for user to give <strong>in</strong>structions.<br />
The most common gesture is the button push – to trigger an event. In a<br />
vision system, a button push does not necessarily require physical contact<br />
with the desktop surface. Without the presence <strong>of</strong> a touch screen or other<br />
contact sensors, it is hard to visually detect whether the user’s hand has<br />
touched the <strong>in</strong>terface or not. The method discussed here is to monitor the<br />
<strong>in</strong>terested area over consequent frames to analyse whether the button has<br />
been pushed, kept pressed, or released.<br />
Po<strong>in</strong>t<strong>in</strong>g is also realised as another widget <strong>in</strong> this system, equivalent to<br />
a touchpad on a laptop. When the po<strong>in</strong>t<strong>in</strong>g device is engaged, a rectangle<br />
<strong>in</strong> the control area is assigned to a touchpad, while a cursor is rendered <strong>in</strong><br />
the data area. The user can slide their f<strong>in</strong>ger across the touchpad as if they<br />
are work<strong>in</strong>g on a laptop. The f<strong>in</strong>ger tip movement <strong>in</strong> the observed images<br />
is analysed and the system responds to it by chang<strong>in</strong>g the display location<br />
<strong>of</strong> the augmented cursor.<br />
151
Figure 6.1(a) shows an image to be projected. The green rectangle <strong>in</strong><br />
the middle bottom section <strong>of</strong> the <strong>in</strong>terface is the touchpad. The bottom<br />
image shows the user us<strong>in</strong>g the touchpad with left hand and po<strong>in</strong>t<strong>in</strong>g a<br />
button us<strong>in</strong>g the right hand.<br />
152
(a) A projected image.<br />
(b) The observed image.<br />
Figure 6.1: A snapshot with touchpad and buttons.<br />
Figure 6.2 shows an object is be<strong>in</strong>g scanned to get the 2.5D depth map.<br />
While the scan is be<strong>in</strong>g performed, the projection image space (shown <strong>in</strong><br />
153
figure 6.1(a)) is replaced with a 1024 × 768 Gray coded stripe image. After<br />
the scan is f<strong>in</strong>ished, the menus and control buttons will reappear <strong>in</strong> the<br />
<strong>in</strong>teractive <strong>in</strong>terface.<br />
Figure 6.2: A captured image show<strong>in</strong>g an object is be<strong>in</strong>g scanned.<br />
6.2.2 Background<br />
Most <strong>of</strong> the current f<strong>in</strong>ger detection techniques can be classified <strong>in</strong>to three<br />
ma<strong>in</strong> categories.<br />
The majority <strong>of</strong> these techniques rely on background differenc<strong>in</strong>g [69,<br />
63, 72] for the <strong>in</strong>itial stage <strong>of</strong> image process<strong>in</strong>g. In [69], Malik and Laszlo<br />
develop a vision-based <strong>in</strong>put device which allows for hand <strong>in</strong>teractions<br />
with desktop PCs. They use a pair <strong>of</strong> cameras to provide the 3D positions<br />
154
<strong>of</strong> a user’s f<strong>in</strong>gertips, and locate the f<strong>in</strong>gertip and its orientation by seg-<br />
ment<strong>in</strong>g the foreground hand regions from the background. Parnham [74]<br />
proposes a technique <strong>in</strong>volv<strong>in</strong>g a comb<strong>in</strong>ation <strong>of</strong> plane calibration shadow<br />
removal via the analysis <strong>of</strong> the <strong>in</strong>variance image. Letessier and Bérard [63]<br />
present a technique that comb<strong>in</strong>es a method for image differenc<strong>in</strong>g and a<br />
f<strong>in</strong>gertip detection algorithm named Fast Rejection Filter (FRF). FRF is a<br />
set <strong>of</strong> rules to classify hand pixels and non-hand pixels, however it is only<br />
concerned with detect<strong>in</strong>g f<strong>in</strong>gertips but not hand shape. Therefore it is<br />
unable to detect f<strong>in</strong>gers that are pressed together.<br />
Some others make use <strong>of</strong> sk<strong>in</strong> colour detection. In [2], a colour de-<br />
tection method is presented us<strong>in</strong>g a Bayesian classifier [36] plus a small<br />
set <strong>of</strong> tra<strong>in</strong><strong>in</strong>g data. Then a curvature analysis algorithm is applied on<br />
the detected contours to determ<strong>in</strong>e peaks which could correspond to f<strong>in</strong>-<br />
gertips. Quek et al. [78] develop a system named F<strong>in</strong>gerMouse which al-<br />
lows f<strong>in</strong>ger po<strong>in</strong>t<strong>in</strong>g to replace the mouse to control a desktop PC. Their<br />
method <strong>in</strong>volves segmentation via a tra<strong>in</strong><strong>in</strong>g-required probabilistic colour<br />
table look-up, and a Pr<strong>in</strong>ciple Component Analysis (PCA) based f<strong>in</strong>gertip<br />
detection algorithm.<br />
Us<strong>in</strong>g a mask to perform template match<strong>in</strong>g is another way to detect<br />
f<strong>in</strong>gertips. There are techniques where researchers use markers [34, 32]<br />
and gloves [96, 19]. Some researchers use fiducials [56] as po<strong>in</strong>t<strong>in</strong>g device,<br />
which also falls <strong>in</strong>to this category.<br />
155
Apart from the aforementioned ma<strong>in</strong> categories, an alternative is to use<br />
more expensive hardware such as thermoscopic camera or <strong>in</strong>fra-red cam-<br />
era to provide a clean b<strong>in</strong>ary image for further process<strong>in</strong>g [58, 85].<br />
6.2.3 Practical Issues<br />
A few practical issues have to be addressed before background differenc-<br />
<strong>in</strong>g based f<strong>in</strong>ger detection techniques are used <strong>in</strong> this system. F<strong>in</strong>ger de-<br />
tection for use <strong>in</strong> an VAE application is different from those used <strong>in</strong> a con-<br />
ventional vision system. First, it must be resilient to the effect <strong>of</strong> various<br />
light<strong>in</strong>g conditions, especially to the projections. Second, it has to be ef-<br />
ficient so as to be responsive but not adversely affect the performance <strong>of</strong><br />
the rest <strong>of</strong> the system. Third, a user should be able to walk up to the table-<br />
top and beg<strong>in</strong> <strong>in</strong>teract<strong>in</strong>g without the need <strong>of</strong> extra equipment such as<br />
markers or gloves. Last, it provides <strong>in</strong>teractions without conventional <strong>in</strong>-<br />
put devices such as mouse and keyboard, nor the need <strong>of</strong> more expensive<br />
tabletop touch-screens, which means the move and click behaviour that are<br />
usually available by the mouse, need to be addressed.<br />
With those factors stated above, template match<strong>in</strong>g based methods<br />
which might require extra tra<strong>in</strong><strong>in</strong>g are not suitable for this application.<br />
Moreover, although both move and click can be detected <strong>in</strong> a s<strong>in</strong>gle paradigm<br />
<strong>of</strong> f<strong>in</strong>gertip detection by respond<strong>in</strong>g to the <strong>in</strong>stantaneous f<strong>in</strong>gertip loca-<br />
tion, process<strong>in</strong>g the whole image for each frame is not efficient. Robust<br />
background segmentation techniques usually <strong>in</strong>volve analys<strong>in</strong>g the pixel<br />
156
classifications by modell<strong>in</strong>g them as Mixture <strong>of</strong> Gaussians [25, 94, 53] <strong>in</strong><br />
an adjacent few frames. This will <strong>in</strong>evitably causes process<strong>in</strong>g overhead<br />
and affect the overall system performance.<br />
In this research, click is the dom<strong>in</strong>ant <strong>in</strong>teractive gesture therefore we<br />
model it as a button-push action, with a number <strong>of</strong> virtual buttons pro-<br />
vided with<strong>in</strong> the <strong>in</strong>terface (figure 6.1). move is realised by designat<strong>in</strong>g an<br />
area as a touch-pad and switch<strong>in</strong>g it on and <strong>of</strong>f depend<strong>in</strong>g on whether the<br />
locat<strong>in</strong>g device is required or not for the current function, and the desig-<br />
nated touch-pad area is processed <strong>in</strong>stead <strong>of</strong> the whole frame.<br />
6.2.4 Implementation <strong>of</strong> Pushbutton<br />
Figure 6.3: F<strong>in</strong>ger detection.<br />
Our approach to realise the pushbutton widget is to divide the button <strong>in</strong>to<br />
two areas (Figure 6.3). Area A is the <strong>in</strong>ner area where f<strong>in</strong>gers are most<br />
likely placed, and it is roughly the same size as human f<strong>in</strong>ger tip. Area B<br />
157
is the outer area.<br />
Let At0 be the average lum<strong>in</strong>ance over area A at time t0, and At1 for<br />
time t1, then the average lum<strong>in</strong>ance change between this time period is<br />
Similarly for area B, we have<br />
∆A = At0 − At1<br />
∆B = Bt0 − Bt1<br />
(6.1)<br />
(6.2)<br />
We can def<strong>in</strong>e a button be<strong>in</strong>g touched if |∆A| > w1 and |∆B| < w2,<br />
where w1 and w2 are both positive thresholds.<br />
To detect button press and release events, the sign <strong>of</strong> ∆A needs to be<br />
considered. Due to the fact that human sk<strong>in</strong> absorbs a bigger proportion<br />
<strong>of</strong> the <strong>in</strong>cident light than the desktop surface (<strong>in</strong> this case a more reflective<br />
white board), the f<strong>in</strong>ger appears significantly less bright than the back-<br />
ground <strong>in</strong> the image observed from camera. By tak<strong>in</strong>g <strong>in</strong>to account the<br />
sign <strong>of</strong> ∆A <strong>in</strong>stead <strong>of</strong> its absolute value, we can detect the button press<br />
and button release events. The advantage <strong>of</strong> this appearance-based f<strong>in</strong>ger<br />
detection is that it is immune to changes <strong>in</strong> light<strong>in</strong>g conditions and acci-<br />
dental occlusions.<br />
In an early version [64] <strong>of</strong> our f<strong>in</strong>ger detection system the button area<br />
is taken as a whole when be<strong>in</strong>g monitored. The dual region approach is<br />
more reliable. We have tested the new approach for a cont<strong>in</strong>uous time<br />
period <strong>of</strong> more than 24 hours, dur<strong>in</strong>g which it survived extreme changes<br />
158
<strong>in</strong> light<strong>in</strong>g conditions such as sunrise, sunset, pull<strong>in</strong>g up and down the<br />
bl<strong>in</strong>ds, switch<strong>in</strong>g lights on and <strong>of</strong>f. The buttons are never mistriggered.<br />
Button calibration<br />
The two thresholds w1 and w2 <strong>in</strong>troduced above are set <strong>in</strong> different ways.<br />
w1which controls the outer region is set empirically to a small value so<br />
that the outer region <strong>of</strong> the button is <strong>in</strong>tolerant to noise, which makes the<br />
button less likely to be triggered accidentally. The <strong>in</strong>ner region is where<br />
the f<strong>in</strong>ger is normally pressed.<br />
(a) The projected button. (b) The observed button<br />
(no f<strong>in</strong>ger).<br />
Figure 6.4: Button calibration.<br />
(c) The observed button<br />
(f<strong>in</strong>ger pressed).<br />
To decide the threshold w2 for the <strong>in</strong>ner region, a quick calibration pro-<br />
cess is provided at the beg<strong>in</strong>n<strong>in</strong>g. First, a button is projected onto the<br />
surface (figure 6.4). The system takes an image <strong>of</strong> the projected button by<br />
itself and works out the average pixel value for <strong>in</strong>ner region, say v1. In<br />
practice, v1 can be an average value over a small time period ∆t. Then a<br />
help message is displayed to advise the user to press the button. Similarly<br />
159
let v2 be the average pixel value <strong>of</strong> the <strong>in</strong>ner region over a small time pe-<br />
riod. Then w2 = v1 − v2.<br />
Although w1 and w2 are both averaged values from a period <strong>of</strong> time, it<br />
is still regarded as a short period consider<strong>in</strong>g that the system will be up<br />
and runn<strong>in</strong>g for a much longer time. Therefore <strong>in</strong> practice, a tolerance fac-<br />
tor t are applied. w ′ 1 = w1t and w ′ 2 = w2t are used as the f<strong>in</strong>al threshold<br />
values.<br />
Figure 6.5: The TPR and FPR <strong>of</strong> button push detection.<br />
160
In figure 6.5 shows the effect <strong>of</strong> choos<strong>in</strong>g different tolerance factor t<br />
on the button detection performance. We also study the improvement <strong>of</strong><br />
the dual-region method over the previous implementation where average<br />
pixel values across the whole button region is used.<br />
The test framework is designed as follows. For each method, we first<br />
evaluate its TPR by keep press<strong>in</strong>g the button and record the rate <strong>of</strong> suc-<br />
cessful detection. Then a hand is randomly waved over the button, us<strong>in</strong>g<br />
various types <strong>of</strong> gestures and the rate <strong>of</strong> mis-trigger<strong>in</strong>g as FPR. For both<br />
experiments, 100 times <strong>of</strong> the repeated same actions is used.<br />
The top graph shows that us<strong>in</strong>g the old method, although <strong>in</strong>creas<strong>in</strong>g<br />
the tolerance factor decreases the FPR, it is at the sacrifice <strong>of</strong> TPR. As we<br />
<strong>in</strong>crease the tolerance factor he TPR is lowered down to near 60%, its FPR<br />
is still way too high at 40%. The new method shows promis<strong>in</strong>g results,<br />
thanks to its dual region design (figure 6.3 on page 157) that effectively re-<br />
duces the chance <strong>of</strong> the button be<strong>in</strong>g accidentally hit. The FPR <strong>of</strong> the new<br />
method is controlled below 10% <strong>in</strong> the bottom graph, while the TPR stays<br />
above 80% with the tolerance factor set below 0.6.<br />
All curves <strong>in</strong> both top and bottom graph has a similar trend <strong>of</strong> de-<br />
crease as the tolerance factor <strong>in</strong>creases. This is expected because with us-<br />
<strong>in</strong>g smaller tolerance factor decreases threshold values for both <strong>in</strong>ner and<br />
outer region detection, which ultimately leads to both button positive de-<br />
tection and mis-detection be<strong>in</strong>g more likely to happen.<br />
161
Button observation<br />
For each s<strong>in</strong>gle button, its position and size is fixed <strong>in</strong> the projection im-<br />
age. Once a button is def<strong>in</strong>ed, it is assigned a constant 2D position and<br />
size (length and width). The position and size <strong>of</strong> the button <strong>in</strong> the ob-<br />
served image depends on the camera and projector setup. S<strong>in</strong>ce there is<br />
a plane-to-plane projective transform between the camera space and the<br />
projector space while <strong>in</strong>duced by the desktop as a third plane (section 3.5),<br />
once a button is attached onto the source image, its appearance (position<br />
and size) <strong>in</strong> the observed image is known. Here is an illustration <strong>of</strong> a pro-<br />
jection image and its observed camera image (figure 6.6).<br />
162
(a) The projected buttons.<br />
(b) The observed buttons.<br />
Figure 6.6: The projected buttons and their observations <strong>in</strong> camera image.<br />
(The red blocks only <strong>in</strong>dicate the area to be monitored).<br />
163
6.2.5 Implementation <strong>of</strong> Touchpad<br />
Real-time segmentation <strong>of</strong> mov<strong>in</strong>g regions <strong>in</strong> image sequences is done by<br />
background subtraction. The simplest way to do it is threshold<strong>in</strong>g the<br />
error between the image taken earlier without any mov<strong>in</strong>g objects and<br />
the current image. However, to deal with the various light<strong>in</strong>g conditions<br />
which change from time to time <strong>in</strong>volves more complicated process<strong>in</strong>g.<br />
As discussed earlier <strong>in</strong> section 6.2.3, a separate rectangular area is as-<br />
signed and a constant pattern is projected onto it as the touchpad. This<br />
area is monitored and we apply the background subtraction algorithm<br />
only <strong>in</strong> that area <strong>in</strong> the observed frames.<br />
The mixture <strong>of</strong> Gaussian based adaptive background modell<strong>in</strong>g method<br />
[25] is used to generate a foreground mask for each frame. In this appli-<br />
cation the detected foreground regions are f<strong>in</strong>gers or sometimes with part<br />
<strong>of</strong> the palm also <strong>in</strong>cluded. Unlike most <strong>of</strong> the vision systems, we do not<br />
explicitly segment the foreground blobs because the <strong>in</strong>formation needed<br />
from the foreground region is the f<strong>in</strong>ger tips, and it is assumed that f<strong>in</strong>ger<br />
is always po<strong>in</strong>t<strong>in</strong>g up.<br />
Figure 6.7 shows the result <strong>of</strong> background segmentation algorithm on<br />
four different occasions. From left to right column-wise, the image are cap-<br />
tured when 1. only one f<strong>in</strong>ger is at present; 2. two f<strong>in</strong>gers are at present;<br />
3. part <strong>of</strong> the palm is <strong>in</strong>cluded; 4. the whole upper hand is <strong>in</strong>cluded. The<br />
f<strong>in</strong>gertip is f<strong>in</strong>ally located at the top middle position <strong>of</strong> the most dom<strong>in</strong>ant<br />
164
lob <strong>in</strong> the resultant foreground region.<br />
(a) Orig<strong>in</strong>al image.<br />
(b) Background region.<br />
(c) Foreground region.<br />
(d) Detected f<strong>in</strong>gertip.<br />
Figure 6.7: F<strong>in</strong>gertip detection us<strong>in</strong>g background segmentation algorithm.<br />
165
6.3 User <strong>in</strong>terface<br />
S<strong>in</strong>ce the whole system is based on <strong>in</strong>teractions, it is important to have an<br />
well designed <strong>in</strong>terface via which the user can give <strong>in</strong>structions and re-<br />
ceive feedback from the computer. Therefore, it must be understandable,<br />
streaml<strong>in</strong>ed and easy to use. Two pr<strong>in</strong>ciples are tightly followed dur<strong>in</strong>g<br />
the design <strong>of</strong> the user <strong>in</strong>terface. First, the data area is maximised to be<br />
able to present all relevant <strong>in</strong>formation and data. Second, various controls<br />
are efficiently grouped <strong>in</strong>to different sections while tak<strong>in</strong>g as little space as<br />
possible. Besides, we are also aware that not all the control units need to<br />
be revealed at the same time for the purpose <strong>of</strong> sav<strong>in</strong>g the limited desktop<br />
space.<br />
The user <strong>in</strong>terface itself is a 1024 × 768 image be<strong>in</strong>g projected onto the<br />
desktop surface. Figure 6.8 shows a screen shot <strong>of</strong> the work<strong>in</strong>g environ-<br />
ment. It is divided <strong>in</strong>to 5 areas.<br />
Left column<br />
The left column is the preview area where all the thumbnails are listed.<br />
Only the thumbnails <strong>of</strong> the views that have already been scanned will be<br />
displayed here. The user can switch between different views by press-<br />
<strong>in</strong>g the correspond<strong>in</strong>g thumbnails. The current <strong>in</strong>vestigated view is high-<br />
lighted and red framed.<br />
Right column<br />
166
Figure 6.8: A screen shot <strong>of</strong> the work<strong>in</strong>g environment.<br />
The right column is the area for system controls. These are the most im-<br />
portant system-wide controls so they will stay on display throughout the<br />
whole process. From the bottom up, they are Lock, Snapshot, Scan, Re-Scan,<br />
and Exit. The user might want to Lock the current desktop when the target<br />
object needs to be re-positioned manually by user or the tabletop is go<strong>in</strong>g<br />
to be unattended for some time so that the buttons will not be accidentally<br />
triggered. When the desktop is <strong>in</strong> lock, all buttons except the Lock button<br />
are not responsive until it is unlocked by user. By press<strong>in</strong>g the Scan button,<br />
a new structured light projection takes over the system. When it is done,<br />
all relevant <strong>in</strong>formation such as the texture map and depth map are dis-<br />
played <strong>in</strong> the central area and the system goes back to idle. A thumbnail<br />
<strong>of</strong> this scan is displayed <strong>in</strong> the left column too. Re-Scan is similar to Scan<br />
button, the only difference be<strong>in</strong>g press<strong>in</strong>g Re-Scan will erase data from the<br />
167
previous shape <strong>in</strong>put. This is useful sometimes when a structured light<br />
process is disturbed which could result <strong>in</strong> unexpected large errors <strong>in</strong> the<br />
scanned data, so they are deleted prior to the next scan to save the mem-<br />
ory. On the top <strong>of</strong> this column is an Exit button to quit the whole system.<br />
Bottom left panel<br />
The bottom left area conta<strong>in</strong>s four mode buttons: Inspect, Touchup, Corre-<br />
spondence, and Visualise. Once a mode button is pressed, it will stay high-<br />
lighted and the system engages the appropriate mode. Relevant guide<br />
messages will appear above the control panel to briefly <strong>in</strong>troduce what<br />
can be done <strong>in</strong> this mode or sometimes advise the user what the next pos-<br />
sible steps are. The user can hit the same mode button aga<strong>in</strong> to quit the<br />
current mode, or simply press another mode button to switch to a differ-<br />
ent mode directly. Detailed discussion <strong>of</strong> the <strong>in</strong>dividual modes is given <strong>in</strong><br />
section 6.4.<br />
Bottom right panel<br />
The content displayed <strong>in</strong> the bottom right section depends on the mode<br />
currently engaged.<br />
Central display area<br />
The central area holds the ma<strong>in</strong> display. Normally, all data displayed <strong>in</strong><br />
the central area is from the same view. This area is composed <strong>of</strong> four sub<br />
pictures: depth map, texture, colour texture, and a rendered model with<br />
texture map attached onto the depth map.<br />
168
6.4 Ma<strong>in</strong> Utilities<br />
In this section the ma<strong>in</strong> utilities <strong>of</strong> the system are <strong>in</strong>troduced. They not<br />
only function <strong>in</strong>dividually but also work collectively as a whole unit to<br />
perform the 3D <strong>in</strong>put task under the user’s <strong>in</strong>structions. Although some<br />
<strong>of</strong> the utilities requires certa<strong>in</strong> steps to be done first, there is no specific<br />
order <strong>of</strong> which <strong>of</strong> them comes first or which last. The user can switch be-<br />
tween these modes anytime based on what to be done next. If an illegal<br />
operation is evoked a warn<strong>in</strong>g message will appear to advise the user <strong>of</strong><br />
the correct options.<br />
We now briefly describe how the system works as an overview, then<br />
discuss the <strong>in</strong>dividual utilities via a scenario example to illustrate how<br />
they perform the <strong>in</strong>dividual tasks.<br />
169
6.4.1 Overview<br />
Figure 6.9 shows a screen shot <strong>of</strong> system start-up projection. On the left<br />
hand side, a few place holders are attached and each <strong>of</strong> them represents<br />
one view. This is where the thumbnails <strong>of</strong> the captured views are go<strong>in</strong>g to<br />
be placed after the user runs the structured light scan. On the right hand<br />
side are the attached system buttons which can be hit any time dur<strong>in</strong>g the<br />
process. The Lock button is placed at the bottom for the user’s convenience<br />
to lock up the screen so it is temporarily not responsive to the user’s <strong>in</strong>-<br />
structions. Four mode buttons are also shown at the bottom left, however<br />
at this stage they will not evoke any applications because there is currently<br />
no captured data to be processed.<br />
At the bottom centre area, a button with a small red area is attached<br />
and flashed. A help message is displayed above the button to <strong>in</strong>form the<br />
user <strong>of</strong> the button calibration with five seconds count down. After the<br />
count down, the user is expected to put his f<strong>in</strong>ger <strong>in</strong> the designated area<br />
to perform the button calibration, and the system will choose an optimal<br />
value for the button push detection threshold based on the current room<br />
light<strong>in</strong>g, the projection illum<strong>in</strong>ation level, and this specific person’s sk<strong>in</strong><br />
colour. Detailed discussion <strong>of</strong> this calibration process is <strong>in</strong> section 6.2.4.<br />
A quick structured light scan is done right after the button calibration,<br />
as a plane calibration step (section 4.4.2). The scan button (third button<br />
from the bottom up <strong>in</strong> the right column, the one with the black and white<br />
stripes) is flashed to rem<strong>in</strong>d the user to capture data first before any pro-<br />
cess<strong>in</strong>g can be carried out.<br />
170
Figure 6.9: Screen shot <strong>of</strong> the system start-up state.<br />
Once a scanned view is captured, some contents <strong>of</strong> the screen will be<br />
updated. A thumbnail <strong>of</strong> the current view is attached to the appropri-<br />
ate place <strong>in</strong> the left column. It serves as an identification <strong>of</strong> the view it<br />
represents. The user can switch between different views to perform the<br />
process<strong>in</strong>g task by press<strong>in</strong>g the correspond<strong>in</strong>g thumbnails. The captured<br />
data is visualised <strong>in</strong> the central display area <strong>in</strong> different forms: the depth<br />
map, a rendered 3D partial model, the texture map, and the colour map.<br />
Various tasks can be performed right after a view is captured. In gen-<br />
eral, there are four ma<strong>in</strong> modes the user can switch <strong>in</strong>to:<br />
• The Inspect Mode for check<strong>in</strong>g the captured data without chang<strong>in</strong>g<br />
171
the data itself. The user can <strong>in</strong>spect the data not only on the depth<br />
map itself but also through a manipulatable rendered 3D model.<br />
• The Touchup Mode for touch<strong>in</strong>g up the depth map if an obvious error<br />
is believed to have occurred.<br />
• The Correspondence Mode for f<strong>in</strong>d<strong>in</strong>g match<strong>in</strong>g po<strong>in</strong>ts, estimat<strong>in</strong>g the<br />
transform between two views, and fus<strong>in</strong>g the two views together. At<br />
least two captured views are required for this mode.<br />
• The Visualisation Mode for visualis<strong>in</strong>g the built 3D model. The user<br />
can visualise the f<strong>in</strong>al 3D model that has been built, check which<br />
view contributes to a certa<strong>in</strong> part <strong>of</strong> the object, and how well the<br />
views are fused together by switch<strong>in</strong>g any <strong>of</strong> the views on and <strong>of</strong>f.<br />
From section 6.4.2 to 6.4.5, an owl object experiment is used <strong>in</strong> an ex-<br />
ample scenario to show the usage <strong>of</strong> these utilities, both <strong>in</strong>dividually and<br />
collectively.<br />
6.4.2 Mode 1: Inspect<br />
172
In Inspect Mode, the user adjusts the orientation <strong>of</strong> the selected rendered<br />
model, for view<strong>in</strong>g or check<strong>in</strong>g purposes. The first four arrow buttons<br />
are provided for rotat<strong>in</strong>g the rendered model <strong>in</strong> 3D space (pan and tilt),<br />
while the two rightmost buttons adjust the magnitude ga<strong>in</strong> <strong>of</strong> the ren-<br />
dered model to further <strong>in</strong>spect the surface.<br />
Normally the very first move after a scan is to switch to this mode, to<br />
exam<strong>in</strong>e the accuracy <strong>of</strong> the estimated depth map and see if there is any<br />
outstand<strong>in</strong>g errors which can be caused by surface discont<strong>in</strong>uities, shad-<br />
ows, reflectance artifacts or other disturbances occurr<strong>in</strong>g dur<strong>in</strong>g the scan.<br />
The Inspect Mode does not <strong>in</strong>volve any process<strong>in</strong>g <strong>of</strong> the collected data, but<br />
works closely with the other modes. One can switch to this mode anytime<br />
for <strong>in</strong>spection purposes. It is sometime helpful to switch to a different<br />
view, if available, to double check the identified error and ga<strong>in</strong> more con-<br />
fidence.<br />
173
Figure 6.10: Owl experiment, 3 views captured, current on view 1.<br />
Figure 6.11: Owl experiment, 3 views captured, current on view 0, model<br />
rotated.<br />
174
Figure 6.10 shows the projected display after three views are captured,<br />
and view 1 is currently selected. In the depth map, two white spots are<br />
observed and <strong>in</strong>itially identified to be an obvious error. The error is more<br />
obvious <strong>in</strong> the top right picture where it is rendered <strong>in</strong> 3D and attached<br />
with the colour map. The two spikes seen <strong>in</strong> that picture correspond to<br />
the two bright spots found <strong>in</strong> the depth map, and this can be further con-<br />
firmed by rotat<strong>in</strong>g the rendered model <strong>in</strong>to a more suitable angle (figure<br />
6.11), where it can be clearly seen that the two spikes come from the side<br />
<strong>of</strong> the owl’s left foot. These two sparks come from two t<strong>in</strong>y spots on the<br />
owl’s right leg (the one underneath), where the projector fails to illum<strong>in</strong>ate<br />
those that little area, but it is with<strong>in</strong> the view <strong>of</strong> the camera.<br />
Once the error is identified and confirmed, the user can move on to<br />
Touchup Mode to correct the error, after which they can witch back to <strong>in</strong>-<br />
spect the results aga<strong>in</strong>, but this is totally the user’s choice.<br />
6.4.3 Mode 2: Touchup<br />
Touchup Mode gives the user opportunity to manually touch up on the<br />
175
depth map and improve the view, without hav<strong>in</strong>g to adjust the system<br />
parameters or run the shape acquisition stage once more. Although this<br />
mode doesn’t provide a sophisticated and detailed correction mechanism<br />
for the depth map, it does <strong>of</strong>fer a tool for the user to alleviate or erase<br />
the most obvious errors based on their own judgement. Once the capture<br />
error is presentably visualised <strong>in</strong> the Inspect Mode, this correction tool is<br />
simple to use, fast, and effective.<br />
In this mode, different functional buttons are provided - a touch pad<br />
for locat<strong>in</strong>g the cursor and a push button to commit the change. A speed<br />
control button is also provided to adjust the cursor speed. The cursor can<br />
be positioned quickly towards the error po<strong>in</strong>t by faster cursor movement,<br />
but once it is located slower cursor movement may be used to p<strong>in</strong>po<strong>in</strong>t<br />
the error spot. The cursor is restricted with<strong>in</strong> the depth map sub-w<strong>in</strong>dow.<br />
The same owl object is used as an example to illustrate the touchup<br />
process. First an error po<strong>in</strong>t <strong>in</strong> the depth map is identified <strong>in</strong> the Inspection<br />
Mode, as shown <strong>in</strong> figure 6.10 and 6.11. The error actually occurs <strong>in</strong> the<br />
codification stage where the codewords <strong>of</strong> a group <strong>of</strong> pixels are wrongly<br />
built hence the table look-up result for those pixels are <strong>in</strong>correct. Figure<br />
6.12 shows a row <strong>in</strong>dex image, which is the result <strong>of</strong> codification table<br />
look-up. In the row <strong>in</strong>dex image, the pixel value corresponds to which<br />
row <strong>of</strong> the projection image it is illum<strong>in</strong>ated by, and the brighter pixels<br />
correspond to higher rows. This image is an <strong>of</strong>f-l<strong>in</strong>e <strong>in</strong>spection dur<strong>in</strong>g de-<br />
bug and will not be shown to the user.<br />
176
Figure 6.12: The row <strong>in</strong>dex picture <strong>of</strong> the first view (the brighter pixel<br />
values correspond to higher rows <strong>in</strong> the projection image.)<br />
The touch-up process executes a median filter on the area located by<br />
the cursor once the commit button is hit. The median filter is very effec-<br />
tive for the type <strong>of</strong> salt and pepper noise <strong>in</strong> this example. The result <strong>of</strong><br />
the touchup is not only shown on the depth map, it is also reflected on<br />
the rendered model <strong>in</strong> the image to its right <strong>in</strong>stantly (figure 6.13), as the<br />
two are synchronised throughout the process. It is clearly seen that the<br />
spikes <strong>in</strong> the rendered image caused by depth error are no longer present,<br />
compared to figure 6.11. (Note, the big <strong>in</strong>crease <strong>in</strong> brightness level <strong>of</strong> the<br />
depth maps between figure 6.13 and 6.11 is caused by scal<strong>in</strong>g, because all<br />
displayed depth maps are re-scaled to 0-255 otherwise all pixels exceed<strong>in</strong>g<br />
255 will appear as full white.)<br />
177
Once the touchup is done, the user is also advised to switch <strong>in</strong>to the<br />
Inspect Mode to tune the 3D model <strong>in</strong>to a better pose to double check the<br />
questioned part, and see if there is any other part <strong>of</strong> the object needs to be<br />
corrected.<br />
The changes made by the median filter to the depth map are also up-<br />
dated <strong>in</strong> the correspond<strong>in</strong>g 3D po<strong>in</strong>t set <strong>of</strong> the current view. Upon exit<br />
<strong>of</strong> the touchup process, the user has a f<strong>in</strong>al Yes-or-No choice <strong>of</strong> whether<br />
to accept this change permanently. If No is selected, the modified part is<br />
recovered by the backup data. Otherwise, the updated data will replace<br />
the old version to participate further process<strong>in</strong>g.<br />
Figure 6.13: The touchup result <strong>of</strong> 6.10.<br />
178
6.4.4 Mode 3: Correspondence<br />
In Correspondence Mode, this mode follows the work flow <strong>in</strong>troduced<br />
<strong>in</strong> section 5.3 and section 5.4. It is named Correspondence Mode because<br />
it starts with f<strong>in</strong>d<strong>in</strong>g the match<strong>in</strong>g po<strong>in</strong>ts between the image pair, and the<br />
correspondences hold the key to the <strong>in</strong>itial guess <strong>of</strong> the transform between<br />
the two views. This <strong>in</strong>itial guess provides the user with a trial fuse, which<br />
can be further adjusted. A m<strong>in</strong>imum <strong>of</strong> two views is required to perform<br />
this task.<br />
While the all the back-end image process<strong>in</strong>g tasks are discussed ear-<br />
lier <strong>in</strong> chapter 5, here we are concerned with the <strong>in</strong>terface part and how<br />
to <strong>in</strong>corporate the back-end process <strong>in</strong>to a collaborative environment. The<br />
ma<strong>in</strong> pr<strong>in</strong>ciple that is susta<strong>in</strong>ed here is to perform the whole po<strong>in</strong>t set fu-<br />
sion process where the user works as a decision maker and the computer<br />
merely as a work force and a source <strong>of</strong> guidance.<br />
Dur<strong>in</strong>g the process <strong>of</strong> image registration and po<strong>in</strong>t set fusion, a set <strong>of</strong><br />
parameters are used for each s<strong>in</strong>gle step. Although there is a set <strong>of</strong> trial<br />
parameters provided to work with most <strong>of</strong> the scenarios, different objects<br />
have different properties (e.g. size, texture, surface reflection) and it is dif-<br />
ficult to f<strong>in</strong>d the best set <strong>of</strong> parameters for <strong>in</strong>dividual objects. For example,<br />
to register a pair <strong>of</strong> images <strong>of</strong> a periodic pattern such as a checkerboard<br />
179
(figure 5.4), choos<strong>in</strong>g a too big search w<strong>in</strong>dow confounds the NCC with<br />
mismatches. On the other hand, if the search w<strong>in</strong>dow is not big enough,<br />
the right correspondence might not be found <strong>in</strong> a largely displaced image<br />
pair. Therefore we provide a randomised mechanism to let the user f<strong>in</strong>d<br />
those optimal parameters while not be<strong>in</strong>g exposed to too many much tech-<br />
nical details. The underly<strong>in</strong>g idea is to keep it simple, and keep it visualised.<br />
The process beg<strong>in</strong>s with list<strong>in</strong>g the views that have been scanned. The<br />
user is advised to choose two views as ’from’ image and ’to’ image for<br />
image registration <strong>in</strong> order to transfer the ’from’ po<strong>in</strong>t set towards the ’to’<br />
po<strong>in</strong>t set (figure 6.14). If any other two views have already been registered<br />
previously, there will be a red connection l<strong>in</strong>e underneath <strong>in</strong>dicat<strong>in</strong>g so.<br />
The colour texture maps <strong>of</strong> the selected two views will participate <strong>in</strong> the<br />
registration.<br />
Instead <strong>of</strong> tak<strong>in</strong>g the whole part <strong>of</strong> the two selected images, the system<br />
crops the images with a ROI (figure 6.15) based on the expected position<br />
and size, both estimated by the object size estimated from the po<strong>in</strong>t set <strong>in</strong><br />
3D space and the camera imag<strong>in</strong>g geometry (these are all available because<br />
the camera-projector pair is calibrated, and the centroid <strong>of</strong> the object, the<br />
m<strong>in</strong>imum and maximum ends <strong>of</strong> the object along the X, Y, Z axes can all<br />
be worked out from the 3D po<strong>in</strong>t set). Giv<strong>in</strong>g the user an option <strong>of</strong> choos-<br />
<strong>in</strong>g the ROI has another purpose, as non rigid objects can be partially de-<br />
formed while be<strong>in</strong>g positioned to different poses, so these deformed parts<br />
are ideally excluded from participat<strong>in</strong>g <strong>in</strong> the correspondence match<strong>in</strong>g.<br />
180
Figure 6.14: Correspondence Mode: two images are selected as ’from’ and<br />
’to’.<br />
Figure 6.15: Correspondence Mode: ROIs are selected.<br />
181
After the image pair with ROIs is chosen, they are enlarged and dis-<br />
played at the centre <strong>of</strong> the desktop to show better details. Three image<br />
process<strong>in</strong>g tasks, corner detection, cross-correlation and outlier exclusion,<br />
are performed. The implementation details are discussed earlier <strong>in</strong> sec-<br />
tion 5.3.1 - 5.3.3. While these image process<strong>in</strong>g tasks are performed (figure<br />
6.16 - 6.17), all system parameters are hidden from the user but the user<br />
still has the privilege to adjust the parameters and re-do the current step<br />
aga<strong>in</strong> with a new set <strong>of</strong> parameters. At each step <strong>of</strong> the aforementioned<br />
image process<strong>in</strong>g task, a set <strong>of</strong> default parameters which is pre-set with<br />
empirical values is loaded with the <strong>in</strong>stant result reflected on the desktop.<br />
All parameters also come with a float<strong>in</strong>g range, from which they can be<br />
randomly selected. If the user is satisfied with the result yielded by the<br />
current parameter set, he/she can hit the Proceed button (the one with a<br />
tick) and move on to the next step. Otherwise, the user can use the Adjust<br />
button (the one with two gears) to select a new comb<strong>in</strong>ation <strong>of</strong> the param-<br />
eters which are randomly selected from the allowed closed <strong>in</strong>terval. This<br />
process is repeated until a satisfy<strong>in</strong>g result is shown on the desktop before<br />
mov<strong>in</strong>g onto the next step.<br />
Reasonable outcomes are <strong>of</strong>ten achieved at the first attempt. The user is<br />
advised to repeat the process a few times us<strong>in</strong>g different sett<strong>in</strong>gs to com-<br />
pare the results, or sometimes work<strong>in</strong>g towards the possibility <strong>of</strong> even<br />
better results. However, all the parameters are restricted to be randomised<br />
only and not controllable, to comply with our pr<strong>in</strong>ciple <strong>of</strong> keep<strong>in</strong>g it sim-<br />
ple by leav<strong>in</strong>g all the technical details hidden.<br />
182
Figure 6.16: Correspondence Mode: extracted corners.<br />
Figure 6.17: Correspondence Mode: correlated and improved po<strong>in</strong>t corre-<br />
spondences.<br />
183
The established correspondences may still not be good enough. This<br />
is expected when the two participat<strong>in</strong>g images are highlighted. There are<br />
parts <strong>in</strong> the left image that will appear perspectively deformed <strong>in</strong> the other<br />
image, or sometimes don’t even exist because <strong>of</strong> the view po<strong>in</strong>t change.<br />
Other challenges <strong>in</strong>clude lack <strong>of</strong> texture <strong>of</strong> the measured object, sur-<br />
face reflections caused by the bright projection light, and deformed parts<br />
<strong>of</strong> objects such as stuffed animals. Further discussions <strong>of</strong> how to tackle<br />
these problems are given later <strong>in</strong> chapter 7.<br />
In figure 6.17 where the correspondences are shown, press<strong>in</strong>g the pro-<br />
ceed button will make the commitment <strong>of</strong> us<strong>in</strong>g the current po<strong>in</strong>t corre-<br />
spondences as control po<strong>in</strong>ts to deploy the estimation <strong>of</strong> the rotation and<br />
translation vectors. The estimation is a quick process which takes less than<br />
a second, then the second po<strong>in</strong>t set is transformed towards the other based<br />
on the rotation and translation vectors estimated. This is a trial registra-<br />
tion <strong>of</strong> the two po<strong>in</strong>t sets suggested by the system as an <strong>in</strong>itial guess.<br />
The user can accept this registration by press<strong>in</strong>g the proceed button<br />
aga<strong>in</strong>, or to further adjust their positions manually. By switch<strong>in</strong>g between<br />
the R and T buttons, both <strong>of</strong> which are attached with a set <strong>of</strong> six buttons<br />
for rotat<strong>in</strong>g a po<strong>in</strong>t set about its centroid around the X, Y, Z axes or trans-<br />
lat<strong>in</strong>g along them, the engaged po<strong>in</strong>t set can be manipulated rotation-wise<br />
and translation-wise respectively (figure 6.18).<br />
184
Dur<strong>in</strong>g the course <strong>of</strong> tun<strong>in</strong>g, the first po<strong>in</strong>t set (on the left) is used as<br />
a reference while the second one is transformed towards the first one. A<br />
f<strong>in</strong>al solution is thought to be reached (figure 6.19) after the overlapp<strong>in</strong>g<br />
area <strong>of</strong> the two po<strong>in</strong>ts co<strong>in</strong>cide on each other.<br />
Figure 6.18: Correspondence Mode: visualised po<strong>in</strong>t sets tun<strong>in</strong>g, with con-<br />
trollable rotation and translation.<br />
185
Figure 6.19: Correspondence Mode: two po<strong>in</strong>t sets are fused.<br />
6.4.5 Mode 4: Visualisation<br />
Although the captured data can be visualised by different means <strong>in</strong><br />
any <strong>of</strong> the three modes <strong>in</strong>troduced earlier, Visualisation Mode <strong>of</strong>fers the fa-<br />
cility to visualise the complete 3D model that is built through the previous<br />
work, <strong>in</strong> 360 degrees. In this mode, the controls are not as sophisticated as<br />
other modes – all the scanned views are listed <strong>in</strong> the bottom centre control<br />
186
panel area represented by the resisted m<strong>in</strong>i version <strong>of</strong> their colour textures<br />
(figure 6.20). The rendered object will be displayed at the centre <strong>of</strong> the dis-<br />
play area, slowly rotat<strong>in</strong>g about its centroid as if it is placed on a turn table.<br />
To be noticed, <strong>in</strong> Correspondence Mode, two po<strong>in</strong>t sets are only regis-<br />
tered (i.e. to work out the rotation and translation vectors between them),<br />
but no po<strong>in</strong>t set data is changed. In this mode, all po<strong>in</strong>t sets selected will<br />
be merged together (i.e transform one po<strong>in</strong>t set towards the other so that<br />
they are <strong>in</strong> the same coord<strong>in</strong>ate space and shar<strong>in</strong>g a same centroid).<br />
Figure 6.20: The Visualisation Mode.<br />
Apart from the view<strong>in</strong>g, the other only operation the user can do <strong>in</strong><br />
the Visualisation Mode is turn<strong>in</strong>g on or <strong>of</strong>f different views to <strong>in</strong>spect the 3D<br />
model <strong>of</strong> the measured object by press<strong>in</strong>g the correspond<strong>in</strong>g buttons. All<br />
187
the views be<strong>in</strong>g turned on are fused first us<strong>in</strong>g the estimated transform<br />
between them which is previously worked out <strong>in</strong> the Correspondence Mode.<br />
There can be more than one view be<strong>in</strong>g turned on at the same time, or even<br />
all <strong>of</strong> the views (if all the necessary transform <strong>in</strong>formation is available) –<br />
and this is possible only <strong>in</strong> this mode. If no view is selected, noth<strong>in</strong>g is<br />
displayed.<br />
However, not all <strong>of</strong> the views can be selected randomly and then fused<br />
together. There are a few ground rules that need to be applied for choos<strong>in</strong>g<br />
different views to be fused:<br />
• If two views are to be selected, they have to be either registered <strong>in</strong><br />
the Correspondence Mode (i.e. the transform vectors between them are<br />
available), or they are both registered with a same third view.<br />
• Registration relay is also allowed (e.g. if view 1 and 2, 2 and 3, 3 and<br />
4 are all registered, then view 1 and 4 are registered too).<br />
• All <strong>in</strong>ter-registered views are categorised <strong>in</strong>to a same group, and<br />
only the views from the same group can be visualised at the same<br />
time.<br />
The reason beh<strong>in</strong>d the above rules is that any two registered views can<br />
be regarded to have a path between them – the rotation and the transla-<br />
tion vectors. Suppose the rotation vector from view A to view B is RAB =<br />
(θ, φ, ψ) and let its translation vector is TAB = (Tx, Ty, Tz), then the rotation<br />
188
and translation vectors from view B to view A are RBA = (−θ, −φ, −ψ) and<br />
TBA = (−Tx, −Ty, −Tz). This relationship propagates to multiple views be-<br />
cause as long as there is not a stand-alone view that is not registered to any<br />
<strong>of</strong> the others, there is always a path <strong>of</strong> transform for this view to be trans-<br />
formed to any <strong>of</strong> the other’s orientation and position.<br />
Table 6.1 gives an example <strong>of</strong> the propagation <strong>of</strong> this relationship. We<br />
still consider the scenario example used earlier <strong>in</strong> this chapter <strong>in</strong> which<br />
five different views <strong>of</strong> the owl are scanned while the sixth view is not cap-<br />
tured yet. It starts from stage 0 where all five views are related to each<br />
other and the views are not grouped. At stage 1, view 1 and view 2 are<br />
registered so they are labelled as group 1. A red connection l<strong>in</strong>e is drawn<br />
between them to <strong>in</strong>dicated this relationship. At stage 1, view 3 and view<br />
4 are registered as a new group, group 2, and it is <strong>in</strong>dicated by a green<br />
connection l<strong>in</strong>e underneath. So up to this po<strong>in</strong>t, there are two separate<br />
groups between those five scanned views both <strong>of</strong> which are <strong>in</strong>dicated by<br />
different colours to advise to the user that a view from the red group and a<br />
view from the green group or the stand-alone view 5 can not be displayed<br />
together, because there is no way to fuse them. After stage 3, a new regis-<br />
tration is completed between view 1 and 5 so the same group<strong>in</strong>g process<br />
is carried out. The situation is totally different after stage 4, after which<br />
view 2 from view 3 are registered. This registration changes everyth<strong>in</strong>g as<br />
it br<strong>in</strong>gs the two groups <strong>in</strong>to one. In other words, a registration between<br />
any other two views will result <strong>in</strong> the same group<strong>in</strong>g as long as they are<br />
from two different groups, one each.<br />
189
Stage View<br />
(from)<br />
View<br />
(to)<br />
0 n/a n/a 0<br />
1 1 2 1<br />
2 3 4 2<br />
3 1 5 2<br />
4 2 3 2<br />
Number<br />
<strong>of</strong><br />
groups<br />
Relationship l<strong>in</strong>es<br />
Table 6.1: Group<strong>in</strong>g status <strong>of</strong> the po<strong>in</strong>t sets at different stages.<br />
Figure 6.21, 6.22, and 6.23 shows process <strong>of</strong> a model <strong>of</strong> the owl be<strong>in</strong>g<br />
built from three central views. By fus<strong>in</strong>g the view 2 and 3 together and<br />
visualised the fused model, it can be seen from figure 6.21 that the right-<br />
fac<strong>in</strong>g object <strong>in</strong> view 2 completes the left w<strong>in</strong>g part that is partially not vis-<br />
ible <strong>in</strong> view 3 where the object is fac<strong>in</strong>g straight up. However, as the same<br />
model be<strong>in</strong>g rotated around its centroid until its right part is exposed, it is<br />
clear that the right w<strong>in</strong>g <strong>of</strong> the current model is miss<strong>in</strong>g data.<br />
We notice the object <strong>in</strong> view 4 is fac<strong>in</strong>g left and its right part is visible<br />
while still shar<strong>in</strong>g a fair amount <strong>of</strong> the overlapp<strong>in</strong>g area between view 3<br />
190
and itself. By fus<strong>in</strong>g view 4 <strong>in</strong>to the model previously built from view 2<br />
and 3, another part <strong>of</strong> the object is fulfilled as shown <strong>in</strong> figure 6.23.<br />
Figure 6.21: View 2 and 3 fused together. View completes the left w<strong>in</strong>g <strong>of</strong><br />
the owl.<br />
191
Figure 6.22: View 2 and 3 fused together.<br />
Figure 6.23: Fusion <strong>of</strong> view 2, 3, and 4.<br />
192
6.5 Conclusions<br />
In this chapter we present a work<strong>in</strong>g and user friendly <strong>in</strong>terface for VAE<br />
system designed <strong>in</strong> this research. This <strong>in</strong>teractive <strong>in</strong>terface is a mixed en-<br />
vironment with real objects and projected signals, where users’ <strong>in</strong>terac-<br />
tion with these objects and projections are captured and <strong>in</strong>terpreted by<br />
adjusted projections. Techniques <strong>in</strong>troduced <strong>in</strong> chapter 4 and 5 are both<br />
<strong>in</strong>tegral parts <strong>of</strong> the designed system, while efficient monitor<strong>in</strong>g <strong>of</strong> the <strong>in</strong>-<br />
teractive surface and accurate response to it rely on the explicit calibration<br />
presented <strong>in</strong> chapter 3.<br />
Two widgets are <strong>in</strong>troduced and then implemented to simulate two <strong>of</strong><br />
two <strong>of</strong> the most frequently used gestures <strong>in</strong> the human-computer <strong>in</strong>ter-<br />
actions, the button push for trigger<strong>in</strong>g events and the touchpad slide for<br />
position<strong>in</strong>g.<br />
Four major facilities are provided to accomplish the task <strong>of</strong> 3D <strong>in</strong>put,<br />
with which the user are allowed to <strong>in</strong>spect the captured data from differ-<br />
ent view angles, po<strong>in</strong>t out and correct errors, manipulate the projection<br />
signals, and f<strong>in</strong>ally build and visualise the complete 3D model. Other<br />
tools such as a desktop lock-down and snap-shot tool are also provided<br />
for practical uses dur<strong>in</strong>g the process.<br />
193
6.5.1 Future Work<br />
In an <strong>in</strong>teractive user <strong>in</strong>terface, an easy-to-use and efficient <strong>in</strong>teractive tool<br />
is always desired. Future work for implementation <strong>of</strong> f<strong>in</strong>ger tip detection<br />
can be beneficial to the system. Provided robust f<strong>in</strong>ger detection is imple-<br />
mented across the whole projection area, the touch up can be much easier<br />
as the user can po<strong>in</strong>t his f<strong>in</strong>ger directly at the questionable area.<br />
Drag and drop <strong>of</strong> the virtual elements on the desktop can be another<br />
possible extension to the f<strong>in</strong>ger detection. Previous work at York [74]<br />
yields promis<strong>in</strong>g results and lays the foundation <strong>of</strong> the future work <strong>in</strong> this<br />
area.<br />
As a f<strong>in</strong>al <strong>in</strong>spection on the built 3D model, the visualisation mode (sec-<br />
tion 6.4.5) can be further elaborated. Possible implementation <strong>of</strong> touch-up<br />
<strong>in</strong> 3D space is a big plus, as this is the stage where errors are likely to<br />
be rediscovered. Efficient and quick responses need to be made to cor-<br />
rect those errors on the rendered model straightaway, <strong>in</strong> a visualised way,<br />
rather than repeatedly go<strong>in</strong>g back to the 2D models.<br />
194
Chapter 7<br />
System Evaluation<br />
Most <strong>of</strong> the techniques used <strong>in</strong> this research have already been evaluated<br />
and justified at appropriate stages earlier <strong>in</strong> the thesis. In this chapter, we<br />
present <strong>in</strong>formal user tests to evaluate the system performance. In partic-<br />
ular, the system performance with different test objects are evaluated, to<br />
provide an <strong>in</strong>sightful suggestion <strong>of</strong> what is the possible way <strong>of</strong> achiev<strong>in</strong>g<br />
the best results with the presence <strong>of</strong> technical challenges and practical is-<br />
sues.<br />
195
7.1 Test Objects<br />
7.1.1 An Overview<br />
An overview <strong>of</strong> the objects used for experiments is listed <strong>in</strong> table 7.1. Each<br />
object is represented with a thumbnail, object name, and a brief descrip-<br />
tion.<br />
7.1.2 Object Descriptions<br />
The objects chosen to participate <strong>in</strong> the user tests covers a variety <strong>of</strong> differ-<br />
ent sizes, colours, and surface materials, as a diversity. For example, the<br />
owl appeared <strong>in</strong> previous chapters as example object because it presents<br />
various challenges to the techniques presented <strong>in</strong> early part <strong>of</strong> this the-<br />
sis. It has convexity and concaveness across it surface, and this will easily<br />
cause shadows while be<strong>in</strong>g illum<strong>in</strong>ated from certa<strong>in</strong> angles. The owl itself<br />
doesn’t lack texture, but its fluffy surface complicates the texture mapp<strong>in</strong>g<br />
because the same texture could appear totally different due to the <strong>in</strong>ter-<br />
reflections caused by the uneven surface. Furthermore, the back side <strong>of</strong><br />
the owl completely lacks texture.<br />
Other test objects present different technical challenges. The football<br />
is an example <strong>of</strong> high specular reflectance. Despite the system not be<strong>in</strong>g<br />
designed for human body measurement because the top-down projector-<br />
camera setup, we still did a test to evaluate how well the system performs<br />
196
Thumbnail Object Description<br />
Cushion A small s<strong>of</strong>t cushion with bright colour texture.<br />
A small turtle attached onto the right side, but<br />
the tropical fish is just a 2D pattern.<br />
Football A small spherical object. Slightly deflated for it<br />
to stand on the table by itself. Surface has high<br />
specular reflection.<br />
Stand A mid-sized object made with cardboard and<br />
wrapped with brown pack<strong>in</strong>g paper, hardly re-<br />
flect<strong>in</strong>g any lights.<br />
Owl A fairly big stuffed animal. It has s<strong>of</strong>t and fluffy<br />
<strong>Human</strong><br />
Body<br />
surface, and part <strong>of</strong> its body will be deformed<br />
while chang<strong>in</strong>g pose.<br />
A user ly<strong>in</strong>g on desktop. The rigidity is not guar-<br />
anteed, as the relative position between the head<br />
and the upper-body can be changed from one<br />
pose to another.<br />
Table 7.1: An overview <strong>of</strong> the objects used for the tests.<br />
on such an object and see where can be improved. Dur<strong>in</strong>g the human body<br />
test, the table top is lowered. This is not a computer vision driven move –<br />
purely to comply with the health and safety regulations.<br />
197
In the rest <strong>of</strong> this chapter, test frameworks are designed to test the <strong>in</strong>-<br />
dividual ma<strong>in</strong> techniques proposed and evaluate the performance with<br />
various types <strong>of</strong> objects. Then the system is evaluated as a whole.<br />
7.2 Shape Acquisition<br />
In section 7.2, the performance <strong>of</strong> the shape acquisition us<strong>in</strong>g structured<br />
light on different objects is evaluated. Most <strong>of</strong> the techniques <strong>in</strong>volved <strong>in</strong><br />
structured light scan are either discussed or experimentally tested <strong>in</strong> chap-<br />
ter 4, but it is still unclear how these separate pieces <strong>of</strong> techniques work as<br />
a whole. This section is aimed to address this issue.<br />
198
Object No. <strong>of</strong><br />
views<br />
Initial<br />
error<br />
(per<br />
view)<br />
Error<br />
after<br />
touchup<br />
(per<br />
view)<br />
Initial diagnosis<br />
Cushion 2 5 (2.5) 0 (0) black part <strong>of</strong> the object surface<br />
Football 5 16 (3.2) 0 (0) common field <strong>of</strong> view problem (re-<br />
gions that can only be seen from the<br />
camera)<br />
Stand 5 36 (7.2) 5 (1) surface reflection<br />
Owl 5 6 (1.2) 0 (0) concaveness on the surface fails to be<br />
<strong>Human</strong><br />
Body<br />
illum<strong>in</strong>ated by the projector because<br />
<strong>of</strong> occlusion<br />
3 4 (1.3) 0 (0) distance from the object to the<br />
projector-camera pair<br />
Table 7.2: Evaluation: depth capture error, and their corrections.<br />
Table 7.2 lists the performance <strong>of</strong> the shape acquisition process us<strong>in</strong>g<br />
objects <strong>of</strong> different size, shape, and surface. It also shows the amount <strong>of</strong><br />
effort required to touch-up the most obvious error until all captured depth<br />
<strong>in</strong>formation are reasonably accurate upon visual <strong>in</strong>spection. The numbers<br />
shown <strong>in</strong> the table are the number <strong>of</strong> parts (e.g. spikes, jumps, holes, and<br />
etc.) that are believed to be error (numbers <strong>in</strong> the brackets are the averaged<br />
number <strong>of</strong> errors per view). The third column <strong>in</strong> the table is the <strong>in</strong>itial er-<br />
199
or <strong>in</strong> the captured depth maps, and the fourth column is the number <strong>of</strong><br />
unerasable errors rema<strong>in</strong>ed after the user touch-up. Initial diagnoses <strong>of</strong><br />
the possible reason <strong>of</strong> the error are listed <strong>in</strong> the last column, to be further<br />
justified.<br />
7.2.1 The Owl Experiment<br />
General speak<strong>in</strong>g, best result <strong>of</strong> the depth capture comes from the Owl ex-<br />
periment. Despite the owl be<strong>in</strong>g the second biggest object among those<br />
five be<strong>in</strong>g tested, it has a more cont<strong>in</strong>uous surface. The camera and the<br />
projector share a close common view<strong>in</strong>g area <strong>of</strong> the surface (i.e. where the<br />
projector can reach is where the camera can see, and vice versa). The only<br />
obvious <strong>in</strong>accurate measurement at the concaved part at the bottom <strong>of</strong> the<br />
owl’s feet. The error part, seen as a bright dot <strong>in</strong> figure 7.1(a), is t<strong>in</strong>y and<br />
can be easily erased by s<strong>in</strong>gle touch-up.<br />
7.2.2 The Football and Stand Experiment<br />
In this section, two objects are tested together to have a comparison be-<br />
tween them. Between the two objects <strong>in</strong> Football and Stand, there are a few<br />
dissimilarities. The capture result <strong>of</strong> three views are listed for each <strong>of</strong> these<br />
two experiments, <strong>in</strong> figure 7.4 and 7.3.<br />
200
(a) Depth map (b) Rendered model<br />
(c) Depth map (d) Rendered model<br />
Figure 7.1: Shape acquisition test: Owl. Top two: before touchup; bottom<br />
two: after touchup.<br />
• The difference <strong>in</strong> specular reflectance. The football is a rigid spheri-<br />
cal object with high gloss surface. For test<strong>in</strong>g purpose, it is slightly<br />
deflated to be firmly placed on the desktop without us<strong>in</strong>g a stand.<br />
The brown stand is an object made <strong>of</strong> cardboard, but wrapped up<br />
with reflective brown pack<strong>in</strong>g paper. Reflectance <strong>in</strong> Football experi-<br />
ment is more severe than the Stand. However, due to the spherical<br />
surface <strong>of</strong> the football, the high reflectance are focused onto one s<strong>in</strong>-<br />
201
Figure 7.2: The projector-camera pair setup. The shaded part is the ’dead’<br />
area that can not be illum<strong>in</strong>ated by the projector but <strong>in</strong> the view<strong>in</strong>g range<br />
<strong>of</strong> the camera.<br />
gle po<strong>in</strong>t. It is noticed that <strong>in</strong> figure 7.4, the error caused by the high<br />
gloss surface is already filtered out by apply<strong>in</strong>g a smooth filter after<br />
on the scanned data, and the user touchup can be spared.<br />
• The difference <strong>in</strong> the projection light required. As mentioned above<br />
the high gloss surface <strong>of</strong> the football causes an overall <strong>in</strong>crease <strong>in</strong><br />
pixel values across the image, adjustment on the projection bright-<br />
ness is required to avoid the white balance <strong>in</strong> the captured image<br />
202
e<strong>in</strong>g too high that the texture details are lost. The opposite need<br />
to be done for the Stand experiment. In implementation, projection<br />
brightness <strong>of</strong> 100 is used for Football, and 200 for the Stand.<br />
• Both cases suffer from shadows and occlusions, but <strong>in</strong> a slight dif-<br />
ferent way. In the Football experiment, it can be seen from figure 7.4<br />
that the only obvious depth <strong>in</strong>accuracy occurs near the lower bottom<br />
rim <strong>of</strong> the sphere. This is because a small part <strong>of</strong> the desktop always<br />
stay out <strong>of</strong> the illum<strong>in</strong>ation but is <strong>in</strong> the view<strong>in</strong>g range <strong>of</strong> the camera<br />
(see figure 7.2). For Stand, it is a different case where the object is<br />
wrapped with material that hardly reflect any light. It is illustrated<br />
<strong>in</strong> figure 7.3 that all those planes nearly parallel to the projection rays<br />
are severely affected, because there is not enough projection light get<br />
reflected to the camera plane via the surface. There are also two small<br />
areas <strong>of</strong> depth <strong>in</strong>accuracies caused by shadows, but can be easily cor-<br />
rected us<strong>in</strong>g the touchup tool provided.<br />
203
(a) Depth map (b) Colour map<br />
(c) Depth map (d) Colour map<br />
(e) Depth map (f) Colour map<br />
Figure 7.3: Shape acquisition test: Stand. Left column: depth maps; right<br />
column: the correspond<strong>in</strong>g textures.<br />
204
(a) Depth map (b) Colour map<br />
(c) Depth map (d) Colour map<br />
(e) Depth map (f) Colour map<br />
Figure 7.4: Shape acquisition test: Football. Left column: depth maps; right<br />
column: the correspond<strong>in</strong>g textures.<br />
205
It is noticed <strong>in</strong> table 7.2 that Stand is the only test with unerasable cap-<br />
ture errors. This is because the error parts are too big for the median filter<br />
to handle<br />
7.2.3 The Cushion and <strong>Human</strong> Body Experiment<br />
We compare the results <strong>of</strong> Cushion and <strong>Human</strong> Body together because the<br />
similarities shared between them. In both experiments, fewer views are<br />
used. For the cushion, front and back are the only two views captured, as<br />
it is hard to place the cushion <strong>in</strong>to other orientations. When human body<br />
is be<strong>in</strong>g measured, we lower the table first (reason stated <strong>in</strong> section 7.1.2)<br />
then the tester lie on the table. Three views are captured: one fac<strong>in</strong>g left,<br />
one fac<strong>in</strong>g right and the third one fac<strong>in</strong>g up.<br />
In these two tests, the object surface are cont<strong>in</strong>uous and convex hence<br />
the problem we had <strong>in</strong> figure 7.4 and 7.3 does not occur here. However,<br />
’holes’ <strong>in</strong> depth image are found at the eyes and tail <strong>of</strong> the fish, and part<br />
<strong>of</strong> the human’s hair, which are all black area. After study<strong>in</strong>g the captured<br />
Gray coded stripe image, it is found that all those area appear black (pre-<br />
cisely, with 0 pixel values) <strong>in</strong> the observed image whether they are illu-<br />
m<strong>in</strong>ated by the white or black projection. As a result, they stay 0 <strong>in</strong> the<br />
subtraction image <strong>of</strong> positive and negative images, and will be labelled as<br />
background pixels.<br />
206
(a) Depth map (front view, before<br />
touchup)<br />
(b) Colour map (front view)<br />
(c) Depth map (back view) (d) Colour map (back view)<br />
Figure 7.5: Shape acquisition test: Cushion. Left column: depth maps; right<br />
column: the correspond<strong>in</strong>g textures.<br />
207
(a) Depth map (b) Colour map<br />
(c) Depth map (d) Colour map<br />
Figure 7.6: Shape acquisition test: <strong>Human</strong> Body. Left column: depth maps;<br />
right column: the correspond<strong>in</strong>g textures.<br />
7.3 Correspondences F<strong>in</strong>d<strong>in</strong>g<br />
The test framework for evaluate correspondences f<strong>in</strong>d<strong>in</strong>g is set as follows.<br />
For each object test, we pick two adjacent views, and run the correspon-<br />
dence program on the image pair. Depth and po<strong>in</strong>t set data are touched up<br />
208
if there is any obvious errors, before we start f<strong>in</strong>d<strong>in</strong>g the correspondence.<br />
As <strong>in</strong>troduced earlier, when do<strong>in</strong>g the corner detection the user is pro-<br />
vided with a facility to randomise a parameter set, run the program, and<br />
the <strong>in</strong>stant results are projected onto the desktop for <strong>in</strong>spection. The exact<br />
value <strong>of</strong> the parameters such as the search range, the eigenvalue thresh-<br />
old or the w<strong>in</strong>dow size for local aggregation are all hidden from the user.<br />
While repeat<strong>in</strong>g the process by randomis<strong>in</strong>g the parameter set, it is not<br />
necessary that the parameter set which yields the most corners is chosen<br />
as the optimal value. The user is advised to use his own judgement by<br />
look<strong>in</strong>g at the result reflected on the desktop. This is similar to debugg<strong>in</strong>g<br />
a C program on a local PC, the only difference be<strong>in</strong>g that <strong>in</strong> this application<br />
the user doesn’t have to know anyth<strong>in</strong>g about the technical details which<br />
is hidden. Therefore, we apply the same rule <strong>of</strong> ’how to choose the opti-<br />
mal parameter set’ <strong>in</strong> the test framework, to simulate the user’s behaviour.<br />
209
Object Corners<br />
(left im-<br />
age)<br />
Corners<br />
(right<br />
image)<br />
No. <strong>of</strong><br />
correspon-<br />
dences<br />
User ad-<br />
justment<br />
required?<br />
Cushion n/a n/a n/a Y 1<br />
Football 42 51 18 Y 1<br />
Stand 87 105 29 N 2.5<br />
Owl 206 197 30 Y 4<br />
<strong>Human</strong><br />
Body<br />
102 113 39 N 2<br />
Table 7.3: Evaluation: build<strong>in</strong>g correspondences.<br />
Table 7.3 shows the test result.<br />
Total time<br />
spent<br />
(m<strong>in</strong>utes)<br />
It is noticed that the first test, Cushion, doesn’t have the results or num-<br />
ber <strong>of</strong> corners detected or number <strong>of</strong> correspondence built. This is because<br />
there are only two views captured for the cushion, one is the top view and<br />
one is the bottom view. Although these two captured views complete the<br />
object model, they share no overlapp<strong>in</strong>g part. Therefore it is mean<strong>in</strong>gless<br />
to run the correspondence search between the two images. In the test, we<br />
skip the corner extraction and correlation step, go straight <strong>in</strong>to the tun<strong>in</strong>g.<br />
The tun<strong>in</strong>g task is straightforward too, as all the user has to do is to turn<br />
the second view over (rotate by 180 ◦ ).<br />
For the rest <strong>of</strong> the objects, it is usually more time is spent on bigger<br />
210
objects. The correspondence search <strong>in</strong> Stand and <strong>Human</strong> Body tests works<br />
very well, hence the <strong>in</strong>itial trial rotation and translation vectors given by<br />
the computer are accepted without further user adjustment. The Owl ex-<br />
periment take longer time, where lots <strong>of</strong> corner po<strong>in</strong>ts are detected but<br />
only a small portion <strong>of</strong> them are found to be match<strong>in</strong>g. It can be seen from<br />
diagram 7.7 that it has only about half the percentage <strong>of</strong> build<strong>in</strong>g corre-<br />
spondences from detected corners, compared to other objects.<br />
(a)<br />
(b)<br />
Figure 7.7: Number <strong>of</strong> extracted corner po<strong>in</strong>ts and matched correspon-<br />
dence.<br />
211
7.4 Conclusions<br />
In this chapter, five objects are used as the test objects to evaluate the sys-<br />
tem performance with<strong>in</strong> controlled test framework. Although a lot more<br />
objects have been tested <strong>in</strong> this research, those five listed here are the most<br />
representative ones illustrat<strong>in</strong>g the impact <strong>of</strong> different objects on the re-<br />
sults. This <strong>in</strong>cludes the surface reflectance <strong>of</strong> the objects, their texture, con-<br />
vexity and concaveness, rigidity, and the level <strong>of</strong> depth cont<strong>in</strong>uity across<br />
the surface.<br />
Two key components <strong>of</strong> the system, shape acquisition via structured<br />
light scan and po<strong>in</strong>t set registration from po<strong>in</strong>t correspondences, are tested.<br />
Statistics and experimental results give the diagnose and possible solution<br />
to the problem caused by those aforementioned challenges, and provide<br />
the foundation on which future research can be built.<br />
212
Chapter 8<br />
Conclusions<br />
8.1 Summary<br />
All <strong>of</strong> the chapters presented <strong>in</strong> this thesis conta<strong>in</strong> their own <strong>in</strong>troductions<br />
and conclusions. Apart from Introduction, Background and this Conclu-<br />
sion chapter itself, the rest <strong>of</strong> the thesis is summarised as follows:<br />
• Chapter 3 Calibration<br />
Methods for complete calibration <strong>of</strong> the VAE system are presented.<br />
This <strong>in</strong>cludes a full calibration <strong>of</strong> the projector-camera system for<br />
213
their own <strong>in</strong>tr<strong>in</strong>sic and extr<strong>in</strong>sic parameters, and the calibration for<br />
a plane-to-plane homography between the rendered projector plane<br />
and the captured image plane <strong>in</strong>duced by a third plane.<br />
• Chapter 4 Shape Acquisition<br />
A Gray coded structured light scan is implemented for acquir<strong>in</strong>g<br />
depth <strong>in</strong>formation. It is then extended and adapted to tackle the<br />
practical issues raised, before be<strong>in</strong>g <strong>in</strong>corporated <strong>in</strong>to the whole VAE<br />
framework.<br />
• Chapter 5 Registration <strong>of</strong> Po<strong>in</strong>t Sets<br />
A framework for 3D po<strong>in</strong>t set registration is presented <strong>in</strong> this chap-<br />
ter. The conventional image registration technique is used to f<strong>in</strong>d<br />
correspond<strong>in</strong>g po<strong>in</strong>ts between a pair <strong>of</strong> 2D image, and the estab-<br />
lished correspondences are propagated from 2D to register the po<strong>in</strong>t<br />
sets <strong>in</strong> 3D space. This framework is justified to work not only on<br />
planar surface, but also arbitrary objects with the user’s assistance <strong>in</strong><br />
a VAE system, while there is no ground truth <strong>in</strong>formation known a<br />
priori.<br />
• Chapter 6 System Design<br />
This is the core <strong>of</strong> this research. A new system design is presented<br />
<strong>in</strong> this chapter, for <strong>in</strong>putt<strong>in</strong>g 3D by work<strong>in</strong>g collaboratively with the<br />
PC <strong>in</strong> a VAE. The proposed system is cheap to ma<strong>in</strong>ta<strong>in</strong> with <strong>of</strong>f-the-<br />
shelf hardware, and easy to be deployed with requir<strong>in</strong>g m<strong>in</strong>imum<br />
214
configuration <strong>of</strong> the projector-camera pair. The use <strong>of</strong> the system<br />
presented is aimed not be<strong>in</strong>g restricted only <strong>in</strong> research laboratory<br />
environment.<br />
• Chapter 7 User Experiments<br />
Major components <strong>of</strong> the system are evaluated <strong>in</strong> this chapter, with<br />
controlled test frameworks.<br />
8.2 Discussions<br />
System-wise, one <strong>of</strong> the most important design goals is to allow users to<br />
br<strong>in</strong>g their objects to be <strong>in</strong>put, walk up to the VAE and start the mission<br />
without worry<strong>in</strong>g about the technical details <strong>of</strong> computer vision or how to<br />
produce the code to do that. We aim to recreate an environment where the<br />
computer and its attached vision equipment work as an assistant to the<br />
user, while the user always make the f<strong>in</strong>al call on key decisions based on<br />
the feedback from this <strong>in</strong>teractive collaboration. Higher cost equipment<br />
such as HMD, touch screen, or other customised tools such as markers<br />
and gloves are all avoided, as the system presented here is not only de-<br />
signed for laboratory purpose, but also for home and <strong>of</strong>fice environment<br />
or other open public space such as schools and meseums.<br />
215
8.3 Future Work<br />
Techniques employed <strong>in</strong> this research are evaluated <strong>in</strong> separate chapters.<br />
Although the framework itself conta<strong>in</strong>s techniques that are already widely<br />
used <strong>in</strong> the field, it br<strong>in</strong>gs together these techniques <strong>in</strong> a new, practical<br />
and efficient way. But as mentioned before, many <strong>of</strong> the system elements<br />
would benefit from further improvement and optimisation.<br />
There are planned improvements for the techniques used <strong>in</strong> the sys-<br />
tem. In calibration, manual adjustment <strong>of</strong> the photometric sett<strong>in</strong>gs <strong>of</strong> the<br />
camera and the projector is not only <strong>in</strong>convenient but also <strong>in</strong>efficient. Fur-<br />
ther development <strong>of</strong> the calibration framework would <strong>in</strong>clude automatic<br />
photomatric calibration.<br />
Once the automated photometric calibration is feasible, it might be sen-<br />
sible to exploit the use <strong>of</strong> colour-based structured light techniques which<br />
allow real-time scan <strong>of</strong> depth <strong>in</strong>formation. There are also other planned<br />
improvement for the shape acquisition framework, as described <strong>in</strong> section<br />
4.6.1.<br />
Assign<strong>in</strong>g user more power and <strong>in</strong>itiative <strong>in</strong> the po<strong>in</strong>t set registration<br />
stage would be another big step forward, because if appropriately de-<br />
signed and implemented, it would gauge the registration process more<br />
quickly and efficiently towards the optimal results, while the user’s lead-<br />
<strong>in</strong>g role is still ma<strong>in</strong>ta<strong>in</strong>ed.<br />
216
As mentioned <strong>in</strong> section 6.5.1, robust f<strong>in</strong>gertip detection and touchup<br />
<strong>in</strong> 3D space are regarded as two major improvements <strong>in</strong> future work. Suc-<br />
cessful f<strong>in</strong>gertip detection would not only simplify the user <strong>in</strong>terface by<br />
reduc<strong>in</strong>g the number <strong>of</strong> <strong>in</strong>teractive buttons required, but also <strong>of</strong>fer a new<br />
dimension <strong>of</strong> user <strong>in</strong>teraction as locat<strong>in</strong>g a po<strong>in</strong>t would be much easier<br />
either on a physical object or on a virtual element. 3D touchup could ef-<br />
fectively be a consequence <strong>of</strong> the deployment <strong>of</strong> f<strong>in</strong>gertip detection, and<br />
it would be a big boost if the user is allowed to manipulate the rendered<br />
object model us<strong>in</strong>g his bare hands as if he is touch<strong>in</strong>g the real object.<br />
Regard<strong>in</strong>g the user test carried out <strong>in</strong> chapter 7, they are still ma<strong>in</strong>ly at<br />
a descriptive stage. The next task <strong>of</strong> system performance measure would<br />
be aimed to get testers from a variety <strong>of</strong> backgrounds – from computer<br />
vision academics to someone who has little experience with the field – to<br />
characterise the system both behaviourally and experimentally.<br />
217
Bibliography<br />
[1] P. Anandan. A computational framework and an algorithm for the mea-<br />
surement <strong>of</strong> visual motion. International Journal <strong>of</strong> <strong>Computer</strong> Vision, 2(3):283–<br />
310, 1989.<br />
[2] A. Argyros and M.I.A. Lourakis. Vision-based <strong>in</strong>terpretation <strong>of</strong> hand ges-<br />
tures for remote control <strong>of</strong> a computer mouse. In European Conference on<br />
<strong>Computer</strong> Vision, Workshop on <strong>Human</strong> <strong>Computer</strong> Interactions, pages 40–51,<br />
2006.<br />
[3] K.S. Arun, T.S. Huang, and S.D. Bloste<strong>in</strong>. Least-squares fitt<strong>in</strong>g <strong>of</strong> two 3-d<br />
po<strong>in</strong>t sets. IEEE Trans. on Pattern Analysis and Mach<strong>in</strong>e Intelligence, 9(5):698–<br />
700, 1987.<br />
[4] K.E. Atk<strong>in</strong>son. An Introduction to Numerical Analysis. John Wiley and Sons,<br />
2nd edition, 1989.<br />
[5] J.L. Barron, D.J. Fleet, and S.S. Beauchem<strong>in</strong>. Performance <strong>of</strong> optical flow<br />
techniques. Int. J. Comput. Vision, 12(1):43–77, 1994.<br />
[6] J. Batlle, E. Mouaddib, and J. Salvi. Recent progress <strong>in</strong> coded structured<br />
light as a technique to solve the correspondence problem: A survey. Pattern<br />
Recognition, 31(7):963–982, July 1998.<br />
218
[7] S.S. Beauchem<strong>in</strong> and J.L. Barron. The computation <strong>of</strong> optical flow. ACM<br />
Comput. Surv., 27(3):433–466, 1995.<br />
[8] J.R. Bergen, P.J. Burt, R. H<strong>in</strong>gorani, and S. Peleg. Comput<strong>in</strong>g two motions<br />
from three frames. ICCV, 90:27–32, 1990.<br />
[9] D. Bergmann. New approach for automatic surface reconstruction with<br />
coded light. Remote Sens<strong>in</strong>g and Reconstruction for Three-Dimensional Objects<br />
and Scenes, 2572(1):2–9, 1995.<br />
[10] M.J. Black and P. Anandan. A framework for the robust estimation <strong>of</strong> opti-<br />
cal flow. In ICCV93, pages 231–236, 1993.<br />
[11] M. Blackm and A. Rangarajan. the unification <strong>of</strong> l<strong>in</strong>e processes, outlier<br />
rejection, and robust statistics with applications to early vision. 1996.<br />
[12] A. Blake and R. Cipolla. Robust estimation <strong>of</strong> surface curvature from defor-<br />
mation <strong>of</strong> apparent contours. In Proceed<strong>in</strong>gs <strong>of</strong> the First European Conference<br />
on <strong>Computer</strong> Vision, pages 465–474, London, UK, 1990. Spr<strong>in</strong>ger-Verlag.<br />
[13] S. Borkowski, J. Letessier, and J.L. Crowley. Spatial control <strong>of</strong> <strong>in</strong>teractive<br />
surfaces <strong>in</strong> an augmented environment. In EHCI/DS-VIS, pages 228–244,<br />
2004.<br />
[14] J.Y. Bouguet. Camera calibration toolbox for matlab, 2006. (Last retrieved<br />
30 November 2006).<br />
[15] J.Y. Bouguet and P. Perona. 3d photography on your desk. In ICCV ’98,<br />
pages 43–50, 1998.<br />
[16] K.L. Boyer and A.C. Kak. Color-encoded structured light for rapid active<br />
rang<strong>in</strong>g. IEEE Trans. Pattern Anal. Mach. Intell., 9(1):14–28, 1987.<br />
219
[17] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy m<strong>in</strong>imization<br />
via graph cuts. In ICCV (1), pages 377–384, 1999.<br />
[18] D.C. Brown. Decenter<strong>in</strong>g distortion <strong>of</strong> lenses. Photometric Eng<strong>in</strong>eer<strong>in</strong>g,<br />
32(3):444–462, 1966.<br />
[19] V. Buchmann, S. Violich, M. Bill<strong>in</strong>ghurst, and A. Cockburn. F<strong>in</strong>gartips:<br />
gesture based direct manipulation <strong>in</strong> augmented reality. In GRAPHITE ’04:<br />
Proceed<strong>in</strong>gs <strong>of</strong> the 2nd <strong>in</strong>ternational conference on <strong>Computer</strong> graphics and <strong>in</strong>ter-<br />
active techniques <strong>in</strong> Australasia and South East Asia, pages 212–221, New York,<br />
NY, USA, 2004. ACM.<br />
[20] J.F. Canny. A Computational Approach to Edge Detection. 8(6):679–698,<br />
1986.<br />
[21] B. Carrihill and R.A. Hummel. Experiments with the <strong>in</strong>tensity ratio data<br />
sensor. 32(3):337–358, December 1985.<br />
[22] D. Caspi, N. Kiryati, and J. Shamir. Range imag<strong>in</strong>g with adaptive color<br />
structured light. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e Intelli-<br />
gence, 20(5):470–480, 1998.<br />
[23] G. Chazan and N. Kiryati. Pyramidal <strong>in</strong>tensity ratio depth sensor. Tech-<br />
nical Report, Center for Communication and Information Technologies, Dept. <strong>of</strong><br />
Electrical Eng. Haifa, Israel, Oct 1995.<br />
[24] C.S. Chen, Y.P. Hung, C.C. Chiang, and J.L. Wu. Range data-acquisition<br />
us<strong>in</strong>g color structured light<strong>in</strong>g and stereo vision. 15(6):445–456, June 1997.<br />
[25] C. Chris Stauffer, W. Eric, and L. Grimson. Adaptive background mixture<br />
models for real-time track<strong>in</strong>g. In CVPR, pages 2246–2252, 1999.<br />
220
[26] R. Cipolla, T. Drummond, and D. Robertson. Camera calibration from van-<br />
ish<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong> images <strong>of</strong> architectural scenes. BMVC, 1999.<br />
[27] E. Costanza and J.A. Rob<strong>in</strong>son. A region adjacency tree approach to the<br />
detection and design <strong>of</strong> fiducials. In Proc. Vision, <strong>Video</strong> and Graphics, Bath,<br />
UK, July 2003.<br />
[28] E. Costanza, S. B. Shelley, and J.A. Rob<strong>in</strong>son. d-touch: a consumer-grade<br />
tangible <strong>in</strong>terface module and musical applications. In Proceed<strong>in</strong>gs <strong>of</strong> De-<br />
sign<strong>in</strong>g for Society HCI2003, Bath, UK, September 2003.<br />
[29] E. Costanza, S.B. Shelley, and J.A. Rob<strong>in</strong>son. Introduc<strong>in</strong>g audio d-touch:<br />
A tangible user <strong>in</strong>terface for music composition and performance. Digital<br />
Audio Effects (DAFx) 2003, September 2003.<br />
[30] J. Coutaz, S. Borkowski, and N. Barralon. Coupl<strong>in</strong>g <strong>in</strong>teraction resources:<br />
an analytical model. In sOc-EUSAI ’05: Proceed<strong>in</strong>gs <strong>of</strong> the 2005 jo<strong>in</strong>t conference<br />
on Smart objects and ambient <strong>in</strong>telligence, pages 183–188, New York, NY, USA,<br />
2005. ACM.<br />
[31] A. Crim<strong>in</strong>isi, I.D. Reid, and A. Zisserman. S<strong>in</strong>gle view metrology. IJCV,<br />
40(2):123–148, November 2000.<br />
[32] J. Crowley, F. Berard, and J. Coutaz. F<strong>in</strong>ger track<strong>in</strong>g as an <strong>in</strong>put device for<br />
augmented reality, 1995.<br />
[33] C.J. Davies and M.S. Nixon. A hough transform for detect<strong>in</strong>g the location<br />
and orientation <strong>of</strong> 3-dimensional surfaces via color encoded spots. SMC-B,<br />
28(1):90–95, February 1998.<br />
[34] J. Davis and M. Shah. Visual gesture recognition, 1994.<br />
221
[35] R. Deriche and O.D. Faugeras. Track<strong>in</strong>g l<strong>in</strong>e segments. In ECCV 90: Proceed-<br />
<strong>in</strong>gs <strong>of</strong> the first european conference on <strong>Computer</strong> vision, pages 259–268, New<br />
York, NY, USA, 1990. Spr<strong>in</strong>ger-Verlag New York, Inc.<br />
[36] P. Dom<strong>in</strong>gos and M. Pazzani. On the optimality <strong>of</strong> the simple bayesian<br />
classifier under zero-one loss. Mach<strong>in</strong>e Learn<strong>in</strong>g, 29(2):103–130, November<br />
1997.<br />
[37] O. Faugeras. Three-Dimensional <strong>Computer</strong> Vision. MIT Press, 1993.<br />
[38] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for<br />
model fitt<strong>in</strong>g with applications to image analysis and automated cartogra-<br />
phy. Commun. ACM, 24(6):381–395, June 1981.<br />
[39] A.W. Fitzgibbon and A Zisserman. Automatic 3d model acquisition and<br />
generation <strong>of</strong> new images from video sequences. In European Signal Pro-<br />
cess<strong>in</strong>g Conference (EUSIPCO98), pages 1261–1269, Rhodes, Greece, 1998.<br />
[40] d.j. Fleet and A.D. Jepson. Computation <strong>of</strong> component image velocity from<br />
local phase <strong>in</strong>formation. Int. J. Comput. Vision, 5(1):77–104, 1990.<br />
[41] D.M. Frohlich, T. Clancy, J.A. Rob<strong>in</strong>son, and E. Costanza. The audiophoto<br />
desk. 2ad. The Second International Conference on Appliance Design, May 2004.<br />
[42] W.C. Grauste<strong>in</strong>. Homogeneous Cartesian Coord<strong>in</strong>ates. L<strong>in</strong>ear Dependence <strong>of</strong><br />
Po<strong>in</strong>ts and L<strong>in</strong>es . New York: Macmillan, pp. 29-49, 1930.<br />
[43] W.E.L. Grimson. Computational experiments with a feature based stereo<br />
algorithm. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e Intelligence,<br />
7:17–34, 1985.<br />
[44] D. Gruber. The mathematics <strong>of</strong> the 3d rotation matrix, 2000. (Last retrieved<br />
March 2007).<br />
222
[45] J. Guehr<strong>in</strong>g. Dense 3d surface acquisition by structured light us<strong>in</strong>g <strong>of</strong>f-the-<br />
shelf components. In SPIE, <strong>Video</strong>metrics and Optical Methods for 3D Shape<br />
Measurement, volume 4309 <strong>of</strong> Presented at the Society <strong>of</strong> Photo-Optical Instru-<br />
mentation Eng<strong>in</strong>eers (SPIE) Conference, pages 220–231, December 2000.<br />
[46] C. Harris and M. Stephens. A comb<strong>in</strong>ed corner and edge detection. pages<br />
147–151, 1988.<br />
[47] R.I. Hartley and A. Zisserman. Multiple View Geometry <strong>in</strong> <strong>Computer</strong> Vision.<br />
Cambridge University Press, ISBN: 0521540518, second edition, 2004.<br />
[48] J. Heikkila and O. Silven. A four-step camera calibration procedure with<br />
implicit image correction. In IEEE <strong>Computer</strong> Vision and Pattern Recognition,<br />
pages 1106–1112, 1997.<br />
[49] B.K.P. Horn and B.G. Schunck. Determ<strong>in</strong><strong>in</strong>g optical flow. Artificial Intelli-<br />
gence, 17:185–203, 1981.<br />
[50] J. Hyde and D. Parnham. the openillusionist project, 2008. (Last retrieved<br />
May 2008).<br />
[51] Intel. Open source computer vision library. (Last retrieved 30 Nov 2006).<br />
[52] J.A. JRob<strong>in</strong>son and C. Robertson. The livepaper system: augment<strong>in</strong>g paper<br />
on an enhanced tabletop. <strong>Computer</strong>s & Graphics, 25(5):731–743, 2001.<br />
[53] P. KaewTraKulPong and R. Bowden. An improved adaptive background<br />
mixture model for real-time track<strong>in</strong>g with shadow detection, 2001.<br />
[54] D. Kalman. A s<strong>in</strong>gularly valuable decomposition: The svd <strong>of</strong> a matrix. The<br />
College Mathematics Journal, 27(1):2–23, 1996.<br />
[55] T. Kanade. Development <strong>of</strong> a <strong>Video</strong>-Rate Stereo Mach<strong>in</strong>e. In 1994 ARPA<br />
Image Understand<strong>in</strong>g Workshop, November 1994.<br />
223
[56] H. Kato and M. Bill<strong>in</strong>ghurst. Marker track<strong>in</strong>g and hmd calibration for a<br />
video-based augmented reality conferenc<strong>in</strong>g system. In IWAR ’99: Proceed-<br />
<strong>in</strong>gs <strong>of</strong> the 2nd IEEE and ACM International Workshop on <strong>Augmented</strong> Reality,<br />
page 85, Wash<strong>in</strong>gton, DC, USA, 1999. IEEE <strong>Computer</strong> Society.<br />
[57] R. Klette, K. Schluns, and A. Koschan. <strong>Computer</strong> Vision: Three-Dimensional<br />
Data from Images. Spr<strong>in</strong>ger-Verlag S<strong>in</strong>gapore Pte. Limited, 1998.<br />
[58] H. Koike, Y. Sato, Y. Kobayashi, H. Tobita, and M. Kobayashi. Interactive<br />
textbook and <strong>in</strong>teractive venn diagram: natural and <strong>in</strong>tuitive <strong>in</strong>terfaces on<br />
augmented desk system. In CHI ’00: Proceed<strong>in</strong>gs <strong>of</strong> the SIGCHI conference<br />
on <strong>Human</strong> factors <strong>in</strong> comput<strong>in</strong>g systems, pages 121–128, New York, NY, USA,<br />
2000. ACM.<br />
[59] M.W. Krueger. Artificial Reality. Addison-Wesley, Read<strong>in</strong>g, MA, 1983.<br />
[60] M.W. Krueger. Environmental technology: mak<strong>in</strong>g the real world virtual.<br />
Commun. ACM, 36(7):36–37, 1993.<br />
[61] D.T. Lawton and W.F. Gardner. Translational decomposition <strong>of</strong> flow fields.<br />
pages 697–705, 1993.<br />
[62] D.C. Lay. L<strong>in</strong>ear Algebra and its Applications. Addison Wesley Longman Inc.,<br />
1997.<br />
[63] J. Letessier and F. Bérard. Visual track<strong>in</strong>g <strong>of</strong> bare f<strong>in</strong>gers for <strong>in</strong>teractive<br />
surfaces. In UIST ’04: Proceed<strong>in</strong>gs <strong>of</strong> the 17th annual ACM symposium on User<br />
<strong>in</strong>terface s<strong>of</strong>tware and technology, pages 119–122, New York, NY, USA, 2004.<br />
ACM.<br />
[64] L. Li and J.A Rob<strong>in</strong>son. A semi-automatic human-computer collaborative<br />
system for 3d shapes <strong>in</strong>putt<strong>in</strong>g. IET Visual Information Eng<strong>in</strong>eer<strong>in</strong>g, July<br />
2007.<br />
224
[65] B.D. Lucas and T. Kanade. An iterative image registration technique with<br />
an application to stereo vision. In In Proceed<strong>in</strong>gs <strong>of</strong> International Jo<strong>in</strong>t Confer-<br />
ence on Artificial Intelligence, pages 674–679, 1981.<br />
[66] F. Lv, T. Zhao, and R. Nevatia. Camera calibration from video <strong>of</strong> a walk-<br />
<strong>in</strong>g human. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e Intelligence,<br />
28(9):1513–1518, 2006.<br />
[67] P. Maes. Artificial life meets enterta<strong>in</strong>ment: lifelike autonomous agents.<br />
Commun. ACM, 38(11):108–114, 1995.<br />
[68] P. Maes, T. Darrell, B. Blumberg, and A. Pentland. The alive system: Wire-<br />
less, full-body <strong>in</strong>teraction with autonomous agents. 1996.<br />
[69] S. Malik and J. Laszlo. Visual touchpad: a two-handed gestural <strong>in</strong>put de-<br />
vice. In ICMI, pages 289–296, 2004.<br />
[70] J.W. Mateer and J.A. Rob<strong>in</strong>son. A vision-based postproduction tool for<br />
footage logg<strong>in</strong>g, analysis, and annotation. Graph. Models, 67(6):565–583,<br />
2005.<br />
[71] H.K. Nishihara. Prism: A practical real-time imag<strong>in</strong>g stereo matcher. Tech-<br />
nical report, Cambridge, MA, USA, 1984.<br />
[72] C. Nölker and H. Ritter. Detection <strong>of</strong> f<strong>in</strong>gertips <strong>in</strong> human hand movement<br />
sequences. In I. Wachsmuth and M. Fröhlich, editors, Gesture and Sign Lan-<br />
guage <strong>in</strong> <strong>Human</strong>-<strong>Computer</strong> Interaction, Proceed<strong>in</strong>gs <strong>of</strong> the International Gesture<br />
Workshop 1997, pages 209–218. Spr<strong>in</strong>ger, 1998.<br />
[73] S. O’Mahony and J.A. Rob<strong>in</strong>son. Penpets: a physical environment for vir-<br />
tual animals. In CHI ’03: CHI ’03 extended abstracts on <strong>Human</strong> factors <strong>in</strong><br />
comput<strong>in</strong>g systems, pages 622–623, New York, NY, USA, 2003. ACM.<br />
225
[74] D. Parnham. An Infrastructure for <strong>Video</strong>-<strong>Augmented</strong> Environments. PhD the-<br />
sis, University <strong>of</strong> York, February 2007.<br />
[75] D. Parnham, J.A. Rob<strong>in</strong>son, and Y. Zhao. A compact fiducial for aff<strong>in</strong>e<br />
augmented reality. Second International Conference on Visual Information En-<br />
g<strong>in</strong>eer<strong>in</strong>g (VIE), pages 347–352, April 2005.<br />
[76] J.L. Posdamer and M.D. Altschuler. Surface measurement by space-<br />
encoded projected beam system. CGIP, 18(1):1–17, January 1982.<br />
[77] W.H. Press, S.A. Teukolsky, W.T. Vetterl<strong>in</strong>g, and B.P. Flannery. Numerical<br />
Recipes <strong>in</strong> C: The Art <strong>of</strong> Scientific Comput<strong>in</strong>g. Cambridge University Press,<br />
2nd edition, January 1993.<br />
[78] F. Quek, T. Mysliwiec, and M. Zhao. Figermouse: a freehand po<strong>in</strong>t<strong>in</strong>g <strong>in</strong>-<br />
terface. In International Workshop on Automatic Face and Gesture Recognition,<br />
pages 372–377, Zurich, Switzerland, June 1995.<br />
[79] J. Renno, J. Orwell, and G. Jones. Learn<strong>in</strong>g surveillance track<strong>in</strong>g models for<br />
the self-calibrated ground plane, 2002.<br />
[80] J.A. Rob<strong>in</strong>son. Collaborative vision and <strong>in</strong>teractive mosaic<strong>in</strong>g. Vision, <strong>Video</strong><br />
and Graphics (VVG), July 2003.<br />
[81] C. Rocch<strong>in</strong>i, P. Cignoni, C. Montani, P. P<strong>in</strong>gi, and R. Scopigno. A low cost 3d<br />
scanner based on structured light. EUROGRAPHICS, 20(3):299–308, 2001.<br />
[82] S. Roy and I.J. Cox. A maximum-flow formulation <strong>of</strong> the n-camera stereo<br />
correspondence problem. In ICCV, pages 492–502, 1998.<br />
[83] G. Sansoni, S. Lazzari, S. Peli, and F. Docchio. 3-d imager for dimensional<br />
gaug<strong>in</strong>g <strong>of</strong> <strong>in</strong>dustrial workpieces: State-<strong>of</strong>-the-art <strong>of</strong> the development <strong>of</strong> a<br />
robust and versatile system. 3dim, 0:19, 1997.<br />
226
[84] K. Sato and S. Inokuchi. Three-dimensional surface measurement by space<br />
encod<strong>in</strong>g range imag<strong>in</strong>g. J.Robotic Systems, 2(1):27–39, 1985.<br />
[85] Y. Sato, Y. Kobayashi, and H. Koike. Fast track<strong>in</strong>g <strong>of</strong> hands and f<strong>in</strong>gertips<br />
<strong>in</strong> <strong>in</strong>frared images for augmented desk <strong>in</strong>terface. In FG ’00: Proceed<strong>in</strong>gs <strong>of</strong><br />
the Fourth IEEE International Conference on Automatic Face and Gesture Recog-<br />
nition 2000, page 462, Wash<strong>in</strong>gton, DC, USA, 2000. IEEE <strong>Computer</strong> Society.<br />
[86] D. Scharste<strong>in</strong> and R. Szeliski. Stereo match<strong>in</strong>g with non-l<strong>in</strong>ear diffusion.<br />
Technical Report TR96-1575, 18, 1996.<br />
[87] D. Scharste<strong>in</strong>, R. Szeliski, and R. Zabih. A taxonomy and evaluation <strong>of</strong><br />
dense two-frame stereo correspondence algorithms. In Proceed<strong>in</strong>gs <strong>of</strong> the<br />
IEEE Workshop on Stereo and Multi-Basel<strong>in</strong>e Vision, Kauai, HI, Dec. 2001., 2001.<br />
[88] P. Schnemann. A generalized solution <strong>of</strong> the orthogonal procrustes prob-<br />
lem. Psychometrika, 31(1):1–10, March 1966.<br />
[89] L. Shapiro and G. Stockman. <strong>Computer</strong> Vision. Prentice Hall, 2001.<br />
[90] L.S. Shapiro, H. Wang, and J.M. Brady. A match<strong>in</strong>g and track<strong>in</strong>g strategy<br />
for <strong>in</strong>dependently mov<strong>in</strong>g objects. Proc. 3rd British Mach<strong>in</strong>e Vision Confer-<br />
ence, pages 306–315, September 1992.<br />
[91] H. Shikawa and D. Geiger. Occlusions, discont<strong>in</strong>uities, and epipolar l<strong>in</strong>es<br />
<strong>in</strong> stereo. Lecture Notes <strong>in</strong> <strong>Computer</strong> Science, 1406:232–248, 1998.<br />
[92] D. S<strong>in</strong>clair, A. Blake, S. Smith, and S. Rothwel. Planar region detection and<br />
motion recovery. In 3rd British Mach<strong>in</strong>e Vision Conference, 1992.<br />
[93] Q. Stafford-Fraser and P. Rob<strong>in</strong>son. Brightboard: A video-augmented en-<br />
vironment. In CHI, pages 134–141, 1996.<br />
227
[94] C. Stauffer, W. Eric, and L. Grimson. Learn<strong>in</strong>g patterns <strong>of</strong> activity us<strong>in</strong>g<br />
real-time track<strong>in</strong>g. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):747–757,<br />
2000.<br />
[95] G. Strang. Introduction to L<strong>in</strong>ear Algebra, Third Edition. Wellesley Cambridge<br />
Pr, March 2003.<br />
[96] B. Thomas and W. Piekarski. Glove based user <strong>in</strong>teraction techniques for<br />
augmented reality <strong>in</strong> an outdoor environment, 2002.<br />
[97] M. Trob<strong>in</strong>a. Error model <strong>of</strong> a coded-light range sensor, 1995.<br />
[98] E. Trucco, R.B. Fisher, A.W. Fitzgibbon, and D.K. Naidu. Calibration,<br />
data consistency and model acquisition with a 3-d laser striper. RobCIM,<br />
11(4):292–310, 1998.<br />
[99] R.Y. Tsai. An efficient and accurate camera calibration technique for 3d<br />
mach<strong>in</strong>e vision. In Proceed<strong>in</strong>gs <strong>of</strong> IEEE Conference on <strong>Computer</strong> Vision and<br />
Pattern Recognition, pages 364–374, Miami Beach, FL, 1986.<br />
[100] Vision3dWeb. (Last retrieved, April 2008).<br />
[101] website. the national museum scotland, 2008. (Last retrieved May 2008).<br />
[102] P. Wellner. The digitaldesk calculator: tangible manipulation on a desk top<br />
display, 1991.<br />
[103] P. Wellner. Adaptive threshold<strong>in</strong>g on the digitaldesk. EuroPARC Technical<br />
Report EPC-93-110, 1993, 1993.<br />
[104] P. Wellner. Interact<strong>in</strong>g with paper on the DigitalDesk. Communications <strong>of</strong><br />
the ACM, 36(7):86–97, 1993.<br />
228
[105] G. Wiora. High-resolution measurement <strong>of</strong> phase-shift amplitude and nu-<br />
meric object phase calculation. Vision Geometry IX, 4117(1):289–299, 2000.<br />
[106] R.J. Woodham. Photometric method for determ<strong>in</strong><strong>in</strong>g surface orientation<br />
from multiple images. Optical Eng<strong>in</strong>eer<strong>in</strong>, 19:139 –144, 1980.<br />
[107] R.J Woodham. Determ<strong>in</strong><strong>in</strong>g surface curvature with photometric stereo.<br />
IEEE Conf Robotics & Automation, pages 36–42, 1989.<br />
[108] R.J. Woodham. Gradient and curvature from photometric stereo <strong>in</strong>clud-<br />
<strong>in</strong>g local confidence estimation. Journal <strong>of</strong> the Optical Society <strong>of</strong> America,<br />
11(11):3050–3068, 1994.<br />
[109] L. Zhang, B. Curless, and S. M. Seitz. Rapid shape acquisition us<strong>in</strong>g color<br />
structured light and multi-pass dynamic programm<strong>in</strong>g. The 1st IEEE In-<br />
ternational Symposium on 3D Data Process<strong>in</strong>g, Visualization, and Transmission,<br />
pages 24–36, June 2002.<br />
[110] L. Zhang, B. Curless, and S. Seitz. Spacetime stereo: Shape recovery for dy-<br />
namic scenes. International Conference on <strong>Computer</strong> Vision and Pattern Recog-<br />
nition, , Madison, WI., pages 367–374, June, 2003.<br />
[111] Z.Y. Zhang. A flexible new technique for camera calibration. IEEE Transac-<br />
tion Pattern Analysis and Mach<strong>in</strong>e Intelligence, 22(11):1300–1334, 2000.<br />
229
Appendix A<br />
Declarations for class CButton<br />
1 #pragma once<br />
2<br />
3<br />
4 // number <strong>of</strong> buttons<br />
5 #def<strong>in</strong>e NUM_BUTTONS 59<br />
6<br />
7 // default threshold used for <strong>in</strong>ner region, if button<br />
calibration is skipped<br />
8 #def<strong>in</strong>e BUTTON_TH_INNER 10.00<br />
9<br />
10 // default threshold used for outer region, if button<br />
calibration is skipped<br />
11 #def<strong>in</strong>e BUTTON_TH_OUTER 5.0<br />
12<br />
13 // the time period a button stays highlighted for, <strong>in</strong><br />
milliseconds<br />
14 #def<strong>in</strong>e BUTTON_INT 500<br />
15<br />
16<br />
17 // Top-left corners <strong>of</strong> all buttons<br />
18 <strong>in</strong>t button_pos[NUM_BUTTONS*2] =<br />
19 {<br />
20 954, 618, // 0: lock // 60x40<br />
21 954, 538, // 1: save<br />
22 954, 458, // 2: SL<br />
23 954, 378, // 3: SL_repeat<br />
24 954, 298, // 4: exit<br />
25<br />
26 10, 588, // 5: thumbnail (80 x 60)<br />
27 10, 508, // 6: thumbnail (80 x 60)<br />
230
28 10, 428, // 7: thumbnail (80 x 60)<br />
29 10, 348, // 8: thumbnail (80 x 60)<br />
30 10, 268, // 9: thumbnail (80 x 60)<br />
31 10, 188, // 10: thumbnail (80 x 60)<br />
32<br />
33 80, 698, // 11: INSPECT MODE<br />
34 150, 698, // 12: TOUCHUP MODE<br />
35 220, 698, // 13: CORRESPONDENCE MODE<br />
36 290, 698, // 14: visualization mode<br />
37<br />
38 380, 698, // 15: up<br />
39 460, 698, // 16: down<br />
40 540, 698, // 17: left<br />
41 620, 698, // 18: right<br />
42 700, 698, // 19: <strong>in</strong><br />
43 780, 698, // 20: out<br />
44<br />
45 370, 698, // 21: v_expand<br />
46 440, 698, // 22: v_shr<strong>in</strong>k<br />
47 510, 698, // 23: h_expand<br />
48 580, 698, // 24: h_shr<strong>in</strong>k<br />
49 660, 698, // 25: roi_up<br />
50 730, 698, // 26: roi_down<br />
51 800, 698, // 27: roi_left<br />
52 870, 698, // 28: roi_right<br />
53<br />
54 370, 673, // 29: touchpad (150 x 90)<br />
55 530, 698, // 30: double cursor speed<br />
56 600, 698, // 31: push button<br />
57<br />
58 884, 698, // 32: manual search<br />
59 954, 698, // 33: mouse assisted<br />
60<br />
61 880, 698, // 34: param<br />
62 954, 698, // 35: proceed<br />
63<br />
64 370, 673, // 36: R<br />
65 370, 723, // 37: T<br />
66<br />
67 450, 698, // 38: R_x+<br />
68 520, 698, // 39: R_x-<br />
69 590, 698, // 40: R_y+<br />
70 660, 698, // 41: R_y-<br />
71 730, 698, // 42: R_z-<br />
72 800, 698, // 43: R_z-<br />
73<br />
74 450, 698, // 44: T_x+<br />
75 520, 698, // 45: T_x-<br />
231
76 590, 698, // 46: T_y+<br />
77 660, 698, // 47: T_y-<br />
78 730, 698, // 48: T_z+<br />
79 800, 698, // 49: T_z-<br />
80<br />
81 870, 698, // 50: x1, x2, x4, x8<br />
82<br />
83 450, 698, // 51: po<strong>in</strong>tset0<br />
84 520, 698, // 52: po<strong>in</strong>tset1<br />
85 590, 698, // 53: po<strong>in</strong>tset2<br />
86 660, 698, // 54: po<strong>in</strong>tset3<br />
87 730, 698, // 55: po<strong>in</strong>tset4<br />
88 800, 698, // 56: po<strong>in</strong>tset5<br />
89<br />
90 880, 698, // 57: no<br />
91 870, 618 // 58: tun<strong>in</strong>g pose<br />
92 };<br />
93<br />
94<br />
95 // Button IDs<br />
96 enum BUTTON_ID<br />
97 {<br />
98 SYS_LOCK,<br />
99 SYS_SAVE,<br />
100 SYS_SL,<br />
101 SYS_SL2,<br />
102 SYS_EXT,<br />
103<br />
104 THUMB_0,<br />
105 THUMB_1,<br />
106 THUMB_2,<br />
107 THUMB_3,<br />
108 THUMB_4,<br />
109 THUMB_5,<br />
110<br />
111 MODE_INSPECT,<br />
112 MODE_TOUCHUP,<br />
113 MODE_CORRESP,<br />
114 MODE_VISUAL,<br />
115<br />
116 CTRL_UP,<br />
117 CTRL_DOWN,<br />
118 CTRL_LEFT,<br />
119 CTRL_RIGHT,<br />
120 CTRL_IN,<br />
121 CTRL_OUT,<br />
122<br />
123 CTRL_ROI_VEXPAND,<br />
232
124 CTRL_ROI_VSHRINK,<br />
125 CTRL_ROI_HEXPAND,<br />
126 CTRL_ROI_HSHRINK,<br />
127 CTRL_ROI_UP,<br />
128 CTRL_ROI_DOWN,<br />
129 CTRL_ROI_LEFT,<br />
130 CTRL_ROI_RIGHT,<br />
131<br />
132 CTRL_TOUCHPAD,<br />
133 CTRL_DOUBLE_SPEED,<br />
134 CTRL_PUSHBUTTON,<br />
135<br />
136 CTRL_MANUAL,<br />
137 CTRL_MOUSE,<br />
138<br />
139 CTRL_PARAM,<br />
140 CTRL_PROCEED,<br />
141<br />
142 CTRL_R,<br />
143 CTRL_T,<br />
144<br />
145 CTRL_R_XP,<br />
146 CTRL_R_XM,<br />
147 CTRL_R_YP,<br />
148 CTRL_R_YM,<br />
149 CTRL_R_ZP,<br />
150 CTRL_R_ZM,<br />
151<br />
152 CTRL_T_XP,<br />
153 CTRL_T_XM,<br />
154 CTRL_T_YP,<br />
155 CTRL_T_YM,<br />
156 CTRL_T_ZP,<br />
157 CTRL_T_ZM,<br />
158<br />
159 CTRL_CHANGE_SPEED,<br />
160<br />
161 CTRL_SELECT_0,<br />
162 CTRL_SELECT_1,<br />
163 CTRL_SELECT_2,<br />
164 CTRL_SELECT_3,<br />
165 CTRL_SELECT_4,<br />
166 CTRL_SELECT_5,<br />
167<br />
168 CTRL_NO,<br />
169 CTRL_TUNING_POSE,<br />
170 };<br />
171<br />
233
172<br />
173<br />
174 //--------------------------------------------<br />
175 // CButton class declaration<br />
176 //--------------------------------------------<br />
177<br />
178 class CButton<br />
179 {<br />
180 private:<br />
181 CvRect mProRect; // button position/size <strong>in</strong> projector<br />
image<br />
182 CvRect mCamRect; // button position/size <strong>in</strong> camera image<br />
183 CvRect mCamInnerRect; // <strong>in</strong>ner region for button push<br />
detection<br />
184<br />
185 char *mpImageName; // name <strong>of</strong> the image to be loaded for<br />
the button<br />
186 char *mpHelpText1; // help text 1st l<strong>in</strong>e<br />
187 char *mpHelpText2; // help text 2nd l<strong>in</strong>e<br />
188<br />
189 bool mFlagActive; // a flag that <strong>in</strong>dicates the current<br />
button is engaged or not<br />
190 bool mFlagHighlighted; // a flag that <strong>in</strong>dicates the<br />
current button is highlighted or not<br />
191<br />
192 // Constructor<br />
193 CButton();<br />
194<br />
195 // Destructor<br />
196 ˜CButton();<br />
197<br />
198 public:<br />
199<br />
200 // Based on the size <strong>in</strong> the projector image, calculate the<br />
buttons’ expected positions <strong>in</strong> the camera image<br />
201 void SetSize(<strong>in</strong>t px, <strong>in</strong>t py, <strong>in</strong>t pxsize, <strong>in</strong>t pysize);<br />
202<br />
203 // Initialise nth button<br />
204 void Initialise(<strong>in</strong>t n);<br />
205<br />
206 // on given image, get <strong>in</strong>ner region avg<br />
207 double GetInnerAvg(picture_<strong>of</strong>_<strong>in</strong>t *<strong>in</strong>pic);<br />
208<br />
209 // on given image, get outer region avg<br />
210 double GetOuterAvg(picture_<strong>of</strong>_<strong>in</strong>t *<strong>in</strong>pic);<br />
211<br />
212 bool Pressed();<br />
213 bool Released();<br />
234
214 void Flash();<br />
215 void Highlight();<br />
216 void Dehighlight();<br />
217<br />
218 void Attach();<br />
219 void Detach();<br />
220 void AttachText();<br />
221 void DetachText();<br />
222 void AttachNewText(char *<strong>in</strong>Text1, char *<strong>in</strong>Text2);<br />
223<br />
224 // Black text with white background, as opposed to normal<br />
text<br />
225 void AttachInverseText();<br />
226<br />
227 void DrawButtonBoundary(colour_picture &<strong>in</strong>pic);<br />
228 };<br />
List<strong>in</strong>g A.1: Header: Button.h<br />
235
Appendix B<br />
Declarations for class CPo<strong>in</strong>tSet<br />
1 #pragma once<br />
2<br />
3 #<strong>in</strong>clude <br />
4 #<strong>in</strong>clude "XMLParser.h"<br />
5<br />
6 us<strong>in</strong>g namespace std;<br />
7<br />
8 typedef std::vector CvMat_vector;<br />
9 typedef std::vector CvScalar_vector;<br />
10<br />
11<br />
12<br />
13 //--------------------------------------------<br />
14 // CPo<strong>in</strong>tSet class declaration<br />
15 //--------------------------------------------<br />
16 class CPo<strong>in</strong>tSet<br />
17 {<br />
18 private:<br />
19<br />
20 //------------------------<br />
21 // Ma<strong>in</strong> data<br />
22 //------------------------<br />
23 <strong>in</strong>t mLength; // total number <strong>of</strong> po<strong>in</strong>ts<br />
24 CvMat *mpObjectPo<strong>in</strong>ts; // 3D coord<strong>in</strong>ates<br />
25 CvMat *mpImagePo<strong>in</strong>ts; // 2D positions<br />
26 CvScalar *mpColour; // colour <strong>in</strong>formation<br />
27<br />
28 <strong>in</strong>t mLength_bk;<br />
29 CvMat *mpObjectPo<strong>in</strong>ts_bk;<br />
30 CvMat *mpImagePo<strong>in</strong>ts_bk;<br />
236
31 CvScalar *mpColour_bk;<br />
32<br />
33 //------------------------<br />
34 // Matrices<br />
35 //------------------------<br />
36 CvMat *mpCentroid;<br />
37 CvMat *mpRvec; // 3x1 <strong>in</strong>stant rotation vector<br />
38 CvMat *mpTvec; // 3x1 <strong>in</strong>stant translation vector<br />
39 CvMat *mpRvecInter[NUM_VIEWS]; // 3x1 <strong>in</strong>ter-po<strong>in</strong>tset<br />
rotation vectors<br />
40 CvMat *mpTvecInter[NUM_VIEWS]; // Save as above, but<br />
vectors for translation<br />
41 <strong>in</strong>t mMergedGroup; // which group this po<strong>in</strong>t set is<br />
merged to, -1 for non-merge, 0 for group0, 1 for group1<br />
, and so on...<br />
42<br />
43 //------------------------<br />
44 // Rendered images<br />
45 //------------------------<br />
46 picture_<strong>of</strong>_<strong>in</strong>t *mpImageBwPic; // black and white model<br />
47 colour_picture *mpImageColorPic;// model attached with<br />
colour <strong>in</strong>formation<br />
48<br />
49<br />
50 //-------------------------------------------------<br />
51 // Constructor, Decontructor<br />
52 //-------------------------------------------------<br />
53 CPo<strong>in</strong>tSet ();<br />
54 ˜CPo<strong>in</strong>tSet ();<br />
55<br />
56 // Overload operator, for po<strong>in</strong>t set replication<br />
57 CPo<strong>in</strong>tSet& CPo<strong>in</strong>tSet::operator=(CPo<strong>in</strong>tSet& param);<br />
58<br />
59 public:<br />
60 //-------------------------------------------------<br />
61 // Primary functions<br />
62 //-------------------------------------------------<br />
63<br />
64 // Load po<strong>in</strong>t set from XML<br />
65 void LoadXML(char *fileName);<br />
66<br />
67 // Save po<strong>in</strong>t set to XML<br />
68 void SaveXML(char *fileName);<br />
69<br />
70 // Reallocate both front data and backup data<br />
71 void ReallocateAllMemory(<strong>in</strong>t len, <strong>in</strong>t len_bk);<br />
72<br />
73 // Reallocate memory for front data with size <strong>of</strong> len<br />
237
74 void ReallocateFrontMemory(<strong>in</strong>t len);<br />
75<br />
76 // Reallocate memory for backup data with size <strong>of</strong> len<br />
77 void ReallocateBackMemory(<strong>in</strong>t len);<br />
78<br />
79 // Replace front data with backup<br />
80 void ResetFromBackup();<br />
81<br />
82 // Save front data <strong>in</strong>to backup<br />
83 void SaveToBackup();<br />
84<br />
85 // Default -1 means list all data; otherwise the list nth<br />
element<br />
86 void List(<strong>in</strong>t <strong>in</strong>dex=-1);<br />
87<br />
88 // Given 2D image coord<strong>in</strong>ate, f<strong>in</strong>d the po<strong>in</strong>t <strong>in</strong> po<strong>in</strong>t set,<br />
and return its <strong>in</strong>dex<br />
89 <strong>in</strong>t Get<strong>Index</strong>(<strong>in</strong>t x<strong>in</strong>, <strong>in</strong>t y<strong>in</strong>);<br />
90<br />
91 // Cut <strong>of</strong>f out-<strong>of</strong>-boundary po<strong>in</strong>ts and zero-depth po<strong>in</strong>ts<br />
92 void RestrictSize(<strong>in</strong>t size);<br />
93<br />
94 // Slim with voxel quantisation<br />
95 void Slim(<strong>in</strong>t objSize, <strong>in</strong>t voxSize);<br />
96<br />
97<br />
98 //-------------------------------------------------<br />
99 // Po<strong>in</strong>t set transform <strong>in</strong> 3D<br />
100 //-------------------------------------------------<br />
101 void UpdateCentroid();<br />
102 void Rotate();<br />
103 void Translate();<br />
104<br />
105 // Rotate + Translate + UpdateCentroid<br />
106 void FullTransform();<br />
107<br />
108 // Rotate about the WCS orig<strong>in</strong><br />
109 void RotateAboutOrig<strong>in</strong>();<br />
110<br />
111 // Theta rotation about unit vector (x, y, z)<br />
112 void RotateThetaAboutVector();<br />
113<br />
114 // Manually f<strong>in</strong>e tune rotation or translation. flag: -1,<br />
do noth<strong>in</strong>g; flag: 1˜6 for rotation; flag: 7˜12 for<br />
translation<br />
115 void StepAdjustRorT(<strong>in</strong>t flag=-1);<br />
116<br />
117<br />
238
118 //-------------------------------------------------<br />
119 // Plott<strong>in</strong>g and display<br />
120 //-------------------------------------------------<br />
121<br />
122 // Draw rendered po<strong>in</strong>t set <strong>in</strong>to an image for display<br />
123 void DrawBw(<strong>in</strong>t flagTopHalf=0, <strong>in</strong>t flagInterp=0, <strong>in</strong>t<br />
<strong>in</strong>terpStep=1);<br />
124<br />
125 // Draw rendered po<strong>in</strong>t set <strong>in</strong>to an image for display (with<br />
colour <strong>in</strong>fo attached)<br />
126 void DrawColor(<strong>in</strong>t flagTopHalf=0, <strong>in</strong>t flagInterp=0, <strong>in</strong>t<br />
<strong>in</strong>terpStep=1);<br />
127 };<br />
List<strong>in</strong>g B.1: Header: Po<strong>in</strong>tSet.h<br />
239
Appendix C<br />
Declarations for class CView<br />
1 #pragma once<br />
2<br />
3<br />
4 #<strong>in</strong>clude "Po<strong>in</strong>tSet.h"<br />
5 #<strong>in</strong>clude "Cursor.h"<br />
6<br />
7 // number <strong>of</strong> views (maximum allowed)<br />
8 #def<strong>in</strong>e NUM_VIEWS 6<br />
9<br />
10 // number <strong>of</strong> views to be tested, debug mode<br />
11 #def<strong>in</strong>e TESTING_VIEWS 5<br />
12<br />
13<br />
14<br />
15 //--------------------------------------------<br />
16 // CView class declaration<br />
17 //--------------------------------------------<br />
18<br />
19 class CView<br />
20 {<br />
21 private:<br />
22 <strong>in</strong>t mView<strong>Index</strong>; // <strong>in</strong>dex <strong>of</strong> the current view<br />
23<br />
24 // Four sub images for display<br />
25 picture_<strong>of</strong>_<strong>in</strong>t *mpDepthPic; // depth map<br />
26 picture_<strong>of</strong>_<strong>in</strong>t *mpTextPic; // texth map<br />
27 colour_picture *mpModelPic; // rendered model<br />
28 colour_picture *mpColourPic;// colour map<br />
29<br />
30 // 4 sub rect, each is half size <strong>of</strong> 640x480<br />
240
31 CvRect mDepthRect, mTextRect, mModelRect, mColourRect;<br />
32<br />
33 // Thumbnail position<br />
34 CvRect mThumbRect;<br />
35<br />
36 // buffer image for fast push and pop central display area<br />
37 colour_picture *mpCentralDisplayPic;<br />
38<br />
39 // Cursor member, for cursor render<strong>in</strong>g and position<strong>in</strong>g<br />
40 CCursor mCursor;<br />
41<br />
42 // Po<strong>in</strong>t set member<br />
43 CPo<strong>in</strong>tSet *mPo<strong>in</strong>tset;<br />
44 // flag <strong>in</strong>dicat<strong>in</strong>g current tun<strong>in</strong>g mode (rotation or<br />
translation)<br />
45 bool mFlagF<strong>in</strong>eTuneRorT;<br />
46 // if po<strong>in</strong>tset <strong>of</strong> current view is merged away to other<br />
views, set it true.<br />
47 bool flag_Po<strong>in</strong>tSetMergedAway;<br />
48<br />
49 // ROI for image registration (left image)<br />
50 CvRect mCorrespROIRect1;<br />
51 // ROI for image registration (right image)<br />
52 CvRect mCorrespROIRect2;<br />
53<br />
54 // Constructor<br />
55 CView(<strong>in</strong>t);<br />
56 // Destructor<br />
57 ˜CView();<br />
58<br />
59 public:<br />
60<br />
61 //--------------------------------------------<br />
62 // primary functions<br />
63 //--------------------------------------------<br />
64<br />
65 // Initialise current view, allocate memory, assign<br />
positions<br />
66 void Initialise();<br />
67<br />
68 // Get ROI based on object dimension and po<strong>in</strong>tset centroid<br />
, then work out the estimated area the object is go<strong>in</strong>g<br />
to appear <strong>in</strong> the observed image, crop it.<br />
69 void PrepareThumbnail(char* fname);<br />
70<br />
71 // Attach four sub images<br />
72 void AttachDisplay();<br />
73<br />
241
74 void PushCentralDisplay();<br />
75 void PopCentralDisplay();<br />
76 void ClearCentralDisplay();<br />
77 void FadeCentralDisplay();<br />
78 void UnfadeCentralDisplay();<br />
79<br />
80 // Attach small box on thumbnail and big box on central<br />
display, draw all connections l<strong>in</strong>es<br />
81 void AttachBox(<strong>in</strong>t Rval, <strong>in</strong>t Gval, <strong>in</strong>t Bval);<br />
82 void DetachBox();<br />
83<br />
84<br />
85 //--------------------------------------------<br />
86 // Touchup mode<br />
87 //--------------------------------------------<br />
88<br />
89 // Adjust rendered model picture, based on <strong>in</strong>com<strong>in</strong>g flag n<br />
=0˜5: up, down, left, right, <strong>in</strong>, out<br />
90 void AdjustFijipic(<strong>in</strong>t n);<br />
91<br />
92 // Do TouchUp on depth image, based on current cursor<br />
location. This will change contents <strong>of</strong> depth data,<br />
po<strong>in</strong>t set data, and colour map, all with backup. Once<br />
done, set flagTouchUpModified = true<br />
93 void TouchUp();<br />
94<br />
95 bool flagTouchUpModified;<br />
96<br />
97<br />
98 //--------------------------------------------<br />
99 // Correspondence mode<br />
100 //--------------------------------------------<br />
101<br />
102 // Called when user selects ’from’ and ’to’ image for<br />
registration<br />
103 void UpdateCorrespSelectionDisplay(<strong>in</strong>t selection);<br />
104<br />
105 // Save as above, just remove everyth<strong>in</strong>g completely (<br />
without any repairs)<br />
106 void RemoveCorrespThumbMa<strong>in</strong>DisplayCompletely();<br />
107<br />
108 // System gives trial ROI selections<br />
109 void AutoSelectRODisplay();<br />
110<br />
111 // Select ROI <strong>of</strong> the chosen image, slide them <strong>in</strong>to centre<br />
for better view<br />
112 void SlideImages();<br />
113<br />
242
114<br />
115 //--------------------------------------------<br />
116 // Visualize Mode<br />
117 //--------------------------------------------<br />
118<br />
119 // Default is -1: do noth<strong>in</strong>g; flag 0˜5: for rotations;<br />
flag 6˜11: for translations<br />
120 void UpdateVisualPanelArea(<strong>in</strong>t flag = -1);<br />
121<br />
122 };<br />
List<strong>in</strong>g C.1: Header: View.h<br />
243