PhD thesis - Institute for Computer Graphics and Vision - Graz ...
PhD thesis - Institute for Computer Graphics and Vision - Graz ...
PhD thesis - Institute for Computer Graphics and Vision - Graz ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Graz</strong> University of Technology<br />
<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> <strong>Graphics</strong> <strong>and</strong> <strong>Vision</strong><br />
Dissertation<br />
High-Per<strong>for</strong>mance Modeling From<br />
Multiple Views Using <strong>Graphics</strong><br />
Hardware<br />
Christopher Zach<br />
<strong>Graz</strong>, Austria, February 2007<br />
Thesis supervisors<br />
Prof. Dr. Franz Leberl, <strong>Graz</strong> University of Technology<br />
Prof. Dr. Horst Bischof, <strong>Graz</strong> University of Technology
Abstract<br />
Generating 3-dimensional virtual representations of real world environments is still a challenging<br />
scientific <strong>and</strong> technological objective. Photogrammetric computer vision methods<br />
enable the creation of virtual copies from a set of acquired images. These methods are<br />
usually based on either off-the-shelf digital cameras or large-scale sensors. High quality<br />
image-based models with minimal human assistance are achieved by ensuring sufficient<br />
redundancy in the image content. As a consequence, a large amount of image data needs<br />
to be captured <strong>and</strong> subsequently processed. Recent advances in the computational per<strong>for</strong>mance<br />
of graphics processing units (GPUs) <strong>and</strong> in the provided programmable features<br />
make these devices a natural plat<strong>for</strong>m <strong>for</strong> generic high-per<strong>for</strong>mance parallel processing.<br />
In particular, several fundamental computer vision methods can be successfully accelerated<br />
by graphics hardware due to their intrinsic parallelism <strong>and</strong> due to the highly efficient<br />
filtered pixel access.<br />
The contribution of this <strong>thesis</strong> is the development of several new 3D vision algorithms<br />
intended <strong>for</strong> efficient execution on current generation GPUs. All proposed methods address<br />
the fully automated creation of dense 2.5D <strong>and</strong> 3D geometry of objects <strong>and</strong> environments<br />
captured on a sequence of images. The range of depicted methods starts with simple <strong>and</strong><br />
purely local approaches with very efficient respective implementations. Furthermore, a<br />
novel <strong>for</strong>mulation of a semi-global depth estimation approach suitable <strong>for</strong> fast execution<br />
on the GPU is presented. In addition it is shown, that variational methods <strong>for</strong> depth<br />
estimation can benefit significantly from GPU acceleration as well. Finally, highly efficient<br />
methods are presented, which generate 3D models from the input image set, either<br />
directly from the images or indirectly via intermediate 2.5D geometry. The per<strong>for</strong>mance<br />
of the developed methods <strong>and</strong> their respective implementations is evaluated on artificial<br />
datasets to obtain quantitative results, <strong>and</strong> demonstrated in real world applications as<br />
well. The proposed methods are incorporated into a complete 3D vision pipeline, which<br />
was successfully applied in several research projects.<br />
Keywords. multiple view reconstruction, depth estimation, dynamic programming,<br />
variational depth map evolution, space carving, volumetric range image integration,<br />
general purpose programming on graphics processing units (GPGPU), GPU acceleration<br />
iii
Acknowledgments<br />
Writing a <strong>PhD</strong> <strong>thesis</strong> is a large scale project. Everybody with a <strong>PhD</strong> degree knows<br />
this simple fact from his (or her) own experience. Although oneself has the primary<br />
responsibility to make progress with the <strong>thesis</strong>, the support from many other people is<br />
very substantial <strong>for</strong> a successful completion. This section is the place to mention <strong>and</strong> to<br />
thank those people helping me directly or indirectly in preparing this <strong>thesis</strong>.<br />
At first I need to thank my <strong>thesis</strong> supervisors, Prof. Franz Leberl <strong>and</strong> Prof. Horst<br />
Bischof from the <strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> <strong>Graphics</strong> <strong>and</strong> <strong>Vision</strong> <strong>for</strong> their advice during my<br />
time as <strong>PhD</strong> student. In those times, when Prof. Leberl was engaged with highly ambitious<br />
projects, Prof. Bischof provided significant guidance <strong>for</strong> my scientific work.<br />
During my <strong>PhD</strong> time I was a researcher at the VRVis Research Center <strong>for</strong> Virtual<br />
Reality <strong>and</strong> Visualization, <strong>and</strong> this <strong>thesis</strong> was largely funded by this research company. I<br />
would like to thank my current <strong>and</strong> <strong>for</strong>mer colleagues from VRVis <strong>Graz</strong> <strong>and</strong> Vienna <strong>for</strong><br />
the opportunity of this position <strong>and</strong> their collaboration.<br />
In particular, the full reconstruction pipeline creating virtual copies from a set of<br />
images contains many more steps than those developed by me during this <strong>thesis</strong>. Several<br />
stages in the pipeline is work done by my colleagues in the “Virtual Habitat” group at<br />
VRVis. At first I would like to thank Mario, who acquired many of the source images <strong>and</strong><br />
is mainly responsible <strong>for</strong> the first steps in the modeling pipeline. The textures <strong>for</strong> the final<br />
3D models displayed in this <strong>thesis</strong> were generated by Lukas as part of his master <strong>thesis</strong>.<br />
I would like to thank Dr. Ivana Kolingerova <strong>and</strong> her <strong>PhD</strong> students from Plzen, who<br />
invited me to work <strong>for</strong> several weeks in this really nice town. I spent almost two months<br />
there (including the annual WSCG conference).<br />
During my time as <strong>PhD</strong> student I advised three master students: Mario, Lukas <strong>and</strong><br />
Manni, who all did valuable work <strong>for</strong> their respective projects. Mario <strong>and</strong> Lukas started<br />
working at VRVis after finishing their master <strong>thesis</strong>. Manni began working at the associated<br />
computer vision institute, hence I guess I didn’t discourage those students too<br />
much.<br />
Having the office located directly at the institute <strong>for</strong> computer graphics <strong>and</strong> vision<br />
proved highly beneficial. Several new ideas were developed during personal talks with the<br />
institute members. In particular, I would like to thank the current <strong>and</strong> <strong>for</strong>mer attendees<br />
of the espresso club, namely Bernhard, Horst, Martina, Mike, Tom (2x), Pierre <strong>and</strong> last<br />
v
vi<br />
but not least Roli, whose legendary parties will be remembered <strong>for</strong> a long, long time.<br />
Additionally, I had fruitful <strong>and</strong> interesting discussions with Peter, Matthias, Suri, Markus,<br />
Alex <strong>and</strong> especially with Martin, who shared the office with me now <strong>for</strong> so many years.<br />
Finishing this <strong>thesis</strong> was not possible without some additional activities freeing the<br />
mind <strong>and</strong> relaxing the body. At first I would like to thank all Aikido teachers <strong>and</strong> fellows<br />
on the tatami from <strong>Graz</strong>, who worked hard <strong>for</strong> the last seven years to make my body less<br />
stiff.<br />
Furthermore, I would like to thank Vera <strong>for</strong> persuading me to start dancing lessons with<br />
her. She is not only a clever <strong>and</strong> ambitious person, but she was additionally discovered as<br />
a gifted partner in the dance hall.<br />
<strong>Graz</strong>, January 2007<br />
Christopher Zach<br />
The problem is not that people will steal your ideas. On the contrary,<br />
your job as an academic is to ensure that they do.<br />
Tom’s advice, according to Frank Dellaert
Contents<br />
1 Introduction 1<br />
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2 Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> . . . . . . . . . . . . 2<br />
1.3 3D Models from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.4 Overview of this Thesis <strong>and</strong> Contributions . . . . . . . . . . . . . . . . . . . 10<br />
2 Related Work 15<br />
2.1 Dense Depth <strong>and</strong> Model Estimation . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.1.1 Computational Stereo on Rectified Images . . . . . . . . . . . . . . . 15<br />
2.1.2 Multi-View Depth Estimation . . . . . . . . . . . . . . . . . . . . . . 17<br />
2.1.3 Direct 3D Model Reconstruction . . . . . . . . . . . . . . . . . . . . 18<br />
2.2 GPU-based 3D Model Computation . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.2.1 General Purpose Computations on the GPU . . . . . . . . . . . . . 19<br />
2.2.2 Real-time <strong>and</strong> GPU-Accelerated Dense Reconstruction from Multiple<br />
Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
3 Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware 27<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
3.2 Overview of Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
3.2.1 Image Warping <strong>and</strong> Difference Image Computation . . . . . . . . . . 29<br />
3.2.2 Local Error Summation . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
3.2.3 Determining the Best Local Modification . . . . . . . . . . . . . . . 31<br />
3.2.4 Hierarchical Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3.3.1 Mesh Rendering <strong>and</strong> Image Warping . . . . . . . . . . . . . . . . . . 33<br />
3.3.2 Local Error Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
3.3.3 Encoding of Integers in RGB Channels . . . . . . . . . . . . . . . . . 35<br />
3.4 Per<strong>for</strong>mance Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
3.4.1 Amortized Difference Image Generation . . . . . . . . . . . . . . . . 36<br />
3.4.2 Parallel Image Trans<strong>for</strong>ms . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
3.4.3 Minimum Determination Using the Depth Test . . . . . . . . . . . . 37<br />
vii
viii CONTENTS<br />
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
4 GPU-based Depth Map Estimation using Plane Sweeping 43<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
4.2 Plane Sweep Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
4.2.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />
4.2.2 Image Correlation Functions . . . . . . . . . . . . . . . . . . . . . . 45<br />
4.2.2.1 Efficient Summation over Rectangular Regions . . . . . . . 46<br />
4.2.2.2 Normalized Correlation Coefficient . . . . . . . . . . . . . . 47<br />
4.2.3 Sum of Absolute Differences <strong>and</strong> Variants . . . . . . . . . . . . . . . 48<br />
4.2.4 Depth Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
4.3 Sparse Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
4.3.1 Sparse Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
4.3.1.1 Sparse Data Cost Volume During Plane-Sweep . . . . . . . 51<br />
4.3.1.2 Sparse Data Cost Volume <strong>for</strong> Message Passing . . . . . . . 52<br />
4.3.2 Sparse Message Update . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
4.3.2.1 Sparse 1D Distance Trans<strong>for</strong>m . . . . . . . . . . . . . . . . 53<br />
4.4 Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
4.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
4.6 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
5 Space Carving on 3D <strong>Graphics</strong> Hardware 63<br />
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
5.2 Volumetric Scene Reconstruction <strong>and</strong> Space Carving . . . . . . . . . . . . . 64<br />
5.3 Single Sweep Voxel Coloring in 3D Hardware . . . . . . . . . . . . . . . . . 66<br />
5.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
5.3.2 Voxel Layer Generation . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
5.3.3 Updating the Depth Maps . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
5.3.4 Immediate Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
5.4 Extensions to Multi Sweep Space Carving . . . . . . . . . . . . . . . . . . . 70<br />
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
5.5.1 Per<strong>for</strong>mance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
5.5.2 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
6 PDE-based Depth Estimation on the GPU 79<br />
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
6.2 Variational Techniques <strong>for</strong> Multi-View Depth Estimation . . . . . . . . . . . 80<br />
6.2.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
CONTENTS ix<br />
6.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
6.2.3 Extensions <strong>and</strong> Variations . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
6.2.3.1 Back-Matching . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
6.2.3.2 Local Changes in Illumination . . . . . . . . . . . . . . . . 84<br />
6.2.3.3 Other Variations . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
6.3 GPU-based Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
6.3.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
6.3.2 Regularization Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
6.3.3 Depth Update Equation . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
6.3.3.1 Jacobi Iterations . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
6.3.3.2 Conjugate Gradient Solver . . . . . . . . . . . . . . . . . . 87<br />
6.3.4 Coarse-to-Fine Approach . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.4.1 Facade Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.4.2 Small Statue Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
6.4.3 Mirabellstatue Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
7 Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware 97<br />
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
7.2 Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo . . . . . . . . . . . . 98<br />
7.2.1 Scanline Optimization <strong>and</strong> Min-Convolution . . . . . . . . . . . . . . 98<br />
7.2.2 Overall Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
7.2.3 GPU Implementation Enhancements . . . . . . . . . . . . . . . . . . 101<br />
7.2.3.1 Fewer Passes Through Bidirectional Approach . . . . . . . 101<br />
7.2.3.2 Disparity Tracking <strong>and</strong> Improved Parallelism . . . . . . . . 102<br />
7.2.3.3 Readback of Tracked Disparities . . . . . . . . . . . . . . . 103<br />
7.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
7.3 Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong><br />
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
7.3.1 Input Data <strong>and</strong> General Setting . . . . . . . . . . . . . . . . . . . . 106<br />
7.3.2 Similarity Scores based on Incremental Summation . . . . . . . . . . 107<br />
7.3.3 Sensor Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
7.3.4 Slice Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.3.5 SAD Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.3.6 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 111<br />
7.3.7 Depth Extraction by Scanline Optimization . . . . . . . . . . . . . . 111<br />
7.3.8 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
7.3.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
x CONTENTS<br />
8 Volumetric 3D Model Generation 119<br />
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
8.2 Selecting the Volume of Interest . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
8.3 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
8.4 Isosurface Determination <strong>and</strong> Extraction . . . . . . . . . . . . . . . . . . . . 124<br />
8.5 Implementation Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
9 Results 131<br />
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
9.2 Synthetic Sphere Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
9.3 Synthetic House Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />
9.4 Middlebury Multi-View Stereo Temple Dataset . . . . . . . . . . . . . . . . 137<br />
9.5 Statue of Emperor Charles VI . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />
9.6 Bodhisattva Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
10 Concluding Remarks 147<br />
A Selected Publications 151<br />
A.1 Publications Related to this Thesis . . . . . . . . . . . . . . . . . . . . . . . 151<br />
A.2 Other Selected Scientific Contributions . . . . . . . . . . . . . . . . . . . . . 151<br />
Bibliography 153
List of Figures<br />
1.1 Several reconstructed statue models . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2 A possible pipeline to create virtual models from images . . . . . . . . . . . 5<br />
1.3 The reconstruction pipeline in an example . . . . . . . . . . . . . . . . . . . 13<br />
2.1 The stream computation model of a GPU . . . . . . . . . . . . . . . . . . . 20<br />
3.1 Mesh reconstruction from a pair of stereo images . . . . . . . . . . . . . . . 29<br />
3.2 The regular grid as seen from the key camera . . . . . . . . . . . . . . . . . 30<br />
3.3 The neighborhood of a currently evaluated vertex . . . . . . . . . . . . . . . 30<br />
3.4 The correspondence between vertex indices <strong>and</strong> grid positions. . . . . . . . 31<br />
3.5 The basic workflow of the matching procedure . . . . . . . . . . . . . . . . . 32<br />
3.6 The modified pipeline to minimize P-buffer switches . . . . . . . . . . . . . 38<br />
3.7 Fragment program to write the depth component . . . . . . . . . . . . . . . 39<br />
3.8 Results <strong>for</strong> the artificial earth dataset. . . . . . . . . . . . . . . . . . . . . . 39<br />
3.9 Results <strong>for</strong> a dataset showing the yard inside a historic building. . . . . . . 40<br />
3.10 Results <strong>for</strong> a dataset showing an apartment house . . . . . . . . . . . . . . 41<br />
3.11 Visual results <strong>for</strong> the Merton college dataset . . . . . . . . . . . . . . . . . . 42<br />
4.1 Plane sweeping principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
4.2 NCC images calculated on the CPU (left) <strong>and</strong> on the GPU (right) . . . . . 48<br />
4.3 Determining the lower envelope using a sparse 1D distance trans<strong>for</strong>m. . . . 53<br />
4.4 Sparse belief propagation timing results wrt. the number of heap entries K 57<br />
4.5 Depth images with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 60<br />
4.6 Point models with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 61<br />
4.7 Point models with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 61<br />
4.8 Depth images with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 62<br />
5.1 A possible configuration <strong>for</strong> plane sweeping through the voxel space . . . . 65<br />
5.2 Perspective texture mapping using visibility in<strong>for</strong>mation . . . . . . . . . . . 67<br />
5.3 Evolution of depth maps <strong>for</strong> two views during the sweep process . . . . . . 69<br />
5.4 Plane sweep with partial knowledge from the preceding sweeps . . . . . . . 71<br />
5.5 Timing results <strong>for</strong> the Bowl dataset . . . . . . . . . . . . . . . . . . . . . . . 74<br />
xi
xii LIST OF FIGURES<br />
5.6 Space carving results <strong>for</strong> the synthetic Dino dataset . . . . . . . . . . . . . 75<br />
5.7 Space carving results <strong>for</strong> the synthetic Bowl dataset . . . . . . . . . . . . . 76<br />
5.8 Space carving results <strong>for</strong> a statue dataset . . . . . . . . . . . . . . . . . . . 77<br />
5.9 Voxel coloring results <strong>for</strong> a statue dataset . . . . . . . . . . . . . . . . . . . 78<br />
6.1 Sparse structure of the linear system obtained from the semi-implicit approach 88<br />
6.2 A reconstructed historical statue displayed as colored point set . . . . . . . 89<br />
6.3 The depth maps of the embedded statue reconstructed with the numerical<br />
schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
6.4 The effect of bidirectional matching on the embedded statue scene. . . . . 91<br />
6.5 Two views on the colored point set showing the front facade of a church. . 92<br />
6.6 The three source images <strong>and</strong> the resulting unsuccessful reconstruction of<br />
the statue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
6.7 Two of the successfully reconstructed point sets using image segmentation<br />
to omit the background scenery. . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
6.8 An enhanced depth map <strong>and</strong> 3D point set obtained using the truncated<br />
error model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
6.9 The effect of image-driven anisotropic diffusion . . . . . . . . . . . . . . . . 96<br />
7.1 Graphical illustration of the <strong>for</strong>ward pass using a recursive doubling approach.100<br />
7.2 Parallel processing of vertical scanlines using the bidirectional approach <strong>for</strong><br />
optimal utilization of the four available color channels . . . . . . . . . . . . 103<br />
7.3 Disparity images <strong>for</strong> the Tsukuba dataset <strong>for</strong> several horizontal resolutions<br />
generated by the GPU-based scanline approach. . . . . . . . . . . . . . . . 105<br />
7.4 Disparity images <strong>for</strong> the Cones <strong>and</strong> Teddy image pairs from the Middlebury<br />
stereo evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />
7.5 Plane-sweep approach to multiple view matching . . . . . . . . . . . . . . . 108<br />
7.6 Plane sweep from left to right . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
7.7 Spatial aggregation <strong>for</strong> the correlation window using sliding sums . . . . . . 110<br />
7.8 The three input views of the synthetic dataset . . . . . . . . . . . . . . . . . 113<br />
7.9 The obtained depth maps <strong>and</strong> timing results <strong>for</strong> the synthetic dataset using<br />
multiview scanline optimization on the GPU . . . . . . . . . . . . . . . . . 114<br />
7.10 The three input views of a wooden Bodhisattva statue <strong>and</strong> the corresponding<br />
depth maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
8.1 Classification of the voxel according to the depth map <strong>and</strong> camera parameters122<br />
8.2 Visual results <strong>for</strong> a small statue dataset generated from a sequence of 47<br />
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
8.3 Source views <strong>and</strong> isosurfaces <strong>for</strong> two real-world datasets. . . . . . . . . . . 128<br />
9.1 Three source views of the synthetic sphere dataset. . . . . . . . . . . . . . . 132<br />
9.2 Depth estimation results <strong>for</strong> a view triplet of the sphere dataset . . . . . . . 133
LIST OF FIGURES xiii<br />
9.3 Fused 3D models <strong>for</strong> the sphere dataset wrt. the depth estimation method . 133<br />
9.4 Three source views of the synthetic house dataset. . . . . . . . . . . . . . . 134<br />
9.5 Fused 3D models <strong>for</strong> the synthetic house dataset wrt. the depth estimation<br />
method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
9.6 Three generated depth maps of the synthetic house dataset . . . . . . . . . 136<br />
9.7 Three (out of 47) source images of the temple model dataset . . . . . . . . 138<br />
9.8 Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />
the original camera matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />
9.9 Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />
new calculated camera matrices . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
9.10 Two views of the statue showing Emperor Charles VI inside the state hall<br />
of the Austrian National Library. . . . . . . . . . . . . . . . . . . . . . . . . 141<br />
9.11 Medium resolution mesh <strong>for</strong> the Charles VI dataset . . . . . . . . . . . . . . 142<br />
9.12 High resolution mesh <strong>for</strong> the Charles VI dataset . . . . . . . . . . . . . . . . 143<br />
9.13 Two depth maps <strong>for</strong> the same reference view of the Charles dataset generated<br />
by the WTA <strong>and</strong> the SO approach . . . . . . . . . . . . . . . . . . . . 144<br />
9.14 Every other of the 13 source images of the Bodhisattva statue dataset . . . 144<br />
9.15 Several depth images <strong>for</strong> the Bodhisattva statue . . . . . . . . . . . . . . . 145<br />
9.16 Medium <strong>and</strong> high resolution results <strong>for</strong> the Bodhisattva statue images . . . 145
List of Tables<br />
3.1 Timing results <strong>for</strong> the sphere dataset on two different graphic cards. . . . . 40<br />
4.1 Timing results <strong>for</strong> the plane-sweeping approach on the GPU with winnertakes-all<br />
depth extraction at different parameter settings <strong>and</strong> image resolutions.<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
6.1 Regularization terms induced by diffusion processes . . . . . . . . . . . . . . 82<br />
7.1 Average timing result <strong>for</strong> various dataset sizes in seconds/frame. . . . . . . 104<br />
7.2 Runtimes of GPU-scanline optimization using a 9 × 9 NCC at different<br />
resolutions using three views . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
9.1 Quantitative evaluation of the reconstructed spheres . . . . . . . . . . . . . 134<br />
9.2 Quantitative evaluation of the reconstructed synthetic house . . . . . . . . . 137<br />
9.3 Timing results <strong>for</strong> the Emperor Charles dataset . . . . . . . . . . . . . . . . 138<br />
xv
Chapter 1<br />
Introduction<br />
Contents<br />
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2 Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> . . . . . 2<br />
1.3 3D Models from Multiple Images . . . . . . . . . . . . . . . . . . 5<br />
1.4 Overview of this Thesis <strong>and</strong> Contributions . . . . . . . . . . . . 10<br />
1.1 Introduction<br />
Creating a 3D virtual representation of a real object or scenery from images or other sensory<br />
data has many important real-world applications – ranging from city planning tasks<br />
per<strong>for</strong>med by surveying offices to virtual conservation of historic buildings <strong>and</strong> objects,<br />
to entertainment <strong>and</strong> gaming applications creating virtual models of real <strong>and</strong> well-known<br />
locations. Consequently, the need <strong>for</strong> automated <strong>and</strong> reliable 3D model generation workflow<br />
<strong>for</strong> data acquired by active <strong>and</strong> passive sensors is still an active research topic. In<br />
particular, creating 3D representations of real objects solely from multiple images is a<br />
challenging task, since the complete automated work-flow is based only on passive sensory<br />
data.<br />
The development of suitable algorithms <strong>and</strong> methods <strong>for</strong> a multi-view reconstruction<br />
pipeline depends substantially on the objects of interest <strong>and</strong> on the number <strong>and</strong> quality<br />
of the acquired images. In order to enable a fully automated work-flow the images must<br />
contain substantial redundancy, i.e. the same 3D features must appear in several images.<br />
Furthermore, static <strong>and</strong> rigid objects are assumed in this work to make the traditional<br />
multiple view approaches <strong>for</strong> image registration applicable. A further question addresses<br />
the intended accuracy of the obtained models. As it is explained later in more detail, the<br />
major objectives of the methods developed in this <strong>thesis</strong> are achieving high per<strong>for</strong>mance <strong>for</strong><br />
immediate visual feedback to the user <strong>and</strong> attaining sufficient accuracy <strong>for</strong> photorealistic<br />
visualization of the virtual models. Dense meshes <strong>and</strong> depth maps generated from multiple<br />
1
2 Chapter 1. Introduction<br />
views are usually not directly suitable <strong>for</strong> accurate 3D measurements, since the achievable<br />
accuracy especially in low textured regions is limited. Nevertheless, further knowledge<br />
about the object under interest enables e.g. fitting geometric primitives into the dense<br />
mesh potentially yielding higher accuracy.<br />
The methods proposed in our work-flow are mainly designed <strong>for</strong> typical close-range<br />
imagery, but the methods are not strictly limited to these settings. In order to illustrate<br />
the kind of datasets to be reconstructed using our modeling pipeline, we give at first<br />
a few examples of obtained virtual models generated by employing the proposed workflow.<br />
Figure 1.1 displays three 3D models generated solely from multiple images using<br />
the methods proposed in this <strong>thesis</strong> in several stages. In particular, efficient dense depth<br />
estimation methods (Chapter 4 <strong>and</strong> 7) were applied to obtain 2 1<br />
2<br />
D height-fields, which<br />
were subsequently fused into a final 3D model using a volumetric approach (Chapter 8).<br />
All procedures in the 3D reconstruction pipeline to create 3D models solely from images<br />
are outlined shortly in Section 1.3.<br />
The models displayed in Figure 1.1 are partially used <strong>for</strong> a historical documentation<br />
system ∗ . The generated models are high-resolution 3D meshes, which are intended <strong>for</strong><br />
visualization when combined with a photorealistic texture.<br />
1.2 Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong><br />
In this <strong>thesis</strong> we propose employing the computing power of modern programmable graphics<br />
processing units (GPU) <strong>for</strong> several essential stages in the 3D reconstruction pipeline.<br />
One goal of this work is fast visual feedback to the human operator, who can immediately<br />
judge the quality of the results <strong>and</strong> may optionally adjust suitable parameters, if necessary.<br />
Further, it is inevitable to have a substantial amount of redundancy in the image<br />
content when applying the current methods <strong>for</strong> reconstruction from multiple views in<br />
order to achieve high quality models. This implies that full 3D modeling of even a single<br />
object typically requires at least tens of images to be processed. Fast processing of these<br />
image sets is desirable, since obtaining the final model after two or 20 minutes makes a<br />
substantial difference. † If special purpose hardware — mainly graphics processing units,<br />
but digital signal processors (DSP) <strong>and</strong> field programmable gate arrays (FPGA), too — is<br />
employed in computer vision methods, several types of application can be distinguished:<br />
1. The first scenario en<strong>for</strong>ces real-time response within specified temporal limits, <strong>and</strong><br />
special purpose hardware provides the required processing power. Much of the initial<br />
research on accelerating computer vision methods is driven by the real-time needs<br />
of the particular application.<br />
2. The main objective in the second setting is faster (but not necessary real-time)<br />
processing using special hardware intensely. Since the computational accuracy <strong>and</strong><br />
∗ www.josefsplatz.info<br />
† especially if the outcome is unsatisfying.
1.2. Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> 3<br />
(a) Small statue of<br />
St. Barbara<br />
(b) Emperor Joseph – Josephsplatz (c) Empower Karl – Josephsplatz<br />
Figure 1.1: Several reconstructed statue models generated by our high-per<strong>for</strong>mance modeling<br />
pipeline. In (a) the model of a small statue depicting depicting St. Barbara is shown.<br />
Figure (b) illustrates the model of an outdoor statue displaying Empire Joseph. Finally,<br />
(c) shows the virtual model of the Empire Karl statue inside the Austrian National Library.<br />
The displayed models are not post-processed (e.g. smoothed or geometrically simplified).<br />
In (a) <strong>and</strong> (c) some noise <strong>and</strong> clutter can be seen, which can be removed by incorporating<br />
silhouette data.<br />
the programming model of special purpose hardware is often limited, the quality of<br />
the result may be decreased if compared with the outcome of CPU implementations.<br />
Finding an appropriate trade-off between higher per<strong>for</strong>mance <strong>and</strong> limiting the quality<br />
degradation is the challenge in this setting. Most methods proposed in this <strong>thesis</strong><br />
fall in this category.<br />
3. Finally, special purpose hardware can be used purely as auxiliary processing unit<br />
executing only fractions of the overall method. In this case there is typically no<br />
degradation in the quality of the result, but the achieved per<strong>for</strong>mance gain can be<br />
limited. Special purpose hardware usually per<strong>for</strong>ms its computation asynchronously<br />
to the main CPU, hence a load-balanced implementation employing both processing<br />
units concurrently gives the largest gain. Most computer vision methods must be<br />
redesigned in order to benefit from this combined processing power.<br />
With the general availability of programmable graphics processing units <strong>and</strong> their large<br />
processing power it is natural, that modern graphics hardware attracts many researchers
4 Chapter 1. Introduction<br />
to accelerate their non-graphical applications as well. We focus on programmable graphics<br />
hardware as computing device <strong>for</strong> the following reasons:<br />
• Driven by the needs of the gaming industry, graphics hardware evolves currently<br />
much faster than traditional CPUs or other processing devices. Selected numerical<br />
operations per<strong>for</strong>m almost 10 times faster on high-end graphics hardware than on<br />
high-end CPUs.<br />
• A reasonable fast graphics processing unit is nowadays equipped in many consumer<br />
personal computers. Hence, the necessary hardware equipment is available virtually<br />
<strong>for</strong> everyone.<br />
• Recently there exist st<strong>and</strong>ardized programming interfaces working <strong>for</strong> hardware of<br />
different vendors. This allows our procedures to execute on a wider range of hardware<br />
not limited to a specific vendor. Additionally, the development cycle is now eased<br />
due to multi-vendor programming interfaces <strong>and</strong> tools.<br />
• While per<strong>for</strong>ming non-graphical computations the GPU can be directly used to<br />
display intermediate <strong>and</strong> final results to the operator, since the necessary data is<br />
already stored in GPU memory.<br />
Due to these factors modern graphics hardware is currently the ideal target plat<strong>for</strong>m <strong>for</strong><br />
high-per<strong>for</strong>mance parallel computing.<br />
Note, that the rapid development of new features built into every upcoming generation<br />
of graphics hardware requires a constant adaption of GPU-based methods to obtain maximal<br />
per<strong>for</strong>mance. Consequently, a continuous redesign of GPU-based implementations is<br />
still necessary, since new features may enable significant per<strong>for</strong>mance improvements, <strong>and</strong><br />
various techniques to increase the speed on current hardware may become obsolete in next<br />
generation graphics hardware. Nevertheless, we assume a stabilizing feature set <strong>for</strong> GPUs<br />
in the medium term.<br />
Using the GPU as a major processing unit <strong>for</strong> non-graphical problems allows direct<br />
visualization of intermediate <strong>and</strong> final results without an additional per<strong>for</strong>mance penalty.<br />
We employ this feature in most of our proposed reconstruction methods to give the user a<br />
direct visual feedback showing the progress of the procedure. Whether immediate visual<br />
feedback (i.e. after a few seconds at most) is available depends on the reconstruction<br />
pipeline as well. Using relatively simple methods e.g. developed <strong>for</strong> small baseline sets of<br />
images yielding a depth map allows the sequential processing of the whole dataset, <strong>and</strong> the<br />
first depth images are available with little delay. In these cases the provided intermediate<br />
results have full resolution, but refer only to a fraction of the final model. Sophisticated<br />
multiple view methods incorporating all images simultaneously often do not have this<br />
fine granularity <strong>and</strong> generally provide no intermediate result (at full resolution) to the<br />
human operator. Typically, a coarse-to-fine scheme <strong>for</strong>ms the basis of these methods <strong>and</strong><br />
intermediate results at coarser resolutions can be shown to the operator.
1.3. 3D Models from Multiple Images 5<br />
In any case, when processing larger datasets with different characteristics <strong>and</strong> from<br />
different sources, the opportunity to evaluate the outcome of the whole modeling pipeline<br />
visually at early processing stages proves very useful.<br />
Although graphics processing units have a very high computing power, the programming<br />
model of graphics hardware is limited. Consequently, the set of computer vision<br />
methods suitable <strong>for</strong> full acceleration by GPUs is restricted. E.g. several highly sophisticated<br />
dense depth estimation methods are currently beyond the capabilities of programmable<br />
graphics hardware, or allow only acceleration of fractions of the whole procedure<br />
in the best case. Hence, only relatively simple (but still nontrivial) computer vision<br />
methods can fully benefit from graphics processing units so far.<br />
Nevertheless, in many cases the generated 3D models created by our high-per<strong>for</strong>mance<br />
work-flow have sufficient quality <strong>for</strong> further processing <strong>and</strong> photorealistic display of the<br />
virtual models. The main contribution of this <strong>thesis</strong> consists of the adaption of several<br />
multi-view reconstruction methods to enable an efficient implementation using graphics<br />
hardware in the first place. Further, the actual efficiency <strong>and</strong> the quality of the obtained<br />
3D models is demonstrated on multiple real-world datasets.<br />
1.3 3D Models from Multiple Images<br />
The creation of virtual 3D models of real objects from a set of digital images requires a<br />
pipeline of several stages. The set of procedures applied in this pipeline depends on the<br />
actual setup <strong>and</strong> on the intended use of the generated model. The steps per<strong>for</strong>med to<br />
create many virtual models shown in this <strong>thesis</strong> is illustrated in Figure 1.2.<br />
Digital images<br />
Depth images<br />
Feature<br />
extraction<br />
Features<br />
POIs<br />
Correspondence<br />
estimation<br />
Sparse<br />
model<br />
Dense depth<br />
estimation<br />
Multiview Geometry<br />
Multiview<br />
depth integration<br />
Raw 3D<br />
geometry<br />
processing<br />
Refined 3D<br />
geometry<br />
texturing<br />
Textured 3D<br />
model<br />
Figure 1.2: A possible pipeline to create virtual models from images.<br />
The steps in this pipeline are suitable <strong>for</strong> reconstructing a 3D object from many small<br />
baseline images taken with a high-quality <strong>and</strong> already calibrated digital single lens reflex<br />
camera. If the images are recorded with a digital video camera or a cheap digital consumer<br />
camera, several (especially early) stages in the pipeline will be substantially different.<br />
We describe the individual processing steps in this pipeline briefly <strong>and</strong> outline necessary<br />
adaptions in case of different source material.
6 Chapter 1. Introduction<br />
Camera Calibration <strong>and</strong> Self-Calibration The term camera calibration often refers<br />
to two related, but nevertheless distinct steps to obtain several parameters of the employed<br />
digital camera <strong>and</strong> its lens system: the first procedure determines lens distortion<br />
parameters to remove the deviations in the image induced by optical lenses. Knowledge of<br />
the lens distortion <strong>and</strong> subsequent resampling of the source images allows the application<br />
of the simple pinhole camera model in the successive processing stages. The second part<br />
of the camera calibration step addresses the determination of the main parameters of the<br />
now applicable idealized pinhole camera model. These parameters are typically comprised<br />
in a 3-by-3 upper triangular matrix<br />
K =<br />
⎛<br />
⎜<br />
⎝<br />
f s x0<br />
0 a f x1<br />
0 0 1<br />
Knowledge of this matrix allows the obtained 3D reconstructions to reside in a metric<br />
space, i.e. the obtained angles <strong>and</strong> length ratios correspond to the ones of the true model.<br />
Without additional knowledge it is not possible to determine the overall scale (or object<br />
size) solely from images.<br />
The most important parameter in this matrix is the focal length f. If the focal length<br />
is incorrectly estimated, the resulting 3D model is severely distorted. The skew parameter<br />
s is determined by the x <strong>and</strong> y-axes of the sensor pixels <strong>and</strong> is very close to zero <strong>for</strong><br />
all practical cameras. Many calibration <strong>and</strong> especially self-calibration techniques assume<br />
orthogonal sensor axes <strong>and</strong> consequently, s = 0. The aspect ratio parameter a is one <strong>for</strong><br />
squared shaped sensor pixels, which is a very common assumption. The intersection of<br />
the optical axis with the image plane is called the principal point (x0, y0) <strong>and</strong> is usually<br />
close to the image center. Accurate estimation of the principal point is difficult (since<br />
moving the principal point can be largely compensated by world space translation), but<br />
the quality of the 3D model is only weakly affected by an incorrect principal point.<br />
Since we focus mainly on generating 3D models from images taken with precalibrated<br />
cameras, a st<strong>and</strong>ard camera calibration procedure [Heikkilä, 2000] using predefined targets<br />
is typically employed in our work-flow. Several images of a planar target with known<br />
circular control points are taken, <strong>and</strong> camera matrices <strong>and</strong> lens distortion parameters are<br />
determined using a nonlinear optimization approach. The advantage of using precalibrated<br />
cameras is the high accuracy of the estimated intrinsic parameters of the camera. Hence,<br />
the subsequently calculated relative orientation <strong>and</strong> the dense depth estimation are based<br />
on reliable camera parameters <strong>and</strong> yield high quality results.<br />
On the other h<strong>and</strong>, good calibration results are mainly available <strong>for</strong> high-quality cameras,<br />
<strong>and</strong> usually fixed lenses set to infinite focus are required. A work-flow based on<br />
target calibration is only partially applicable to cheap consumer cameras with zooming<br />
<strong>and</strong> automatic focusing, <strong>and</strong> it typically fails <strong>for</strong> video sequences. Self-calibration methods<br />
attempt to recover the intrinsic camera parameters solely from image in<strong>for</strong>mation like<br />
⎞<br />
⎟<br />
⎠ .
1.3. 3D Models from Multiple Images 7<br />
correspondences between multiple views. Radial distortion parameters can be determined<br />
even from single images using extracted 2D lines [Devernay <strong>and</strong> Faugeras, 2001], but <strong>for</strong><br />
real datasets some manual intervention is often necessary in order to connect short line<br />
segments belonging to the same object line [Schmidegg, 2005]. Of course, this approach<br />
requires, that e.g. a building with dominant feature lines or even a printed page with<br />
straight lines is captured by the camera.<br />
During self-calibration the parameters of the pinhole camera model are determined<br />
by utilizing certain analytic properties of the epipolar geometry. Several self-calibration<br />
method start with a projective reconstruction based on point correspondences <strong>and</strong> the induced<br />
fundamental matrices between the images. The inherent projective ambiguity can<br />
be resolved using algebraic invariants <strong>and</strong> reasonable assumptions on the camera model<br />
(like zero skew <strong>and</strong> square pixels) [Pollefeys et al., 1999, Nistér, 2001, Nistér, 2004b]. The<br />
main difficulty of these approaches is the creation of an initial accurate <strong>and</strong> outlier-free projective<br />
reconstruction, since the self-calibration procedures are very sensitive to incorrect<br />
input data. A simple self-calibration method not requiring a projective 3D reconstruction<br />
is proposed in [Mendonça <strong>and</strong> Cipolla, 1999]. This approach refines the intrinsic camera<br />
parameters to upgrade the supplied fundamental matrices to essential matrices, which<br />
have stronger algebraic properties. The essential matrix encodes the relative pose between<br />
two views <strong>and</strong> has fewer degrees of freedom than the fundamental matrix. In particular,<br />
the two non-zero singular values of an essential matrix are equal. This property is utilized<br />
in [Mendonça <strong>and</strong> Cipolla, 1999] to adjust initially provided camera intrinsic parameters,<br />
such that the non-zero singular values of the upgraded fundamental matrices are as close<br />
as possible. We employ this method optionally even in the calibrated case to refine the<br />
camera intrinsic parameters <strong>for</strong> highest accuracy.<br />
Feature Extraction Feature extraction selects image points or regions which give significant<br />
structural in<strong>for</strong>mation to be identified in other images showing the same objects<br />
of interest. Commonly used point features are Harris corners [Harris <strong>and</strong> Stephens, 1988]<br />
<strong>and</strong> Förstner points [Förstner <strong>and</strong> Gülch, 1987]. Point features are well suited <strong>for</strong> sparse<br />
correspondence search, but extracting lines may be beneficial <strong>for</strong> images showing manmade<br />
structures. Instead of extracting isolated corner points a set of edge elements (edgel<br />
<strong>for</strong> short) are determined [Canny, 1986] <strong>and</strong> subsequently grouped to obtain geometric<br />
line segments.<br />
If the provided images are taken from rather different positions, more advanced<br />
features <strong>and</strong> local image descriptors are required. In particular, the projected size <strong>and</strong><br />
shape of objects varies substantially in wide baseline setups, which is addressed by<br />
scale- <strong>and</strong> affine-invariant feature detectors <strong>and</strong> descriptors, including scale invariant<br />
feature trans<strong>for</strong>ms [Lowe, 1999], intensity profiles [Tell <strong>and</strong> Carlsson, 2000], maximally<br />
stable extremal regions [Matas et al., 2002] <strong>and</strong> scale <strong>and</strong> affine invariant Harris<br />
points [Mikolajczyk <strong>and</strong> Schmid, 2004]).<br />
In our current work-flow we utilize Harris corners as primary point features, which are
8 Chapter 1. Introduction<br />
extended with either local image patches or intensity profiles as feature descriptors.<br />
Correspondence <strong>and</strong> Pose Estimation In order to relate a set of images geometrically<br />
it is necessary to find correspondences, i.e. the images of identical scene objects.<br />
For the task of calculating the relative orientation between images it is suitable to extract<br />
features with good point localization as provided by the feature extraction step. In a calibrated<br />
setting the relative orientation between two views can be calculated from five point<br />
correspondences. Hence a RANSAC-based approach is used <strong>for</strong> robust initial estimation<br />
of the relative pose between two adjacent views. In order to test many samples an efficient<br />
procedure <strong>for</strong> relative pose estimation is utilized [Nistér, 2004a]. With the knowledge of<br />
the relative poses between all consecutive views <strong>and</strong> corresponding point features visible<br />
in at least 3 images, the orientations of all views in the sequence can be upgraded to a<br />
common coordinate system. The camera poses <strong>and</strong> the sparse reconstruction consisting of<br />
3D points triangulated from point correspondences are refined using a simple but efficient<br />
implementation of sparse bundle adjustment [Lourakis <strong>and</strong> Argyros, 2004]. This step concludes<br />
the pipeline to establish the 3D relationship <strong>for</strong> a sequence of images. The essential<br />
data generated by this pipeline are distortion-free images <strong>and</strong> the camera matrices relating<br />
positions in 3D space with 2D image locations.<br />
In case of video sequences it is sufficient to track simple point features over time <strong>and</strong><br />
to apply a RANSAC scheme to obtain the relative poses of the images, which can be<br />
optionally accomplished in real-time [Nistér et al., 2004]. In our setting targeted at offline<br />
reconstructions using high resolution images a real-time behavior to determine the<br />
geometrical relationship between the views is not necessary. Nevertheless, high processing<br />
per<strong>for</strong>mance of these early reconstruction stages are relevant due to the amount of taken<br />
images. Even reconstructing a small, isolated object like a statue easily results in 50<br />
images of that object, which must be integrated into a common coordinate system.<br />
Foreground Segmentation If the 3D reconstruction of individual or free-st<strong>and</strong>ing objects<br />
is desired, an image segmentation procedure separating <strong>for</strong>eground objects from<br />
the unwanted background is suitable. If the result of this segmentation step is accurately<br />
representing the silhouette of the object of interest, any shape from silhouette technique<br />
[Laurentini, 1995, Lok, 2001, Matusik et al., 2001, Li et al., 2003] can be applied<br />
to obtain a first coarse 3D model called the visual hull. When encountered with many<br />
small baseline images, manual segmentation of <strong>for</strong>eground pixels against a complex background<br />
is a tedious task. Hence, a automated or semi-automatic approach to generate<br />
the object silhouettes <strong>for</strong> these cases is reasonable. Specifying an initial object silhouette<br />
<strong>and</strong> propagating it through the image sequence is described in [Sormann et al., 2005,<br />
Sormann et al., 2006]. Silhouette in<strong>for</strong>mation is partially used in the successive dense<br />
matching procedures to suppress unintended fragments in the final model.
1.3. 3D Models from Multiple Images 9<br />
Dense Depth Estimation With the knowledge of the camera parameters <strong>and</strong> the<br />
relative poses between the source views dense correspondences <strong>for</strong> all pixels of a particular<br />
key view can be estimated. Since the epipolar geometry is already known, this procedure<br />
is basically a one-dimensional search along the epipolar line <strong>for</strong> every pixel. Triangulation<br />
of these correspondences results in a dense 3D model, which reflects the true surface<br />
geometry of the captured object in ideal settings.<br />
In order to simplify the depth estimation task <strong>and</strong> to make it more robust, almost all<br />
dense depth estimation method assume opaque surfaces with diffuse reflection properties<br />
to be reconstructed. In some approaches the lighting conditions <strong>and</strong> the exposure settings<br />
of the camera may change between the captured views to some amount. The depth map<br />
<strong>for</strong> the particular key view is usually estimated from a set of nearby views having a large<br />
overlap in their image content.<br />
The major part of this <strong>thesis</strong> addresses the generation of dense depth maps, in particular<br />
Chapter 3, 4, 6 <strong>and</strong> Chapter 7. The main differences between dense depth estimation<br />
approaches in general are the utilized image dissimilarity function, which ranks potential<br />
correspondences on the epipolar line, <strong>and</strong> the h<strong>and</strong>ling of textureless regions, where the<br />
dissimilarity score is ambiguous <strong>and</strong> unreliable. Both factors influence the range of potential<br />
applications <strong>for</strong> the method <strong>and</strong> its per<strong>for</strong>mance in terms of time <strong>and</strong> 3D model<br />
quality. The main contribution of the chapters discussing dense depth estimation is the<br />
efficient generation of depth maps by utilizing the computational power <strong>and</strong> programming<br />
model of modern graphics hardware. The presented methods <strong>and</strong> implementations include<br />
several dissimilarity scores <strong>and</strong> different approaches to cope with regions containing<br />
indiscriminative surface texture.<br />
Multiview Depth Integration The set of depth images obtained from dense depth<br />
estimation needs to be combined in order to obtain a consistent final geometric model of<br />
the captured scene or object. If we assume a redundancy of depth in<strong>for</strong>mation, potential<br />
outliers generated by the previous depth estimation procedure can be detected <strong>and</strong> removed<br />
at this point. A successful method <strong>for</strong> multiple depth map fusion is the volumetric<br />
range image integration approach [Curless <strong>and</strong> Levoy, 1996, Wheeler et al., 1998]. Chapter<br />
8 describes our fast depth integration procedure. Alternatively, proper 3D models can<br />
be directly generated using voxel coloring methods (see Chapter 5).<br />
Geometry Processing Depending on the actual depth image integration methods the<br />
obtained 3D mesh may contain holes <strong>and</strong> may appear still somewhat noisy. Furthermore,<br />
the generated mesh is almost always over-tessellated <strong>and</strong> is not directly appropriate <strong>for</strong><br />
further processing or visualization. Consequently, a final geometry processing step may<br />
include mesh simplification techniques <strong>and</strong> other mesh refinement <strong>and</strong> cleaning procedures.<br />
In particular, we apply a mesh simplification tool [Garl<strong>and</strong> <strong>and</strong> Heckbert, 1997] to reduce<br />
the geometric complexity of the model.
10 Chapter 1. Introduction<br />
Photorealistic Texturing The simplified <strong>and</strong> enhanced geometry of the imaged object<br />
still lacks an appropriate texture <strong>for</strong> a photorealistic display within virtual scenes.<br />
Texture map generation <strong>for</strong> arbitrary 3D shapes requires cutting of the original polygonal<br />
representation into several disk-like patches. Each of these patches has its own texture coordinate<br />
mapping associated. In order to obtain few distortions <strong>and</strong> better visual quality,<br />
these patches should be preferably flat. Our implementation [Zebedin, 2005] combines the<br />
texture atlas generation procedure described in [Lévy et al., 2002] with robust multi-view<br />
texturing techniques in presence of occlusions [Mayer et al., 2001, Bornik et al., 2001]. If<br />
a surface element is visible in several images (which is usually the case), unmodeled occlusions<br />
can be detected <strong>and</strong> removed using a robust color averaging method. Additionally,<br />
the orientation of a surface patch with respect to the source images <strong>and</strong> its projected footprint<br />
provides reliability in<strong>for</strong>mation, which can be used to weight the color contribution<br />
from the source images.<br />
An Illustrative Example We illustrate various stages of this pipeline with a statue<br />
example in Figure 1.3. In addition to two (out of 47) input images we show two dense<br />
depth estimation results based on a GPU-accelerated plane-sweep ((c) <strong>and</strong> (d)). These<br />
small-baseline reconstructions are still noisy <strong>and</strong> have outliers. Volumetric depth image<br />
integration uses all available depth images to remove the artifacts <strong>and</strong> creates a suitable<br />
geometry representing the statue (images (e) <strong>and</strong> (f)). Finally, the decimated <strong>and</strong> textured<br />
mesh is illustrated ((g) <strong>and</strong> (h)).<br />
After this coarse presentation of the modeling pipeline, we provide a more in-depth<br />
description of the various stages in the work-flow, which are not directly related with this<br />
<strong>thesis</strong>.<br />
1.4 Overview of this Thesis <strong>and</strong> Contributions<br />
Chapter 2 presents work <strong>and</strong> publications related to this <strong>thesis</strong>. It is divided into two major<br />
sections: Section 2.1 presents important approaches <strong>and</strong> work focusing on dense depth<br />
estimation <strong>and</strong> computational stereo in general. From the vast number of publications in<br />
this field only a few seminal ones are briefly presented. Some of these comprise the basis <strong>for</strong><br />
our procedures <strong>and</strong> are described in more detail in the appropriate chapters. Section 2.2<br />
gives a general overview of GPU-accelerated approaches <strong>and</strong> algorithms that appeared<br />
in recent years. Further, several research lines <strong>for</strong> realtime <strong>and</strong> GPU-based methods to<br />
computational stereo <strong>and</strong> multi-view reconstruction approaches are presented.<br />
Our first computational stereo method accelerated by graphics hardware is described<br />
in Chapter 3. This dense stereo reconstruction procedure is essentially an iterative local<br />
mesh refinement method to generate a surface consistent with the given views. The main<br />
motivation <strong>for</strong> this approach is the fast projective texturing capability provided by graphics<br />
hardware since its beginnings. With the emergence of programmable GPUs, it is possible<br />
to calculate simple image dissimilarity functions by GPUs as well. CPU intervention
1.4. Overview of this Thesis <strong>and</strong> Contributions 11<br />
is necessary to update the current mesh hypo<strong>thesis</strong> according to the determined best<br />
local modifications <strong>and</strong> to occasionally smooth the mesh. Since this approach works on<br />
meshes, this method is the only one presented in this <strong>thesis</strong> making extensive use of vertex<br />
programs. The obtained software per<strong>for</strong>ms reconstructions at interactive or near realtime<br />
rates.<br />
This chapter contains material from two publications ([Zach et al., 2003a] <strong>and</strong><br />
[Zach et al., 2003b]).<br />
Note, that all other procedures presented in the following chapter are purely per<strong>for</strong>med<br />
on the graphics hardware with the CPU only executing the flow control <strong>for</strong> GPU routines.<br />
By providing the source images <strong>and</strong> the camera parameters <strong>and</strong> poses the full reconstruction<br />
pipeline to the final 3D model visualization per<strong>for</strong>ms entirely on the graphics<br />
hardware <strong>and</strong> no expensive data transfer from GPU memory to main memory is necessary.<br />
Consequently, these methods are perfectly suited <strong>for</strong> fast visual feedback to the human<br />
operator.<br />
Plane-sweep methods to depth estimation are still the most suitable approaches <strong>for</strong><br />
efficient implementations on the GPU. So far, most algorithms presented in the literature<br />
require images with exactly the same lighting conditions, since very simple correlation<br />
measures like the sum of absolute differences (SAD) or sum of squared differences (SSD)<br />
are utilized. In Chapter 4 we propose an approximated zero-mean normalized sum of absolute<br />
differences correlation function, which produces results similar to the widely used<br />
NCC function <strong>and</strong> can be more efficiently calculated on current generation graphics hardware.<br />
Using GPU-based summed area tables (aka. integral images) the computation time<br />
<strong>for</strong> this image correlation measure is independent of the template window size. Furthermore,<br />
a sparse belief propagation method is proposed to obtain depth maps incorporating<br />
smoothness constraints. Material from this chapter can be found in [Zach et al., 2006a].<br />
Chapter 5 describes, how a voxel-coloring technique can be executed entirely on graphics<br />
hardware by combining plane-sweep approaches with correct visibility h<strong>and</strong>ling. Thus,<br />
3D volumetric models from many images can be obtained at interactive rates. Additionally,<br />
several voxel-coloring passes can be applied in orthogonal directions to obtain true<br />
3D models from a complete sequence around the object in interest. But this particular<br />
space carving technique on the GPU requires a 3D volume texture to be stored in video<br />
memory, thereby limiting the resolution of the voxel space.<br />
A very fast variational approach to depth estimation in presented in Chapter 6. On a<br />
first view it seems unlikely that graphics hardware can accelerate the numerical calculations<br />
required to solve the partial differential equations derived from variational <strong>for</strong>mulations<br />
of depth estimation. However, it turns out that the current programming features<br />
of GPUs substantially decrease the run-time of iterative PDE solvers on regular grids.<br />
Variational depth estimation methods can provide very high quality models, but they are<br />
very sensitive to parameter settings <strong>and</strong> to the initial depth hypo<strong>thesis</strong> in general, hence<br />
an immediate feedback is very useful to a human operator.
12 Chapter 1. Introduction<br />
The most versatile method <strong>for</strong> dense depth estimation, which can be per<strong>for</strong>med by<br />
the GPU entirely, is scanline optimization as described in Chapter 7. Conceptually, the<br />
technique described in this chapter extends the plane-sweep method from Chapter 4 with<br />
a semi-global depth extraction technique. The key innovation in this chapter is the <strong>for</strong>mulation<br />
of a specific dynamic programming approach to depth estimation in a manner<br />
suitable <strong>for</strong> the programming model of GPUs. Although the time complexity after the<br />
trans<strong>for</strong>mation is O(N log N) instead of O(N), the observed timing results are promising.<br />
The core method from this chapter is presented in [Zach et al., 2006b].<br />
The final algorithmic contribution of this <strong>thesis</strong> discussed in Chapter 8 is a volumetric<br />
approach to generate proper 3D models from multiple depth maps at interactive rates.<br />
The final 3D model is represented implicitly as isosurface in a scalar volume dataset <strong>and</strong><br />
the corresponding mesh geometry can be extracted using marching cubes or tetraheda<br />
methods. Alternatively, the isosurface can be directly visualized from the volume data using<br />
recent methods of volume visualization. A condensed version of this chapter appeared<br />
in [Zach et al., 2006a].<br />
Chapter 9 presents several multi-view datasets <strong>and</strong> the associated depth maps <strong>and</strong><br />
models generated with the proposed methods. In few cases, where a ground truth is<br />
available, a quantitative accuracy evaluation is provided as well.
1.4. Overview of this Thesis <strong>and</strong> Contributions 13<br />
(a) (b) (c) (d)<br />
(e) (f) (g) (h)<br />
Figure 1.3: Several step in the reconstruction pipeline illustrated with a statue example.<br />
(a) <strong>and</strong> (b) are two source images out of 47 images in total. The result of GPU-based<br />
dense depth estimation <strong>for</strong> two views is shown in (c) <strong>and</strong> (d). Two views of the result mesh<br />
after volumetric depth image integration are given in (e) <strong>and</strong> (f). The finally simplified<br />
<strong>and</strong> textured 3D geometry of the statue is displayed in (g) <strong>and</strong> (h).
Chapter 2<br />
Related Work<br />
Contents<br />
2.1 Dense Depth <strong>and</strong> Model Estimation . . . . . . . . . . . . . . . . 15<br />
2.2 GPU-based 3D Model Computation . . . . . . . . . . . . . . . . 19<br />
2.1 Dense Depth <strong>and</strong> Model Estimation<br />
There is a huge bibliography on the generation of depth images <strong>and</strong> dense geometry from<br />
multiple views, hence we focus on seminal work in this field. We divide the approaches to<br />
computational stereo into three subtopics <strong>for</strong> a better structure: at first, important publications<br />
dealing with the classical stereo setup consisting of two images with vertically<br />
aligned epipolar geometry are discussed. Subsequently, major approaches to depth estimation<br />
from multiple, not necessarily rectified images are presented. Finally, true multi-view<br />
methods generating a 3D model (<strong>and</strong> not just depth images) directly are briefly sketched.<br />
Note, that computational stereo <strong>and</strong> depth estimation can be seen as a subtopic of the<br />
more general optical flow computation between images. The main difference between the<br />
<strong>for</strong>mer <strong>and</strong> optical flow is the reduced (one-dimensional) search space <strong>for</strong> stereo methods,<br />
since knowledge of the epipolar geometry is assumed. In order to obtain metric models<br />
the internal camera parameters are required to be known, too.<br />
2.1.1 Computational Stereo on Rectified Images<br />
The minimal requirement to obtain a depth map, or equivalently a 2.5D height field solely<br />
from images, is a pair of input images with a typically convergent view on the scene<br />
to be reconstructed. Many methods generating depth maps from such input data work<br />
on rectified images with aligned epipolar geometry mostly <strong>for</strong> efficiency reasons, since<br />
vertically aligned epipolar lines allow efficient image dissimilarity calculations <strong>and</strong> the<br />
reuse of already computed values. Recent surveys of computational stereo methods are<br />
15
16 Chapter 2. Related Work<br />
given in [Scharstein <strong>and</strong> Szeliski, 2002], [Faugeras et al., 2002] <strong>and</strong> [Brown et al., 2003].<br />
Additionally, in [Scharstein <strong>and</strong> Szeliski, 2002] an evaluation framework is proposed, which<br />
is still widely used to compare stereo methods in terms of their ability to recover the true<br />
geometry.<br />
Many depth estimation methods per<strong>for</strong>m typically the following four subsequent steps<br />
to constitute a depth map (after [Scharstein <strong>and</strong> Szeliski, 2002]):<br />
1. matching cost (i.e. image dissimilarity score) computation;<br />
2. an aggregation procedure to accumulate the matching costs within some region;<br />
3. depth map extraction;<br />
4. <strong>and</strong> an optional refinement of the depth map.<br />
Often, the first two steps cannot be separated, e.g. if the utilized matching score is<br />
already based on some measure involving pixel neighborhoods. The major difference<br />
between the various computational stereo approaches lies in the method of depth<br />
map extraction given the matching costs data structure. Purely local methods apply<br />
a very greedy winner-takes-all approach, which assigns the depth value with the<br />
lowest matching cost to a pixel. Global methods <strong>for</strong> depth map extraction apply an<br />
optimization procedure, which takes matching scores <strong>and</strong> spatial smoothness of the<br />
depth map into account. Smoothness is typically modeled by a regularization function,<br />
which has the depth values assigned to adjacent pixels as input <strong>and</strong> yields a (positive)<br />
penalty value <strong>for</strong> unequal depths. If smoothness of the depth map is en<strong>for</strong>ced only on<br />
vertical scanlines (which coincide with the epipolar lines), a very efficient <strong>and</strong> elegant<br />
algorithms based on the dynamic programming principle can be devised. Earlier<br />
work includes [Baker <strong>and</strong> Bin<strong>for</strong>d, 1981, Ohta <strong>and</strong> Kanade, 1985, Geiger et al., 1995,<br />
Birchfield <strong>and</strong> Tomasi, 1998]. Although dynamic programming approaches to stereo<br />
are known <strong>for</strong> a long time, there is still ongoing research on this topic [Veksler, 2003,<br />
Criminisi et al., 2005, Hirschmüller, 2005, Hirschmüller, 2006, Lei et al., 2006]. A more<br />
detailed discussion of one employed dynamic programming approach to stereo <strong>and</strong> its<br />
GPU-based implementation is provided in Chapter 7.<br />
More recently, many proposed global methods <strong>for</strong> stereo focus on en<strong>for</strong>cing<br />
smoothness in both directions, not just within the same scanline. Since finding<br />
the true global optimum is not feasible, various approximation schemes have<br />
been presented in the literature. Largely, two lines of global optimization<br />
procedures have been applied successfully to stereo problems: maximum network<br />
flow methods (usually called graph-cut approaches in the computer vision literature<br />
[Boykov et al., 2001, Kolmogorov <strong>and</strong> Zabih, 2001, Kolmogorov <strong>and</strong> Zabih, 2002]),<br />
<strong>and</strong> Markov r<strong>and</strong>om field methods based on iterative belief updating (belief<br />
propagation [Sun et al., 2003, Felzenszwalb <strong>and</strong> Huttenlocher, 2004, Sun et al., 2005]).<br />
Although the depth maps obtained from these advanced procedures are generally
2.1. Dense Depth <strong>and</strong> Model Estimation 17<br />
better than those generated by dynamic programming methods, their time <strong>and</strong><br />
space complexities are substantially higher than those <strong>for</strong> 1-dimensional optimization<br />
procedures.<br />
Graph-cut methods are iterative procedures to update the current labeling (i.e. depth<br />
values in the stereo case) of pixels to obtain a lower total energy value. The initial depth<br />
labeling can be computed e.g. by pure local stereo methods. In every iteration a greedy, but<br />
large ∗ relabeling of pixels is determined, which yields the lowest total energy. A suitable<br />
graph network is built in every iteration, <strong>and</strong> the maximum flow solution corresponds to<br />
an optimal greedy relabeling. These iterations are repeated until a (strong) local minimum<br />
is reached.<br />
While dynamic programming, belief propagation <strong>and</strong> graph cut approaches to computational<br />
stereo treat the underlying energy minimization problem as combinatorial problem<br />
with a discrete set of pixels <strong>and</strong> disparity labels, it is nevertheless possible to employ variational<br />
methods developed to solve problems on a continuous domain <strong>for</strong> stereo vision.<br />
Since many of the proposed variational approaches <strong>for</strong> multi-view reconstruction are typically<br />
<strong>for</strong>mulated <strong>for</strong> a general multiple view setup, these methods are discussed below in<br />
Section 2.1.2.<br />
The depth maps returned by any of the above-mentioned methods may still contain<br />
wrong depth values <strong>for</strong> certain pixels, e.g. due to occlusions, specular reflections<br />
etc. These mismatches can be potentially detected by a very simple left-right consistency<br />
check [Fua, 1993] (also called bidirectional matching or back-matching). This technique<br />
reverses the role of the input images <strong>and</strong> generates two depth maps (one wrt. the first<br />
image <strong>and</strong> one wrt. the second image). Only depth values <strong>for</strong> pixels which agree in both<br />
depth maps (according to some metric) are retained.<br />
2.1.2 Multi-View Depth Estimation<br />
In this section we summarize work on dense depth estimation from multiple, but usually<br />
still small baseline views. In general, more than two views cannot be rectified in order to<br />
simplify <strong>and</strong> accelerate the depth estimation procedure. Since small baselines between the<br />
images are assumed, explicit or implicit occlusion detection <strong>and</strong> h<strong>and</strong>ling strategies are<br />
possible. Implicit occlusion h<strong>and</strong>ling approaches typically use truncated matching scores<br />
or multiple scores between pairs of images to reduce the influence of occluded pixels in the<br />
estimation procedure (e.g. [Woetzel <strong>and</strong> Koch, 2004] <strong>and</strong> Chapter 4).<br />
Several approaches developed <strong>for</strong> a multi-view setup utilize variational methods to<br />
search <strong>for</strong> a 3D surface or depth map color-consistent with the provided input images. A<br />
hypothetical surface or depth map (together with the known epipolar geometry between<br />
the views) induces a (nonlinear) 2D transfer between the images. If the correct depth map<br />
is found, all warped source images are very similar according to a provided image similarity<br />
metric. Additionally, surface smoothness is assumed if the image data is ambiguous (i.e.<br />
∗ Meaning, that the subset of pixels with a newly assigned label is as large as possible.
18 Chapter 2. Related Work<br />
lacking sufficient texture). Variational approaches to multi-view stereo <strong>for</strong>mulate the reconstruction<br />
problem as continuous energy optimization task <strong>and</strong> apply methods from the<br />
variational calculus (most notably the Euler-Lagrange equation) to determine a suitable<br />
gradient descent direction in function space. The current mesh (or depth map hypo<strong>thesis</strong>)<br />
is updated according to this direction until convergence. All variational methods to stereo<br />
employ a coarse-to-fine strategy to avoid reaching a weak local minimum in early stages<br />
of the procedure.<br />
If a surface is evolved within a variational framework to obtain a final<br />
mesh consistent with the images, an implicit level-set representation of the<br />
current mesh hypo<strong>thesis</strong> allows simple h<strong>and</strong>ling of topological changes of the<br />
mesh [Faugeras <strong>and</strong> Keriven, 1998, Yezzi <strong>and</strong> Soatto, 2003, Pons et al., 2005]. Generating<br />
depth images instead of meshes from multiple views within a continuous framework<br />
yields to a set of partial differential equations, which are numerically solved to obtain the<br />
final depth map [Strecha <strong>and</strong> Van Gool, 2002, Strecha et al., 2003, Slesareva et al., 2005].<br />
Chapter 6 describes depth estimation using variational principles more precisely <strong>and</strong><br />
presents an efficient GPU-based implementation of one particular approach.<br />
Combinatorial <strong>and</strong> graph optimization methods can be applied in the multi-view stereo<br />
case as well: Kolmogorov et al. [Kolmogorov <strong>and</strong> Zabih, 2002, Kolmogorov et al., 2003]<br />
employ graph-cut optimization to obtain a depth map from multiple views. In addition to<br />
image similarity <strong>and</strong> smoothness terms the energy function is augmented with an explicit<br />
visibility term derived from the current depth map.<br />
2.1.3 Direct 3D Model Reconstruction<br />
This section outlines several approaches <strong>for</strong> multi-view reconstruction targeted at<br />
using all available images from different viewpoints simultaneously. Early methods<br />
include space carving <strong>and</strong> its variants, which projects 3D voxels in the available images<br />
according to the current visibility <strong>and</strong> an image consistency score is calculated from the<br />
sampled pixels. If the voxel is declared as inconsistent, the voxel is classified as empty<br />
<strong>and</strong> the current model <strong>and</strong> visibility in<strong>for</strong>mation is updated. The variants of the basic<br />
space carving principle mostly differ in their employed consistency function <strong>and</strong> the<br />
voxel traversal order [Seitz <strong>and</strong> Dyer, 1997, Prock <strong>and</strong> Dyer, 1998, Seitz <strong>and</strong> Dyer, 1999,<br />
Culbertson et al., 1999, Kutulakos <strong>and</strong> Seitz, 2000, Slabaugh et al., 2001,<br />
Sainz et al., 2002, Stevens et al., 2002] (see also Chapter 5). All space carving methods<br />
compute the so called photo hull (the set of image-consistent voxels), which typically<br />
contains the true geometry, but in practice the photo hull can be a substantial<br />
over-estimate of the true model. Textureless regions yield to poor photo hulls in<br />
particular because of the absence of a smoothing <strong>for</strong>ce.<br />
In order to address the shortcomings of pure space carving methods with<br />
its instant classification of voxels, volumetric graph cut extraction of surface<br />
voxels incorporating image consistency <strong>and</strong> smoothness constraints were recently
2.2. GPU-based 3D Model Computation 19<br />
proposed [Vogiatzis et al., 2005, Tran <strong>and</strong> Davis, 2006, Hornung <strong>and</strong> Kobbelt, 2006b,<br />
Hornung <strong>and</strong> Kobbelt, 2006a]. Since individual voxels essentially correspond to nodes<br />
in the network graph used to determine the maximum flow, these methods still rely<br />
on existing object silhouettes in order to consider only voxels close to the visual hull.<br />
Additionally, approximate visibility is inferred from the visual hull to determine occluded<br />
views <strong>for</strong> each voxel.<br />
Instead of a direct, one-pass reconstruction approach from multiple views, one can utilize<br />
a two-pass method, which generates at first a set of depth images from small baseline<br />
subsets of the provided source views, <strong>and</strong> subsequently creates a full 3D model by merging<br />
the depth maps. Goesele et al. [Goesele et al., 2006] employs a simple plane-sweep<br />
based depth estimation approach followed by a volumetric range image integration procedure<br />
[Curless <strong>and</strong> Levoy, 1996] to obtain the final 3D model. Only relatively confident<br />
depth values are retained in the depth maps, hence the final model may still contain holes<br />
e.g. in textureless regions. Additionally, the range image integration is based on weighted<br />
depth values with the weights induced from the corresponding matching score. This approach<br />
is very similar to our purely GPU-based reconstruction pipeline comprising the<br />
methods presented in Chapter 4 <strong>and</strong> Chapter 8 (see also [Zach et al., 2006a]). In contrast<br />
to volumetric graph cut methods, which generate watertight surfaces, the result of the<br />
pure locally working volumetric range image method may contain holes, which can be<br />
geometrically filled e.g. using volumetric diffusion processes [Davis et al., 2002].<br />
2.2 GPU-based 3D Model Computation<br />
2.2.1 General Purpose Computations on the GPU<br />
Because of the rapid development <strong>and</strong> per<strong>for</strong>mance increase of current 3D graphics hardware,<br />
the goal of using graphics processing units <strong>for</strong> non-graphical purposes became appealing.<br />
The SIMD design of graphics hardware allows much higher peak per<strong>for</strong>mance in<br />
certain applications than it is achievable <strong>for</strong> a general purpose CPU. Whereas a traditional<br />
CPU like a 3 GHz Pentium 4 achieves a theoretical per<strong>for</strong>mance of 6 GFlops <strong>and</strong> a memory<br />
b<strong>and</strong>width of about 6 GByte/sec, a high-end graphics card such as a NVidia GeForce<br />
6800 achieves 53 GFlops at 34 GByte/sec [Harris <strong>and</strong> Luebke, 2005]. Furthermore, the<br />
annual increase of per<strong>for</strong>mance <strong>for</strong> graphics processing units is significantly higher than<br />
<strong>for</strong> CPUs. In contrast to the MIMD programming model <strong>for</strong> traditional processing units<br />
the computational model <strong>for</strong> GPUs is a stream processing approach applying the same<br />
instructions to multiple data items. Consequently, existing CPU-based algorithms must<br />
be mapped onto this computational model, <strong>and</strong> not every algorithm can benefit from the<br />
processing power of the GPU.<br />
Since the emergence of programmable graphics hardware in the year 2001, a huge<br />
number of research papers addresses the acceleration of known algorithms <strong>and</strong> numerical<br />
methods using the GPU as specialized, but fast coprocessor. In this section we only refer
20 Chapter 2. Related Work<br />
to seminal work in this area.<br />
At first we give a brief overview of the computational model of GPU-based computations<br />
(Figure 2.1). The incoming vertex stream with several attributes per vertex (vertex<br />
position, color, texture coordinates) is processed by a vertex program <strong>and</strong> trans<strong>for</strong>med into<br />
normalized screen space. A set of three vertices constitutes a triangle, which is prepared<br />
<strong>for</strong> the rasterization step. The rasterizer generates fragments <strong>and</strong> interpolates vertex attributes.<br />
An optional fragment program takes the incoming fragments <strong>and</strong> may per<strong>for</strong>m<br />
additional calculation, thereby modifying the outgoing fragment color <strong>and</strong> depth. The<br />
blending stage per<strong>for</strong>ms optional alpha blending <strong>and</strong> combines several fragment samples<br />
into one pixel, if multi-sampling based antialiasing is enabled. Fragment programs <strong>and</strong><br />
recently vertex programs as well can per<strong>for</strong>m texture lookups to retrieve arbitrary image<br />
data.<br />
Vertex stream<br />
Vertex<br />
program<br />
Texture<br />
Trans<strong>for</strong>med<br />
vertex stream<br />
Triangle<br />
assembly/clip<br />
Screen−space<br />
triangle stream<br />
Rasterization<br />
Unprocessed<br />
Fragment<br />
program<br />
Fragment<br />
Blending<br />
Pixel<br />
Framebuffer<br />
Image<br />
fragment stream stream stream<br />
Figure 2.1: The stream computation model of a GPU (adapted<br />
from [Harris <strong>and</strong> Luebke, 2005]).<br />
Most applications using the GPU as a general purpose SIMD processor employ the<br />
fragment shaders to per<strong>for</strong>m computational tasks, since most of the processing power of<br />
modern graphics hardware is concentrated in the fragment units. Additionally, direct <strong>and</strong><br />
dependent texture lookups provided by fragment shaders constitute a powerful instrument<br />
<strong>for</strong> data array access. Consequently, general purpose computing on the GPU focuses on the<br />
second row in pipeline depicted in Figure 2.1 (notably fragment programs <strong>and</strong> blending).<br />
Textures act as data array sources, on which the same set of instructions is applied.<br />
The resulting fragments represent the calculated outcome of these computation. Hence,<br />
in most applications a screen-aligned quadrilateral with appropriate texture coordinates<br />
is drawn <strong>and</strong> the requested computation is entirely per<strong>for</strong>med in the fragment processing<br />
units.<br />
Vertex <strong>and</strong> fragment programs are specified in an assembly like language in the<br />
first instance. Several higher level specification languages <strong>for</strong> vertex <strong>and</strong> fragment<br />
programs were developed to ease the development of GPU programs. A commonly
2.2. GPU-based 3D Model Computation 21<br />
used language <strong>for</strong> visual effects <strong>and</strong> general purpose programming on the GPU is<br />
Cg [NVidia Corporation, 2002a, Mark et al., 2003], which provides a C-like specification<br />
language <strong>for</strong> GPU programs <strong>and</strong> a compiler <strong>for</strong> translation to the native instruction set<br />
of graphics hardware. Brook is a language designed specifically <strong>for</strong> parallel numerical<br />
algorithms [Dally et al., 2003], <strong>and</strong> now an implementation is available <strong>for</strong> current<br />
programmable graphics hardware [Buck et al., 2004]. The two main concepts of Brook<br />
(<strong>and</strong> parallel numerical approaches in general) are kernels <strong>and</strong> reductions. A kernel is a<br />
procedure applied to a large set of data items <strong>and</strong> represents the more powerful version<br />
of a SIMD instruction. Since the computation of a kernel only depends on the incoming<br />
data <strong>and</strong> a kernel has no additional side-effects, a kernel can be executed <strong>for</strong> many data<br />
values in parallel. Application of a kernel is similar to the higher order map function<br />
found in most functional programming languages. A reduction operation combines the<br />
elements in a data array to generate a single result. In functional programming this<br />
operation corresponds to the (again higher-order) fold function. On graphics hardware<br />
kernels correspond mainly to fragment programs <strong>and</strong> can be applied in a straight<strong>for</strong>ward<br />
manner. Reductions require a rather expensive multipass procedure based on recursive<br />
doubling with a logarithmic number of passes.<br />
Because of the close relationship between the computational model of modern GPUs<br />
<strong>and</strong> general stream processing concepts, similar benefits <strong>and</strong> limitations <strong>for</strong> algorithm implementations<br />
can be found in both models. Nevertheless, there are significant differences<br />
between general stream processors <strong>and</strong> graphics hardware: In contrast to general parallel<br />
programming <strong>and</strong> stream computation models, a GPU only provides a very limited support<br />
<strong>for</strong> scatter operations (i.e. indexed array updates) <strong>and</strong> other general purpose operations<br />
(e.g. bit-wise integer manipulation). On the other h<strong>and</strong>, linearly filtered data access is<br />
per<strong>for</strong>med very efficiently by the GPU, since this is an intrinsic feature of texture units. In<br />
spite of these (<strong>and</strong> many other) differences between stream processing models <strong>and</strong> modern<br />
GPUs, essentially the same set of algorithms can be accelerated by both architectures.<br />
Even be<strong>for</strong>e programmable graphics hardware was available, the fixed<br />
function pipeline of 3D graphics processors was utilized to accelerate several<br />
numerical [Hopf <strong>and</strong> Ertl, 1999a, Hopf <strong>and</strong> Ertl, 1999b] <strong>and</strong> geometric<br />
calculations [Hoff III et al., 1999, Krishnan et al., 2002] <strong>and</strong> even to emulate<br />
programmable shading not available at that time [Peercy et al., 2000]. The<br />
introduction of a quite general programming model <strong>for</strong> vertex <strong>and</strong> pixel processing<br />
[Lindholm et al., 2001, Proudfoot et al., 2001] opened a very active research<br />
area. The primary application <strong>for</strong> programmable vertex <strong>and</strong> fragment processing<br />
is the enhancement of photorealism <strong>and</strong> visual quality in interactive visualization<br />
systems (e.g. [Engel et al., 2001, Hadwiger et al., 2001]) <strong>and</strong> entertainment applications<br />
([Mitchell, 2002, NVidia Corporation, 2002b]). Additionally several non-photorealistic<br />
rendering techniques can be effectively implemented in modern graphics hardware<br />
[Lu et al., 2002, Mitchell et al., 2002, Weiskopf et al., 2002, Dominé et al., 2002].<br />
Thompson et al. [Thompson et al., 2002] implemented several non-graphical
22 Chapter 2. Related Work<br />
algorithms to run on programmable graphics hardware <strong>and</strong> profiled the execution times<br />
against CPU based implementations. They concluded that an efficient memory interface<br />
(especially when transferring data from graphics memory into main memory) is still an<br />
unsolved issue. For the same reason our implementations are designed to minimize the<br />
memory traffic between graphics hardware <strong>and</strong> main memory.<br />
Naturally the texture h<strong>and</strong>ling capability <strong>and</strong> especially the free bilinear <strong>and</strong> accelerated<br />
anisotropic texture fetch operation makes graphics hardware suitable <strong>for</strong> image<br />
processing tasks, e.g. filtering with linear kernels. Sugita et al. [Sugita et al., 2003] <strong>and</strong><br />
Colantoni et al. [Colantoni et al., 2003] compared the per<strong>for</strong>mance of CPU-based <strong>and</strong><br />
GPU-based implementations of several image filters <strong>and</strong> image trans<strong>for</strong>ms, <strong>and</strong> observed<br />
substantial per<strong>for</strong>mance gains using the GPU over optimized CPU implementations.<br />
Numerical methods <strong>and</strong> simulations became feasible on the GPU since the emergence<br />
of floating point texture capabilities, that enables specification <strong>and</strong> h<strong>and</strong>ling of floating<br />
point values <strong>for</strong> use on the GPU (instead of the 8 bit fixed point precision provided<br />
so far). Numerical solvers <strong>for</strong> sparse matrix equations were proposed by Bolz et<br />
al. [Bolz et al., 2003] <strong>and</strong> by Krüger <strong>and</strong> Westermann [Krüger <strong>and</strong> Westermann, 2003].<br />
Note that the system matrices appearing in variational methods to optical flow <strong>and</strong> depth<br />
estimation are huge, but sparse matrices with usually 4 or 8 off-diagonal b<strong>and</strong>s. Consequently,<br />
variational methods exploiting the computational power of modern GPUs are<br />
now feasible <strong>and</strong> outper<strong>for</strong>m CPU based implementation substantially. Of course, the<br />
limited floating point precision of current GPUs (essentially an IEEE 32 bit float <strong>for</strong>mat)<br />
is an obstacle to high precision numerical computations. Actual numerical or physical<br />
simulations are described in [Harris et al., 2002, Kim <strong>and</strong> Lin, 2003, Lefohn et al., 2003,<br />
Goodnight et al., 2003, Morel<strong>and</strong> <strong>and</strong> Angel, 2003].<br />
2.2.2 Real-time <strong>and</strong> GPU-Accelerated Dense Reconstruction from Multiple<br />
Images<br />
In this section we focus on multi-view reconstruction methods that are either aimed on<br />
realtime execution or use programmable 3D graphics hardware to accelerate the depth<br />
estimation procedure.<br />
<strong>Vision</strong>-based dense depth estimation methods per<strong>for</strong>ming at interactive rates<br />
or even in real-time were initially implemented using special hardware <strong>and</strong> digital<br />
signal processors [Faugeras et al., 1996, Kanade et al., 1996, Konolige, 1997,<br />
Woodfill <strong>and</strong> Herzen, 1997, Jia et al., 2003, Darabiha et al., 2003]. With the<br />
appearance of SIMD instructions sets like MMX <strong>and</strong> SSE primarily intended<br />
<strong>for</strong> multimedia applications on general purpose CPUs, several implementations<br />
targeted on the efficient use of these extensions <strong>for</strong> computational stereo applications<br />
[Mühlmann et al., 2002, Mulligan et al., 2002, Forstmann et al., 2004]. The basic<br />
ideas of high per<strong>for</strong>mance CPU depth estimation method include a cache friendly design<br />
of the algorithm to minimize CPU pipeline stalls, <strong>and</strong> exploiting the SIMD functionality
2.2. GPU-based 3D Model Computation 23<br />
e.g. by rating four disparity values simultaneously. All these approaches work usually<br />
with very simple image similarity measure like the SSD or SAD.<br />
The Triclops vision system [Point Grey Research Inc., 2005] is a commercially available<br />
realtime stereo implementation. Typically the setup consists of two or three cameras <strong>and</strong><br />
appropriate software <strong>for</strong> realtime stereo matching. Depending on the image resolution<br />
<strong>and</strong> the disparity range, the system is able to generate depth images at a rate of about<br />
30Hz <strong>for</strong> images of 320x240 pixels on current PC hardware. The software exploits the<br />
particular L-shape orientation of the cameras <strong>and</strong> MMX/SSE instructions available on<br />
current CPUs.<br />
Probably the first multi-view depth estimation approach executed on programmable<br />
graphics hardware was presented by Yang et al. [Yang et al., 2002], who developed a fast<br />
stereo reconstruction method per<strong>for</strong>med in 3D hardware by utilizing a plane sweep approach<br />
to find correct depth values. The proposed method uses projective texturing capabilities<br />
of 3D graphics hardware to project the given image onto the reference plane.<br />
Further, single pixel error accumulation <strong>for</strong> all given views is per<strong>for</strong>med on the GPU<br />
as well. The number of iterations is linear in the requested resolution of depth values,<br />
there<strong>for</strong>e this method is limited to rather coarse depth estimation in order to fulfill the<br />
realtime requirements of their video conferencing application. Further, their approach<br />
requires a true multi-camera setup to be robust, since the error function is only aggregated<br />
in single pixel windows. Since the application behind this method is a multi-camera<br />
teleconferencing system, accuracy is less important than realtime behavior. In later work<br />
the method was made more robust using trilinear texture access to accumulate error differences<br />
within a window [Yang <strong>and</strong> Pollefeys, 2003]. Their developed ideas were reused<br />
<strong>and</strong> improved to obtain a GPU-based dense matching procedure <strong>for</strong> a rectified stereo<br />
setup [Yang et al., 2004].<br />
The basic GPU-based plane-sweep technique <strong>for</strong> depth estimation can be enhanced<br />
with implicit occlusion h<strong>and</strong>ling <strong>and</strong> smoothness constraints to obtain depth maps with<br />
higher quality. Woetzel <strong>and</strong> Koch [Woetzel <strong>and</strong> Koch, 2004] addressed occlusion occurring<br />
in the source images by a best n out of m <strong>and</strong> by a best half-sequence multi-view selection<br />
policy to limit the impact of occlusions on the resulting depth map. In order to obtain<br />
sharper depth discontinuities a shiftable correlation window approach was utilized. The<br />
employed image similarity measure is a truncated sum of squared differences, which is<br />
sensitive to changing lighting conditions.<br />
Cornelis <strong>and</strong> Van Gool [Cornelis <strong>and</strong> Van Gool, 2005] proposed several refinement<br />
steps per<strong>for</strong>med after a plane-sweep procedure used to obtain an initial depth map using<br />
a single pixel truncated SSD correlation measure. Outliers in the initially obtained depth<br />
map are removed by a modified median filtering procedure, which may destroy fine 3D<br />
structures. These fine details are recovered by an subsequent depth refinement pass.<br />
Since this approach is based on single pixel similarity instead of a window based one,<br />
slanted surfaces <strong>and</strong> depth discontinuities are reconstructed more accurately compared<br />
with window-based approaches.
24 Chapter 2. Related Work<br />
Typically, the correlation windows used in realtime dense matching have a fixed size,<br />
which causes inaccuracies close to depth discontinuities. Since large depth changes are<br />
often accompanied by color or intensity changes in the corresponding image, adapting<br />
the correlation window to extracted edges is a reasonable approach. Gong <strong>and</strong><br />
Yang [Gong <strong>and</strong> Yang, 2005a] investigated in a GPU-based computational stereo procedure<br />
with an additional color segmentation step to increase the quality of the depth map<br />
near object borders.<br />
A GPU-based plane-sweeping technique suitable <strong>for</strong> sparse 3D reconstructions was<br />
presented by Rodrigues <strong>and</strong> Fern<strong>and</strong>es [Rodrigues <strong>and</strong> Ramires Fern<strong>and</strong>es, 2004]. They<br />
used projective texturing hardware to map rays going through interest points into the<br />
other views according to the epipolar geometry. In contrast to the dense depth planesweeping<br />
methods, a true multi-view configuration of the cameras can be used. The result<br />
of the procedure is a sparse 3D point cloud corresponding to 2D interest point seen in<br />
several input images.<br />
For several applications, e.g. video teleconferencing <strong>and</strong> mixed reality applications,<br />
it is sufficient to reconstruct the visual hull, which is the intersection of the generalized<br />
cones generated by the silhouette of the object <strong>and</strong> the optical center of a camera.<br />
Even with the non-programmable traditional graphics pipeline real-time generation <strong>and</strong><br />
rendering of visual hulls can be accelerated by 3D graphics hardware. Lok [Lok, 2001],<br />
Matusik et al. [Matusik et al., 2001] <strong>and</strong> Li et al. [Li et al., 2003] present on-line visual<br />
hull reconstructions systems mostly aimed on video conferencing <strong>and</strong> mixed reality applications.<br />
In order to improve the visual quality of the reconstructed models, the visual<br />
hull can be upgraded with depth in<strong>for</strong>mation generated by computational stereo algorithms<br />
[Slabaugh et al., 2002, Li et al., 2002].<br />
Li et al [Li et al., 2004] present a method <strong>for</strong> GPU-based photo hull generation<br />
used <strong>for</strong> viewpoint interpolation, that is in some aspects similar to the material<br />
presented in Chapter 5. Essentially their work combine the plane-sweep approach<br />
proposed by Yang [Yang et al., 2002] with visibility h<strong>and</strong>ling used in the space carving<br />
framework [Seitz <strong>and</strong> Dyer, 1997, Kutulakos <strong>and</strong> Seitz, 2000]. In contrast to our<br />
approach only depth maps suitable <strong>for</strong> view interpolation are generated, whereas our<br />
approach creates proper 3D models as obtained by other voxel coloring <strong>and</strong> space carving<br />
techniques.<br />
Recently, Gong <strong>and</strong> Yang [Gong <strong>and</strong> Yang, 2005b] implemented a dynamic programming<br />
approach to computational stereo with a simple discontinuity cost model on the<br />
GPU <strong>and</strong> achieved at least interactive rates. In contrast to the other GPU based depth<br />
estimation methods this approach belongs to the category of global matching procedures<br />
(as opposed to the winner-takes-all local methods). Although their framework can be implemented<br />
entirely on the GPU, they report higher per<strong>for</strong>mance using a hybrid CPU/GPU<br />
approach, in which the dynamic programming step is per<strong>for</strong>med on the CPU. Currently,<br />
GPU-based global methods <strong>for</strong> disparity assignment are slowly emerging in the literature.<br />
Dixit et al. [Dixit et al., 2005] present a GPU implementation of a graph cut opti-
2.2. GPU-based 3D Model Computation 25<br />
mization method called GPU-cut used <strong>for</strong> image segmentation. Since graph cut based<br />
approaches to computational stereo are highly successful, further investigations of GPUcut<br />
<strong>for</strong> dense stereo are expected.<br />
Mairal <strong>and</strong> Keriven [Mairal <strong>and</strong> Keriven, 2006] propose a GPU-based variational stereo<br />
framework, which iteratively refines a 3D mesh hypo<strong>thesis</strong> until convergence. The basic<br />
framework <strong>and</strong> goals are similar to our system presented in Chapter 3. A variational multiview<br />
approach <strong>for</strong> 3D reconstruction using graphics hardware is proposed by Labatut et<br />
al. [Labatut et al., 2006], which uses a level-set approach to de<strong>for</strong>m an initial mesh to<br />
match the image similarity constraint. The authors reported a per<strong>for</strong>mance speedup by a<br />
factor of approximately four compared with their CPU implementation. The overall time<br />
required to obtain the final model using a 128 3 volumetric grid is about 5 to 7 minutes<br />
depending on the data-set.<br />
Loopy belief propagation with its basically parallel message update scheme is ostensibly<br />
an ideal c<strong>and</strong>idate <strong>for</strong> GPU-based methods: Brunton <strong>and</strong> Shu [Brunton <strong>and</strong> Shu, 2006]<br />
<strong>and</strong> Yang et al. [Yang et al., 2006] describe implementations utilizing the GPU. The disadvantage<br />
of belief propagation is at first the huge memory consumption <strong>for</strong> large images<br />
<strong>and</strong> depth resolutions requiring either limited depth range [Brunton <strong>and</strong> Shu, 2006] <strong>and</strong><br />
image resolution [Yang et al., 2006]. Additionally, the purely parallel (synchronous) message<br />
update feasible on the GPU converges slower than the sequential update available on<br />
the CPU [Tappen <strong>and</strong> Freeman, 2003].
Chapter 3<br />
Mesh-based Stereo Reconstruction<br />
Using <strong>Graphics</strong> Hardware<br />
3.1 Introduction<br />
This chapter describes a computational stereo method generating a 2.5D height-field represented<br />
as a triangular mesh from a pair of images with known relative pose. The key<br />
idea is a generate-<strong>and</strong>-test approach, which successively modifies a mesh hypo<strong>thesis</strong> <strong>and</strong><br />
evaluates an image correlation measure to rate the refined hypo<strong>thesis</strong>. The current 3D<br />
mesh geometry <strong>and</strong> the relative pose between the images can be used to generate virtual<br />
views of the source images with respect to one particular view. The generated images of<br />
the virtual views should match closely if the correct 3D geometry is found.<br />
The procedure works iteratively: mesh modification resulting in better image correlation<br />
are kept, whereas mesh variations lowering the image similarity are discarded. These<br />
iterations are embedded in a coarse-to-fine framework to avoid convergence to purely local<br />
minima. This procedure can be seen as a simple <strong>and</strong> discrete <strong>for</strong>mulation of a variational,<br />
mesh-based dense stereo approach.<br />
The virtual view generation <strong>and</strong> the subsequent image similarity calculation are per<strong>for</strong>med<br />
by programmable graphics processing units. In contrast to several GPU-based<br />
3D reconstruction methods described in following chapters, the required feature set to be<br />
provided by the GPU <strong>for</strong> this method is very small. Consequently, the proposed stereo<br />
approach described in this chapter works on early generations of programmable graphics<br />
hardware.<br />
Unlike to the approaches proposed in later chapters this approach still uses a mixed<br />
computation model employing the GPU <strong>for</strong> many portions of the procedures, but nevertheless<br />
relies on CPU-based computations in some aspects. Essentially, only those parts of<br />
the method are accelerated by graphics hardware, which can be efficiently implemented on<br />
DirectX 8.1 class GPUs. ∗ The proposed approach in this chapter substantially exploits the<br />
∗ DirectX 8.1 type GPUs provide relatively powerful vertex shaders, but only very limited pixel shaders<br />
27
28 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
main capabilities of graphics hardware by repeated rendering of multi-textured mesh geometry<br />
<strong>for</strong> virtual view generation. Virtual view creation induces a non-linear de<strong>for</strong>mation<br />
of the source image, hence we refer to this operation as image warping procedure.<br />
3.2 Overview of Our Method<br />
The input <strong>for</strong> our procedure consists of two gray-scale images with known relative pose<br />
<strong>and</strong> camera calibration suitable <strong>for</strong> stereo reconstruction, <strong>and</strong> a coarse initial mesh to<br />
start with. This mesh can be based on a sparse reconstruction obtained by the relative<br />
orientation procedure (e.g. a mesh generated from a sparse set of corresponding points by<br />
some triangulation). In our experiments we use a planar mesh as the starting point <strong>for</strong><br />
dense reconstruction. One image of the stereo pair is referred as the key image, whereas<br />
the other one is denoted as the sensor image. † Consequently the cameras (resp. their<br />
positions) are designated as the key camera <strong>and</strong> the sensor camera.<br />
The overall idea of the dense stereo procedure is that if the current mesh hypo<strong>thesis</strong><br />
corresponds to the true model, the appropriately warped sensor image virtually created<br />
<strong>for</strong> the key camera position resembles the original key image. This similarity is quantified<br />
by some suitable error metric on images, which is the sum of absolute difference values in<br />
our current implementation. Modifying the current mesh results in different warped sensor<br />
images with potentially higher similarity to the key image (see Figure 3.1). The current<br />
mesh hypo<strong>thesis</strong> is iteratively refined to generate <strong>and</strong> evaluate improved hypotheses. The<br />
huge space of possible mesh hypotheses can be explored efficiently, since local mesh refinements<br />
have only local impacts on the warped image, there<strong>for</strong>e many local modifications<br />
can be applied <strong>and</strong> evaluated in parallel.<br />
The matching procedure consists of three nested loops:<br />
1. The outermost loop determines the mesh <strong>and</strong> image resolutions. In every iteration<br />
the mesh <strong>and</strong> image resolutions are doubled. The refined mesh is obtained by linear<br />
(<strong>and</strong> optionally median) filtering of the coarser one. This loop adds the hierarchical<br />
strategy to our method.<br />
2. The inner loop chooses the set of vertices to be modified <strong>and</strong> updates the depth<br />
values of these vertices after per<strong>for</strong>ming the innermost loop.<br />
3. The innermost loop evaluates depth variations <strong>for</strong> c<strong>and</strong>idate vertices selected in the<br />
enclosing loop. The best depth value is determined by repeated image warping<br />
<strong>and</strong> error calculation wrt. the tested depth hypo<strong>thesis</strong>. The body of this loop runs<br />
entirely on 3D graphics hardware.<br />
with a small number of instructions are available. Additionally, floating point accuracy <strong>for</strong> textures <strong>and</strong><br />
pixel shaders is not supported.<br />
† There is no unique fixed convention to denote the role of the two views. Sometimes the images are<br />
called master <strong>and</strong> slave views to indicate the key resp. the sensor view. In medical image processing the<br />
notion of template <strong>and</strong> moving image are very common.
3.2. Overview of Our Method 29<br />
Mesh to reconstruct<br />
Key camera<br />
camera ray<br />
Secondary camera<br />
mesh vertex<br />
tested displacement<br />
Figure 3.1: Mesh reconstruction from a pair of stereo images. Vertices of the current mesh<br />
hypo<strong>thesis</strong> are translated along the back-projected ray of the key camera. The image<br />
obtained from the sensor camera is warped onto the mesh <strong>and</strong> the effect in the local<br />
neighborhood of the modified vertex is evaluated.<br />
To per<strong>for</strong>m image warping the current mesh hypo<strong>thesis</strong> is rendered like a regular<br />
height-field as illustrated in Figure 3.2. As it can be seen in Figure 3.3, a change of<br />
the depth value of one vertex has only influence on few adjacent triangles. There<strong>for</strong>e<br />
one fourth of the vertices can be modified simultaneously without affecting each other.<br />
The optimization procedure to minimize the error between key image <strong>and</strong> warped image<br />
is a sequence of determining the best depth values <strong>for</strong> alternating fractions of the mesh<br />
vertices. Since vertices of the grid are numbered such that vertices, which are modified<br />
<strong>and</strong> evaluated in the same pass, comprise a connected block (Figure 3.4), we denote the<br />
fraction of vertices to change as a block.<br />
In every step the depth values of one <strong>for</strong>th of the vertices is modified <strong>and</strong> the local<br />
error between the key image <strong>and</strong> the warped image in the affected neighborhood of the<br />
vertex is evaluated. For every modified vertex the best depth value is determined <strong>and</strong><br />
the mesh is updated accordingly. The procedure to calculate <strong>and</strong> update error values <strong>for</strong><br />
modified vertices is outlined in Figure 3.5.<br />
3.2.1 Image Warping <strong>and</strong> Difference Image Computation<br />
Since the vertices of the mesh are moved along the back-projected rays of the key camera,<br />
the mesh as seen from the first camera is always a regular grid <strong>and</strong> mesh modifications<br />
do not distort the key image. The appearance of the sensor image as seen from the key<br />
camera depends on the mesh geometry.
30 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
Figure 3.2: The regular grid as seen from the key camera. This grid structure allows<br />
fast rendering of the mesh using triangle strips with only one call. The marked vertices<br />
comprise one block. These vertices are shifted on the back-projected ray <strong>and</strong> evaluated<br />
simultaneously in every iteration.<br />
Modified vertex<br />
Affected triangles<br />
Accumulated neighborhood<br />
Figure 3.3: The neighborhood of a currently evaluated vertex. Moving this vertex on the<br />
back-projected ray will only effect the 6 shaded triangles. The actual error <strong>for</strong> this vertex<br />
is calculated <strong>for</strong> the enclosing rectangle, that is still disjoint with the neighborhoods of all<br />
other tested vertices.<br />
From the 3D positions of the current mesh vertices <strong>and</strong> the known relative orientation<br />
between the cameras, it is easy to use automatic texture coordinate generation with appropriate<br />
coefficients to per<strong>for</strong>m the image warping step. To minimize updates of mesh<br />
geometry we use our own vertex program to calculate texture coordinates <strong>for</strong> the sensor<br />
image. This vertex shader is described in more detail in Section 3.3.1.<br />
3.2.2 Local Error Summation<br />
After the difference between the key image <strong>and</strong> the warped image is computed <strong>and</strong> stored<br />
in a pixel buffer, we need to accumulate the error in the neighborhoods of modified ver-
3.2. Overview of Our Method 31<br />
(0,0) (2, 0) (1,0) (3, 0) (0,1) (2,1) (3,1) (1,1)<br />
Block 0 Block 1 Block 2<br />
Block 3<br />
Figure 3.4: The correspondence between vertex indices <strong>and</strong> grid positions.<br />
tices. In order to sum the values within a rectangular window, we employ a variant of<br />
a recursive doubling scheme. The required modification of the recursive approach refers<br />
to the encoding <strong>and</strong> accumulation of larger integer values, if only traditional 8 bit color<br />
channels are available (see Section 3.3.3). Essentially, we per<strong>for</strong>m a repeated downsampling<br />
procedure, which sums up four adjacent pixels into one resulting pixel. The target<br />
pixel buffer has half the resolution in every dimension of the source buffers. If one vertex<br />
is located every four pixels, the downsampling is per<strong>for</strong>med three times to sum the error<br />
in an 8 by 8 pixel window.<br />
We need to mention that only 2 n × 2 n error values are computed <strong>for</strong> a mesh with<br />
(2 n + 1) × (2 n + 1) vertices. Vertices at the right <strong>and</strong> lower edge of the grid do not have<br />
an associated error value. For these vertices we propagate the depth values from the left<br />
resp. upper neighbors.<br />
3.2.3 Determining the Best Local Modification<br />
If δ denotes the largest allowed depth change, then the tested depth variations are sampled<br />
regularly from the interval [−δ, δ]. To minimize the amount of data that needs to be copied<br />
from graphics memory to main memory, we do not directly read back the local errors to<br />
determine the best local modification in software. We store the currently best local error<br />
<strong>and</strong> the corresponding index in a texture <strong>and</strong> update these values within an additional<br />
pass. These values are read back after all depth variations <strong>for</strong> one block of vertices are<br />
evaluated.<br />
3.2.4 Hierarchical Matching<br />
In order to avoid local optima during dense matching we utilize a hierarchical approach.<br />
The coarsest level consists of a mesh with 9 by 9 vertices <strong>and</strong> an image resolution of 32<br />
by 32 pixels. The initial model comprise a planar mesh with the approximate correct<br />
depth values known from the points of interest generated by the relative pose estimation<br />
procedure. After a fixed number of iterations the mesh calculated in the coarser level<br />
is upsampled (using a bilinear filter) <strong>and</strong> used as input to the next level. A median<br />
filter is optionally applied to the mesh to remove potential outliers especially found in<br />
homogeneous image regions.<br />
The largest allowed displacement <strong>for</strong> mesh vertices is decreased <strong>for</strong> higher levels to
32 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
Key image<br />
Sensor image<br />
Absolute difference<br />
Sum of abs.<br />
differences<br />
Range image<br />
Update mesh hypo<strong>thesis</strong><br />
New minimal error<br />
<strong>and</strong> optimal depth<br />
Minimum calculation<br />
Old minimal error<br />
Figure 3.5: The basic workflow of the matching procedure. For the current mesh hypo<strong>thesis</strong><br />
a difference image between key image <strong>and</strong> warped sensor image is calculated in<br />
hardware. The error in the local neighborhood of the modified vertices are accumulated<br />
<strong>and</strong> compared with the previous minimal error value. The result of these calculations are<br />
minimal error values (stored in the red, green <strong>and</strong> blue channel) <strong>and</strong> the index of the best<br />
modification of vertices so far (stored in the alpha channel). All these steps are executed<br />
in graphics hardware <strong>and</strong> do not require transfer of large datasets between main memory<br />
<strong>and</strong> video memory.<br />
enable higher precision. It is assumed that the model generated at the previous level is<br />
already a sufficiently accurate approximation of the true model, <strong>and</strong> only local refinements<br />
to the mesh are required at the next level. In the current implementation we halve the<br />
largest evaluated depth variation when entering the next hierarchy level. The coarsest<br />
level starts with a maximum depth variation roughly equal to the distance of the object<br />
to the key camera.
3.3. Implementation 33<br />
3.3 Implementation<br />
In this section we describe in more detail some aspects of our approach. Our<br />
implementation is based on OpenGL extensions available <strong>for</strong> the ATI Radeon<br />
9700Pro, namely VERTEX_OBJECT_ATI, ELEMENT_ARRAY_ATI, VERTEX_SHADER_EXT <strong>and</strong><br />
FRAGMENT_SHADER_ATI [Hart <strong>and</strong> Mitchell, 2002]. These extensions are available on<br />
the Radeon 8500 <strong>and</strong> 9000 as well, there<strong>for</strong>e our method can be applied with these<br />
older (<strong>and</strong> cheaper) cards, too. For better reading we sketch the vertex program in Cg<br />
notation [NVidia Corporation, 2002a].<br />
The major design criterion is to minimize the amount of data transferred between the<br />
CPU memory <strong>and</strong> GPU memory. In particular, reading back data from the graphics card<br />
is very slow, there<strong>for</strong>e only absolutely necessary in<strong>for</strong>mation is copied from video memory.<br />
3.3.1 Mesh Rendering <strong>and</strong> Image Warping<br />
For maximum per<strong>for</strong>mance we employ the VERTEX_OBJECT_ATI <strong>and</strong> ELEMENT_ARRAY_ATI<br />
OpenGL extension to store mesh vertices <strong>and</strong> connectivity in<strong>for</strong>mation directly in graphics<br />
memory. In every iteration one fourth of the vertices needs to be updated to test<br />
mesh modifications. In order to reduce memory traffic we update the mesh only after all<br />
modifications are evaluated <strong>and</strong> the best one is determined. The current tested offset is a<br />
parameter to a vertex program, that moves vertices along the camera ray as indicated by<br />
the given offset.<br />
Additionally the mesh vertices are ordered such that vertices that are modified in the<br />
same pass comprise a single connected block, there<strong>for</strong>e only one fourth of the vertex array<br />
object stored in video memory needs to be updated.<br />
We sketch the vertex program that calculates the appropriate texture coordinates <strong>for</strong><br />
the sensor image in Algorithm 1. The vertex attributes consists of the position <strong>and</strong> the<br />
block mask encoded in the primary color attribute. Program parameters common <strong>for</strong> all<br />
vertices are<br />
1. the currently tested depth displacement <strong>for</strong> the active block,<br />
2. a matrix M1 trans<strong>for</strong>ming pixel positions into back-projected rays of the key camera,<br />
3. <strong>and</strong> a matrix M2 representing the trans<strong>for</strong>mation from the key camera into image<br />
positions of the sensor camera.<br />
If a vertex belongs to block i, then the i-th component of the block mask attribute of<br />
this vertex is set to one. The other channels are set to zero. If all vertices of block j<br />
are currently evaluated, the displacement represented as a 4-component vector has the<br />
current offset value at position j <strong>and</strong> zeros otherwise. There<strong>for</strong>e a four-component dot<br />
product between the mask <strong>and</strong> the given displacement is either the displacement or zero,<br />
depending whether the block numbers match.
34 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
Algorithm 1 The vertex program responsible <strong>for</strong> warping the sensor image. This vertex<br />
shader calculates appropriate texture coordinates <strong>for</strong> the second image based on the<br />
relative orientation of the cameras <strong>and</strong> the currently evaluated offset.<br />
Procedure Vertex program <strong>for</strong> sensor image warping<br />
Input: Constant parameters: Matrices M1 <strong>and</strong> M2, displacement (a 4-vector)<br />
Input: Vertex attributes: position (homogeneous 3D position), mask (a 4-vector, provided<br />
in the associated vertex color)<br />
depthold ← position.z<br />
{Inner product to determine actual depth displacement}<br />
delta ← displacement · mask<br />
depthnew ← depthold + delta<br />
{Back-project pixel to obtain the corresponding ray of the key camera}<br />
ray ← M1 · position<br />
positionnew ← depthnew · ray<br />
{Position on 2D screen, to be trans<strong>for</strong>med by the modelview-projection matrix}<br />
windowP osition ← (position.x, position.y, 0, 1)<br />
{Project perturbed 3D position to obtain final texture coordinate to sample the sensor<br />
image}<br />
texcoord ← M2 · positionnew<br />
If K1 <strong>and</strong> K2 are the internal parameters of the key resp. the sensor camera (arranged<br />
in an upper-triangular matrix) <strong>and</strong> O is the relative orientation � �between<br />
the cameras<br />
(with O being a 4 × 4 matrix with the components O =<br />
are calculated as follows:<br />
R<br />
0<br />
t<br />
1<br />
), then M1 <strong>and</strong> M2<br />
⎛<br />
1<br />
⎜<br />
M1 = ⎜<br />
⎝<br />
1<br />
0<br />
⎞<br />
⎟ × K−1 1<br />
1 ⎠<br />
1<br />
<strong>and</strong><br />
⎛<br />
1/w<br />
⎜<br />
M2 = ⎜<br />
⎝<br />
1/h<br />
1<br />
1 0<br />
⎞<br />
⎟<br />
⎠ × K2 × O,<br />
where w <strong>and</strong> h represent the image width <strong>and</strong> height in pixels. If M1 is applied to a vector<br />
(x, y, ·, 1), the result is the direction (∆x, ∆y, 1, 1) of the camera ray going through the<br />
pixel at (x, y). This direction is scaled by the target depth value to obtain the vertex in<br />
the key camera space. Consequently, the vertex data <strong>for</strong> mesh points consists of vectors<br />
(x, y, z, 1), where (x, y) are the pixel coordinates in the key image <strong>and</strong> z is the current<br />
depth value. The obtained texture coordinates (s, t, q, q) <strong>for</strong> the sensor image are subject
3.3. Implementation 35<br />
to perspective division prior to texture lookup. On current hardware perspective texture<br />
lookup is per<strong>for</strong>med <strong>for</strong> every texel, hence the correct perspective projection (<strong>and</strong> warping)<br />
is achieved.<br />
Additionally we remark, that texture coordinate trans<strong>for</strong>mation from one image to<br />
another cannot be accomplished only by one trans<strong>for</strong>mation matrix: in this case the<br />
depth changes are applied in screen space, which maps world coordinates non-linearly due<br />
to perspective division.<br />
The described image warping trans<strong>for</strong>mation can result in texture coordinates lying<br />
outside the sensor image. It is possible to ignore mesh regions outside the sensor image<br />
explicitly, but according to our experience simple clamping of texture coordinates is<br />
sufficient in those cases.<br />
3.3.2 Local Error Aggregation<br />
Aggregating the intensity difference values between the key image <strong>and</strong> the warped sensor<br />
image is per<strong>for</strong>med by a recursive doubling approach, which is basically a successive<br />
downsampling procedure.<br />
One iteration of the downsampling procedure is quite simple: the input texture is<br />
bound to four texture units <strong>and</strong> a quadrilateral covering the whole viewport is rendered.<br />
The texture coordinates <strong>for</strong> the 4 texturing units are jittered slightly, such that the correct<br />
adjacent pixels are accessed <strong>for</strong> each final fragment. The filtering mode <strong>for</strong> the source<br />
textures is set to GL_NEAREST. Since the aggregation window is fixed to a 8 × 8 rectangle,<br />
three iterations are applied.<br />
3.3.3 Encoding of Integers in RGB Channels<br />
Although the input images are grayscale images <strong>and</strong> one 8 bit gray channel is sufficient<br />
to represent the absolute difference image, summation of local errors is likely to generate<br />
overflows. Current generations of graphics cards supports float textures, but at the time<br />
of our first attempts to employ the GPU <strong>for</strong> computer vision applications no pixel buffer<br />
<strong>for</strong>mat allowed color channels with floating point precision. There<strong>for</strong>e we decided to utilize<br />
a slightly more complex method to per<strong>for</strong>m error summation with 8 bit RGB channels.<br />
In the proposed implementation floating point textures are not required.<br />
Our integer encoding assigns the least significant 6 bits of a larger integer value to the<br />
red channel, the middle 6 bits to the green channel <strong>and</strong> the remaining bits to the blue<br />
channel. The two most significant bits of the red <strong>and</strong> green channel are always zero. This<br />
encoding allows summation of four error values without loss of precision using a fragment<br />
program utilizing a dependent texture lookup. After (component-wise) summation of 4<br />
input values the most significant bits of the red <strong>and</strong> green component of the register storing<br />
the sum are possibly set, hence this register requires an additional conversion to obtain<br />
the final error value with the desired encoding. This conversion is per<strong>for</strong>med using a 256<br />
by 256 texture map.
36 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
If more than four values are summed in one step, the number of spare bits needs to<br />
be adjusted, e.g. if 8 values are summed in one pass, the three most significant bits of the<br />
red <strong>and</strong> green channel must be reserved to avoid overflows.<br />
3.4 Per<strong>for</strong>mance Enhancements<br />
As it turns out, the implementation described above has still per<strong>for</strong>mance bottlenecks,<br />
that can be avoided by a careful design of the particular implementation.<br />
3.4.1 Amortized Difference Image Generation<br />
For larger image resolutions (e.g. 1024 × 1024) rendering of the corresponding mesh<br />
generated by the sampling points takes a considerable amount of time. In the 1-megapixel<br />
case the mesh consists of approximately 131 000 triangles, which must be rendered <strong>for</strong> every<br />
depth value (several hundred times in total). Especially on mobile graphic boards, mesh<br />
processing implies a severe per<strong>for</strong>mance penalty: stereo matching of two 256 × 256 pixel<br />
images shows similar per<strong>for</strong>mances on the evaluated desktop GPU <strong>and</strong> on the employed<br />
mobile GPU of a laptop, but matching 1-megapixel images requires two times longer on<br />
the mobile GPU.<br />
In order to reduce the number of mesh drawings up to four depth values are evaluated<br />
in one pass. We use multitexturing facilities to generate four texture coordinates <strong>for</strong><br />
different depth values within the vertex program. The fragment shader calculates the<br />
absolute differences <strong>for</strong> these de<strong>for</strong>mations simultaneously <strong>and</strong> stores the results in the four<br />
color channels (red, green, blue <strong>and</strong> alpha). Note that the mesh hypo<strong>thesis</strong> is updated<br />
infrequently <strong>and</strong> the actually evaluated mesh is generated within the vertex shader by<br />
de<strong>for</strong>ming the incoming vertices according to the current displacement.<br />
The vertex program has now more work to per<strong>for</strong>m, since four trans<strong>for</strong>mations (matrixvector<br />
multiplications) are executed to generate texture coordinates <strong>for</strong> the right image<br />
<strong>for</strong> each vertex. Nevertheless, the obtained timing results (see Section 3.5) indicate a<br />
significant per<strong>for</strong>mance improvement by utilizing this approach. Several operations are<br />
executed only once <strong>for</strong> up to 4 mesh hypotheses: transferring vertices <strong>and</strong> trans<strong>for</strong>ming<br />
them into window coordinates, triangle rasterization setup <strong>and</strong> texture access to the left<br />
image.<br />
3.4.2 Parallel Image Trans<strong>for</strong>ms<br />
In contrast to Yang <strong>and</strong> Pollefeys [Yang <strong>and</strong> Pollefeys, 2003] we calculate the error within a<br />
window explicitly using multiple passes. In every pass four adjacent pixels are accumulated<br />
<strong>and</strong> the result is written to a temporary off-screen frame buffer (usually called pixel buffer<br />
or P-buffer <strong>for</strong> short). It is possible to set pixel buffers as destination <strong>for</strong> rendering<br />
operations (write access) or to bind a pixel buffer as a texture (read access), but a combined<br />
read <strong>and</strong> write access is not available. In the default setting the window size is 8 × 8,
3.4. Per<strong>for</strong>mance Enhancements 37<br />
there<strong>for</strong>e 3 passes are required. Note that we use specific encoding of summed values to<br />
avoid overflow due to the limited accuracy of one color channel.<br />
Executing this multipass pipeline to obtain the sum of absolute differences within a<br />
window requires several P-buffer activations to select the correct target buffer <strong>for</strong> writing.<br />
These switches turned out to be relatively expensive (about 0.15ms per switch). In combination<br />
with the large number of switches the total time spent within these operations<br />
comprise a significant fraction of the overall matching time (about 50% <strong>for</strong> 256 × 256<br />
images). If the number of these operations can be optimized, one can expect substantial<br />
increase in per<strong>for</strong>mance of the matching procedure.<br />
Instead of directly executing the pipeline in the innermost loop (requiring 5 P-buffer<br />
switches) we reorganize the loops to accumulate several intermediate results in one larger<br />
buffer with temporary results arranged in tiles (see Figure 3.6). There<strong>for</strong>e P-buffer switches<br />
are amortized over several iterations of the innermost loop. This flexibility in the control<br />
flow is completely transparent <strong>and</strong> needs not to be coded explicitly within the software.<br />
Those stages in the pipeline waiting <strong>for</strong> the input buffer to become ready are skipped<br />
automatically.<br />
3.4.3 Minimum Determination Using the Depth Test<br />
We have two procedures available to update the minimal error <strong>and</strong> optimal depth value:<br />
the first approach utilizes a separate pass employing a simple fragment program <strong>for</strong> the<br />
conditional update. This method works on a wider range of graphic cards (on some mobile<br />
GPUs in particular), but it is rather slow due to necessary P-buffer activations (since the<br />
minimum computation cannot be done in-place). The alternative implementation employs<br />
Z-buffer tests <strong>for</strong> the conditional updates of the frame buffer in-place, but the range of<br />
supported graphics hardware is more limited. In order to utilize this simpler (<strong>and</strong> faster)<br />
method, the GPU must support user-defined assignment of z-values within the fragment<br />
shader (e.g. by using the ARB_FRAGMENT_PROGRAM OpenGL extension). Older hardware<br />
always interpolates z-values from the given geometry (vertices).<br />
We use the rather simple fragment program shown in Figure 3.7 to obtain one scalar<br />
error value from the color coded error <strong>and</strong> move this value to the depth register used<br />
by graphics hardware to test the incoming depth against the z-buffer. Using the depth<br />
test provided by 3D graphics hardware, the given index of the currently evaluated depth<br />
variation <strong>and</strong> the corresponding sum of absolute differences is written into the destination<br />
buffer, if the incoming error is smaller than the minimum already stored in the buffer.<br />
There<strong>for</strong>e a point-wise optimum <strong>for</strong> evaluated depth values <strong>for</strong> mesh vertices can be computed<br />
easily <strong>and</strong> efficiently.
38 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
Difference<br />
image 1<br />
Difference<br />
image 2<br />
Difference<br />
image 3<br />
n/2 x n/2<br />
n x n<br />
n x n<br />
Difference<br />
image 4<br />
pixel summation<br />
<strong>for</strong> every iteraion<br />
n x n<br />
pixel summation<br />
every four iterations<br />
pixel summation<br />
once per block<br />
Figure 3.6: The modified pipeline to minimize P-buffer switches. Several temporary results<br />
are accumulated in larger pixel buffers arranged like tiles. Later passes operate on all those<br />
intermediate results <strong>and</strong> are there<strong>for</strong>e executed less frequently.<br />
3.5 Results<br />
We tested our hardware based matching procedure on artificial <strong>and</strong> on real datasets. In all<br />
test cases the source images are grayscale images with a resolution of 1024 by 1024 pixels.<br />
For the real datasets the relative orientations between stereo images are determined using<br />
the method described by Klaus et al. [Klaus et al., 2002].<br />
We run the timing experiments on a desktop PC with an Athlon XP 2700 <strong>and</strong> an ATI<br />
Radeon 9700 <strong>and</strong> on a laptop PC with a mobile Athlon XP 2200 <strong>and</strong> an ATI Radeon 9000<br />
Mobility.<br />
The artificial dataset comprise two images of a sphere mapped with an earth texture<br />
rendered by the Inventor scene viewer (Figure 3.8). The meshes obtained by our<br />
reconstruction method are displayed as point set <strong>for</strong> easier visual evaluation. Timing
3.5. Results 39<br />
PARAM depth_index = program.env[0];<br />
PARAM coeffs = { 1/256, 1/16, 1, 0 };<br />
TEMP error, col;<br />
TEX col, fragment.texcoord[0],<br />
texture[0], 2D;<br />
DP3 error, coeffs, col;<br />
MOV result.color, depth_index;<br />
MOV result.depth, error;<br />
Figure 3.7: Fragment program to transfer the incoming, color coded error value to the<br />
depth component of the fragment. The dot product (DP3) between the texture element<br />
<strong>and</strong> the coefficient vector restores the scalar error value encoded in the color channels.<br />
statistics <strong>for</strong> this dataset reconstructed at different resolutions are given in Table 3.1. The<br />
matching procedure per<strong>for</strong>ms 8 iterations with 7 tested depth variations <strong>for</strong> each hierarchy<br />
level. These values result in high quality reconstructions in reasonable time. There<strong>for</strong>e<br />
the pipeline shown in Figure 3.5 is executed 56 times <strong>for</strong> each level. The number of levels<br />
varies from 4 to 6 depending on the given image resolution. The total number of evaluated<br />
mesh hypo<strong>thesis</strong> is 224 (256x256), 280 (512x512) <strong>and</strong> 336 (1024x1024). In the highest<br />
resolution (1024x1024) each vertex is actually tested with 84 depth values out of a range<br />
of approximately 600 possible values. Because of limitations in graphics hardware we are<br />
currently restricted to images with power of two dimensions.<br />
(a) The key image (b) The second image (c) The reconstructed model<br />
Figure 3.8: Results <strong>for</strong> the artificial earth dataset.<br />
In addition to the timing experiments we applied the proposed procedure to several<br />
real-world datasets consisting of stereo image pairs showing various buildings. The source<br />
image of these datasets are grayscale images resampled to 1024 × 1024 pixels to meet the<br />
power-of-two graphics hardware requirement. The source images <strong>and</strong> the reconstructed<br />
models are visualized in Figure 3.9–3.11. In Figure 3.10 the homogeneously textured<br />
regions showing the sky yield to poor reconstructions in these areas in particular. The
40 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
Hardware Resolution Matching time<br />
Radeon 9700 Pro 256x256 0.106s<br />
512x512 0.198s<br />
1024x1024 0.501s<br />
Radeon 9000 Mobility 256x256 0.095s<br />
512x512 0.31s<br />
1024x1024 1.05s<br />
Table 3.1: Timing results <strong>for</strong> the sphere dataset on two different graphic cards.<br />
same holds <strong>for</strong> the repetitive pattern on the <strong>for</strong>eground lawn in Figure 3.11.<br />
Since the number of iterations is equal to the one chosen <strong>for</strong> the artificial dataset, the<br />
times required <strong>for</strong> dense reconstruction are similar.<br />
(a) The key image (b) The second image (c) The reconstructed model<br />
Figure 3.9: Results <strong>for</strong> a dataset showing the yard inside a historic building.<br />
3.6 Discussion<br />
This chapter presents a method to reconstruct dense meshes from stereo images with<br />
known relative pose, which is almost completely per<strong>for</strong>med in programmable graphics<br />
hardware. Dense reconstructions can be generated <strong>for</strong> pairs of images with one megapixel<br />
resolution in less than one second on the evaluated hardware plat<strong>for</strong>ms.<br />
With the emergence of additional features provided by the GPU, the approach proposed<br />
in this chapter is extended <strong>and</strong> enhanced as described in the following chapters.<br />
The simple sum of absolute differences image similarity measure can be replaced by more<br />
robust correlation function to achieve better results <strong>for</strong> real-world datasets. Additionally,<br />
the presented method can be easily extended to a multi-view setup at the cost of<br />
higher execution times. A true variational multi-view dense depth estimation framework<br />
per<strong>for</strong>med by the GPU is presented in Chapter 6.
3.6. Discussion 41<br />
(a) The key image (b) The second image (c) The reconstructed model<br />
Figure 3.10: Results <strong>for</strong> a dataset showing an apartment house. Unstructured regions<br />
showing the sky are poorly reconstructed due to the ambiguity in the local image similarity.<br />
Another straight<strong>for</strong>ward extension of the method described in this chapter addresses<br />
the generation of an optical flow field between two views. If no epipolar geometry is known<br />
or the static scene assumption is violated, the one-dimensional search along back-projected<br />
pixels is replaced by a 2D disparity search space. Since a 3D reconstruction from a sole<br />
disparity field is not possible, we focused on the setting with known epipolar geometry<br />
allowing 3D models to be generated.
42 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />
(a) Left image (b) Right image (c) The depth image<br />
(d) The reconstructed model as 3D point cloud<br />
Figure 3.11: Visual results <strong>for</strong> the Merton college dataset. The source images have a<br />
resolution of 1024 × 1024 pixels.
Chapter 4<br />
GPU-based Depth Map<br />
Estimation using Plane Sweeping<br />
Contents<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
4.2 Plane Sweep Depth Estimation . . . . . . . . . . . . . . . . . . . 43<br />
4.3 Sparse Belief Propagation . . . . . . . . . . . . . . . . . . . . . . 50<br />
4.4 Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
4.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
4.6 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.1 Introduction<br />
This chapter describes the implementation of a multiview depth estimation method based<br />
on a plane-sweeping approach, which is accelerated by 3D graphics hardware. The goal<br />
of our approach is the generation of depth maps with suitable quality at interactive rates.<br />
The final depth extraction can be per<strong>for</strong>med using a fast <strong>and</strong> simple winner-takes-all<br />
approach, or alternatively a time- <strong>and</strong> memory-efficient variant of belief propagation can<br />
be employed to obtain higher quality depth images.<br />
4.2 Plane Sweep Depth Estimation<br />
Plane sweep techniques in computer vision are simple <strong>and</strong> elegant approaches to image<br />
based reconstruction from multiple views, since a rectification procedure as required in<br />
many traditional computational stereo methods is not required. The 3D space is iteratively<br />
traversed by parallel planes, which are usually aligned with a particular key view<br />
43
44 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
(Figure 4.1). The plane at a certain depth from the key view induces homographies <strong>for</strong><br />
all other views, thus the sensor images can be mapped onto this plane easily.<br />
Key view<br />
Sensor view<br />
Figure 4.1: Plane sweeping principle. For different depths the homography between the<br />
reference plane <strong>and</strong> the sensor view is varying. Consequently, the projected image of the<br />
sensor view changes with the depth according to the epipolar geometry.<br />
If the plane at a certain depth passes exactly through the surface of the object to<br />
be reconstructed, the color values from the key image <strong>and</strong> from the mapped sensors images<br />
should coincide at appropriate positions (assuming constant brightness conditions).<br />
Hence, it is reasonable to assign the best matching depth value (according to some image<br />
correlation measure) to the pixels of the key view. By sweeping the plane through the<br />
3D space (i.e. varying the plane depth with respect to the key view) a 3D volume can be<br />
filled with image correlation values similar to the disparity space image (DSI) in traditional<br />
stereo. There<strong>for</strong>e the dense depth map can be extracted using global optimization<br />
methods, if depth continuity or any other constraint on the depth map is required.<br />
Note, that a plane sweep technique in a two frame rectified stereo setup coincides<br />
with traditional stereo methods <strong>for</strong> disparity estimation. In these cases the homography<br />
between the plane <strong>and</strong> the sensor view is solely a translation along the X-axis.<br />
There are several techniques to make dense reconstruction approaches more robust in<br />
case of occlusions in a multi-view setup. Typically, occlusions are only modeled implicitly<br />
in contrast to e.g. space carving methods, where the generated model so far directly influences<br />
visibility in<strong>for</strong>mation. Here we discuss briefly two approaches to implicit occlusion<br />
h<strong>and</strong>ling:
4.2. Plane Sweep Depth Estimation 45<br />
• Truncated scores: The image correlation measure is calculated between the key view<br />
<strong>and</strong> the sensor view <strong>and</strong> the final score <strong>for</strong> the current depth hypo<strong>thesis</strong> is the<br />
accumulated sum of the truncated individual similarities. The reasoning behind this<br />
approach is that the effect of occlusions between a pair of views on the total score<br />
should be limited to favor good depth hypotheses supported by other image pairs.<br />
• Best half-sequence selection: In many cases the set of images comprise a logical<br />
sequence of views, which can be totally ordered (e.g. if the camera positions are<br />
approximately on a line). Hence the set of images used to determine the score in<br />
terms of the key view can be split into two half-sequences, <strong>and</strong> the final score is the<br />
better score of these subsets. The motivation behind this approach is, that occlusion<br />
with respect to the key view appear either in the left or in the right half-sequence.<br />
Dense depth estimation using plane sweeping as described in this chapter is restricted to<br />
small baseline setups, since <strong>for</strong> larger baselines occlusions should be modeled explicitly.<br />
Additionally, the inherent fronto-parallel surface assumption of correlation windows yields<br />
inferior results in wide baseline cases.<br />
4.2.1 Image Warping<br />
In the first step, the sensor images are warped onto the current 3D key plane π = (n ⊤ , d)<br />
using the projective texturing capability of graphics hardware. If we assume the canonical<br />
coordinate frame <strong>for</strong> the key view, the sensor images are trans<strong>for</strong>med by the appropriate<br />
homography H with<br />
H = K<br />
�<br />
R − t n ⊤ �<br />
/d K −1 .<br />
K denotes the intrinsic matrix of the camera <strong>and</strong> (R|t) is the relative pose of the sensor<br />
view.<br />
In order to utilize the vector processing capabilities of the fragment pipeline in an<br />
optimal manner, the (grayscale) sensor images are warped wrt. four plane offset values d<br />
simultaneously. All further processing works on a packed representation, where the four<br />
values in the color <strong>and</strong> alpha channels correspond to four depth hypotheses.<br />
4.2.2 Image Correlation Functions<br />
After a sensor image is projected onto the current plane hypo<strong>thesis</strong>, a correlation score<br />
<strong>for</strong> the current sensor view is calculated, <strong>and</strong> the scores <strong>for</strong> all sensor views are integrated<br />
into a final correlation score of the current plane hypo<strong>thesis</strong>. The accumulation of the<br />
single image correlation scores depend on the selected occlusion h<strong>and</strong>ling policy: simple<br />
additive blending operations are sufficient if no implicit occlusion h<strong>and</strong>ling is desired. If the<br />
best half-sequence policy is employed, additive blending is per<strong>for</strong>med <strong>for</strong> each individual<br />
subsequence <strong>and</strong> a final minimum-selection blending operation is applied.
46 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
To our knowledge, all published GPU-based dense depth estimation methods use the<br />
simple sum of absolute differences (SAD) or squared differences (SSD) <strong>for</strong> image dissimilarity<br />
computation (usually <strong>for</strong> per<strong>for</strong>mance reasons). By contrast, we have a set<br />
of GPU-based image correlation functions available, including the SAD, the normalized<br />
cross correlation (NCC) <strong>and</strong> the zero-mean NCC (ZNCC) similarity functions. The<br />
NCC <strong>and</strong> ZNCC implementations optionally use sum tables <strong>for</strong> an efficient implementation<br />
[Tsai <strong>and</strong> Lin, 2003]. Small row <strong>and</strong> column sums can be generated directly by<br />
sampling multiple texture elements within the fragment shader. Summation over larger<br />
regions can be per<strong>for</strong>med using a recursive doubling approach similar to the GPU-based<br />
generation of integral images [Hensley et al., 2005]. Full integral image generation is also<br />
possible, but precision loss is observed <strong>for</strong> the NCC <strong>and</strong> ZNCC similarity functions in this<br />
case (see Section 4.2.2.2).<br />
For longer image sequences one cannot presume constant brightness conditions across<br />
all images, hence an optional prenormalization step is per<strong>for</strong>med, which subtracts the boxfiltered<br />
image from the original one to compensate changes in illumination conditions. If<br />
this prenormalization is applied, the depth maps obtained using the different correlation<br />
functions have similar quality.<br />
4.2.2.1 Efficient Summation over Rectangular Regions<br />
The image similarity functions described in the following section can be efficiently implemented<br />
by utilizing integral images (also known as summed-area tables in computer<br />
graphics). Integral images allow constant-time box filtering regardless of the window<br />
size [Crow, 1984]. Given the integral image of a source image any box filtering can be<br />
per<strong>for</strong>med in constant time using four image accesses (resp. texture lookups). This efficient<br />
box filtering approach can be extended more complex higher-order filtering operations<br />
[Heckbert, 1986].<br />
The single-pass procedure to calculate the integral image efficiently on a general purpose<br />
processor is slow when mapped on SIMD architectures. Consequently, a different<br />
approach using a logarithmic number of passes to generate the integral image on the GPU<br />
is much more efficient [Hensley et al., 2005]. Note, that the integral image requires a much<br />
higher precision of the color channels than the source image precision. Calculating <strong>and</strong><br />
using integral images on the GPU is only feasible since the emergence of floating point<br />
support on current graphics hardware.<br />
Note that <strong>for</strong> very small window sizes the utilization of bilinear texture fetches available<br />
on current graphics hardware essentially <strong>for</strong> free is usually more efficient than the<br />
computation <strong>and</strong> application of integral images. Bilinear texturing allows the summation<br />
of four adjacent pixels by just one texture access, e.g. summing the values inside a 4x4<br />
windows can be done using 4 bilinear texture lookups (instead of 16 individual accesses).<br />
Consequently, in order to obtain highest per<strong>for</strong>mance suitably customized procedures are<br />
best <strong>for</strong> very small correlation windows.
4.2. Plane Sweep Depth Estimation 47<br />
4.2.2.2 Normalized Correlation Coefficient<br />
The widely used (zero-mean) normalized correlation coefficient <strong>for</strong> window-based local<br />
matching of two images X <strong>and</strong> Y is (where ¯ X <strong>and</strong> ¯ Y denote the means inside the rectan-<br />
gular region W)<br />
ZNCC =<br />
�<br />
i∈W (Xi − ¯ X) (Yi − ¯ Y )<br />
�� i∈W (Xi − ¯ �<br />
X) 2<br />
i∈W (Yi − ¯ Y ) 2<br />
which is invariant under (affine linear) changes of luminance between images, but relatively<br />
costly to calculate. Using integral images the ZNCC can be calculated in constant time<br />
regardless of the correlation window size [Tsai <strong>and</strong> Lin, 2003], since<br />
ZNCC =<br />
� XiYi − ( � Xi) ( � Yi) /N<br />
��� X2 i − ( � Xi) 2 � �� /N Y 2<br />
i − ( � Yi) 2 �<br />
/N<br />
.<br />
From the above <strong>for</strong>mula it can be seen that five integral images are requires to calculate<br />
the ZNCC: the integral image <strong>for</strong> � Xi, � Yi, � X 2 i , � Y 2<br />
i <strong>and</strong> finally � XiYi. The<br />
precision requirement <strong>for</strong> the higher order sums is 8 + 8 + log 2 512 + log 2 512 = 34 bit <strong>for</strong><br />
512 × 512 source images. The 32 bit floating point <strong>for</strong>mat of current GPUs has a mantissa<br />
of 23 bit <strong>and</strong> artefacts due to precision loss may occur. Figure 4.2 illustrates the reduced<br />
precision by depicting a ZNCC error image generated in software on a CPU <strong>and</strong> another<br />
one computed on the GPU. An increasing loss of precision can be seen towards the lower<br />
right corner of the image. Since the integral image generations starts from the upper left<br />
corner, the lower right portion has the highest precision requirements within the integral<br />
image.<br />
Note that the precision requirements <strong>for</strong> the simple sums � Xi <strong>and</strong> � Yi are 26 bit <strong>for</strong><br />
8 bit images with 512 × 512 pixels resolution. By subtracting the image mean in advance<br />
from the source image two additional precision bits can be saved: one by halving the<br />
magnitude of the source values <strong>and</strong> another one by exploiting the sign bit in the integral<br />
image.<br />
Instead of creating full integral images, which allows box filtering with arbitrary window<br />
sizes, it is usually sufficient to sum the values with a given specific window, since we do<br />
not vary the aggregation window size during similarity score computation. Accumulation<br />
of larger windows can be per<strong>for</strong>med using a similar recursive doubling scheme as used <strong>for</strong><br />
integral image generation. Consequently, the precision requirements on the target buffer<br />
storing the aggregated values depend on the window size, <strong>and</strong> these are substantially lower<br />
than the requirements <strong>for</strong> integral images.<br />
,
48 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
(a) CPU generated (b) GPU generated<br />
Figure 4.2: NCC images calculated on the CPU (left) <strong>and</strong> on the GPU (right) using<br />
integral images.. Darker pixels indicate smaller similarity values. The image computed on<br />
the GPU has significant deviations especially in the lower right regions.<br />
4.2.3 Sum of Absolute Differences <strong>and</strong> Variants<br />
The sum of absolute differences (SAD) is a widely used image similarity function because<br />
of its simple computation, the minimal precision requirements <strong>and</strong> its high per<strong>for</strong>mance:<br />
SAD = �<br />
|Xi − Yi|,<br />
i∈W<br />
where W denotes the aggregation window. It is not insensitive to illumination changes,<br />
which results in limited use of the SAD <strong>for</strong> real-world application.<br />
Lighting changes in the scene can be incorporated by subtracting the local mean from<br />
the original image values yielding a zero-mean sum of absolute differences (ZSAD):<br />
ZSAD = �<br />
|(Xi − ¯ X) − (Yi − ¯ Y )|.<br />
i∈W<br />
In contrast to the correlation coefficient the subtracted local means cannot be moved<br />
outside the absolute value bars. Hence a similar technique like the shifting theorem <strong>for</strong><br />
the correlation coefficient is not applicable <strong>and</strong> the ZSAD is not suitable <strong>for</strong> efficient<br />
computation. In a first step we replace the true zero-mean intensity values Xi − ¯ X resp.<br />
Yi − ¯ Y by the differences Xi − X σ i , where Xσ is a smoothed version of the image X. X σ<br />
is typically generated by box-filtering the original image. The same applies to Y . The
4.2. Plane Sweep Depth Estimation 49<br />
net effect of this approximation is, that the normalization of the images can be per<strong>for</strong>med<br />
once <strong>for</strong> the input images.<br />
Hence, the first step is to calculate images ˜ X <strong>and</strong> ˜ Y , which are difference images<br />
between the original image <strong>and</strong> the smoothed one (i.e. ˜ X = X −X σ <strong>and</strong> ˜ Y = Y −Y σ ). The<br />
the approximated zero-mean sum of absolute differences reads as simple SAD operating<br />
on the trans<strong>for</strong>med images:<br />
ZSAD ≈ �<br />
| ˜ Xi − ˜ Yi|.<br />
i∈W<br />
The SAD (<strong>and</strong> the approximated ZSAD) can be normalized to the range [0, 1] by appropriate<br />
division:<br />
SAD = 1 �<br />
|Xi − Yi|,<br />
|W|<br />
i∈W<br />
if Xi ∈ [0, 1] <strong>and</strong> Yi ∈ [0, 1] is assumed. An alternative normalized variant of the SAD is<br />
known as the Bray Curtis (respectively Sorensen) distance:<br />
<strong>and</strong><br />
NSAD =<br />
ZNSAD =<br />
�<br />
i∈W |Xi − Yi|<br />
�<br />
i∈W |Xi| + �<br />
i∈W<br />
�<br />
i∈W | ˜ Xi − ˜ Yi|<br />
|Yi| ,<br />
�<br />
i∈W | ˜ Xi| + �<br />
i∈W | ˜ Yi| .<br />
These similarity scores are between 0 <strong>and</strong> 1, where 0 indicates perfect match between the<br />
two local windows.<br />
Computing the NSAD (<strong>and</strong> the ZNSAD) between two images requires three integral<br />
images II(·) to be generated <strong>for</strong> every depth value:<br />
• II(|Xi − Yi|) to calculate the numerator <strong>for</strong> the NSAD efficiently,<br />
• II(|Xi|) <strong>and</strong> II(|Yi|) to compute the denominator of the NSAD <strong>for</strong>mula.<br />
For the ZNSAD, the integral images are computed <strong>for</strong> ˜ Xi <strong>and</strong> ˜ Yi.<br />
If the plane sweep is per<strong>for</strong>med normal to an input view, II(|Xi|) must be calculated<br />
only once be<strong>for</strong>e the sweep. In case of a rectified stereo setup, the integral images (resp. the<br />
box filtered images) of the mean-normalized inputs can be entirely precomputed be<strong>for</strong>e<br />
the sweep. For every depth (resp. disparity) value the integral image <strong>for</strong> the absolute<br />
difference image |Xi − Yi| between the two views must be calculated.<br />
Of course, the required sums <strong>for</strong> rectangular regions can be achieved by direct summation<br />
as well, but such an approach is only suitable <strong>and</strong> efficient <strong>for</strong> small support window<br />
sizes.
50 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
4.2.4 Depth Extraction<br />
In order to achieve high per<strong>for</strong>mance <strong>for</strong> depth estimation, we employ primarily a simple<br />
winner-takes-all strategy to assign the final depth values. This approach can be easily <strong>and</strong><br />
efficiently implemented on the GPU using the depth test <strong>for</strong> a conditional update of the<br />
current depth image hypo<strong>thesis</strong> (see [Yang et al., 2002] <strong>and</strong> Section 3.4.3).<br />
Unreliable depth values can be masked by a subsequent thresholding pass removing<br />
pixels in the obtained depth map, which have a low image correlation.<br />
If the resulting depth map is converted to 3D geometry, staircasing artefacts are typically<br />
visible in the obtained model. In order to reduce these artefacts an optional selective,<br />
diffusion-based depth image smoothing step is per<strong>for</strong>med, which preserves true depth discontinuities<br />
larger than the steps induced by the discrete set of depth hypotheses (see<br />
Section 4.4).<br />
4.3 Sparse Belief Propagation<br />
Belief propagation (e.g. [Weiss <strong>and</strong> Freeman, 2001]) is an approximation technique <strong>for</strong><br />
global optimization on graphs, which is based on passing messages on the arcs of the<br />
underlying graph structure. The algorithm iteratively refines the estimated probabilities<br />
of the hypotheses within the graph structure by updating the probability weighting of<br />
neighboring nodes. These updates are referred as message passing between adjacent nodes.<br />
The belief propagation method maintains an array of probabilities called messages <strong>for</strong><br />
every arc in the graph, hence this method requires substantial memory <strong>for</strong> larger graphs<br />
<strong>and</strong> hypo<strong>thesis</strong> spaces. We denote the value of a message from node p going to node q<br />
<strong>for</strong> hypo<strong>thesis</strong> d at time t with m (t)<br />
p→q(d). Here d ranges over the possible hypo<strong>thesis</strong> at<br />
node q. After the belief propagation procedure converged to a stable solution, the final<br />
hypo<strong>thesis</strong> assignment to every node is typically extracted by taking the hypo<strong>thesis</strong> with<br />
the maximum estimated a posteriori probability. We refer to Section 4.3.2 <strong>for</strong> the details<br />
on message passing <strong>and</strong> hypo<strong>thesis</strong> extraction.<br />
In image processing <strong>and</strong> computer vision applications this graph is usually induced<br />
by the rectangular image grid with nodes representing pixels <strong>and</strong> arcs connecting adjacent<br />
pixels. Depth estimation integrating smoothness weights <strong>and</strong> occlusion h<strong>and</strong>ling<br />
can be <strong>for</strong>mulated as global optimization problem <strong>and</strong> solved with belief propagation<br />
methods [Sun et al., 2003]. Nevertheless, basic belief propagation methods are computationally<br />
dem<strong>and</strong>ing, but the special structure of the regularization function typically<br />
used in computer vision to en<strong>for</strong>ce smooth depth maps can be exploited to obtain more<br />
efficient implementations [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. In particular, the Potts<br />
discontinuity cost function <strong>and</strong> the optionally truncated linear cost model allow an efficient<br />
linear-time message passing method. In the Potts model, equal depth values assigned to<br />
adjacent pixels imply no smoothness penalty, whereas any different adjacent depth values<br />
result in a constant regularization penalty. More <strong>for</strong>mally, the smoothness cost V (dp, dq)
4.3. Sparse Belief Propagation 51<br />
is zero, if dp = dq, <strong>and</strong> a constant λ otherwise. In the linear smoothness model we have<br />
V (dp, dq) = λ|dp − dq|.<br />
Our implementation of belief propagation to extract the depth map from image correlation<br />
values is based on the work proposed in [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. In<br />
contrast to already proposed depth estimation techniques based on belief propagation we<br />
apply the message passing procedure only to a promising subset of depth/disparity values.<br />
Consequently, the consumed memory <strong>and</strong> time is a fraction of the original method.<br />
Consider the following concrete example: a depth map with 512×512 pixels resolution<br />
should be extracted from 200 potential depth values. Traditional (dense) belief propagation<br />
requires about 4 × 512 × 512 × 200 message components to be stored (the factor 4<br />
results from the utilized 4-neighborhood of pixels), which gives 800MB <strong>for</strong> 32 bit floating<br />
point components. But most of the 200 depth hypo<strong>thesis</strong> per pixel can be rejected immediately<br />
because of low image similarities. If on average only 10 tentative depth hypo<strong>thesis</strong><br />
survive <strong>for</strong> every pixel, only 4 × 512 × 512 × 10 message components need to be stored,<br />
which results in 40MB of memory consumption. The actual memory footprint is somewhat<br />
larger, since additional data structures are required <strong>for</strong> sparse belief propagation.<br />
We can adopt two of the three ideas proposed in [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]<br />
<strong>for</strong> sparse belief propagation:<br />
• The checker-board update pattern <strong>for</strong> messages can be used directly to halve the<br />
memory requirements.<br />
• The two pass method to compute the message updates in linear time <strong>for</strong> the Potts<br />
<strong>and</strong> the linear regularization can be modified to work <strong>for</strong> sparse representations as<br />
well (see Section 4.3.2).<br />
Additionally, a coarse-to-fine approach to belief propagation <strong>for</strong> vision to accelerate the<br />
convergence is proposed in [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. The basic idea is the<br />
hierarchical grouping of pixels in coarser levels <strong>and</strong> to per<strong>for</strong>m message passing in the<br />
reduced graphs. The results from coarser levels are used as initialization values <strong>for</strong> the<br />
next finer level. Since the hypo<strong>thesis</strong> space (i.e. the range of admissible depth values) <strong>for</strong><br />
a group of pixels in a coarser level consists of the union of all depth hypo<strong>thesis</strong> valid <strong>for</strong><br />
individual pixels, the data structures become less sparse. In the example above starting<br />
with 10 tentative depth values <strong>for</strong> every pixel, the next coarser level is comprised of 2 × 2<br />
pixel blocks associated with up to 40 possible depth values. Hence, there is no direct<br />
improvement in the time complexity using a hierarchical approach <strong>for</strong> our proposed sparse<br />
belief propagation method.<br />
4.3.1 Sparse Data Structures<br />
4.3.1.1 Sparse Data Cost Volume During Plane-Sweep<br />
Since belief propagation is a global optimization framework, a data structure similar to<br />
the disparity space image must be maintained, which stores the correlation value <strong>for</strong>
52 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
every depth hypo<strong>thesis</strong> <strong>and</strong> pixel. We propose a sparse representation to store tentative<br />
depth/correlation value pairs. One simple implementation would store exactly K<br />
depth/correlation pairs <strong>for</strong> every pixel, which is a appropriate approach in practice. In<br />
certain situations this uni<strong>for</strong>m choice <strong>for</strong> the number if hypotheses to be stored <strong>for</strong> every<br />
pixel is not appropriate: In highly textured regions there are possibly very few tentative<br />
depth hypotheses, whereas in low textured areas the similarity measure is not discriminative<br />
<strong>and</strong> the choice of K may be to low to include all potential depth c<strong>and</strong>idates.<br />
Consequently, we choose a more dynamic data structure, which stores at least K depth<br />
hypotheses (together with the corresponding correlation value) <strong>and</strong> additionally allocates<br />
a pool of a user defined size, which stores the globally next best depth hypotheses.<br />
For efficient update of this data structure after computing the image similarity <strong>for</strong> a<br />
certain depth plane, the K entries associated with every pixel comprise a heap sorted wrt.<br />
the correlation value. Maintaining the heaps <strong>for</strong> every pixel is relatively cheap, since every<br />
heap has exactly K elements. The dynamically assigned depth hypotheses are maintained<br />
in a heap structure as well. Updating this pool is more costly due to its relative large size.<br />
4.3.1.2 Sparse Data Cost Volume <strong>for</strong> Message Passing<br />
After finishing the plane-sweep procedure to generate the data costs associated with every<br />
pixel <strong>and</strong> every tentative depth value, the gathered sparse data cost volume is restructured<br />
<strong>for</strong> efficient access during message passing. Whereas during plane-sweep the image similarity<br />
value serves as primary key <strong>for</strong> efficient incremental updates, the sparse 1D distance<br />
trans<strong>for</strong>m per<strong>for</strong>med during message updates requires a depth-sorted list of items. Consequently,<br />
the sparse data cost volume used in message passing stage consists of an array of<br />
depth value/similarity value pairs <strong>for</strong> every pixel. In order to avoid memory fragmentation<br />
a scheme similar to compressed row storage <strong>for</strong>mat <strong>for</strong> sparse matrix representations is<br />
employed.<br />
4.3.2 Sparse Message Update<br />
Belief propagation uses repeated communication between adjacent pixels to strengthen<br />
or weaken the support of depth hypotheses. The iterative procedure updates the value<br />
<strong>for</strong> a message going from pixel p to its neighbor q at iteration t, m (t)<br />
p→q, according to the<br />
following rule:<br />
m (t)<br />
p→q(dq) := min ⎝V (|dp − dq|) + D(dp) + �<br />
dp<br />
⎛<br />
s∈N (p)\q<br />
⎞<br />
m (t−1)<br />
s→p (dp) ⎠ , (4.1)<br />
where dp <strong>and</strong> dq are tentative depth values at pixel p <strong>and</strong> q respectively. V (·) is the<br />
regularization term <strong>and</strong> D(dp) is the image similarity value <strong>for</strong> the depth dp. The sum<br />
�<br />
s∈N (p)\q m(t−1)<br />
s→p (dp) denotes the incoming messages from the neighborhood of q excluding
4.3. Sparse Belief Propagation 53<br />
p. The values from the previous iteration are used to determine the incoming messages<br />
(as denoted by the superscript (t − 1)).<br />
We utilize a linear regularization model, i.e.<br />
or a truncated linear approach with<br />
V (d) = λ d,<br />
V (d) = min {Vmax, λ d} ,<br />
with a regularization weight λ.<br />
After a user-specified number of iterations T <strong>for</strong> each pixel p the depth hypo<strong>thesis</strong> with<br />
the highest support (belief) is chosen as the actual depth:<br />
d result<br />
p<br />
= arg min<br />
dp<br />
⎧<br />
⎨<br />
4.3.2.1 Sparse 1D Distance Trans<strong>for</strong>m<br />
⎩ D(dp) + �<br />
s∈N (p)<br />
m T s→p(dp)<br />
For the linear regularization model the quadratic time complexity of message updates<br />
can be reduced to linear complexity using a two-pass scheme to calculate the<br />
min-convolution [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. Computing the min-convolution<br />
can be easily extended <strong>for</strong> sparse belief propagation. The procedure to <strong>for</strong> the sparse 1D<br />
distance trans<strong>for</strong>m is illustrated in Figure 4.3 <strong>and</strong> outlined in Algorithm 2.<br />
q1 p1 p2<br />
q2 q3 p3<br />
q4<br />
Figure 4.3: Determining the lower envelope using a sparse 1D distance trans<strong>for</strong>m. Solid<br />
lines represent given values of h[pi] = D[pi] + �<br />
s�=q ms→p[pi] <strong>and</strong> dashed lines indicate<br />
inferred values h[qi] from the distance trans<strong>for</strong>m.<br />
The algorithm applies a <strong>for</strong>ward <strong>and</strong> a backward pass to calculate the lower envelope<br />
in essentially the same manner as in the basic belief propagation framework. The main<br />
⎫<br />
⎬<br />
⎭
54 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
observation or the distance trans<strong>for</strong>ms in the sparse settings is, that only the potential<br />
depth hypotheses <strong>for</strong> the nodes <strong>for</strong>ming the arc p → q in interest need to be considered.<br />
Consequently, the lower envelope is derived solely from the potential depth hypotheses<br />
associated with pixel p <strong>and</strong> q. In order to apply the <strong>for</strong>ward <strong>and</strong> backward pass, these two<br />
sets of selected depth values need to be sorted into a common sequence. This is the first<br />
step in Algorithm 2.<br />
Subsequently, the procedure embeds the given samples stored in the array h to the<br />
corresponding position in the combined sequence f. The subsequent <strong>for</strong>ward <strong>and</strong> backward<br />
passes propagate the distance values through the sequence. Focusing on the <strong>for</strong>ward pass,<br />
the successive element f[i + 1] is updated to<br />
min(f[i + 1], f[i] + λ |mergeddepths[i + 1] − mergeddepths[i]|.<br />
The backward pass in analogous.<br />
Algorithm 2 Sparse variant of 1D distance trans<strong>for</strong>m<br />
Procedure Sparse DT-1D<br />
Input: h[], depthsp[], sizep, depthsq[], sizeq, result mp→q[]<br />
Do a merge-sort step to combine depthsp <strong>and</strong> depthsq to obtain mergeddepths with at<br />
most sizep + sizeq entries<br />
Simultaneously, fill a temporary array f such that<br />
f[j] := h[i], if mergeddepths[j] = depthsp[i]<br />
f[j] := ∞, otherwise<br />
Per<strong>for</strong>m <strong>for</strong>ward pass on f<br />
Per<strong>for</strong>m backward pass on f<br />
Fill in result array mp→q:<br />
mp→q[i] = f[j], if mergeddepths[j] = depthsq[i]<br />
The merge sort step stated in Algorithm 2 can be avoided by precomputing suitable<br />
arrays, but this approach is only slightly faster than using the inlined merge sort step <strong>and</strong><br />
requires additional memory.<br />
4.4 Depth Map Smoothing<br />
If the 3D models generated by the plane sweep procedure are visualized directly, staircase<br />
artefacts induced by the discrete set of depth hypo<strong>thesis</strong> are often clearly visible. If several<br />
individual depth maps resp. the induced 3D meshes are combined into one final model (e.g.<br />
as described in Chapter 8), these artefacts are typically removed by suitable averaging of<br />
the single models <strong>and</strong> the smoothing procedure proposed in this section is not necessary.<br />
Otherwise, a depth smoothing approach selectively removing the staircase effects without<br />
filtering larger depth discontinuities as described in this section can be applied.<br />
In the following we assume that the tentative depth values of every pixels are evenly
4.5. Timing Results 55<br />
spaced in a user-specified interval, <strong>and</strong> successive depth values vary by a constant depth<br />
difference T . Hence, depth variations between neighboring pixel in the magnitude T (or<br />
a small multiple of T ) indicate potential regions <strong>for</strong> depth map smoothing. We per<strong>for</strong>m<br />
this selective filtering approach by applying a diffusion procedure to minimize<br />
�<br />
min (d − d0)<br />
d<br />
2 + µ�W (p) · ∇d� 2 dp.<br />
p<br />
In this term d0(·) denotes the depth map (a function of the pixel position p) generated by<br />
the plane-sweeping method in the first place. d(·) is the final smoothed depth map, <strong>and</strong><br />
W (·) is a weighting vector described below. µ is a user-specified weight to balance the<br />
data term (d − d0) 2 <strong>and</strong> the regularization term �W (p) · ∇d� 2 .<br />
In order to define the weight W (p) at pixel position p, the original depth map d0 is<br />
sampled at position p <strong>and</strong> its four neighbors comprising a vector N = (d E 0 , dW 0 , dN 0 , dS 0 ).<br />
If the depth difference |d0 − d (·)<br />
0<br />
| is smaller than T (or an other used-given threshold), the<br />
diffusion process is allowed in the corresponding direction <strong>and</strong> the appropriate component<br />
in W (p) is set to one. All other components are set to zero to inhibit the diffusion.<br />
In addition to the directional gradient (i.e. the finite differences) in the source depth<br />
map, confidence in<strong>for</strong>mation can be incorporated into W as well. Depth values <strong>for</strong> pixels<br />
with low confidence (e.g. detected by low image similarity) result in directional diffusion<br />
from confident pixels to unconfident ones by the appropriate update of W . We build<br />
a confidence map by assigning one to pixels with confident depths <strong>and</strong> zero otherwise.<br />
This map is based on hard-thresholding of the employed image similarity measure <strong>for</strong> the<br />
extracted depth value. The corresponding component of W is multiplied by the confidence<br />
map entries sampled <strong>for</strong> neighboring pixels.<br />
This diffusion procedure can be again executed by graphics hardware to increase the<br />
per<strong>for</strong>mance. Since Chapter 6 is entirely dedicated to variational methods <strong>for</strong> multi-view<br />
vision, we postpone the detailed description of the GPU-based implementation of diffusion<br />
processes <strong>and</strong> variational approaches in general to that chapter.<br />
4.5 Timing Results<br />
In this section we provide more detailed timing results <strong>for</strong> GPU-based depth estimation<br />
using the plane-sweeping approach. The benchmarking plat<strong>for</strong>m is a P4 3GHz as CPU<br />
<strong>and</strong> a NVidia GeForce 6800GTO as GPU. Since the adjustable parameters <strong>for</strong> our implementation<br />
have many degrees of freedom (image similarity score, aggregation window<br />
dimensions, number of used source images etc.), a tabular representation given in Table 4.1<br />
of the obtained timing results is preferred over a graphical representation. The input <strong>for</strong><br />
the depth estimation method are three grayscale source images at the resolution specified<br />
in the appropriate column (512 × 512 or 1024 × 1024). The use of power-of-two image<br />
dimensions is caused by the partial support of graphics hardware <strong>for</strong> non-power-of-two
56 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
textures. The timing results given in this table reflect essentially the per<strong>for</strong>mance of applying<br />
the homography on the sensor images <strong>and</strong> calculating the stated dissimilarity score,<br />
since the time used <strong>for</strong> actual depth extraction is negligible. Note, that these timings are<br />
mostly insensitive to the provided image content.<br />
Resolution #depth planes Aggr. window Dissimilarity score Time<br />
512 × 512 200 5 × 5 SAD 0.918s<br />
ZNSAD 1.573s<br />
NCC 1.647s<br />
ZNCC 2.344s<br />
9 × 9 SAD 1.362s<br />
ZNSAD 2.426s<br />
NCC 2.481s<br />
ZNCC 3.591s<br />
400 5 × 5 SAD 1.699s<br />
ZNSAD 3.058s<br />
NCC 3.188s<br />
ZNCC 4.611s<br />
9 × 9 SAD 2.579s<br />
ZNSAD 4.774s<br />
NCC 4.855s<br />
ZNCC 7.103s<br />
1024 × 1024 200 5 × 5 SAD 3.772s<br />
ZNSAD 7.096s<br />
NCC 7.402s<br />
ZNCC 10.861s<br />
9 × 9 SAD 6.059s<br />
ZNSAD 11.446s<br />
NCC 11.656s<br />
ZNCC 17.206s<br />
400 5 × 5 SAD 7.540s<br />
ZNSAD 14.146s<br />
NCC 14.842s<br />
ZNCC 21.684s<br />
9 × 9 SAD 11.973s<br />
ZNSAD 22.863s<br />
NCC 23.281s<br />
ZNCC 34.379s<br />
Table 4.1: Timing results <strong>for</strong> the plane-sweeping approach on the GPU with winner-takesall<br />
depth extraction at different parameter settings <strong>and</strong> image resolutions.<br />
At higher resolutions, the expected theoretical ratios between the run times between<br />
the various similarity score are attained. Every score uses one or several accumulation
4.5. Timing Results 57<br />
passes to calculate �<br />
i∈W op(Xi, Yi), which comprises the dominant fraction of the total<br />
run-time. The SAD requires only one accumulation pass ( �<br />
i∈W |Xi − Yi|), whereas the<br />
NSAD resp. the NCC needs two passes, <strong>and</strong> finally the ZNCC per<strong>for</strong>ms three invocations<br />
of the accumulation procedure. ∗ Hence, the observed ratios of approximately 1:2:2:3 <strong>for</strong><br />
the run-times of the evaluated correlation scores can be explained.<br />
Sparse belief propagation <strong>for</strong> the final depth extraction is much more costly in terms<br />
of computation time, as it is illustrated in Figure 4.4. The solid graph displays the required<br />
total run-time against the number of maintained heap entries <strong>for</strong> sparse belief<br />
propagation. This graph shows essentially a linear behavior, since the linear-time message<br />
passing dominates the heap construction with O(K log K) time complexity. For comparison,<br />
the dashed line depicts the runtime of the pure winner-takes all approach. Sparse<br />
belief propagation with just one heap entry requires about 5.8s, whereas the equivalent<br />
winner-takes-all method needs approximately 3s <strong>for</strong> these settings. The corresponding<br />
depth images obtained <strong>for</strong> the utilized dataset are shown later in Section 4.6.<br />
time in sec<br />
55<br />
50<br />
45<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
BP times<br />
WTA time<br />
0<br />
0 5 10 15 20 25 30 35 40<br />
number of sparse BP entries<br />
Figure 4.4: Sparse belief propagation timing results wrt. the number of heap entries K.<br />
The image <strong>and</strong> depth map resolution is 512 × 512 pixels <strong>and</strong> 200 depth hypotheses are<br />
evaluated using a 7 × 7 ZNCC image similarity score.<br />
∗ Recall Section 4.2.2. Additionally, the summations involving only key image can be precomputed.
58 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
4.6 Visual Results<br />
In this section we provide depth maps <strong>and</strong> 3D models <strong>for</strong> real datasets in order to demonstrate<br />
the per<strong>for</strong>mance of our GPU-based depth estimation procedure <strong>and</strong> to indicate the<br />
differences between the winner-takes-all (WTA) depth extraction approach <strong>and</strong> the sparse<br />
belief propagation method. All source images are resampled to a resolution of 512 × 512<br />
pixels, since images with power-of-two dimensions are still better supported on graphics<br />
hardware.<br />
The L<strong>and</strong>haus dataset shown in Figures 4.5 <strong>and</strong> 4.6 represents a historical statue<br />
embedded into a building facade. Three grayscale images with small baselines are used <strong>for</strong><br />
depth estimation. At first, Figure 4.5 shows depth images generated by the winner-takesall<br />
<strong>and</strong> by the sparse belief propagation approach at different numbers of maintained<br />
heap entries K. 200 potential depth values are examined in all cases. The reported<br />
timings correspond to the values displayed in Figure 4.4. Most notably, belief propagation<br />
enhances the depth maps in the textureless wall regions on either side of the statue itself.<br />
Additionally, Figure 4.6 shows two 3D models represented as colored point sets obtained by<br />
a WTA depth extraction step <strong>and</strong> a sparse belief propagation procedure using 20 surviving<br />
depth entries. Both models look relatively similar <strong>and</strong> only a closer inspection reveals the<br />
outliers. If the models are rendered as shaded triangular meshes as in Figure 4.7, the<br />
noisy structure of the WTA result is clearly manifested. Note, that many outliers found in<br />
the initial depth maps can be removed by the subsequent depth image fusion procedure,<br />
which generates a proper 3D model from a set of depth maps.<br />
Three source images of another statue dataset <strong>and</strong> the respective depth results are<br />
shown in Figure 4.8. 400 tentative depth planes are evaluated on three adjacent images<br />
with small baseline. Since the dark background scenery to the left <strong>and</strong> right of the statue<br />
is out of the plane-sweep range, the depth image has poor quality in these regions. Belief<br />
propagation significantly smooths the depth map especially near depth discontinuities.<br />
4.7 Discussion<br />
GPU-based plane-sweeping procedures allow the efficient generation of depth images from<br />
multiple small base-line images. Several image dissimilarity measures are available in<br />
our implementation, which are efficiently calculated on graphics hardware <strong>and</strong> give good<br />
results even <strong>for</strong> varying lighting conditions.<br />
In case of highly textured scenes a final winner-takes-all depth extraction method is<br />
sufficient <strong>and</strong> fast enough to calculate to allow almost interactive feedback to the user.<br />
Optionally, a sparse belief propagation method is proposed, which significantly enhances<br />
the depth map in ambiguous regions.<br />
Future work needs to address a qualitative <strong>and</strong> quantitative comparison of traditional<br />
belief propagation <strong>and</strong> our proposed sparse counterpart. The question, whether the early<br />
rejection of unpromising depth values can have a negative impact on the extracted depth
4.7. Discussion 59<br />
maps, is still unresolved. Additionally, even sparse belief propagation is 5 to 10 times<br />
slower than the (fully hardware accelerated) winner-takes-all strategy, which opens the<br />
question, if further per<strong>for</strong>mance enhancements are possible <strong>for</strong> sparse BP.<br />
In Chapter 7 a GPU-based one-dimensional energy minimization approach based on<br />
the dynamic programming principle is presented.
60 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
(a) Sensor image (b) Without BP (WTA); 3s<br />
(c) BP, K = 10; 16.5s (d) BP, K = 20; 29.5s<br />
(e) BP, K = 30; 40.3s (f) BP, K = 40; 50.1s<br />
Figure 4.5: Depth images with <strong>and</strong> without belief propagation <strong>for</strong> the L<strong>and</strong>haus dataset.<br />
With more allowed heap entries K, the amount of noisy pixels in textureless regions is<br />
reduced, but the runtime increases accordingly.
4.7. Discussion 61<br />
(a) Without BP (WTA) (b) With BP (K = 20)<br />
Figure 4.6: Point models with <strong>and</strong> without belief propagation<br />
(a) Without BP (WTA) (b) With BP (K = 20)<br />
Figure 4.7: Point models with <strong>and</strong> without belief propagation
62 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />
(a) Left image (b) Middle (sensor) image (c) Right image<br />
(d) Without BP (WTA), 6.7s (e) With BP, 37s<br />
Figure 4.8: Depth images with <strong>and</strong> without belief propagation
Chapter 5<br />
Space Carving on 3D <strong>Graphics</strong><br />
Hardware<br />
Contents<br />
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
5.2 Volumetric Scene Reconstruction <strong>and</strong> Space Carving . . . . . . 64<br />
5.3 Single Sweep Voxel Coloring in 3D Hardware . . . . . . . . . . 66<br />
5.4 Extensions to Multi Sweep Space Carving . . . . . . . . . . . . 70<br />
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
5.1 Introduction<br />
This chapter presents a direct scene reconstruction approach fully accelerated by graphics<br />
hardware. It shares the plane-sweep principle to obtain a model from multiple images with<br />
the method discussed in the previous chapter. In contrast to the plane sweep based depth<br />
estimation approach, the voxel coloring <strong>and</strong> space carving implementations proposed in<br />
this chapter generate a true 3D model from a large set of input views directly.<br />
Voxel coloring [Seitz <strong>and</strong> Dyer, 1997] <strong>and</strong> its derivatives incorporate multiple, optionally<br />
wide-baseline views simultaneously, <strong>and</strong> produce directly volumetric 3D models.<br />
Methods derived from the voxel coloring approach test a large number of voxels <strong>for</strong> photoconsistency<br />
<strong>and</strong> are there<strong>for</strong>e rather slow. Reported calculation times <strong>for</strong> voxel coloring<br />
range from several seconds <strong>for</strong> low resolutions up to hours <strong>for</strong> high quality models.<br />
In this chapter we address efficient implementations <strong>for</strong> voxel coloring <strong>and</strong> space carving<br />
exploiting commodity 3D graphics cards. Our current implementation is based on OpenGL<br />
using fragment shader extension (ATI fragment shader in particular). The hardware requirements<br />
are rather modest; in particular any ATI Radeon 8500 or better is supported<br />
63
64 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
by our implementation. Medium resolution models are generated at interactive rates on<br />
present-day graphics hardware, whereas high resolution models are typically obtained after<br />
a few seconds. There are at least two application scenarios, which can benefit from a<br />
fast voxel coloring implementation: at first, our implementation provides a fast preview<br />
<strong>for</strong> more highly sophisticated algorithms. The second scenario addresses improved functionality<br />
of plenoptic image editing: modifications in one or several images can be used<br />
to update the 3D model instantly. After recalculating the new model, these changes are<br />
propagated to the remaining images as well. Thus, specular highlights on surfaces <strong>and</strong><br />
similar flaws can be removed interactively to improve the quality of the generated 3D<br />
model.<br />
5.2 Volumetric Scene Reconstruction <strong>and</strong> Space Carving<br />
Voxel coloring [Seitz <strong>and</strong> Dyer, 1997] generates a volumetric model by analyzing the consistency<br />
of scene voxels. As the voxel space is traversed using a plane sweeping approach,<br />
the state of each voxel is determined. For scenes without translucent objects a voxel can<br />
be classified either as empty or opaque. During the voxel coloring procedure voxels are<br />
projected into the input images <strong>and</strong> the distribution of the corresponding pixel values<br />
is used to determine the state of each voxel. A so-called photo-consistency (or colorconsistency)<br />
measure decides, whether a voxel is on the surface of a scene object, i.e. the<br />
voxel is opaque. This method is conservative in the sense that only assured inconsistent<br />
voxels are labeled as empty. There<strong>for</strong>e already processed voxels can be used to determine<br />
visibility of voxels with respect to the input views.<br />
In order to traverse the voxels in correct depth by a simple plane sweep, the placement<br />
of cameras is restricted by the so called ordinal visibility constraint. This constraint<br />
ensures, that voxels are visited prior to voxels they occlude. In [Seitz <strong>and</strong> Dyer, 1999] it is<br />
shown, that this visibility constraint is satisfied if the scene to be reconstructed is outside<br />
the convex hull of the camera centers. One typical camera configuration suitable <strong>for</strong> voxel<br />
coloring <strong>and</strong> possible slices used <strong>for</strong> reconstruction are shown in Figure 5.1.<br />
Several extensions of voxel coloring were proposed to allow more general<br />
camera placements. Space carving [Kutulakos <strong>and</strong> Seitz, 2000], generalized voxel<br />
coloring [Culbertson et al., 1999] <strong>and</strong> multi-hypo<strong>thesis</strong> voxel coloring [Eisert et al., 1999]<br />
remove the limitations on camera positions. Space carving per<strong>for</strong>ms multiple iterations<br />
of voxel coloring <strong>for</strong> different sweep directions. Only a suitable subset of all input views<br />
is used <strong>for</strong> each sweep.<br />
A crucial question is how to measure color consistency: the original voxel coloring<br />
approach utilized the variance of colors from projected voxels to determine consistency.<br />
Stevens et al. [Stevens et al., 2002] propose a histogram-based consistency metric. In their<br />
approach the footprint of a voxel in an image contains several pixels, which are organized<br />
in a histogram. A voxel is consistent, if the histograms of the footprints are not pairwise<br />
disjoint. The consistency measure presented by Yang et al. [Yang et al., 2003] h<strong>and</strong>les
5.2. Volumetric Scene Reconstruction <strong>and</strong> Space Carving 65<br />
1 2 3 8<br />
Depth index<br />
Figure 5.1: A possible configuration <strong>for</strong> plane sweeping through the voxel space. The<br />
camera positions are restricted, such that voxels in subsequent layers can only be occluded<br />
by already processed voxels.<br />
non-lambertian, specular surfaces explicitly.<br />
Voxel coloring is a computationally expensive procedure, which typically requires<br />
at least tens of seconds up to tens of minutes to compute the reconstruction. Several<br />
researchers proposed improved implementations <strong>for</strong> voxel coloring, e.g. Prock <strong>and</strong><br />
Dyer [Prock <strong>and</strong> Dyer, 1998] primarily utilize a hierarchical oct-tree representation to<br />
speed up voxel coloring. Additionally, they use graphics hardware to speed up certain calculations.<br />
Their multi-resolution voxel coloring method needs about 15s to generate a reconstruction<br />
with 256 3 voxels. However, a hierarchical, multi-resolution approach to volumetric<br />
3D reconstruction can potentially miss scene details. Sainz et al. [Sainz et al., 2002]<br />
use texture mapping features of 3D graphics hardware to accelerate the computations.<br />
Nevertheless, a 256 3 voxel model requires several minutes to be computed even on recent<br />
hardware.<br />
Seitz <strong>and</strong> Kutulakos [Seitz <strong>and</strong> Kutulakos, 2002] present an image editing approach<br />
<strong>for</strong> multiple images of a 3D scene. Changes in one image are propagated to other images<br />
by using an initially generated voxel model of the scene. There<strong>for</strong>e direct manipulation of<br />
surface textures <strong>and</strong> other image editing operations are possible. Image editing is limited to<br />
methods, which do not require a complete volumetric reconstruction step to propagate the<br />
modifications. With our efficient space carving implementation, it is possible to allow more<br />
general editing methods useful <strong>for</strong> a user-driven interactive refinement of voxel models,<br />
since the volumetric reconstruction can be generated almost instantly from altered input<br />
images.
66 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
5.3 Single Sweep Voxel Coloring in 3D Hardware<br />
In this section we describe the hardware based implementation of voxel coloring. This<br />
description applies to the case of a single sweep <strong>for</strong> camera configurations satisfying the<br />
ordinal visibility constraints [Seitz <strong>and</strong> Dyer, 1997]; we will discuss the extensions required<br />
<strong>for</strong> the multi sweep case in Section 5.4.<br />
The input <strong>for</strong> our method consists of N resampled color images <strong>and</strong> the corresponding<br />
projection matrices, <strong>and</strong> a bounding box denoting the space volume to be reconstructed.<br />
The bounding box of the volume to be reconstructed is organized as a stack of parallel<br />
planes. These planes are traversed in a front-to-back ordering during the reconstruction<br />
procedure. The algorithm maintains a depth map <strong>for</strong> every camera, which stores the depth<br />
with respect to the camera position of the reconstructed model so far. For each plane the<br />
algorithm executes the following steps:<br />
1. The images of the camera views are projected onto the current plane <strong>and</strong> a consistency<br />
measure is evaluated.<br />
2. Surface pixels (voxels) are determined by thresholding the consistency map.<br />
3. For each camera view the associated depth map is updated by rendering the currently<br />
reconstructed voxel layer according to the input views.<br />
At the end of each iteration a layer of voxels is obtained <strong>and</strong> can be used <strong>for</strong> further<br />
processing.<br />
Figure 5.2 illustrates the first step in the procedure to obtain the color of a voxel<br />
with respect to a particular input view. Perspective texture mapping is combined with<br />
a depth test against the so far available depth map to sieve unoccluded voxels. This<br />
procedure is repeated <strong>for</strong> every input view to accumulate the necessary in<strong>for</strong>mation <strong>for</strong><br />
color consistency calculation.<br />
The following sections describe the steps per<strong>for</strong>med in our implementation in more<br />
detail.<br />
5.3.1 Initialization<br />
In addition to the currently calculated voxel layer the algorithm maintains a depth map<br />
<strong>for</strong> every input view to test the visibility of voxels. Since voxel layers are processed in a<br />
front-to-back ordering, it is sufficient to use bitmaps to represent the depth map (pixels<br />
with value 1 indicate empty space along the line-of-sight, whereas value 0 denotes rays<br />
with already processed opaque voxels). In this paper we use range images <strong>for</strong> the depth<br />
maps with gray levels indicating the depth of the voxel layer <strong>for</strong> better visual feedback.<br />
At the beginning of the sweep these depth maps are cleared with a value indicating<br />
empty voxels (i.e. 1). Additionally, we need to h<strong>and</strong>le voxels that are outside the viewing<br />
volume of a camera as well (since other cameras can possibly see these voxels). We set
5.3. Single Sweep Voxel Coloring in 3D Hardware 67<br />
Figure 5.2: Perspective texture mapping using visibility in<strong>for</strong>mation. The original input<br />
image (depicted on the leftmost quad) is filtered using the depth map (in the middle), <strong>and</strong><br />
only unoccluded pixels are rendered on the current voxel layer.<br />
the texture coordinate wrapping mode to GL_CLAMP to h<strong>and</strong>le voxels outside the frustum<br />
correctly. Whenever a depth outside the frustum is accessed, a minimal depth value (0) is<br />
returned. Note, that only voxels in front of the camera can be culled against the viewing<br />
frustum, there<strong>for</strong>e all camera positions must be entirely outside the reconstructed volume.<br />
5.3.2 Voxel Layer Generation<br />
With the knowledge of the depth maps generated <strong>for</strong> every view so far, an estimate <strong>for</strong><br />
photo-consistency can be calculated. We accumulate the consistency value very similar<br />
to the method proposed by Yang et al. [Yang et al., 2002]. In order to obtain the color<br />
of a voxel as seen from a particular input view, projective texture mapping is applied to<br />
determine the color hypo<strong>thesis</strong> <strong>for</strong> every voxel in the current layer. The color hypotheses<br />
<strong>for</strong> all visible views are accumulated to obtain a consistency score <strong>for</strong> each voxel.<br />
Using the color variance as the consistency function is suboptimal on graphics hardware.<br />
At first, a significant number of passes is needed to calculate the variance ∗ , <strong>and</strong><br />
the squaring operation causes numerical problems due to the limited precision available<br />
on the GPU.<br />
A simple consistency measure is the length of the interval generated by the color<br />
hypotheses <strong>for</strong> a voxel, which can be easily computed on graphics hardware <strong>and</strong> turned<br />
out to result in reasonable reconstructions. More <strong>for</strong>mally, the consistency value c of a<br />
∗ One sweep over all input views is required to count the number of visible views <strong>for</strong> every voxel; another<br />
sweep is required to calculate the mean <strong>and</strong> a third sweep is required to obtain the variance.
68 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
voxel projected to pixel with color ci = (ci.r, ci.g, ci.b) in input view i is assigned to<br />
c = max<br />
j∈{r,g,b} (max ci.j − min ci.j).<br />
i<br />
i<br />
If the color hypotheses have a significant disparity, then the interval is too large <strong>and</strong> the<br />
voxel is labeled as inconsistent. Calculation of the interval length can be done with two<br />
complete sweeps over the input views: the first sweep uses a blending equation set to<br />
GL_MIN <strong>and</strong> the second sweep sets the blending equation to GL_MAX. A final pass calculates<br />
the length of the interval, but this step can be integrated into the thresholding step to<br />
determine consistent voxels.<br />
The final result of this step is an opacity bitmap (stored in an off-screen pixel buffer)<br />
indicating consistent voxels of the currently processed layer. This binary image constitutes<br />
one slice of the final volumetric model <strong>and</strong> is used to update the visibility in<strong>for</strong>mation<br />
(Section 5.3.3). In our implementation the opacity of a voxel is stored in the alpha channel<br />
<strong>and</strong> the mean color of the voxel is stored in the remaining channels.<br />
In order to achieve high per<strong>for</strong>mance we exploit several features of graphics hardware:<br />
Visibility Determination Only views that are actually able to see a voxel contribute to<br />
the consistency value <strong>and</strong> image pixels from occluded cameras should be ignored. We employ<br />
the alpha test functionality <strong>for</strong> visibility calculation. The depth index of the current<br />
voxel layer is compared with the value stored in the depth map <strong>for</strong> the appropriate view.<br />
Pixels that fail the alpha test are discarded <strong>and</strong> are there<strong>for</strong>e ignored during consistency<br />
calculation.<br />
Note that it is possible to count the number of visible cameras <strong>for</strong> a voxel efficiently<br />
using the stencil buffer. Using this count it is easily possible to extract only surface voxels<br />
of the model.<br />
Selection of Consistent Voxels Voxels of the current layer are labeled as opaque if<br />
they are photo-consistent <strong>and</strong> if they are not part of the background. In our implementation<br />
dark pixels with an intensity value below some user-defined threshold are treated as<br />
background pixels <strong>and</strong> the state of the voxels is set to empty.<br />
Additional Processing At this stage of the procedure, additional processing the voxel<br />
bit-plane can be applied. In particular, prior knowledge from previous sweeps (see Section<br />
5.4) can be used to refine the generated slice. Furthermore, the generated voxel slice<br />
can be copied into a 3D texture used <strong>for</strong> direct visualization of the obtained volumetric<br />
model.
5.3. Single Sweep Voxel Coloring in 3D Hardware 69<br />
5.3.3 Updating the Depth Maps<br />
After determining filled voxels in the current layer, the depth maps must be updated to<br />
reflect occlusions of the additional solid voxels. For each input view the depth map is<br />
selected as rendering target <strong>and</strong> the corresponding camera matrix is used <strong>for</strong> projection.<br />
The blending mode is set to GL_MIN to achieve a conditional update of depth values. We<br />
apply a small fragment program to filter empty voxels by assigning a maximum depth<br />
value to these pixels. Consequently, transparent voxel do not affect the depth map.<br />
Figure 5.3 shows the successive update of depth maps <strong>for</strong> two input views. Snapshots<br />
of the depth map were taken after 25%, 50% <strong>and</strong> 100% of the reconstruction process.<br />
(a) view 1: 25% (b) 50% (c) 100%<br />
(d) view 2: 25% (e) 50% (f) 100%<br />
Figure 5.3: Evolution of depth maps <strong>for</strong> two views during the sweep process. Darker<br />
regions are closer to the camera. The images show depth maps obtained after processing<br />
25%, 50% <strong>and</strong> 100% of the reconstructed volume.
70 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
5.3.4 Immediate Visualization<br />
Immediate visual feedback is necessary to evaluate the quality of the reconstructed model<br />
rapidly. Reading back the voxel model from graphics memory into main memory to<br />
generate a surface representation is expensive <strong>and</strong> time-consuming, there<strong>for</strong>e direct volume<br />
rendering methods [Engel <strong>and</strong> Ertl, 2002] are more appropriate. The individual slices<br />
obtained by voxel coloring can be copied into a 3D texture <strong>and</strong> visualized immediately.<br />
Alternatively, the depth images generated <strong>for</strong> the input views can be displayed as displacement<br />
map [Kautz <strong>and</strong> Seidel, 2001], which allows the height-field stored in a texture<br />
to be rendered from novel views <strong>for</strong> visual inspection.<br />
5.4 Extensions to Multi Sweep Space Carving<br />
The procedure described in Section 5.3 is limited to cameras fulfilling the ordinal visibility<br />
constraints. In order to obtain reconstructions <strong>for</strong> more general camera setups, the plane<br />
sweep procedure is repeated several times <strong>for</strong> different sweep directions. Only a compatible<br />
set of cameras is used in each iteration. The difference to the single sweep approach lies<br />
in the amount of knowledge from the prior sweeps used in the current sweep. We have<br />
tested three alternatives:<br />
Independent Sweeps All sweeps are per<strong>for</strong>med independently <strong>and</strong> no prior in<strong>for</strong>mation<br />
is used in the current sweep. The reconstructed volumetric model is the intersection of<br />
the models generated by the independent sweeps. The intersection of the obtained voxel<br />
models is per<strong>for</strong>med by the main CPU. This approach has no restriction on the resolution<br />
of the voxel space, but the frequent transfer of voxel data from graphics memory imposes a<br />
severe per<strong>for</strong>mance penalty. In our experiments we observed significantly longer running<br />
times, when voxel data is read back into main memory. Copying image data from the<br />
frame buffer or texture memory into main memory is a rather slow operation (in contrast<br />
to the reverse direction). This per<strong>for</strong>mance penalty depends on the resolution, <strong>and</strong> results<br />
in more than doubled execution time e.g. at 256 3 scene resolution.<br />
Complete Prior Knowledge The opacity value of the voxels generated in the previous<br />
sweep is stored in a 3D texture, which is used in the subsequent sweep to determine already<br />
carved voxels. The need <strong>for</strong> a 3D texture residing on graphics memory limits the maximum<br />
resolution of the voxel space. On consumer level graphics hardware the resolution of the<br />
voxel space is typically bounded by 256 3 . Two 3D textures are required simultaneously;<br />
one texture represents the previous model <strong>and</strong> the other one serves as destination <strong>for</strong> the<br />
model generated in the current sweep. Additionally, the continuous access of a 3D texture<br />
lowers the runtime per<strong>for</strong>mance of the implementation. A significant advantage of this<br />
approach is the opportunity to visualize the generated model immediately using direct<br />
volume rendering methods.
5.4. Extensions to Multi Sweep Space Carving 71<br />
Partial Prior Knowledge In order to avoid the expensive 3D texture representing<br />
complete prior knowledge, a height field can be used as a trade-off between the <strong>for</strong>mer two<br />
alternatives. In the following we assume orthogonal sweep directions along the major axis<br />
of the voxel space. In addition to the depth maps <strong>for</strong> the input views, the preceding sweep<br />
maintains a depth map in the sweep direction. This height-field is used to inhibit already<br />
carved voxels from being classified as opaque in the current sweep. This can be achieved<br />
by comparing the appropriate component of the voxel position with the value stored in<br />
the height field (see Figure 5.4).<br />
Carved voxels<br />
Current sweep direction<br />
1 2 3 8<br />
Depth index<br />
Previous sweep direction<br />
Figure 5.4: Plane sweep with partial knowledge from the processing sweeps. Carved voxels<br />
remain unfilled by using a depth image. The shaded region is known to be empty from<br />
the previous sweep, there<strong>for</strong>e filling voxels inside this region is prohibited.<br />
The final model is again the intersection of the volumetric models generated by the<br />
sweeps, since the incoming knowledge <strong>for</strong> each sweep is only a partial model. In order to<br />
avoid the expensive transfer of data from graphics memory to per<strong>for</strong>m this intersection<br />
in software, we display the result of the final sweep to the user. Additionally, we use the<br />
height-fields of all prior sweeps to approximate the volumetric model.<br />
In this approach the available graphics memory does not limit the voxel space resolution,<br />
but the depth of the color channel is a restricting factor, if high precision depth<br />
buffers are not available.
72 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
5.5 Experimental Results<br />
5.5.1 Per<strong>for</strong>mance Results<br />
We have implemented voxel coloring <strong>and</strong> space carving as described in Sections 5.3<br />
<strong>and</strong> 5.4. Our implementation is based on fragment shader features as exposed by the<br />
ATI fragment shader OpenGL extension. Hence it is possible to per<strong>for</strong>m hardware<br />
accelerated voxel coloring <strong>and</strong> space carving on low-end or mobile graphics hardware as<br />
well.<br />
At first we give per<strong>for</strong>mance results obtained by our implementation. The benchmarking<br />
system is equipped with an AMD Athlon XP2000 as CPU <strong>and</strong> an ATI Radeon 9700<br />
Pro as graphics hardware. The per<strong>for</strong>mance plots are created <strong>for</strong> the synthetic “Bowl”<br />
dataset (see Figure 5.7). 36 views of the model were captured using a virtual turntable<br />
software. Each sweep uses 9 views <strong>for</strong> reconstruction. Figure 5.5(a) presents timing results<br />
<strong>for</strong> the voxel coloring implementation at different resolutions. The required time <strong>for</strong><br />
voxel coloring is approximately linear in the depth resolution (i.e. the number of generated<br />
slices). Surprisingly, the time needed <strong>for</strong> resolutions from 32 × 32 × d up to 128 × 128 × d<br />
are close to the time required <strong>for</strong> 256 × 256 × d. The runtime <strong>for</strong> lower resolutions is dominated<br />
by the expensive pixel buffer switches (which is linear in the number of slices, but<br />
independent of the resolution). At higher resolutions the fill rate of the graphics hardware<br />
becomes more dominant. For 256 × 256 × d scene resolutions our implementation of the<br />
voxel coloring approach generates 3D models at interactive rates.<br />
Figure 5.5(b) compares the observed timings <strong>for</strong> the proposed space carving methods.<br />
The final 3D model was generated using four sweeps in order to utilize all 36 captured<br />
views. The timings <strong>for</strong> single sweep voxel coloring are displayed <strong>for</strong> comparison, too. For<br />
resolutions up to 128 3 space carving is slightly more expensive than per<strong>for</strong>ming four voxel<br />
coloring sweeps, since some time is required to merge the individual sweeps. At 256 3<br />
resolution, space carving maintaining the full voxel model in graphics memory runs out of<br />
memory <strong>and</strong> requires substantially more time.<br />
5.5.2 Visual Results<br />
In this section we illustrate the visual quality of the obtained reconstructions. At first we<br />
demonstrate our implementation on a synthetic dataset obtained by off-screen rendering<br />
<strong>and</strong> capturing a 3D dinosaur model. The resolution of the input images is 256 × 256.<br />
Several input images are shown in Figure 5.6(a)–(c). The volumetric texture directly<br />
obtained by the space carving procedure is shown in Figure 5.6(d). In order to reduce the<br />
size of the 3D texture, only luminance values instead of colors are stored in the texture.<br />
Figure 5.6(e) <strong>and</strong> (f) are snapshots showing the 3D model as a point cloud within a VRML<br />
viewer.<br />
Another synthetic dataset, the “Bowl” dataset, is shown in Figure 5.7. The images<br />
were obtained under the same conditions as the Dino dataset. In Figure 5.7(d) complete
5.6. Discussion 73<br />
prior knowledge stored in a 3D texture is used, whereas in Figure 5.7(e) the already carved<br />
model is approximated by height-fields. The latter model contains more outliers <strong>and</strong> noise,<br />
but the memory requirement is substantially reduced.<br />
The real dataset consists of images showing a historic statue (Figure 5.8(a)–(c)). In<br />
Figure 5.8(d) the surface voxels of the reconstructed model generated from 7 input views<br />
is shown as point cloud. The number of voxels is 1024 × 1024 × 250 <strong>and</strong> the pure voxel<br />
coloring took about 4.8s. Reading the voxels back into the main memory <strong>and</strong> generating<br />
the VRML file requires additional 40s. A lower resolution version (256 3 ) of the same<br />
dataset generated in 0.77s is shown in Figure 5.9.<br />
5.6 Discussion<br />
This chapter described a hardware accelerated approach <strong>for</strong> voxel coloring <strong>and</strong> space carving<br />
scene reconstruction methods. Voxel coloring can be per<strong>for</strong>med at interactive rates<br />
<strong>for</strong> medium scene resolutions, <strong>and</strong> volumetric models can be obtained with space carving<br />
very quickly (in the order of seconds). Despite the simple consistency measure used in<br />
our implementation, the obtained 3D models are suitable <strong>for</strong> visual feedback to the user<br />
to estimate the parameters used <strong>for</strong> the final high-quality, software-based reconstruction.<br />
With new features provided by modern graphics processors, more sophisticated consistency<br />
measures can be implemented. In particular, a histogram-based consistency measure<br />
[Stevens et al., 2002] is a potential c<strong>and</strong>idate <strong>for</strong> efficient implementation in graphics<br />
hardware.<br />
At low resolution the per<strong>for</strong>mance of our implementation is dominated by the multipass<br />
rendering overhead. Consequently, reducing the number of passes especially at coarse<br />
resolutions may yield to a near real-time generation of volumetric models. Such improvements<br />
need further investigations.
74 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
time in millisecs<br />
time in millisecs<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
256x256xd<br />
512x512xd<br />
1024x1024xd<br />
0<br />
50 100 150<br />
depth resolution<br />
200 250<br />
partial knowledge<br />
independent sweeps<br />
complete knowledge<br />
voxel coloring<br />
(a)<br />
0<br />
32x32x32 64x64x64 128x128x128 256x256x256<br />
resolution<br />
(b)<br />
Figure 5.5: Timing results <strong>for</strong> the Bowl dataset. Each sweep used 9 views to calculate<br />
the consistency of voxels. (a) shows timing results <strong>for</strong> voxel coloring using a single plane<br />
sweep at different resolutions. (b) illustrates timing results <strong>for</strong> space carving using multiple<br />
sweeps at various voxel space resolution. With the exception of voxel coloring, which<br />
is depicted <strong>for</strong> comparison, four sweeps are per<strong>for</strong>med to obtain the final model. Space<br />
carving with complete prior knowledge requires almost 33s at 256 3 resolution; this behavior<br />
is caused by shortage of graphics memory.
5.6. Discussion 75<br />
(a) (b) (c)<br />
(d) (e) (f)<br />
Figure 5.6: (a)–(c) Three input views (of 36) from the synthetic Dino dataset. (d) The<br />
obtained volumetric model visualized with a 3D texture. We use only luminance <strong>and</strong> alpha<br />
channels <strong>for</strong> the texture to reduce the memory footprint of the 3D texture. (e) <strong>and</strong> (f)<br />
show the 3D model rendered as point cloud. In our current implementation, colors <strong>for</strong><br />
surface voxels are assigned in the final sweep, hence surface voxels not seen in the final<br />
sweep have a default color.
76 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
(a) (b) (c)<br />
(d) (e)<br />
Figure 5.7: (a)–(c) Three input views (of 36) from the synthetic Bowl dataset. (d) The<br />
obtained volumetric model visualized with a 3D texture. The model was generated in<br />
1.4s. (e) is generated by approximating the result of previous sweeps with height-fields<br />
instead of a full 3D texture.
5.6. Discussion 77<br />
(a) (b) (c)<br />
(d) (e)<br />
Figure 5.8: (a)–(c) Three input views from an image sequence showing a statue. (d)<br />
shows a high resolution reconstruction generated by carving 250 Mio. initial voxels. Pure<br />
voxel coloring done in graphics hardware required less than 5s. Only surface voxels are<br />
shown as a point cloud.
78 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />
(a) (b)<br />
Figure 5.9: (a) A 3D reconstruction generated by single sweep voxel coloring using a<br />
space of 256 × 256 × 250 voxels. 7 input views are used <strong>for</strong> the reconstruction. Voxel<br />
coloring <strong>and</strong> VRML generation required about 3s. The displayed geometry consists of<br />
surface voxels rendered as points, hence several holes are apparent. (b) A depth image <strong>for</strong><br />
the same dataset generated in 0.77s.
Chapter 6<br />
PDE-based Depth Estimation on<br />
the GPU<br />
Contents<br />
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
6.2 Variational Techniques <strong>for</strong> Multi-View Depth Estimation . . . 80<br />
6.3 GPU-based Implementation . . . . . . . . . . . . . . . . . . . . . 85<br />
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
6.1 Introduction<br />
This chapter describes a variational approach to multi-view depth estimation, which is<br />
accelerated by 3D graphics hardware. Variational methods to multi-view depth estimation<br />
are techniques with its foundations in variational calculus <strong>and</strong> numerical analysis.<br />
The result of these procedures is a depth image which minimizes an energy functional<br />
incorporating image similarity <strong>and</strong> smoothness regularization terms. In contrast to many<br />
window-based dense matching approaches favoring fronto-parallel surfaces, the utilized<br />
variational depth estimation method is based on per-pixel image similarities <strong>and</strong> works<br />
well <strong>for</strong> slanted surfaces. Depth values communicate with the surrounding depth hypotheses<br />
through the regularization term.<br />
Energy-based approaches to dense correspondence estimation incorporate image similarity<br />
<strong>and</strong> smoothness constraints into the objective function <strong>and</strong> search <strong>for</strong> an appropriate<br />
minimum. Consequently, these methods allow the propagation of depth values into textureless<br />
regions, where no robust correspondences are available. Variational techniques<br />
express the discrete energy function in continuous terms <strong>and</strong> solve the corresponding<br />
Euler-Lagrange partial differential equation numerically.<br />
79
80 Chapter 6. PDE-based Depth Estimation on the GPU<br />
In contrast to energy-based methods <strong>for</strong> image restoration <strong>and</strong> segmentation, variational<br />
techniques <strong>for</strong> multi-view depth require successive de<strong>for</strong>mation (warping) of the<br />
sensor images according to the current depth map hypo<strong>thesis</strong>. In particular, this step<br />
can be significantly accelerated by the texture units of graphics hardware, which offer the<br />
necessary image interpolation virtually <strong>for</strong> free. Furthermore, the numerical procedures<br />
to solve variational problems are typically algorithms with high parallelism <strong>and</strong> can be<br />
transferred to current generation graphics hardware <strong>for</strong> optimal per<strong>for</strong>mance.<br />
This chapter outlines our implementation of the hardware-accelerated approach to<br />
variational depth estimation <strong>and</strong> presents some positive results. We demonstrate that<br />
a substantial per<strong>for</strong>mance gain is obtained by our approach. Additionally, difficult settings<br />
<strong>for</strong> variational stereo methods resulting in incorrect 3D models are discussed <strong>and</strong><br />
possible solutions proposed. Notice, that very fast numerical solvers allow the convenient<br />
investigation of potentially more complex <strong>and</strong> robust image similarity measures <strong>and</strong> other<br />
extensions to the basic model of variational depth.<br />
6.2 Variational Techniques <strong>for</strong> Multi-View Depth Estimation<br />
6.2.1 Basic Model<br />
This section describes a variational approach to depth estimation following<br />
mostly [Strecha <strong>and</strong> Van Gool, 2002, Strecha et al., 2003]. In order to allow a<br />
one-dimensional search <strong>for</strong> a depth value at every pixel, the camera calibration matrices<br />
<strong>and</strong> the external orientations are assumed to be known. In order to utilize a true<br />
multi-view setup, pixels in one image are transferred by the epipolar geometry (as<br />
described below), <strong>and</strong> an image rectification procedure is not required. In the set of<br />
employed images one image Ii represents the key image, <strong>for</strong> which the depth map is<br />
generated. The other images, Ij, j �= i, are sensor images. The camera imaging Ii is<br />
assumed to be in canonical position (Pi = Ki [I|0]) <strong>and</strong> the external orientation <strong>for</strong> Ij<br />
is [Rj|tj] <strong>and</strong> the camera calibration matrix is Kj. The depth map is calculated with<br />
respect to Ii <strong>and</strong> depth values assigned <strong>for</strong> pixels in Ii transfer to the other images as<br />
follows: The corresponding pixel qij <strong>for</strong> a pixel pi in Ii with associated depth di is given<br />
by<br />
qij(pi) = Hij pi + Tj/di,<br />
where Hij = KjR t j K−1<br />
i<br />
<strong>and</strong> Tj = Kj tj. Note, that pi <strong>and</strong> qij refer to homogeneous pixel<br />
positions <strong>and</strong> qij must be normalized by its third component.<br />
The primary goal of depth estimation is the assignment of depth values to every pixel<br />
of Ii, such that a cost function incorporating image similarity terms <strong>and</strong> smoothness terms<br />
is minimized. In particular, the following objective function is often used in variational<br />
stereo methods:
6.2. Variational Techniques <strong>for</strong> Multi-View Depth Estimation 81<br />
E(di) = �<br />
⎛<br />
⎝ �<br />
pi<br />
j<br />
(Ij(qij(di(pi))) − Ii(pi)) 2 + λ�∇di(pi)� 2<br />
⎞<br />
⎠ → min (6.1)<br />
Since the depth map di is defined on a grid, ∇di refers to a suitable finite<br />
difference scheme to calculate the gradient. We omit the explicit dependence of di<br />
on the pixel pi <strong>and</strong> abbreviate Ij(qij(di(pi))) as Ij(di). Minimizing Eq. 6.1 using<br />
discrete (non-continuous) methods can be achieved using e.g. graph cut methods<br />
[Boykov et al., 2001, Kolmogorov <strong>and</strong> Zabih, 2001, Kolmogorov <strong>and</strong> Zabih, 2002,<br />
Kolmogorov <strong>and</strong> Zabih, 2004]. Alternatively, Eq. 6.1 can be seen as discrete<br />
approximation to a continuous minimization problem <strong>and</strong> techniques from variational<br />
calculus can be applied. The continuous <strong>for</strong>mulation of Eq. 6.1 is<br />
�<br />
S(di) =<br />
⎛<br />
⎝ �<br />
(Ij(di) − Ii) 2 + λ�∇di� 2<br />
⎞<br />
⎠ dp → min (6.2)<br />
p<br />
j<br />
The Euler-Lagrange equation states a necessary condition <strong>for</strong> the function di to be a<br />
stationary value with respect to S [Lanczos, 1986]:<br />
δS<br />
δdi<br />
= � ∂Ij<br />
(Ij(di) − Ii) − λ∇<br />
∂di<br />
2 !<br />
di = 0 (6.3)<br />
j<br />
Note, that this equation holds <strong>for</strong> every pixel p in Ii. The spatial derivative ∂Ij<br />
∂dj<br />
is the<br />
intensity change along the epipolar line in image Ij. By discretizing Eq. 6.3 one can solve<br />
the associated partial differential equation using a numerical scheme on the grid of pixels.<br />
We describe a particular approach, which is very suitable <strong>for</strong> GPU-based implementation.<br />
At first, the image intensities Ij(di) are locally linearized around d0 i using the first<br />
order Taylor expansion:<br />
Ij(di) = Ij(d 0 i + ∆di) ≈ Ij(d 0 i ) + ∂Ij(d 0 i )<br />
Applying this expansion on the Euler-Lagrange equation yields<br />
�<br />
�<br />
∂Ij<br />
j<br />
∂di<br />
Ij(d 0 i ) + ∂Ij(d0 i )<br />
∆di − Ii<br />
∂di<br />
∂di<br />
∆di<br />
�<br />
− λ∇ 2 di = 0. (6.4)<br />
In combination with a (linear) finite differencing scheme <strong>for</strong> ∇ 2 di the equation above<br />
results in a huge, but sparse linear system to solve <strong>for</strong> di. This scheme iteratively refines<br />
the estimates <strong>for</strong> the depth map di given its previous estimate.<br />
In order to prevent the scheme to converge to a suboptimal local minimum, a coarseto-fine<br />
approach is m<strong>and</strong>atory.
82 Chapter 6. PDE-based Depth Estimation on the GPU<br />
Diffusion type Term in S Derivative<br />
Homogeneous diffusion ∇ t d∇d = �∇d� 2 ∇ 2 d<br />
Image-driven isotropic diffusion ∇ t d g(�∇I� 2 )∇d div(g(�∇I� 2 )∇d)<br />
Image-driven anisotropic diffusion ∇ t d D(∇I)∇d div(D(∇I)∇d)<br />
Flow-driven isotropic diffusion ∇ t d g(�∇d� 2 )∇d div(g(�∇I� 2 )∇d)<br />
Flow-driven anisotropic diffusion ∇ t d D(∇d)∇d div(D(∇d)∇d)<br />
Table 6.1: Regularization terms induced by diffusion processes<br />
6.2.2 Regularization<br />
Taking the Laplacian of the depth map, ∇ 2 di, to guide the regularization gives usually too<br />
smooth results <strong>and</strong> the obtained depth maps lack sharp depth discontinuities. Table 6.2.2<br />
lists several regularization functions based on diffusion processes mostly in accordance<br />
with the taxonomy of Weickert et al. [Weickert et al., 2004]. In this table the function<br />
g(s 2 ) is a decreasing scalar function solely based on the magnitude of the gradient, e.g.<br />
g(s 2 ) = exp(−Ks 2 ) (<strong>for</strong> a user specified K). D(∇c) denotes the diffusion tensor<br />
D(∇c) =<br />
1<br />
�∇c�2 + 2ν2 ��<br />
∂c<br />
∂y<br />
− ∂c<br />
� �<br />
∂c<br />
∂y<br />
∂x − ∂c<br />
�t + ν<br />
∂x<br />
2 �<br />
I .<br />
ν is a small constant to prevent singularities in perfectly homogeneous regions. Setting<br />
ν to 0.001 is a common choice. Note that D(∇c) is very similar to the structural tensor<br />
used to detect image corners. If <strong>for</strong> example | ∂c ∂c<br />
∂x | ≫ | ∂y | (a vertical edge in the image),<br />
the diffusion is inhibited in the x-direction.<br />
Isotropic diffusion inhibits diffusion at discontinuities regardless of the direction of the<br />
gradient, whereas anisotropic regularization allows diffusion parallel to edge discontinuities.<br />
Image-driven regularization is based solely on the gradients calculated in the source<br />
data (images) <strong>and</strong> the numerical scheme results in linear expressions. Hence, imagedriven<br />
diffusion is also called linear diffusion [Weickert <strong>and</strong> Brox, 2002]. In flow-based<br />
regularization the diffusion stops at discontinuities of the current flow resp. depth map.<br />
Consequently, the obtained equation system derived from finite differencing is a nonlinear<br />
system <strong>and</strong> requires e.g. fix-point iterations to be solved.<br />
Note that the terminology in not uni<strong>for</strong>m in the literature: flow-driven isotropic diffusion<br />
is often referred as nonlinear anisotropic diffusion [Perona <strong>and</strong> Malik, 1990]. In<br />
addition to homogeneous diffusion we employ an image-driven (linear) anisotropic regularization<br />
approach [Nagel <strong>and</strong> Enkelmann, 1986] <strong>for</strong> the following reasons:<br />
• The anisotropy of this regularization adapts very well to homogeneous image region<br />
boundaries <strong>and</strong> allows smoothing along image edges.<br />
• The linear nature of the numerical scheme allows efficient sparse matrix solvers to<br />
be utilized.
6.2. Variational Techniques <strong>for</strong> Multi-View Depth Estimation 83<br />
Pure image-driven diffusion employed <strong>for</strong> image smoothing <strong>and</strong> denoising will fail in highly<br />
textured regions, but in this case the discriminative image data will result in correct<br />
determination of the final depth map.<br />
6.2.3 Extensions <strong>and</strong> Variations<br />
In the literature several extensions <strong>and</strong> enhancements are proposed to increase the quality<br />
<strong>and</strong> reliability of variational approaches to depth estimation. We summarize a few<br />
important concepts in this section.<br />
6.2.3.1 Back-Matching<br />
In order to increase the robustness of the variational depth estimation method <strong>and</strong> to<br />
detect mismatches, a back-matching scheme can be utilized to assign confidence values<br />
to the depth values. Confident depth estimates should have a higher influence in the<br />
regularization term <strong>for</strong> adjacent pixels with lower confidence.<br />
In a back-matching setting, every image Ii takes the role of a key image <strong>and</strong> a dense<br />
depth map is computed with several Ij, j �= i, as sensor images. If we denote the depth<br />
map computed <strong>for</strong> Ii with di <strong>and</strong> qij(p, di) represents the transfer of a pixel p in image Ii<br />
with the associated depth into Ij, then the <strong>for</strong>ward-backward error is<br />
eij = �p − qji(qij(p, di), dj)�.<br />
The confidence cij is now a function of eij, e.g.<br />
or<br />
cij = 1/(1 + k eij)<br />
�<br />
cij = exp − c2<br />
�<br />
.<br />
k<br />
If cij is close to 1, then the depth value is highly confident. Values of cij close to zero<br />
indicate unreliable depth values. In [Strecha et al., 2003] the following energy functional<br />
is proposed:<br />
�<br />
S(di) =<br />
p<br />
⎛<br />
⎝ �<br />
j<br />
cij(Ij(d) − Ii) 2 + λ∇ t diD(∇Ci)∇di<br />
⎞<br />
⎠ dp → min,<br />
where Ci = maxj(cij) <strong>and</strong> D(∇Ci) is a anisotropic diffusion operator. The corresponding<br />
Euler-Lagrange equation reads<br />
δS<br />
δdi<br />
= �<br />
j<br />
∂Ij<br />
!<br />
cij (Ij(di) − Ii) − λdiv(D(∇Ci)∇di = 0.<br />
∂di
84 Chapter 6. PDE-based Depth Estimation on the GPU<br />
6.2.3.2 Local Changes in Illumination<br />
If the scene to be reconstructed contains not only purely Lambertian surfaces with diffuse<br />
reflection behavior, illumination changes appear between the images. These local lighting<br />
changes can be modeled by an additional intensity scaling function κij, which scales the<br />
intensity values of Ii to match the intensities in Ij. The extended energy function is<br />
�<br />
S(di, κij) =<br />
p<br />
⎛<br />
⎝ �<br />
j<br />
(Ij(d) − κij Ii) 2 + λ�∇di� 2 + λ2�∇κij� 2<br />
⎞<br />
⎠ dp → min,<br />
since both di <strong>and</strong> κij are assumed to change smoothly over the image domain. The<br />
corresponding Euler-Lagrange equations <strong>for</strong> di <strong>and</strong> κij are now:<br />
δS<br />
δdi<br />
δS<br />
δκij<br />
= � ∂Ij<br />
(Ij(d) − κij Ii) − λ∇<br />
∂di<br />
2 d<br />
j<br />
= Ii(Ij(di) − κij Ii) − λ2∇ 2 κij.<br />
Of course, confidence evaluation using back-matching <strong>and</strong> the estimation of local lighting<br />
changes can be combined into one framework.<br />
In case of local illumination changes the intensity scaling <strong>and</strong> the depth map will<br />
be affected. It is impossible to correctly estimate the depth from the available local<br />
in<strong>for</strong>mation only, since both depth <strong>and</strong> intensity scaling processes will adapt to match the<br />
pixel intensity values.<br />
6.2.3.3 Other Variations<br />
The energy functional presented in Eq. 6.2 <strong>and</strong> used in the previous sections can be<br />
modified in various ways. At first, the L 2 data term (Ij(di) − Ii) 2 can be replaced by a<br />
suitable function Ψ on the intensity differences, e.g.<br />
Ψ(Ij(di) − Ii) =<br />
�<br />
(Ij(di) − Ii) 2 + ε 2<br />
<strong>for</strong> small ε [Brox et al., 2004, Slesareva et al., 2005]. This choice of Ψ is a smooth, differentiable<br />
L 1 norm. Additionally, the data term may incorporate intensity gradient <strong>and</strong><br />
other higher order in<strong>for</strong>mation as well [Papenberg et al., 2005].<br />
It the L 1 image data term is utilized, it is common to employ a total variation regularization<br />
[Rudin et al., 1992], �∇d�, instead of the quadratic one. In general, the choice<br />
of the regularization significantly affects the results especially close to discontinuities.
6.3. GPU-based Implementation 85<br />
6.3 GPU-based Implementation<br />
This section describes our implementation of the variational depth estimation technique<br />
on a GPU. Depth estimation in our application is per<strong>for</strong>med on a set of three images<br />
(one key image plus two sensor images). In general, three passes are per<strong>for</strong>med in every<br />
iteration of depth refinements:<br />
1. In the first pass the sensor images Ij are warped according to the current depth map<br />
hypo<strong>thesis</strong> <strong>and</strong> the spatial derivatives ∂Ij/∂di are calculated.<br />
2. Expressions used in the regularization term are precomputed, e.g. the Laplacian or<br />
the anisotropic flow used in the subsequent semi-implicit solvers.<br />
3. Finally, the depth estimates are updated using some semi-implicit strategy derived<br />
from Eq. 6.4.<br />
The next sections describe each pass in more detail. These iterations are embedded in a<br />
coarse-to-fine framework using a Gaussian image pyramid to avoid immediate convergence<br />
to a local minimum. The depth map acquired after convergence at the coarser level is used<br />
as initial depth map at the next finer level.<br />
6.3.1 Image Warping<br />
The first pass of the GPU-based depth estimation implementation consists of warping the<br />
sensor images, Ij, according to the depth map di. The lookup in image Ij is per<strong>for</strong>med<br />
using the epipolar parametrization<br />
qij = (x, y, 1) t = Hij pi + Tij/di,<br />
Consequently, the warped image according to the current depth hypo<strong>thesis</strong> can be obtained<br />
by dependent texture lookups. The required spatial derivative ∂Ij(di)/∂di can be<br />
efficiently calculated by the chain rule:<br />
∂Ij(di)<br />
∂di<br />
= ∂Ij(qij) ∂qij<br />
qij ∂di<br />
� �t � �<br />
∂Ij/∂x ∂x/∂di<br />
=<br />
.<br />
∂Ij/∂y ∂y/∂di
86 Chapter 6. PDE-based Depth Estimation on the GPU<br />
If we define X = (X (1) , X (2) , X (3) ) t = Hij pi + Tij/di <strong>and</strong> Tij = (T (1)<br />
ij<br />
have<br />
∂x/∂di =<br />
∂y/∂di =<br />
(1)<br />
T ij X(3) − T (3)<br />
ij X(1)<br />
(X (3) ) 2 d2 i<br />
(2)<br />
T ij X(3) − T (3)<br />
ij X(2)<br />
(X (3) ) 2 d2 i<br />
(2) (3)<br />
, T ij , T ij )t , then we<br />
The advantage of this scheme is, that with precomputed gradient images ∇Ij the spatial<br />
derivative along the epipolar line, ∂Ij(di)/∂di, can be easily calculated <strong>and</strong> the computation<br />
of X = Hij pi + Tij/di can be shared, if Ij(di) <strong>and</strong> its derivative is calculated in the<br />
same fragment program. In our implementation, a texture representing Ij holds the intensity<br />
value, the horizontal <strong>and</strong> the vertical gradient in its three channels. Image warping<br />
assigns Ij(di) <strong>and</strong> its derivative to the two channels of the target buffer.<br />
Note, that Hij pi needs not to be calculated <strong>for</strong> every pixel, but can be linearly interpolated<br />
by the GPU rasterizer like any other texture coordinate. On our hardware<br />
the per<strong>for</strong>mance gain was rather minimal, since the matrix-vector multiplication in the<br />
fragment program is mostly hidden by the required texture fetches.<br />
The GPU version of this step per<strong>for</strong>ms approximately 100 times faster than a straight<strong>for</strong>ward,<br />
but otherwise completely equivalent software implementation.<br />
6.3.2 Regularization Pass<br />
If Laplacian regularization is employed, a simple fragment program is sufficient to calculate<br />
∇ 2 di. The more interesting case is the utilization of image-based or confidence-based<br />
anisotropic diffusion to control the depth map regularization. Both regularization approaches<br />
yield linear numerical schemes, since the diffusion weights remain constant <strong>for</strong><br />
the current level in the image pyramid.<br />
Confidence images are created as follows: After determining the depth maps at the<br />
next-coarser resolution, a confidence map cij between view i <strong>and</strong> j is generated with<br />
cij = 1/(1 + k eij), where eij = �p − qji(qij(p, di), dj)� is the back-matching error. This<br />
confidence map remains constant <strong>for</strong> the current resolution level. The confidence values<br />
cij adjacent to a pixel are normalized, such that their sum is one. For every pixel this<br />
results in a weight vector W with four components. The regularization term is calculated<br />
as<br />
⎛<br />
⎜<br />
⎝<br />
W [x−1]<br />
W [x+1]<br />
W [y−1]<br />
W [y+1]<br />
⎞t<br />
⎟<br />
⎠<br />
⎛<br />
⎜<br />
⎝<br />
d [x−1]<br />
i<br />
d [x+1]<br />
i<br />
d [y−1]<br />
i<br />
d [y+1]<br />
i<br />
This is proportional to the st<strong>and</strong>ard Laplacian, if W is set to (1/4, 1/4, 1/4, 1/4) t .<br />
− di<br />
− di<br />
− di<br />
− di<br />
⎞<br />
⎟<br />
⎠ .
6.3. GPU-based Implementation 87<br />
6.3.3 Depth Update Equation<br />
The finite difference scheme of equation 6.4 (respectively one of its extensions) is a large<br />
system of equations in the unknowns ∆di <strong>for</strong> every pixel:<br />
�<br />
�<br />
∂Ij<br />
Ij(di) +<br />
∂di<br />
∂Ii(di)<br />
�<br />
∆di − I0 − λ∇<br />
∂di<br />
2 (di + ∆di) = 0 (6.5)<br />
j<br />
Approximating the Laplacian (resp. the employed diffusion term) by a linear operator, the<br />
system becomes a sparse one, <strong>and</strong> the unknowns ∆di are coupled only <strong>for</strong> adjacent pixels<br />
through to the regularization term yielding a sparse system matrix.<br />
Using the st<strong>and</strong>ard 4-star scheme to calculate the Laplacian the matrix of the sparse<br />
linear system obtained from the above equation has a special structure containing 5 diagonal<br />
b<strong>and</strong>s (Figure 6.1). Two iterative numerical schemes to solve sparse linear system are<br />
currently applicable <strong>for</strong> the GPU: the Jacobi method <strong>and</strong> the conjugate gradient method.<br />
6.3.3.1 Jacobi Iterations<br />
In order to solve a linear system Ax = b with diagonally dominant matrix A, the Jacobi<br />
method per<strong>for</strong>ms the following iteration:<br />
x (n+1) �<br />
−1<br />
= D (D − A)x (n) �<br />
+ b ,<br />
where D is the diagonal part of A. Consequently, the new components of x (n+1) depend<br />
only on the old values of x (n) . The update procedure <strong>for</strong> every pixel according to Eq. 6.5<br />
is now<br />
∆d (n+1)<br />
i<br />
=<br />
�<br />
λ ∇2di + 1 �<br />
4<br />
p∈N ∆d(n)<br />
i<br />
λ + �<br />
j<br />
�<br />
− ∂Ij(di)<br />
∂di<br />
�2 � ∂Ij(di)<br />
∂di<br />
(Ij(di) − I0)<br />
,<br />
where p ∈ N runs over the four adjacent pixels to the current pixel. After several iterations<br />
=<br />
of this inner loop to obtain a converged ∆d final<br />
i<br />
d (k)<br />
i<br />
+ ∆dfinal<br />
i .<br />
6.3.3.2 Conjugate Gradient Solver<br />
, the depth map is updated as d (k+1)<br />
i<br />
In addition to the Jacobi method we implemented a conjugate gradient procedure on<br />
the GPU to solve the sparse linear system. This implementation is based on the ideas<br />
presented by Krüger <strong>and</strong> Westermann [Krüger <strong>and</strong> Westermann, 2003].<br />
On the GPU the system matrix with five diagonal b<strong>and</strong>s is stored in two textures: the<br />
off-diagonal b<strong>and</strong>s are stored in a 4 component texture image, which remains constant.<br />
The main diagonal is represented as a single component render target, since it must be<br />
updated after every warping pass. Analogous to the Jacobi method the result of the<br />
conjugate gradient approach is a stabilized depth update ∆di.
88 Chapter 6. PDE-based Depth Estimation on the GPU<br />
Figure 6.1: The sparse structure of the linear system obtained from the semi-implicit<br />
approach. Dark pixels indicate non-zero entries.<br />
6.3.4 Coarse-to-Fine Approach<br />
In order to avoid reaching a local minimum immediately we utilize a coarse-to-fine scheme.<br />
We chose a usual image pyramid, which halves the image dimensions in every level. After<br />
downsampling the image of the next finer level the obtained image was additionally<br />
smoothed. When going to the next coarser level, the regularization weight λ should be<br />
halved as well, but in practice scaling λ by a factor of � 1/2 gave better results.<br />
6.4 Results<br />
This section presents several depth maps <strong>and</strong> 3D models to illustrate the benefits <strong>and</strong><br />
possible shortcomings of the variational depth estimation method.<br />
6.4.1 Facade Datasets<br />
The first dataset depicts a historical statue embedded in a facade. The resolution of<br />
the grayscale source images <strong>and</strong> the resulting depth map is 512 × 512 pixels. Figure 6.2<br />
illustrates the obtained range map based on three small-baseline source images as colored<br />
3D point set. Figure 6.3 shows the corresponding depth images using the implemented<br />
numerical solvers <strong>and</strong> gives timing in<strong>for</strong>mation. Six pyramidal levels are generated <strong>for</strong><br />
the coarse-to-fine approach. The Jacobi <strong>and</strong> the CG solvers execute 50 iterations in the<br />
outer loop (image warping) <strong>and</strong> 3 iterations in the inner loop to calculate the actual depth<br />
update. The Jacobi solver runs fastest with 1.15s, whereas the conjugate gradient solver<br />
requires significantly more time. The obtained depth maps are almost identical with both<br />
approaches.
6.4. Results 89<br />
Figure 6.2: A reconstructed historical statue displayed as colored point set with a resolution<br />
of 512 × 512 points. Three small baseline images are used to generate the model.<br />
Figure 6.4 shows the consequences of back-matching. Without back-matching a severe<br />
mismatch appears near the feet of the statue (Figure 6.4(a)). Back-matching uses a larger<br />
sequence of images to mutually verify the depth maps as described in Section 6.2.3.1.<br />
Figure 6.4(b) shows the same close-up view of the feet with a significantly better geometry.<br />
Another result of the variational depth estimation approach is shown in Figure 6.5.<br />
The resolution of the depth map <strong>for</strong> this dataset is 1024 × 640.<br />
6.4.2 Small Statue Dataset<br />
This section addresses the reconstruction of another dataset, which requires additional<br />
methods to be applied to obtain a suitable model. The object to be reconstructed is a<br />
small statue, <strong>for</strong> which more than 40 images were taken in a circular path around the<br />
statue.<br />
Using the source images directly to generate the depth maps is not successful, which can<br />
be seen in Figure 6.6. Even including the back-matching approach does not improve the<br />
result. The reason <strong>for</strong> this failure is due to the very large depth discontinuities between<br />
the <strong>for</strong>eground statue <strong>and</strong> the background scenery. Consequently, the smoothness <strong>and</strong><br />
ordering constraint is violated in these images (see Figure 6.6(a–c)).<br />
The first approach to obtain better reconstructions is to per<strong>for</strong>m an image segmentation<br />
procedure to separate <strong>for</strong>eground <strong>and</strong> background regions. The initial manual<br />
segmentation <strong>for</strong> one image is propagated through the complete sequence, such that only
90 Chapter 6. PDE-based Depth Estimation on the GPU<br />
(a) Jacobi (n=3), 1.15s (b) CG (n=3), 3.15s<br />
Figure 6.3: The depth maps of the embedded statue reconstructed with the numerical<br />
schemes. Both numerical solver yields almost similar result, with the Jacobi solver being<br />
faster.<br />
little further manual interaction is necessary [Sormann et al., 2005]. Background pixels<br />
are set to a uni<strong>for</strong>m color be<strong>for</strong>e applying the depth estimation procedure. Two of the<br />
obtained point sets are shown in Figure 6.7.<br />
Alternatively we introduced a more robust image intensity error term in order to h<strong>and</strong>le<br />
the changing background <strong>and</strong> occlusions. The energy function to be optimized includes a<br />
truncated intensity difference:<br />
⎛<br />
�<br />
S(di) = ⎝ �<br />
min � T, (Ij(di) − Ii) 2� + λ�∇di� 2<br />
⎞<br />
⎠ dp → min, (6.6)<br />
p<br />
j<br />
with a thresholding parameter T . Instead of replacing the thresholding operator by a<br />
differentiable soft-min function, we chose a very different approach: Since we have two<br />
sensor images, Ij1 <strong>and</strong> Ij2 , zero, one or both data terms may be saturated <strong>and</strong> in the<br />
Euler-Lagrange equation the corresponding term is missing. Consequently, the new depth
6.4. Results 91<br />
(a) Without back-matching (b) With back-matching<br />
Figure 6.4: The effect of bidirectional matching on the embedded statue scene.<br />
is taken from this set of decoupled solutions:<br />
∆d (k+1)<br />
i<br />
∆d (k+1)<br />
i<br />
∆d (k+1)<br />
i<br />
=<br />
=<br />
=<br />
λ∇ 2 d (k)<br />
i<br />
∆d (k+1)<br />
i = ∇ 2 d (k)<br />
i<br />
− �<br />
j<br />
λ + �<br />
j<br />
∂Ij(d (k) �<br />
i )<br />
∂di<br />
� ∂Ij(d (k)<br />
i )<br />
∂di<br />
Ij(d (k)<br />
�<br />
i ) − I0<br />
� 2<br />
λ∇2d (k)<br />
i − ∂Ij (d 1 (k)<br />
i )<br />
Ij1 ∂di<br />
(d(k)<br />
�<br />
∂Ij (d 1 λ +<br />
(k)<br />
�2 i )<br />
∂di<br />
λ∇2d (k)<br />
i − ∂Ij (d 2 (k)<br />
i )<br />
Ij2 ∂di<br />
(d(k)<br />
�<br />
∂Ij (d 2 λ +<br />
(k)<br />
�2 i )<br />
∂di<br />
�<br />
�<br />
i ) − I0<br />
i ) − I0<br />
Note that the three lower equation are obtained by removing one or both image terms in<br />
the first equation. In case of truncation of the intensity error, the derivative of the constant<br />
threshold is zero. The depth value with the lowest actual error term is selected as the<br />
result <strong>for</strong> this iteration. In Figure 6.8 the resulting enhanced depth map <strong>and</strong> 3D model is<br />
illustrated. Although the depth image <strong>and</strong> the reconstructed model are far superior than<br />
the original model depicted in Figure 6.6, the obtained statue model has still some flaws<br />
<strong>and</strong> a more refined approach requires further investigation.<br />
�<br />
�
92 Chapter 6. PDE-based Depth Estimation on the GPU<br />
(a) (b)<br />
Figure 6.5: Two views on the colored point set showing the front facade of a church.<br />
6.4.3 Mirabellstatue Dataset<br />
The source images of this dataset display an outdoor statue (see Figure 6.9(a)). Depth map<br />
generation is restricted to the statue using silhouette masks to separate the <strong>for</strong>eground<br />
statue object from the background scenery. Three images with 512 × 512 pixels resolution<br />
are used to compute the depth maps illustrated in Figure 6.9(b)–(d). The differences<br />
between the displayed meshes come from the employed regularization approach. The first<br />
two meshes are acquired using a homogeneous regularization with different values <strong>for</strong> the<br />
weight λ. The third mesh is obtained utilizing image-driven anisotropic diffusion <strong>for</strong> a<br />
selective regularization in textureless image regions as discussed in Section 6.2.2.<br />
The mesh shown in Figure 6.9(b) uses a small value <strong>for</strong> λ, which results in noisy mesh<br />
geometry especially in textureless regions. The mesh displayed in Figure 6.9(c) is obtained<br />
by using a larger value <strong>for</strong> λ <strong>and</strong> appears clearly smoother, but sharper creases at depth<br />
discontinuities are missing. Image-driven anisotropic diffusion yields to a generally smooth<br />
mesh, but includes sharp edges at depth discontinuities.<br />
6.5 Discussion<br />
Variational approaches to depth estimation provide a mathematically sound tool <strong>for</strong> generating<br />
3D models from multiple images. These methods work best <strong>for</strong> images with constant<br />
lighting conditions <strong>and</strong> if only little occlusions <strong>and</strong> depth discontinuities are present in the<br />
imaged scene. Under these requirements high-quality depth maps can be generated at<br />
interactive rates.
6.5. Discussion 93<br />
Nevertheless, there are several issues that must be addressed: At first, scenes with large<br />
depth discontinuities <strong>and</strong> violated ordering constraints must be h<strong>and</strong>led in a more robust<br />
manner. The approach presented in Section 6.4.2 is only a first step in this direction, since<br />
the results are still not completely satisfying. Incorporating segmentation in<strong>for</strong>mation to<br />
detect piecewise connected objects can be based on color clustering, as it is partially<br />
employed in Section 6.4.2. Alternatively, combining a segmentation procedure based on<br />
initial <strong>and</strong> coarser depth hypotheses with the described variational approach appears to<br />
be promising. Variational multi-phase approaches (e.g. [Chan <strong>and</strong> Vese, 2002, Shen, 2006,<br />
Jung et al., 2006]) are potential c<strong>and</strong>idates to generate the combined initial depth <strong>and</strong><br />
segmentation hypo<strong>thesis</strong>.<br />
Incorporating lighting changes into a variational framework to optical flow <strong>and</strong><br />
depth estimation can be accomplished using techniques proposed by Hermosillo et<br />
al. [Hermosillo et al., 2001, Chefd’Hotel et al., 2001]. Whether such approaches are<br />
suitable <strong>for</strong> 3D modeling at interactive rates is an open question.<br />
Another item which needs to be addressed is the image smoothing used in the coarse-tofine<br />
hierarchy. In a multi-view setup the epipolar lines run arbitrarily through the source<br />
images <strong>and</strong> usual Gaussian smoothing possibly moves corresponding features away from<br />
the appropriate epipolar line. Consequently, the recovered geometry at a coarser scale<br />
is not a smoothed version of the true geometry, but only loosely coupled with the true<br />
underlying model. In a rectified stereo setup a pure horizontal blurring has the advantage,<br />
that features are smoothed along the epipolar lines, but not in their orthogonal direction.<br />
Extending this approach to a multi-view setting is a topic <strong>for</strong> future research.
94 Chapter 6. PDE-based Depth Estimation on the GPU<br />
(a) (b) (c)<br />
(d) (e)<br />
Figure 6.6: The three source images <strong>and</strong> the resulting unsuccessful reconstruction of the<br />
statue.
6.5. Discussion 95<br />
(a) (b)<br />
Figure 6.7: Two of the successfully reconstructed point sets using image segmentation to<br />
omit the background scenery.<br />
(a) (b)<br />
Figure 6.8: An enhanced depth map <strong>and</strong> 3D point set obtained using the truncated error<br />
model.
96 Chapter 6. PDE-based Depth Estimation on the GPU<br />
(a) One source view (b) Homogeneous, λ = 3<br />
(c) Homogeneous, λ = 10 (d) Image-driven anisotropic,<br />
λ = 10<br />
Figure 6.9: The effect of image-driven anisotropic diffusion. Two generated meshes using<br />
homogeneous regularization with different values of λ are shown in (a) <strong>and</strong> (b). The<br />
choice of λ = 3 in (a) yields a noisy result, wheres setting λ = 10 gives a significantly<br />
better geometry. Employing image-driven anisotropic diffusion yields to the visually most<br />
appealing mesh with sharp creases, but without noise in textureless regions (c).
Chapter 7<br />
Scanline Optimization <strong>for</strong> Stereo<br />
On <strong>Graphics</strong> Hardware<br />
Contents<br />
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
7.2 Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo . . . . . 98<br />
7.3 Cross-Correlation based Multiview Scanline Optimization on<br />
<strong>Graphics</strong> Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
7.1 Introduction<br />
In this chapter we propose a GPU-based computational stereo approach using scanline<br />
optimization to achieve optimal intrascanline disparity maps. Since we employ a linear<br />
discontinuity cost model, the central part of the procedure is the calculation of the<br />
appropriate min-convolution, which is usually implemented as two pass method using destructive<br />
array updates. We replace this in-place updates by a recursive doubling scheme<br />
better suited <strong>for</strong> stream programming models. Consequently, the entire dense estimation<br />
pipeline from matching cost computation to global optimization to obtain the disparity<br />
resp. depth map is per<strong>for</strong>med by the GPU <strong>and</strong> only the control flow is maintained by the<br />
CPU.<br />
Since the material of this chapter is rather technical, it is divided into two parts: the<br />
first section (Section 7.2) focuses on the details of a GPU-based scanline optimization<br />
procedure <strong>for</strong> the rectified stereo setup employing very simple image matching scores.<br />
The second section (Section 7.3) addresses the incorporation of the GPU-based scanline<br />
optimization implementation in a multiview setup. The focus of this section lies on the<br />
efficient utilization of ‘sliding’ sums to calculate the zero mean normalized cross correlation<br />
score in particular.<br />
97
98 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
7.2 Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo<br />
This section describes the core of the GPU implementation of scanline optimization. The<br />
main idea is the trans<strong>for</strong>mation of the main dynamic programming step (which has linear<br />
time complexity on sequential processors) to an equivalent procedure suitable <strong>for</strong> parallel<br />
computing (with O(N log N) time complexity). Additionally, several techniques to employ<br />
the parallelism within the fragment processor to full extent are presented. Not all of these<br />
methods are applicable <strong>for</strong> high-resolution depth maps (see Section 7.3.7 <strong>for</strong> one approach<br />
to overcome this limitation).<br />
7.2.1 Scanline Optimization <strong>and</strong> Min-Convolution<br />
Scanline optimization [Scharstein <strong>and</strong> Szeliski, 2002] searches <strong>for</strong> a globally optimal assignment<br />
of disparity values to pixels in the current (horizontal) scanline, i.e. it finds<br />
arg min<br />
dx<br />
W�<br />
(D(x, dx) + λV (dx, dx−1)) ,<br />
x=1<br />
where D(x, d) is the image dissimilarity cost <strong>and</strong> V (d, d ′ ) is the regularization cost. As in<br />
all dynamic programming approaches to stereo, different scanlines are treated independent<br />
from the neighboring ones (which may result in vertical streaks visible in the disparity<br />
image).<br />
The optimal assignment can be efficiently found using a dynamic programming approach<br />
to maintain the minimal accumulated costs ¯ C(x, d) up to the current position<br />
x:<br />
¯C(x + 1, d) = D(x + 1, d) +<br />
�<br />
min ¯C(x, d1) + V (d, d1)<br />
d1<br />
� .<br />
In a linear discontinuity cost model we have V (d, d1) = λ|d − d1| <strong>and</strong> the calculation of<br />
�<br />
min ¯C(x, d1) + λ|d − d1|<br />
d1<br />
�<br />
<strong>for</strong> every d can be per<strong>for</strong>med in linear time using a <strong>for</strong>ward <strong>and</strong> a backward pass to compute<br />
the lower envelope [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. The linear-time procedure to<br />
calculate the min-convolution is given in Algorithm 3.<br />
This procedure is not directly suitable <strong>for</strong> GPU implementation, since it relies at first<br />
on in-place array updates <strong>and</strong> secondly, a linear number of passes is required to update<br />
the entire array h. ∗<br />
∗ Using the depth test with the same depth buffer as texture source <strong>and</strong> target buffer would allow<br />
the direct implementation, but this approach results in undefined behavior according to the specifications.<br />
Such an approach would have additional disadvantages, mainly the reduced ability to utilize the parallelism<br />
of the GPU.
7.2. Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo 99<br />
Algorithm 3 Procedure to calculate the lower envelope efficiently<br />
Procedure Min-Convolution<br />
Input: Output h[]<br />
<strong>for</strong> d = 1 . . . k do<br />
h[d] ← ¯ C(x, d)<br />
end <strong>for</strong><br />
{Forward pass}<br />
<strong>for</strong> d = 2 . . . k do<br />
h[d] ← min(h[d], h[d − 1] + λ)<br />
end <strong>for</strong><br />
{Backward pass}<br />
<strong>for</strong> d = k − 1 . . . 1 do<br />
h[d] ← min(h[d], h[d + 1] + λ)<br />
end <strong>for</strong><br />
The basic idea to enable a GPU implementation of min-convolution is utilizing<br />
a recursive doubling approach, which is outlined in Algorithm 4. Recursive<br />
doubling [Dubois <strong>and</strong> Rodrigue, 1977] is a common technique in high-per<strong>for</strong>mance<br />
computing to enable parallelized implementations of sequential algorithms. This<br />
technique is frequently used in GPU-based applications to per<strong>for</strong>m stream reduction<br />
operations like accumulating all values of a texture image [Hensley et al., 2005].<br />
If we focus on the <strong>for</strong>ward pass in Algorithm 4, the procedure calculates the result of<br />
[d] contains<br />
the <strong>for</strong>ward pass <strong>for</strong> subsequently longer sequences ending in d. Initially, h + 0<br />
the min-convolution of the single element sequence [d, d]. In every outer iteration with<br />
index L the h<strong>and</strong>led sequence is extended to [d − 2L , d] <strong>and</strong> its length is doubled. Note,<br />
that h + [d] is defined to be ∞ (i.e. a large constant), if d is outside the valid range [1 . . . k].<br />
After all iterations, h + [d] contains the correct result of the <strong>for</strong>ward pass, which can be<br />
easily shown by induction. The same argument applies to the backward pass, hence this<br />
procedure yields to the desired result. In addition to the lower envelope h the disparity<br />
values <strong>for</strong> which the minimum is attained are tracked in the array disp[].<br />
Note that the updates in the loops over d are independent <strong>and</strong> can be per<strong>for</strong>med<br />
as parallel loop. In GPGPU terminology, the bodies of these loops are computational<br />
kernels [Buck et al., 2004]. Additionally, the scanlines of the images are treated independently,<br />
there<strong>for</strong>e the min-convolution can be per<strong>for</strong>med <strong>for</strong> all scanlines in parallel.<br />
Figure 7.1 gives an illustration of the first few iterations in the <strong>for</strong>ward pass of Algorithm<br />
4. Since the next iteration of the outer loops in the min-convolution algorithm<br />
refers only to values generated in the previous iteration, only two arrays must be maintained<br />
(instead of a logarithmic number of arrays). The role of this two arrays is swapped<br />
after every iteration; the destination array becomes the new source <strong>and</strong> vice versa. In<br />
GPU terminology, these arrays correspond to render-to-texture targets, <strong>and</strong> alternating<br />
the roles of these textures is referred as ping-pong rendering.
100 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
Algorithm 4 Procedure to calculate the lower envelope using recursive doubling<br />
Procedure Min-Convolution using Recursive Doubling<br />
{Forward pass}<br />
<strong>for</strong> d = 1 . . . k do<br />
h + 0 [d] ← ¯ C(x, d)<br />
disp[d] ← d<br />
end <strong>for</strong><br />
<strong>for</strong> L = 0 . . . ⌈log 2(k − 1))⌉ do<br />
<strong>for</strong> d = 1 . . . k do<br />
d1 ← d − 2 L<br />
h +<br />
L<br />
[d] ← min(h+<br />
L−1<br />
disp[d] ← arg mind(h +<br />
end <strong>for</strong><br />
end <strong>for</strong><br />
{Backward pass}<br />
<strong>for</strong> d = 1 . . . k do<br />
[d], h+<br />
L−1 [d1] + λ 2 L )<br />
L−1<br />
h − 0 [d] ← h+<br />
L [d]<br />
end <strong>for</strong><br />
<strong>for</strong> L = 0 . . . ⌈log2(k − 1))⌉ do<br />
<strong>for</strong> d = 1 . . . k do<br />
d1 ← d + 2L h −<br />
L<br />
[d] ← min(h−<br />
L−1<br />
disp[d] ← arg mind(h −<br />
end <strong>for</strong><br />
end <strong>for</strong><br />
Return h −<br />
log2 (k−1) <strong>and</strong> disp<br />
[d], h+<br />
L−1 [d1] + λ 2 L )<br />
[d], h−<br />
L−1 [d1] + λ 2 L )<br />
L−1<br />
[d], h+<br />
L−1 [d1] + λ 2 L )<br />
The full linear discontinuity cost model is often not appropriate <strong>and</strong> a truncated linear<br />
cost model with V (d, d1) = λ min(T, |d − d1|) is preferable. If T is chosen to be a<br />
power of two, the truncated cost model can be incorporated without an additional per<strong>for</strong>mance<br />
penalty into Algorithm 4 by replacing the λ 2 L smoothness cost term in the<br />
A<br />
A’<br />
A’’<br />
A<br />
A’<br />
B<br />
B’<br />
min(A+1,B)<br />
C<br />
C’<br />
min(B+1,C)<br />
B’’ B’ C’’ min(A’+2,C’) D’’ min(B’+2,D’)<br />
D<br />
D’<br />
min(C+1,D)<br />
E<br />
E’<br />
E’’<br />
min(D+1,E)<br />
min(C’+2,E’)<br />
Figure 7.1: Graphical illustration of the <strong>for</strong>ward pass using a recursive doubling approach.
7.2. Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo 101<br />
min-convolution algorithm by λ min(T, 2 L ). For other values of T an additional pass over<br />
the ¯ C(x, ·) array is required [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. For optimal per<strong>for</strong>mance<br />
we restrict our implementation to the pure linear model resp. to the truncated<br />
model with power-of-two thresholds.<br />
7.2.2 Overall Procedure<br />
This section describes the basic procedure <strong>for</strong> scanline optimization on the GPU, which<br />
consists of several steps. The outline of the overall procedure is presented in Algorithm 5.<br />
The input consists of two rectified images with resolution W × H. The range of potential<br />
disparity values is [dmin, dmax] with k elements.<br />
The procedure traverses vertical scanlines positioned at x from left to right. At first<br />
the dissimilarity of the current scanline at x in the left image with the set of vertical<br />
scanlines [x + dmin, x + dmax] is calculated, resulting in a texture image with dimensions<br />
H <strong>and</strong> k. The dissimilarity is either a sum of absolute differences aggregated<br />
in a rectangular window or the sampling insensitive pixel dissimilarity score proposed<br />
in [Birchfield <strong>and</strong> Tomasi, 1998].<br />
If the first scanline is processed, the texture storing ¯ C is initialized with the dissimilarity<br />
score. For all subsequent scanlines the lower envelope of ¯ C is computed using<br />
�<br />
Algorithm 4 to obtain mind1<br />
¯C(x − 1, d1) + λ|d − d1| � <strong>for</strong> every row y <strong>and</strong> disparity value<br />
d. The computation of the lower envelope keeps track of the disparity value, where the<br />
minimum is attained (we refer to Section 7.2.3.2 <strong>for</strong> a detailed description of the efficient<br />
disparity tracking). These tracked disparities are read back into main memory <strong>for</strong> the<br />
subsequent optimal disparity map extraction. Afterwards, the ¯ C array is incremented by<br />
the dissimilarity score of the current vertical scanline.<br />
If the final scanline is reached, the total accumulated ¯ C is read back in order to<br />
determine the optimal disparities <strong>for</strong> the last column given by arg mind ¯ C(W, d). With the<br />
knowledge of the disparities <strong>for</strong> the final column, the disparities <strong>for</strong> previous columns can<br />
be assigned by a backtracking procedure.<br />
7.2.3 GPU Implementation Enhancements<br />
The basic method outlined in the last section does not utilize the free parallelism of fragment<br />
program operations, which work on four component vectors simultaneously. Consequently,<br />
the per<strong>for</strong>mance of the method can be substantially improved if this inherent<br />
parallelism is taken into account.<br />
7.2.3.1 Fewer Passes Through Bidirectional Approach<br />
Essentially, W passes of the min-convolution procedure are required to obtain the final<br />
¯C values <strong>and</strong> the corresponding disparity map. This number can be effectively halved,<br />
if scanline optimization is applied on two opposing horizontal positions simultaneously
102 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
Algorithm 5 Outline of the scanline optimization procedure on the GPU<br />
Procedure Scanline optimization on the GPU<br />
<strong>for</strong> x = 1 . . . W do<br />
Compute the image dissimilarity <strong>for</strong> the vertical scanline at x <strong>and</strong> all possible<br />
disparities, resulting in scoreTex<br />
if x = 1 then<br />
sumCostTex := scoreTex<br />
else<br />
Calculate the lower envelope h of sumCostTex resulting in lowerEnvTex.<br />
Read back tracked disparities from lowerEnvTex.<br />
sumCostTex := lowerEnvTex + scoreTex<br />
end if<br />
if x = W then<br />
Read back the accumulated cost <strong>for</strong> the final column from sumCostTex.<br />
end if<br />
end <strong>for</strong><br />
Extract final disparity map by backtracking<br />
finally meeting in the central position. More <strong>for</strong>mally, let ¯ Cfw(x, d) be the accumulated<br />
cost starting from x = 1 <strong>and</strong> ¯ Cbw the cost beginning at x = W , which are computed<br />
simultaneously using parallel fragment operations. If we assume even W , in every iteration<br />
the values <strong>for</strong> ¯ Cfw(x, d) <strong>and</strong> ¯ Cbw(W − x + 1, d) are determined. The iterations stop at<br />
x 1/2 := W/2 + 1 <strong>and</strong> the total cost <strong>for</strong> optimal paths with disparity d at position x 1/2 is<br />
¯Cfw(x 1/2, d) + ¯ Cbw(x 1/2, d) − D(x 1/2, d).<br />
Hence the initial disparity assigned to x 1/2 is the disparity attaining the minimum of this<br />
sum, <strong>and</strong> the complete disparity map can be extracted by the backtracking procedure<br />
as already outlined. This approach better utilizes the essentially free vector processing<br />
capabilities, <strong>and</strong> this modification reduces the total runtime by approximately 45% <strong>for</strong><br />
384 × 288 images.<br />
7.2.3.2 Disparity Tracking <strong>and</strong> Improved Parallelism<br />
Using a bidirectional approach does not only reduce the number of passes, but the parallelism<br />
of the fragment processor is employed to some extent – two ¯ C values are h<strong>and</strong>led<br />
in parallel ( ¯ Cfw <strong>and</strong> ¯ Cbw). Since GPUs are designed to operate on vector values with four<br />
components, an additional per<strong>for</strong>mance gain can be expected if four ¯ C values are stored<br />
in the color channels <strong>for</strong> every pixel.<br />
Note that the calculation of the lower envelope <strong>for</strong> ¯ C is not enough, since the appropriate<br />
disparity values attaining the minimum must be stored as well in order to enable an<br />
efficient backtracking phase. If one assumes integral disparity values, image dissimilarity
7.2. Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo 103<br />
scores <strong>and</strong> an integral smoothness weight λ, then ¯ C <strong>and</strong> h are integer numbers as well.<br />
Hence, the associated disparity can be encoded in the fractional part of h. Furthermore,<br />
no additional operations are needed to track the disparities attaining the minimal accumulated<br />
costs. Of course, in case of ties in the min-convolution procedure, disparities with<br />
smaller encoded fractions are preferred (which is as good as any other strategy).<br />
Encoding the disparity value in the fractional part of floating point numbers limits the<br />
image resolution in order to avoid precision loss. If the dissimilarity score is an integer<br />
from the interval [0, T ], then the total accumulated cost is at most (W/2 + 1) × T , where<br />
W is the source image width. If the dissimilarity score is discretized into the range [0, 255],<br />
16 bit of the mantissa are required to encode ¯ C <strong>for</strong> half PAL resolution (W = 384), which<br />
leaves enough accuracy to encode the disparities in the fractional part. The sign bit of<br />
the floating point representation can be additionally incorporated by centering the range<br />
of dissimilarity scores around 0.<br />
Utilizing this compact representation <strong>for</strong> accumulated cost/disparity pairs allows us<br />
to h<strong>and</strong>le two horizontal scanlines in parallel, thereby reducing the effective image height<br />
to the half <strong>for</strong> the min-convolution. Figure 7.2 illustrates the parallel processing of two<br />
vertical scanlines in the bidirectional approach, <strong>and</strong> the assignment of the RGBA channels<br />
to pixel positions.<br />
R<br />
B<br />
R R<br />
G G<br />
B<br />
B A<br />
Figure 7.2: Parallel processing of vertical scanlines using the bidirectional approach <strong>for</strong><br />
optimal utilization of the four available color channels. The arrows indicate the progression<br />
of the processed scanlines in consecutive passes.<br />
7.2.3.3 Readback of Tracked Disparities<br />
After the lower envelope is computed, the encoded tracked disparities are read back into<br />
main memory to be available <strong>for</strong> the final back tracking procedure. The tracked disparity<br />
A<br />
G<br />
A
104 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
values encoded in the fractional part of the lower envelope are extracted directly on the<br />
GPU into an 8-bit framebuffer (which is efficient, since fragment programs on NVidia<br />
hardware support native instructions to get the fractional part of a floating point number).<br />
The tracked disparities are now read back as byte channels. We discovered, that this<br />
approach is the fastest, since the usually expensive conversion from floating point numbers<br />
to integers is per<strong>for</strong>med on the GPU without a per<strong>for</strong>mance penalty <strong>and</strong> the amount of<br />
data to be read back is substantially reduced.<br />
7.2.4 Results<br />
At first we give timing results <strong>for</strong> CPU <strong>and</strong> GPU implementation of scanline optimization<br />
software. The CPU version is a straight<strong>for</strong>ward C++ implementation using the minconvolution<br />
as described in Algorithm 3. The disparity map is determined <strong>for</strong> successive<br />
scanlines. Code optimization is left to the compiler. The GPU implementation is based<br />
on OpenGL using the frame buffer extension <strong>and</strong> the Cg language.<br />
The timing tests are per<strong>for</strong>med on two hardware plat<strong>for</strong>ms: the first plat<strong>for</strong>m is a<br />
PC with a 3 GHz Pentium 4 CPU (CPUA) <strong>and</strong> an NVidia Ge<strong>for</strong>ce 6800 graphics board<br />
(GPUA) running Linux. The C++ source is compiled with gcc 3.4.3 <strong>and</strong> -O2 optimization.<br />
The second system is a PC with an AMD Athlon64 X2 4400+ CPU (CPUB) <strong>and</strong> a<br />
Ge<strong>for</strong>e 7800GT graphics hardware (GPUB). The employed compiler is gcc 4.0.1 again<br />
with -O2 optimization.<br />
Table 7.1 displayed the obtained timing results. Tsukuba 1x denotes the original wellknown<br />
dataset with 384 × 288 image resolution <strong>and</strong> 15 possible disparity values. Tsukuba<br />
2x <strong>and</strong> 4x denote the same dataset, which is resized to 768 × 288 resp. 1536 × 288 pixels.<br />
The possible disparity range consists of 30 <strong>and</strong> 60 values, respectively. We select horizontal<br />
stretching of the image to simulate sub-pixel disparity estimation.<br />
The Pentagon dataset is another common stereo dataset with 512×512 pixels resolution<br />
<strong>and</strong> 16 potential disparity values (Pentagon 1x). Resizing the images to 1024 × 1024<br />
resolution yields the Pentagon 2x dataset (32 disparities). The image similarity function<br />
in all datasets is the SAD using a 3×1 window calculated on grayscale images. In order to<br />
avoid the memory consuming 3D disparity image space the image dissimilarity is calculated<br />
on dem<strong>and</strong> <strong>for</strong> the current vertical scanline.<br />
CPUA GPUA CPUB GPUB<br />
Tsukuba 1x 0.0462 0.1180 0.0373 0.0678<br />
Tsukuba 2x 0.1891 0.2911 0.1387 0.1565<br />
Tsukuba 4x 0.7257 1.0082 0.5655 0.4566<br />
Pentagon 1x 0.1261 0.1877 0.0953 0.1165<br />
Pentagon 2x 0.9458 1.0381 0.7065 0.4930<br />
Table 7.1: Average timing result <strong>for</strong> various dataset sizes in seconds/frame.
7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 105<br />
The results in Table 7.1 clearly indicate that the multi-pass GPU method is significantly<br />
slower than the CPU version <strong>for</strong> small image resolutions. For higher resolutions the<br />
speed is roughly equal resp. the GPU version shows better per<strong>for</strong>mance depending on the<br />
hardware. Note that most time is actually spent in the scanline optimization procedure<br />
itself; only about 15-20% of the frame time is spent to calculate this particularly simple<br />
image dissimilarity. Additionally, we observed that the CPU-based backtracking part to<br />
extract the optimal disparities has a negligible impact on the total runtime.<br />
The required time grows almost linearly on the CPU with increasing resolution, which<br />
is in contrast to the GPU curve. In theory, the 4 times stretched Tsukuba dataset should<br />
require the 16-fold runtime (fourfold number of disparities <strong>and</strong> horizontal pixels). The<br />
CPU version matches this expectation largely (15.1 <strong>and</strong> 15.7-fold runtime), whereas the<br />
GPU shows a sublinear behavior (8.5 resp. 6.7-fold runtime). At low resolutions the setup<br />
times <strong>for</strong> frame buffers etc. become a more dominant fraction of the total runtime.<br />
In order to provide a visual proof <strong>for</strong> the correctness of the proposed GPU implementation,<br />
the disparity maps <strong>for</strong> several st<strong>and</strong>ard stereo datasets are shown in Figure 7.3 <strong>and</strong><br />
Figure 7.4. Additionally, the obtained depth maps using subpixel disparity estimation <strong>for</strong><br />
the Tsukuba images are displayed in Figure 7.3(b) <strong>and</strong> (c).<br />
(a) 1x (b) 2x (c) 4x<br />
Figure 7.3: Disparity images <strong>for</strong> the Tsukuba dataset <strong>for</strong> several horizontal resolutions<br />
generated by the GPU-based scanline approach.<br />
7.3 Cross-Correlation based Multiview Scanline Optimization<br />
on <strong>Graphics</strong> Hardware<br />
This section extends <strong>and</strong> modifies the approach presented in Section 7.2 on depth estimation<br />
using scanline optimization on the GPU. The value of the <strong>for</strong>merly presented method<br />
is increased by enabling multiple views to be h<strong>and</strong>led. Additionally, the SAD matching<br />
cost function can be replaced by the usually more robust cross correlation similarity score.
106 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
(a) Cones (b) Teddy<br />
Figure 7.4: Disparity images <strong>for</strong> the Cones <strong>and</strong> Teddy image pairs from the Middlebury<br />
stereo evaluation datasets. These disparity images illustrate only the correctness of the<br />
GPU implementation, but the images are not intended to indicate superior matching<br />
per<strong>for</strong>mance.<br />
7.3.1 Input Data <strong>and</strong> General Setting<br />
The input data <strong>for</strong> this method consists of n ≥ 2 grayscale source images of dimension<br />
w × h with already removed lens distortion. Additionally, the camera intrinsic parameters<br />
<strong>and</strong> the relative poses between the views are known. One source image plays the particular<br />
role of a key view, <strong>for</strong> which the depth map is calculated. The other views are used to<br />
evaluate the depth hypotheses <strong>and</strong> are called sensor images. The depth image assigns one<br />
depth value in the range from [znear, zfar] with D possible values from that range. In our<br />
implementation the potential depth values are taken equally spaced from this interval.<br />
The viewing frustum induced by the key view limited to the depth range [znear, zfar]<br />
comprises a 3D volume, which encloses the feasible surface to be reconstructed. Planesweep<br />
methods <strong>and</strong> our approach traverse this volume using a sequence of 3D planes <strong>and</strong><br />
warp the sensor images onto this plane (resp. the corresponding quadrilateral <strong>for</strong>med by<br />
intersection with the view frustum). Plane-sweep methods typically use 3D planes parallel<br />
to the key image plane, whereas our method uses planes induced by vertical scanlines in<br />
the key image.<br />
In the later sections we describe the implementation of several image dissimilarity<br />
functions, which are calculated <strong>for</strong> a user-specified aggregation (support) window of W ×H<br />
pixels. The sum of absolute differences (SAD) between two rectangular sets of pixels is<br />
defined as<br />
SAD = �<br />
|Xi − Yi|,<br />
i∈W
7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 107<br />
where i ∈ W denotes the set of pixels in the rectangular support window W. The zeromean<br />
normalized cross correlations is defined as follows:<br />
�<br />
i∈W<br />
NCC =<br />
(Xi − ¯ X) (Yi − ¯ Y )<br />
�� i∈W (Xi − ¯ �<br />
X) 2<br />
=<br />
By the shifting property one gets:<br />
NCC =<br />
i∈W (Yi − ¯ Y ) 2<br />
�<br />
i∈W (Xi − ¯ X) (Yi − ¯ Y )<br />
�<br />
σ2 X σ2 Y<br />
�<br />
XiYi − 1<br />
N (� Xi) ( � Yi)<br />
�<br />
σ2 X σ2 , (7.1)<br />
Y<br />
with σ 2 X = � X 2 i − (� Xi) 2 /N <strong>and</strong> σ 2 Y = � Y 2<br />
i − (� Yi) 2 /N. Hence, it is possible<br />
to compute the cross correlation solely from several sums aggregated within the support<br />
window.<br />
If multiple sensor images are provided, the total matching cost <strong>for</strong> a depth hypo<strong>thesis</strong><br />
is the sum of individual (optionally truncated) matching costs between the key view <strong>and</strong><br />
each sensor image. Using 8- or 16-bit resolution <strong>for</strong> the correlation values, this sum can<br />
be obtained by utilizing the blending (i.e. in-place accumulation) stage of recent graphics<br />
hardware.<br />
7.3.2 Similarity Scores based on Incremental Summation<br />
If one employs a plane sweep approach combined with a purely local winner-takes-all depth<br />
extraction method (see Figure 7.5), spatial aggregation within the support window is easily<br />
per<strong>for</strong>med. Warping the sensor images on the current depth plane <strong>and</strong> spatial aggregation<br />
can be substantially accelerated by graphics hardware due to its specific projective texture<br />
sampling capabilities (see Chapter 4 <strong>and</strong> [Yang et al., 2002, Yang <strong>and</strong> Pollefeys, 2003,<br />
Cornelis <strong>and</strong> Van Gool, 2005]).<br />
On the other h<strong>and</strong>, if a global depth extraction method is utilized, the matching cost<br />
values conceptually comprise a disparity space image (DSI), which stores the matching<br />
score <strong>for</strong> every pixel in the key view <strong>and</strong> every c<strong>and</strong>idate depth value. Hence, the DSI<br />
is a 3D data array with w × h × D elements. When using scanline optimization to find<br />
the optimal depth assignments <strong>for</strong> horizontal scanlines in the key view, the matching<br />
costs <strong>for</strong> every pixel <strong>and</strong> depth value are only accessed once. Consequently, the matching<br />
scores can be calculated on dem<strong>and</strong> <strong>for</strong> vertical lines in the key view as the algorithm<br />
successively updates the ¯ C array from left to right. Due to this simple observation the<br />
memory-consuming construction of the DSI can be avoided. In the following paragraphs<br />
we describe this on-the-fly matching cost computation <strong>for</strong> multiple view configurations in<br />
more detail.
108 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
Key view<br />
Sensor view<br />
Figure 7.5: Plane-sweep approach to multiple view matching<br />
In contrast to plane-sweep approaches, which warp the sensor images onto a plane<br />
parallel to the key image plane positioned at a certain depth, we project the sensor images<br />
on a plane induced by a vertical scanline x = const in the key image (Figure 7.6). This<br />
plane is <strong>for</strong>med by all rays K −1<br />
0<br />
y<br />
(x, y, 1) <strong>for</strong> a fixed x value..<br />
key view<br />
znear<br />
x<br />
z<br />
zfar<br />
Figure 7.6: Plane sweep from left to right<br />
If the aggregation (correlation) window size is W × H, then (at least conceptually)<br />
W slices around the current x-value must be stored. For image dissimilarity functions,<br />
which can be computed by appropriate box filters, like the sum of absolute differences<br />
(SAD), sum of squared differences (SSD) <strong>and</strong> the normalized cross correlation (NCC),
7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 109<br />
maintaining the aggregated sums can be done in an incremental manner by providing the<br />
new incoming slice <strong>and</strong> the outgoing slice to the updating procedures.<br />
7.3.3 Sensor Image Warping<br />
We assume, that the key view has a canonical position, i.e. P0 = K0 (I|0) with the known<br />
camera intrinsic matrix K0. The sensor view i has the projection matrix Pi = (Mi|mi) =<br />
Ki (Ri, ti). Then the 2D point (x, y) wrt. the key view combined with a depth z maps into<br />
the sensor images in the following manner:<br />
qi ∼ z Ai (x, y, 1) t + mi,<br />
with Ai = Mi K −1<br />
0 . qi is a homogeneous quantity (a 3-vector). Using projective texture<br />
mapping, the correct intensity values from the sensor images can be sampled.<br />
Warping the sensor images onto the planar slices as indicated in Figure 7.6 can be<br />
per<strong>for</strong>med by rendering an aligned quadrilateral into a buffer of dimensions h × D. In<br />
world space the quad is determined by constant x value <strong>and</strong> varying y ∈ [1, h] <strong>and</strong> z ∈<br />
[znear, zfar]. Rasterization of this quadrilateral yields to sampling the pixels from the<br />
sensor images using projective texture mapping. Consequently, the sensor image intensity<br />
values wrt. all depth hypotheses <strong>for</strong> the current vertical scanline can be easily retrieved.<br />
Note, that during rendering of this slice additional operations can be per<strong>for</strong>med <strong>for</strong><br />
higher efficiency. For instance, the corresponding key view pixels (comprising a vertical line<br />
at the current x position) can be sampled as well, <strong>and</strong> a binary operation can be applied<br />
on the sampled key image pixel <strong>and</strong> the sensor image pixel. This feature is utilized as<br />
described in the next sections.<br />
Sensor Image Sampling In a plane-sweep approach the rendered quadrilateral corresponding<br />
to a depth plane matches the assumed fronto-parallel surface geometry. Consequently,<br />
higher quality sensor image sampling using mipmapped trilinear or anisotropic<br />
filtering is immediately available. Since our rendered slices do not match the assumed<br />
(fronto-parallel) object surface, the texture space to screen space derivatives interpolated<br />
by the rasterization hardware from the provided quadrilateral geometry are incorrect. The<br />
simplest solution is to revert to basic linear filtering without using derivative in<strong>for</strong>mation<br />
at all. Another solution is providing derivatives computed in the fragment program to the<br />
texture lookup functions, which is possible on newer graphics hardware. If qi = (q x i<br />
, qy<br />
i , qz i )<br />
is the homogeneous position in the sensor image <strong>for</strong> a given key image pixel (x, y) <strong>and</strong><br />
depth z (as described above), then the texture coordinates are (s, t) = (q x i /qz i<br />
Additionally, we have <strong>for</strong> the texture space derivatives<br />
∂s<br />
∂x = z(A11X3 − A31X1)<br />
(X3) 2 ,<br />
, qy<br />
i /qz i ).
110 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
with X = (X1, X2, X3) t = z Ai (x, y, 1) t + mi <strong>and</strong> Akl are the elements of Ai. The other<br />
derivatives ∂s/∂y, ∂t/∂x <strong>and</strong> ∂t/∂y are calculated in an analogue manner. Using these<br />
derivatives the texture footprint of a fronto-parallel surface can be simulated. The projective<br />
texture lookup to sample the sensor images is then replaced by a 2D lookup with<br />
supplied texture space derivatives.<br />
In our evaluated datasets the results using linear resp. anisotropic texture<br />
sampling are effectively indistinguishable due to the small baseline multiview<br />
geometry. If several surface orientation are evaluated to obtain more accurate<br />
reconstructions [Akbarzadeh et al., 2006], higher quality sensor image sampling could be<br />
beneficial. Enabling fourfold anisotropic texture filtering increased the total runtime by<br />
about 5–10% in our experiments.<br />
7.3.4 Slice Management<br />
The scanline optimization procedure stores the epipolar volume slices around the current<br />
x position, i.e. the slices corresponding to X ∈ {x − W/2, x + W/2}. When the matching<br />
cost computation <strong>and</strong> the update of ¯ C <strong>for</strong> the current x position are finished, the new<br />
slice corresponding to x + W/2 + 1 is rendered into a temporary buffer. The matching<br />
cost update routines are invoked with the now obsolete slice at x − W/2 <strong>and</strong> the newly<br />
generated slice at x + W/2 + 1 provided. This allows the cost update functions to per<strong>for</strong>m<br />
an incremental update of its stored values. Afterwards, the buffer holding the obsolete<br />
slice can be reused as target slice at x + W/2 + 2 in the next iteration.<br />
Figure 7.7 illustrates the incremental update of the accumulated values. Note, that several<br />
different accumulation results may be required depending on the employed matching<br />
cost function.<br />
previous sum incoming slice outgoing slice<br />
Figure 7.7: Spatial aggregation <strong>for</strong> the correlation window. At first, the pixels are aggregated<br />
in the x-direction by incremental summation of multiple slices. The final aggregated<br />
value is obtained by vertical summation of these intermediate pixels.<br />
7.3.5 SAD Calculation<br />
If the SAD is chosen as image dissimilarity cost, the incremental update is very simple:<br />
when rendering the 3D quadrilateral to sample the sensor images the absolute differences<br />
between the sensor image <strong>and</strong> the key image pixels is calculated on the fly. The procedure<br />
to calculate the SAD matching cost maintains only the horizontal sums of absolute differences<br />
<strong>for</strong> j ∈ {x−W/2, x+W/2}. This can be easily achieved, since the update procedure<br />
Σ
7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 111<br />
takes the obsolete <strong>and</strong> the newly generated slice as input. Computing the actual matching<br />
score is per<strong>for</strong>med by vertical aggregation of H pixels.<br />
7.3.6 Normalized Cross Correlation<br />
The basic method to maintain the sums <strong>for</strong> NCC calculation are essentially similar to<br />
the SAD version. In this case, three horizontal sums need to be maintained: �<br />
i Y (i, y),<br />
�<br />
i Y (i, y)2 , <strong>and</strong> �<br />
i X(i, y)Y (i, y), where X(·) denote key image pixels <strong>and</strong> Y (·) refers to<br />
sampled sensor image pixels. Epipolar volume slice extraction calculates Y (i, y) <strong>and</strong> the<br />
product X(i, y)Y (i, y) <strong>and</strong> stores these values in two of the color channels.<br />
The st<strong>and</strong>ard deviation σX wrt. the aggregation window <strong>for</strong> every pixel in the key<br />
image <strong>and</strong> the box filtering result �<br />
i∈W Xi can be precomputed <strong>and</strong> are immediately<br />
available during the iterations at no additional cost.<br />
The calculation of the final correlation score involves vertical aggregation of �<br />
i Y (i, y)<br />
<strong>and</strong> �<br />
�<br />
i X(i, y)Y (i, y) to obtain the sum <strong>for</strong> the rectangular window W, i∈W Yi resp.<br />
�<br />
i∈W XiYi. The squared sum �<br />
i∈W Y 2<br />
i can be generated simultaneously while aggregating<br />
�<br />
i∈W Yi. A final fragment program calculates the NCC using Equation 7.1 from these<br />
intermediate values.<br />
Note, that this approach requires additional buffers to store the appropriate horizontal<br />
sums <strong>for</strong> each sensor image.<br />
In practice we use the square root of the NCC as employed matching cost <strong>for</strong> the<br />
following reasons: at first, discretizing the NCC directly into e.g. 255 different values<br />
induces inaccuracies especially <strong>for</strong> small matching costs. On the contrary, the graph of<br />
√<br />
NCC has a more linear shape, hence a uni<strong>for</strong>m discretization is feasible. Secondly, the<br />
NCC behaves qualitatively like a squared difference between normalized intensities, since<br />
�<br />
�<br />
Xi − ¯ X<br />
i∈W<br />
σX<br />
− Yi − ¯ Y<br />
σY<br />
� 2<br />
= 2 − 2 NCC(X, Y ).<br />
Hence we consider it reasonable to adapt the matching cost to the linear regularization<br />
cost model by taking the square root.<br />
7.3.7 Depth Extraction by Scanline Optimization<br />
The matching costs <strong>for</strong> the current active vertical scanline are used to update the accumulated<br />
cost array ¯ C. In order to have a pure GPU implementation this step is per<strong>for</strong>med<br />
by graphics hardware as well as described in Section 7.2. Alternatively, readback of the<br />
matching scores <strong>and</strong> CPU-based depth extraction by dynamic programming is possible as<br />
well [Wang et al., 2006].<br />
In Section 7.2.3 the vector processing capability of the fragment processor (operating on<br />
4-component vectors simultaneously without additional costs) is utilized by a bidirectional<br />
approach: the accumulated costs ¯ C are calculated in parallel starting from x = 1 in the
112 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
<strong>for</strong>ward direction <strong>and</strong> x = w backwards, meeting in the central position. Backtracking the<br />
optimal depth values is subsequently per<strong>for</strong>med to the left <strong>and</strong> right border starting from<br />
the central pixel. This approach reduces the number of iteration in the multipass methods<br />
to the half <strong>and</strong> doubles the employed parallelism in the fragment programs. Additionally,<br />
two vertically adjacent pixels are treated within the same fragment requiring a compact<br />
encoding of ¯ C <strong>and</strong> the corresponding depth value in one floating point number. We apply<br />
the first, bidirectional scanning technique to improve the parallelism in this work as well.<br />
This implies, that matching costs are computed simultaneously <strong>for</strong> the vertical scanline<br />
at x1 = x <strong>and</strong> x2 = w − x simultaneously. The intermediate values <strong>and</strong> correlation score<br />
<strong>for</strong> x1 <strong>and</strong> x2 are stored in the red <strong>and</strong> green channel resp. the blue <strong>and</strong> alpha channel.<br />
We do not utilize the second method, since it limits the image <strong>and</strong> depth resolution to<br />
ensure accurate results. Nevertheless, we substantially improved the per<strong>for</strong>mance of the<br />
GPU-based scanline optimization method using the following approach: We restrict the<br />
precision of ¯ C stored in GPU memory to 16 bit float values (fp16), which allow accurate<br />
representation of integer values in the range [−2047, 2047]. Using fp16 values instead of<br />
the full IEEE precision floating point range halves the memory b<strong>and</strong>width required by<br />
the GPU-based scanline optimization method. Since this procedure is b<strong>and</strong>width limited<br />
(recall Algorithm 4), the per<strong>for</strong>mance of this step is approximately doubled.<br />
In order to maintain the accuracy of the generated depth maps we assume, that the<br />
matching cost is an integral value from the range [0, 255] <strong>and</strong> λ is integral as well. Hence<br />
¯C is an integral quantity, too. In order to avoid overflows of ¯ C, we per<strong>for</strong>m frequent<br />
renormalization of ¯ C using the following update:<br />
¯C(x, d) ← ¯ C(x, d) − min<br />
d1<br />
¯C(x, d1) − 2047.<br />
We subtract 2047 to exploit the sign bit of the fp16 representation, too. Using ¯ C(x +<br />
n, d) − ¯ C(x, d) ≤ 255n <strong>and</strong> ¯ C(x, d) − mind1 ¯ C(x, d1) ≤ λD, we can calculate the frequency<br />
of updates from<br />
¯C(x + n, d) − min<br />
d1<br />
¯C(x, d1) ≤ λD + 255n.<br />
For the fp16 representation we require that the right h<strong>and</strong> side is at most 4094 (i.e.<br />
2 × 2047), hence<br />
n ≤ (4094 − λD)/255.<br />
This means, that n vertical scanlines can be updated without renormalization. For D =<br />
200 <strong>and</strong> λ = 2 we get n = 14. For the experiments we fixed n = 16 without visible<br />
degradation of the obtained depth map.<br />
7.3.8 Memory Requirements<br />
The parallel computing pattern of our approach treating vertical scanlines at once requires<br />
saving the full data needed <strong>for</strong> the final backtracking procedure. After updating ¯ C this
7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 113<br />
data is read back from the GPU memory into main memory. If the depth range contains<br />
less that 256 entries, the required memory is w × h × D bytes, which is e.g. less than 190<br />
MB <strong>for</strong> datasets with 768 × 1024 × 250 resolution.<br />
7.3.9 Results<br />
The reported timing results in this section are obtained on a Linux PC equipped with a<br />
Pentium IV 3GHz main processor <strong>and</strong> an NVidia Ge<strong>for</strong>ce 6800 graphics card with 12 pixel<br />
pipelines.<br />
The first dataset depicted in Figure 7.8 consist of a virtual turntable sequence displaying<br />
a simple building model. The synthetically rendered images are resized to 512 × 512<br />
pixels. This is the resolution of the obtained depth image as well. Since a turntable is<br />
emulated, the scene objects are rotated, but the light sources remain constant. Hence<br />
the surface shading changes between the views substantially. Consequently, the resulting<br />
depth maps calculated with the SAD matching cost function shown in Figure 7.9(a) <strong>and</strong><br />
(b) have many significant defects. All these depth maps are computed <strong>for</strong> a depth range<br />
containing 200 equally spaced values. Figure 7.9(c) displays the depth image obtained<br />
by a plane-sweep approach using a winner-takes-all depth extraction method (Chapter 4).<br />
There are still mismatches in textureless regions visible. Finally, Figure 7.9(d) is the result<br />
of the proposed NCC + scanline optimization implementation. The scanline optimization<br />
procedure is per<strong>for</strong>med on the GPU as well. In all cases the correlation window is set to<br />
9 × 9 pixels. Alternatively to the pure GPU method, we implemented a mixed CPU/GPU<br />
approach: while the GPU calculates the matching cost <strong>for</strong> the next vertical scanline, the<br />
CPU updates ¯ C <strong>for</strong> the current vertical scanline in parallel (using a straight<strong>for</strong>ward C++<br />
implementation). The runtime of this mixed approach is almost identical to the GPU<br />
method <strong>for</strong> this dataset.<br />
(a) left view (b) center view (c) right view<br />
Figure 7.8: The three input views of the synthetic dataset<br />
Table 7.2 displays the runtimes of our implementation at different resolutions. We<br />
evaluated pure GPU approaches (GPU-fp32 <strong>and</strong> GPU-fp16) <strong>and</strong> mixed implementation
114 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
(a) WTA, SAD:<br />
0.82s<br />
(b) SO, SAD: 5.1s (c) WTA, NCC:<br />
2.86s<br />
(d) SO, NCC: 6.21s<br />
Figure 7.9: The obtained depth maps <strong>and</strong> timing results <strong>for</strong> the synthetic dataset. WTA<br />
denotes a GPU plane-sweep approach with a winner-takes-all depth extraction (Chapter<br />
4). SO designates the scanline optimization implementation proposed in this work.<br />
utilizing the CPU <strong>for</strong> the scanline optimization part. GPU-fp32 denotes the pure GPU implementation<br />
without successive renormalization every 16 scanlines. Hence, 32 bit floating<br />
point values are used to store the accumulated costs ¯ C. GPU-fp16 indicates the pure GPU<br />
algorithm using 16 bit values <strong>for</strong> ¯ C utilizing frequent renormalization. We give timing results<br />
<strong>for</strong> two mixed CPU/GPU approaches as well: the first approach is a synchronous<br />
approach, where the matching costs calculation on the GPU <strong>and</strong> dynamic programming<br />
on the CPU are per<strong>for</strong>med in a sequential manner (4th column). These timings allow<br />
direct comparison of the scanline optimization part with the corresponding runtimes on<br />
the GPU. The asynchronous version of the mixed approach calculates the matching cost<br />
<strong>for</strong> the next vertical scanline on the GPU while ¯ C is updated by the CPU (5th column).<br />
The runtime of this parallel approach is the fastest of all dynamic programming implementations,<br />
since the total runtime is dominated solely by the NCC computation (<strong>and</strong> the<br />
update of ¯ C is basically free). Finally, WTA denotes the local plane sweep approach from<br />
Chapter 4.<br />
Resolution GPU-fp32 GPU-fp16 Mixed sync. Mixed async. WTA<br />
256 × 256 × 100 0.79s 0.69s 0.66s 0.55s 0.34s<br />
512 × 512 × 200 6.2s 5.1s 5.0s 3.9s 2.7s<br />
512 × 768 × 200 9.2s 7.7s 7.7s 6.0s 4.1s<br />
768 × 1024 × 250 27.1s 21.4s 20.6s 16.5s 10.9s<br />
768 × 1024 × 250 10.1s 9.4s 9.6s 6.1s 5.0s<br />
Table 7.2: Runtimes of scanline optimization using a 9 × 9 NCC at different resolutions<br />
using three views. The last row displays the runtimes on a PC equipped with an Athlon64<br />
X2 4400+ <strong>and</strong> a GeForce 7800GT.<br />
The comparison of the last two columns (asynchronous CPU/GPU <strong>and</strong> winner-takesall<br />
depth extraction) reveals the per<strong>for</strong>mance penalty induced by the different sweep directions.<br />
The main reason <strong>for</strong> the higher per<strong>for</strong>mance of the WTA approach is, that
7.4. Discussion 115<br />
this method utilized all 4 components in the fragment processor, whereas the proposed<br />
implementation calculates only two matching score per pixel.<br />
The sole scanline optimization time <strong>for</strong> GPU-fp32 is approximately twice the time<br />
needed by GPU-fp16, as predicted. To see this, the NCC calculation time given in the next<br />
to last columns must be subtracted from the total time given in the respective columns.<br />
Finally, CPU scanline optimization using integer arithmetic is still slightly faster than our<br />
GPU-fp16 implementation (columns 3 <strong>and</strong> 4).<br />
The last row of Table 7.2 depicts the runtimes observed on more recent PC hardware<br />
equipped with an Athlon64 X2 4400+ <strong>and</strong> a GeForce 7800GT. The per<strong>for</strong>mance difference<br />
between the local approach <strong>and</strong> the fastest scanline optimization method is smaller than<br />
the gap observed on our main PC. Additionally, the per<strong>for</strong>mance gain of GPU-fp16 over<br />
GPU-fp32 is less eminent. These partially unexpected, but still preliminary results on<br />
current 3D hardware need further analysis.<br />
Figure 7.10 provides visual results <strong>for</strong> a dataset consisting of three images showing a<br />
wooden Bodhisattva statue. The source images <strong>and</strong> the depth maps have a resolution of<br />
512×768 pixels, <strong>and</strong> the depth range contains 200 values. The lighting conditions changes<br />
slightly between the input views (Figure 7.10(a)–(c)). The depth image obtained by a pure<br />
winner-takes-all approach using a 9 × 9 NCC is shown in Figure 7.10(d). The result of our<br />
multiview scanline optimization method is displayed as depth map (Figure 7.10(e)) <strong>and</strong><br />
as the triangulated surface mesh (Figure 7.10(f)). The computation times <strong>for</strong> the local<br />
method <strong>and</strong> our proposed one are 4.1s <strong>and</strong> 6s, respectively.<br />
7.4 Discussion<br />
In this chapter we propose a scanline optimization procedure <strong>for</strong> disparity estimation suitable<br />
<strong>for</strong> stream architectures like modern programmable graphics processing units. Although<br />
the direct implementation of scanline optimization using destructive (i.e. in-place)<br />
value updates must be replaced by a more expensive recursive approach, the huge computational<br />
power of current GPUs turns out to be beneficial <strong>for</strong> larger image resolutions<br />
<strong>and</strong> disparity ranges. Consequently, the entire disparity estimation pipeline comprising<br />
of matching score computation <strong>and</strong> semi-global disparity extraction can be per<strong>for</strong>med on<br />
graphics hardware, thereby avoiding the relatively costly data transfer between the GPU<br />
<strong>and</strong> the CPU <strong>and</strong> leaving the CPU idle <strong>for</strong> other tasks.<br />
Additionally, the basic GPU friendly approach to scanline optimization <strong>for</strong> a rectified<br />
stereo pair is extended to the multiple view case utilizing the more robust cross correlation<br />
matching score. The matching costs are generated on dem<strong>and</strong> as required by the main<br />
dynamic programming procedure. When using more complex dissimilarity scores it turns<br />
out to be most efficient to employ the GPU <strong>and</strong> the CPU in parallel: while the GPU<br />
calculates the next set of matching scores, the CPU updates the accumulated costs <strong>for</strong> the<br />
current vertical scanline.
116 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />
From the timing results presented in the Section 7.2.4 it can be concluded, that a<br />
GPU-based scanline optimization procedure is mostly suitable <strong>for</strong> larger images <strong>and</strong> disparity<br />
ranges, but not truly appropriate <strong>for</strong> realtime applications in particular. For small<br />
image resolutions the overhead of multipass rendering is still too significant to take advantage<br />
of the processing power of modern GPUs. Additionally, a scanline optimization<br />
procedure using a linear smoothness cost model is better dedicated <strong>for</strong> larger disparity<br />
ranges, where a (potentially truncated) linear model is preferable over the Potts model. If<br />
the disparity range contains only a few values, en<strong>for</strong>cing smooth disparity maps is futile,<br />
since consecutive values in the disparity range typically correspond to substantial depth<br />
discontinuities. Hence, a linear model is not effective in case of few potential disparities<br />
<strong>and</strong> a different approach like the near-realtime reliable dynamic programming (RDP)<br />
approach [Gong <strong>and</strong> Yang, 2005b] is better suited. On the contrary, we believe that the<br />
Potts model used in the RDP approach is not appropriate <strong>for</strong> high-quality reconstruction<br />
applications.<br />
If object silhouettes are available (e.g. by background segmentation), the quality of<br />
the depth map can be improved due to the knowledge of the visual hull. Datasets comprising<br />
turntable sequences with a known background (e.g. the reference multiview stereo<br />
datasets presented in [Seitz et al., 2006]) allow a simple background segmentation in particular.<br />
Additionally, the depth estimation per<strong>for</strong>mance can be increased by using the<br />
z-buffer test to avoid matching cost calculation <strong>for</strong> background pixels. Incorporating these<br />
improvements in such cases is ongoing work.<br />
In order to obtain better depth maps <strong>and</strong> to reduce the influence of the actual setting<br />
of the smoothness weight, the benefit of an adaptive smoothness weight based e.g. on the<br />
source image gradients [Fua, 1993, Scharstein <strong>and</strong> Szeliski, 2002] needs to be investigated.
7.4. Discussion 117<br />
(a) left view (b) center view (c) right view<br />
(d) depth map (WTA) (e) depth map (SO) (f) mesh view<br />
Figure 7.10: The three input views of a wooden Bodhisattva statue <strong>and</strong> the corresponding<br />
depth maps (using a local depth extraction approach indicated by WTA <strong>and</strong> the proposed<br />
scanline optimization method) <strong>and</strong> a view on the triangulated mesh.
Chapter 8<br />
Volumetric 3D Model Generation<br />
Contents<br />
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
8.2 Selecting the Volume of Interest . . . . . . . . . . . . . . . . . . 120<br />
8.3 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
8.4 Isosurface Determination <strong>and</strong> Extraction . . . . . . . . . . . . . 124<br />
8.5 Implementation Remarks . . . . . . . . . . . . . . . . . . . . . . . 126<br />
8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
8.1 Introduction<br />
With the exception of our voxel coloring approach, all methods presented so far generate<br />
a set of depth images resp. 2.5D height fields. In order to create true 3D models<br />
this set of depth maps must be combined into a common representation. The proposed<br />
method in this chapter to create proper 3D models is based on an implicit volumetric<br />
representation, from which the final surface can be extracted by any implicit surface polygonization<br />
technique. The principles of robust fusion of several depth maps in the context<br />
of laser-scanned data was developed by Hilton et al. [Hilton et al., 1996] <strong>and</strong> Curless <strong>and</strong><br />
Levoy [Curless <strong>and</strong> Levoy, 1996]. We apply essentially the same technique on depth maps<br />
obtained by dense depth estimation procedures, but the basic approach needs to be modified<br />
to be more robust against outliers occurring in the input depth maps. The basic idea<br />
of volumetric depth image integration is the conversion of depth maps to corresponding<br />
3D distance fields <strong>and</strong> the subsequent robust averaging of these distance fields. The resolution<br />
<strong>and</strong> the accuracy of the final model are determined by the quality of the source<br />
depth images <strong>and</strong> the resolution of the target volume.<br />
Instead of using an implicit representation of the surfaces induced by the depth images,<br />
one can merge a set of polygonal models directly [Turk <strong>and</strong> Levoy, 1994]. Such an<br />
119
120 Chapter 8. Volumetric 3D Model Generation<br />
approach is sensitive to outliers <strong>and</strong> mismatches occurring in the depth images. A volumetric<br />
approach can combine several surface hypotheses <strong>and</strong> per<strong>for</strong>m a robust voting in<br />
order to extract a more reliable surface. On the other h<strong>and</strong>, a volumetric range image<br />
fusion approach limits the size of 3D features found in the final model dependent on the<br />
voxel size.<br />
Our implementation of the purely software based (i.e. unaccelerated) approach, which<br />
is based on [Curless <strong>and</strong> Levoy, 1996], uses compressed volumetric representations of the<br />
3D distance fields <strong>and</strong> can h<strong>and</strong>le high resolution voxel spaces. Merging (averaging) of<br />
many distance fields induced by the corresponding depth maps is possible, since it is<br />
sufficient to traverse the compressed distance fields on a single voxel basis. Nevertheless<br />
our original implementation has substantial space requirements on external memory <strong>and</strong><br />
consumes significant time to generate the final surface (usually in the order of several<br />
minutes). Hence this approach is not suitable <strong>for</strong> immediate visual feedback to the user.<br />
At least <strong>for</strong> fast <strong>and</strong> direct inspection of the 3D model it is reasonable to develop a very<br />
efficient volumetric range image integration approach again accelerated by the computing<br />
power of modern graphics hardware. Many steps in the range image integration pipeline<br />
are very suitable <strong>for</strong> processing on graphics hardware <strong>and</strong> significant speedup can be<br />
expected.<br />
The overall procedure traverses the voxel space defined by the user slice by slice <strong>and</strong><br />
generates a section of the final implicit 3D mesh representation in every iteration. Consequently,<br />
the memory requirements are very low, but immediate postprocessing (e.g.<br />
filtering) of the generated slices is limited. Although the general idea is very close<br />
to [Curless <strong>and</strong> Levoy, 1996], several modifications are required to allow an efficient GPU<br />
implementations in the first instance. More importantly, the sensitivity to gross outliers<br />
frequently occurring in input depth maps is reduced by a robust voting approach. The<br />
details of our implementation are given in the next sections.<br />
8.2 Selecting the Volume of Interest<br />
The first step of proposed volumetric depth image integration pipeline is the specification<br />
of the 3D domain, <strong>for</strong> which the volumetric representation of the final model is built.<br />
Generally, it is not possible to determine this volume of interest automatically. In case of<br />
small objects entirely visible in each of the source images, the intersection of the viewing<br />
frustra can serve as indicator <strong>for</strong> the volume to be reconstructed. Larger objects only<br />
partially visible in the source images (e.g. large buildings) require human interaction to<br />
select the reconstruction volume. Consequently, there exists a user interface <strong>for</strong> manual<br />
selection of the reconstructed volume. This application displays a set of e.g. 3D feature<br />
points generated by the image orientation procedure or 3D point clouds generated from<br />
dense depth maps. The user can select <strong>and</strong> adjust the 3-dimensional bounding box of<br />
the region of interest. Additionally, the user specifies the intended resolution of the voxel<br />
space, which is set to 256 3 voxels in our experiments.
8.3. Depth Map Conversion 121<br />
8.3 Depth Map Conversion<br />
With the knowledge of the volume of interest <strong>and</strong> its orientation, the voxel space is traversed<br />
slice by slice <strong>and</strong> the values of the depth images are sampled according to the<br />
projective trans<strong>for</strong>mation induced by the camera parameters <strong>and</strong> the position of the slice.<br />
Since the sampled depth values denote the perpendicular distance of the surface to the<br />
camera plane, the distance of a voxel to the surface can be estimated easily as the difference<br />
between the depth value <strong>and</strong> the distance of the voxel to the image plane (see also<br />
Figure 8.1). This difference is an estimated signed distance to the surface; positive values<br />
indicate voxels in front of the surface <strong>and</strong> negative values correspond to voxels hidden by<br />
the surface. Of course, the accuracy of this approximation depends on the angle between<br />
the principal direction of the camera <strong>and</strong> the normal vector of the surface. Nevertheless,<br />
this efficiently computed approximation to the true distance trans<strong>for</strong>m gives very good<br />
results in practice. Additionally, we incorporated the angle between the surface normal<br />
<strong>and</strong> the viewing direction to scale this distance, but this modification had no apparent<br />
effect on the resulting models.<br />
The source depth maps contain two additional special values: one value (in our<br />
implementation chosen as -1) indicates absent depth values, which may occur due to<br />
some depth postprocessing procedure eliminating unreliable matches from the depth<br />
map. Another value (0 in our implementation) corresponds to pixels outside some<br />
<strong>for</strong>eground region of interest, which is based on an optional silhouette mask in our<br />
workflow [Sormann et al., 2005].<br />
Consequently, the processed voxels fall into one of the following categories:<br />
1. Voxels that are outside the camera frustum are labeled as culled.<br />
2. Voxels with an estimated distance D to the surface smaller than a user-specified<br />
threshold Tsurf are labeled as near-surface voxels (|D| ≤ Tsurf ).<br />
3. Voxels with a signed distance greater than this threshold are considered as definitely<br />
empty (D > Tsurf ).<br />
4. The fourth category includes occluded voxels, which have a negative distance with a<br />
magnitude larger than the threshold (D < −Tsurf ).<br />
5. If the depth value of the back-projected voxel indicates an absent value, the voxel is<br />
labeled as unfilled.<br />
6. Voxels back-projecting into pixels outside the <strong>for</strong>eground regions are considered as<br />
empty.<br />
These categories are illustrated in Figure 8.1. The threshold Tsurf specifies essentially the<br />
amount of noise that is expected in the depth images.
122 Chapter 8. Volumetric 3D Model Generation<br />
Camera<br />
center<br />
Image plane<br />
depth distance<br />
Empty region<br />
Culled region<br />
Outside silh.<br />
Culled region<br />
Occluded region<br />
Unfilled region<br />
Occluded region<br />
Outside silh.<br />
Figure 8.1: Classification of the voxel according to the depth map <strong>and</strong> camera parameters.<br />
Voxels outside the camera frustum are initially labeled as culled. Voxels close to the surface<br />
induced by the depth map are near-surface voxels (on both sides of the surface, indicated<br />
by shaded regions). Voxels with a distance larger than a threshold are either empty or<br />
occluded, depending on the sign of the distance.<br />
In many reconstruction setups it is possible to classify culled voxel depth values immediately.<br />
If the object of interest is visible in all images, culled voxels are outside the region<br />
to be reconstructed <strong>and</strong> can be classified as empty instantly. Declaring culled voxels as<br />
unfilled may generate unwanted clutter due to outliers in the depth maps. If the object<br />
to be reconstructed is only partially visible in the images, voxels outside the viewing frustum<br />
of a particular depth map do not contribute in<strong>for</strong>mation <strong>and</strong> are there<strong>for</strong>e labeled as<br />
unfilled. The choice between these two policies <strong>for</strong> h<strong>and</strong>ling culled data is specified by the<br />
user. Consequently, the 6 branches described above correspond to four voxel categories.<br />
A fragment program determines the status of voxels <strong>and</strong> updates an accumulated slice<br />
buffer <strong>for</strong> every given depth image. This buffer consists of four channels in accordance to<br />
the categories described above:<br />
1. The first channel accumulates the signed distances, if the voxel is a near-surface<br />
voxel.<br />
2. The second channel counts the number of depth images, <strong>for</strong> which the voxel is empty.
8.3. Depth Map Conversion 123<br />
3. The third channel tracks the number of depth images, <strong>for</strong> which the voxel is occluded.<br />
4. The fourth channel counts the number of depth images, <strong>for</strong> which the status of the<br />
voxel is unfilled.<br />
Thus, a simple but sufficient statistic <strong>for</strong> every voxel is accumulated, which is the basis <strong>for</strong><br />
the final isosurface determination. Algorithm 6 outlines the incremental accumulation of<br />
the statistic <strong>for</strong> a voxel, which is executed <strong>for</strong> every provided depth image. The accumulated<br />
statistic <strong>for</strong> a voxel is a quadruple comprising the components as described above.<br />
In addition to the user-specified parameter Tsurf , another threshold Tocc can be specified,<br />
which determines the border between occluded voxels <strong>and</strong> again unfilled voxels located<br />
behind the surface. This threshold is set to 10 · Tsurf in our experiments.<br />
Algorithm 6 Procedure to accumulate the statistic <strong>for</strong> a voxel<br />
Procedure stat = AccumulateVoxelStatistic<br />
Input: Camera image plane imageP lane, near-surface threshold Tsurf , Tocc > Tsurf , #Images<br />
Input: depth image D, projective texture coordinate stq, 3D voxel position pos<br />
Input: Voxel statistics: stat = ( � Di, #Empty, #Occluded, #Unfilled) (a quadruple)<br />
st ← stq.xy/stq.z {Perspective division}<br />
if st is inside [0, 1] × [0, 1] then<br />
depth ← tex2D(D, st) {Gather depth from range image}<br />
if depth > 0 then<br />
dist ← depth − imageP lane · pos {Calculate signed distance to the surface}<br />
if dist > Tsurf then<br />
increment #Empty {Too far in front of surface}<br />
else if dist < −Tocc then<br />
increment #Unfilled {Very far behind the surface}<br />
else if dist < −Tsurf then<br />
increment #Occluded {Too far behind the surface}<br />
else<br />
�<br />
Di ← � Di + dist {Near-surface voxel}<br />
end if<br />
else<br />
if depth = 0 then<br />
stat ← (0, #Images + 1, 0, 0) {Declare voxel definitely as empty}<br />
else<br />
increment #Unfilled<br />
end if<br />
end if<br />
else<br />
{Execute one of the following lines, depending on the h<strong>and</strong>ling of culled voxels:}<br />
increment #Empty, or {H<strong>and</strong>le culled voxel as empty}<br />
increment #Unfilled {Alternatively, h<strong>and</strong>le culled voxel as unfilled}<br />
end if<br />
Return stat<br />
This algorithm is very close to the range image integration approach proposed<br />
in [Curless <strong>and</strong> Levoy, 1996]. The main user-given parameter is the threshold Tsurf ,
124 Chapter 8. Volumetric 3D Model Generation<br />
which determines the set of near-surface voxels. This parameter is related to the accuracy<br />
of the depth maps <strong>and</strong> should be set to half of the uncertainty interval in theory. Since<br />
the uncertainty of depth images generated by dense estimation approaches depends on<br />
many parameters like the view geometry, scene content <strong>and</strong> surface properties, this<br />
threshold is determined empirically.<br />
Algorithm 6 has the following differences to the method proposed in<br />
[Curless <strong>and</strong> Levoy, 1996]:<br />
• Culled voxels (i.e. outside the viewing frustum) can be immediately carved away<br />
depending on the user specified policy.<br />
• Voxel very far behind the estimated surface are considered unreliable <strong>and</strong> are labeled<br />
as unfilled instead of being classified as occluded. A user specified threshold Tocc<br />
is introduced to distinguish between occluded (solid) voxels <strong>and</strong> unfilled ones. The<br />
choice of this parameter does not critically affect the obtained model. We use a<br />
default value of Tocc = 10 Tsurf in our experiments.<br />
Weighted Accumulation <strong>for</strong> Near-Surface Voxels<br />
It is possible to compute a weighted average <strong>for</strong> the near-surface voxels by accumulating<br />
weighted distances. If the signed distance of a voxel <strong>for</strong> depth image i is Di, <strong>and</strong> the<br />
corresponding weight (resp. confidence) is Wi, then the averaged distance value is<br />
�<br />
i WiDi<br />
�<br />
i Wi<br />
Because the weights do not sum to one, a weighted scheme requires tracking the total sum<br />
�<br />
i Wi of the weights in addition to the parameters described above. This can be achieved<br />
either by writing to a fifth channel, which requires the recent multi-render-target graphics<br />
extension, or alternatively two of the other parameters can be merged. Depending on the<br />
object to be reconstructed culled voxels can be counted as empty or occluded without<br />
decreasing the accuracy of the final model. For free-st<strong>and</strong>ing objects like statues it is<br />
reasonable to declare culled voxels as empty, since the object in interest is typically visible<br />
in all images. In other cases occluded <strong>and</strong> culled voxels can be treated equivalently.<br />
8.4 Isosurface Determination <strong>and</strong> Extraction<br />
After all available depth images are processed, the target buffer holds the coarse statistic<br />
<strong>for</strong> all voxels of the current slice. The classification pass to determine the final status<br />
of every voxel is essentially a voting procedure. This step assigns the depth distance to<br />
the final surface to every voxel, such that the isosurface at level 0 corresponds with the<br />
merged 3D model. For efficiency the voting procedure uses only the statistics acquired<br />
<strong>for</strong> the current voxel, but does not inspect neighboring voxels. Algorithm 7 presents the<br />
.
8.4. Isosurface Determination <strong>and</strong> Extraction 125<br />
utilized averaging procedure to assign the signed distance to the final surface. There is one<br />
parameter, which must be specified by the user: #RequiredDefinite denotes the minimum<br />
number of near surface entries accumulated in the voxel statistic. This means, that<br />
at least #RequiredDefinite depth maps must agree, that the current voxel is close to the<br />
estimated surface. The choice of this parameter depends on the redundancy in the images<br />
an on the quality of the provided depth maps. A larger choice <strong>for</strong> #RequiredDefinite<br />
reduces the clutter induced by outliers in the input depth maps, but may lead to holes in<br />
the final surface, if parts of the surface are visible in too few views.<br />
Algorithm 7 Procedure to calculate the final surface distance <strong>for</strong> a voxel<br />
Procedure result = AverageDistance<br />
Input: User specified constant: #RequiredDefinite<br />
Input: Voxel statistics: � Di, #Empty, #Occluded, #Unfilled<br />
#Definite ← #Images − #Occluded − #Unfilled<br />
if #Definite < #RequiredDefinite then<br />
result ← UnknownLabel(e.g. NaN)<br />
else<br />
#NearSurface ← #Images − #Empty − #Unfilled<br />
if #NearSurface ≥ #Empty then<br />
result ← � Di/#NearSurface<br />
else<br />
result ← +∞<br />
end if<br />
end if<br />
Return result<br />
Up to now the discussed steps in the volumetric range image integration pipeline, depth<br />
map conversion <strong>and</strong> fusion, run entirely on graphics hardware. After the GPU-based computation<br />
<strong>for</strong> one slice of the voxel space is finished, the isovalues of the current slice are<br />
trans<strong>for</strong>med into a triangular mesh on the CPU [Lorenson <strong>and</strong> Cline, 1987] <strong>and</strong> added to<br />
the final surface representation. This mesh can be directly visualized <strong>and</strong> is ready <strong>for</strong><br />
additional processing like texture map generation. Instead of generating a surface representation<br />
from the individual slices a 3D texture can be accumulated alternatively, which<br />
is suitable <strong>for</strong> volume rendering techniques. The main portion of this approach is per<strong>for</strong>med<br />
again entirely on the GPU <strong>and</strong> does not involve substantial CPU computations.<br />
In contrast to a slice-based incremental isosurface extraction method, this direct approach<br />
requires the space <strong>for</strong> a complete 3D texture in graphics memory. Since modern 3D graphics<br />
hardware is equipped with large amounts of video memory, the 16MB required by a<br />
256 3 voxel space are af<strong>for</strong>dable. Rendering an isosurface directly from the volumetric data<br />
requires additional calculation of surface normals, which are directly derived from the<br />
gradients at every voxel. By using a deferred rendering approach, computation of the gradient<br />
can be limited to the actual surface voxels <strong>and</strong> the additional memory consumption<br />
is minimal.
126 Chapter 8. Volumetric 3D Model Generation<br />
8.5 Implementation Remarks<br />
Tracking the statistic <strong>for</strong> each voxel in the current slice requires a four channel buffer<br />
with floating point precision to accumulate the distance values <strong>for</strong> near-surface voxels. By<br />
normalizing the distance of these voxels from [−T, T ] to [−1, 1] a half precision buffer (16<br />
bit floating point <strong>for</strong>mat) is usually sufficient. Furthermore, the final voxel values can<br />
be trans<strong>for</strong>med to the range [0, 1] <strong>and</strong> a traditional 8 bit fix-point buffer offers adequate<br />
precision. Using low-precision buffers decreases the volume integration time by about 30%.<br />
8.6 Results<br />
This section provides visual <strong>and</strong> timing results <strong>for</strong> some real datasets. The timings are<br />
given <strong>for</strong> a PC hardware consisting of a Pentium4 3GHz processor <strong>and</strong> an NVidia Ge<strong>for</strong>ce<br />
6800 graphics card. All source views are resized to 512 × 512 pixels be<strong>for</strong>eh<strong>and</strong>, <strong>and</strong><br />
the obtained depth images have the same resolutions (unless noted otherwise). Partially<br />
available <strong>for</strong>eground segmentation data is not used in these experiments.<br />
The first dataset depicted in Figure 8.2(a) shows one source image (out of 47) displaying<br />
a small statue. The images are taken in a roughly circular sequence around the statue.<br />
The camera is precalibrated <strong>and</strong> the relative poses of the images are determined from point<br />
correspondences found in adjacent views. From the correspondences <strong>and</strong> the camera parameters<br />
a sparse reconstruction can be triangulated, which is used by a human operator to<br />
determine a 3D box enclosing the voxel space of interest. The extension of this box is used<br />
to determine the depth range employed in the subsequent plane-sweep step, which took 53s<br />
to generate 45 depth images in total (Figure 8.2(b)). In this depth estimation procedure<br />
(recall Chapter 4), 200 evenly distributed depth hypotheses are tested using the SAD <strong>for</strong><br />
a 5 × 5 window. In order to compensate illumination changes in several view triplets, the<br />
source images were normalized by subtracting its local mean image. Black pixels indicate<br />
unreliable matches, which are labeled as unfilled be<strong>for</strong>e the depth integration procedure.<br />
These depth maps are integrated in just over 4 seconds to obtain a 256 3 volume dataset<br />
as illustrated in Figure 8.2(c). The isosurface displayed in Figure 8.2(d) can be directly<br />
extracted using a ray-casting approach on the GPU [Stegmaier et al., 2005]. Almost all<br />
of the clutter <strong>and</strong> artefacts outside the proper statue are eliminated by requiring at least<br />
7 definite values <strong>for</strong> the statistic of a voxel.<br />
The result <strong>for</strong> another dataset consisting of 43 images is shown in Figure 8.3(b), <strong>for</strong><br />
which one source image is depicted in Figure 8.3(a). The same procedure as <strong>for</strong> the previous<br />
dataset is applied, from which a set of 41 depth images is obtained in the first instance.<br />
Plane-sweep depth estimation using the ZNCC correlation with 200 depth hypotheses requires<br />
97.7s in all to generate the depth maps. The subsequent depth image fusion step<br />
requires 4s to yield the volumetric data illustrated in Figure 8.3(b).<br />
Note that these timing reflect the creation time <strong>for</strong> rather high-resolution models. If<br />
all resolutions are halved (256 × 256 × 100 depth images <strong>and</strong> 128 3 volume resolution),
8.7. Discussion 127<br />
(a) One source image (b) One depth image (c) Direct volume<br />
rendering<br />
(d) Shaded isosurface<br />
Figure 8.2: Visual results <strong>for</strong> a small statue dataset generated from a sequence of 47<br />
images. The total time to generate the depth maps <strong>and</strong> the final volumetric representation<br />
is less than 1 min. The left image (a) shows one source view, <strong>and</strong> the corresponding depth<br />
map generated by a plane sweep approach is illustrated in (b). The 3D volume obtained<br />
by depth image integration is displayed using direct volume rendering in (c). The outline<br />
of the isosurface corresponding to the integrated model is clearly visible. Additionally, the<br />
region of near-surface voxels is indicated by the blur next to the surface. The right image<br />
shows the isosurface extracted from the volume data using GPU-based raycasting. Both<br />
images are generated by the volume raycasting software made available by S. Stegmaier<br />
et al. [Stegmaier et al., 2005].<br />
the total depth estimation time is 13s <strong>and</strong> the volumetric integration time is less than 1s<br />
<strong>for</strong> this dataset. We believe that these timing results allow our method to qualify as an<br />
interactive modeling approach.<br />
The visual result <strong>for</strong> another dataset consisting of 16 source views is shown in Figure<br />
8.3(c) <strong>and</strong> (d). Depth estimation <strong>for</strong> 14 views took 34.2s using a 5x5 ZNCC with a<br />
best-half-sequence occlusion strategy (200 tentative depth values). Without an implicit occlusion<br />
h<strong>and</strong>ling approach parts of the sword are missing. Volumetric integration requires<br />
another 1.8s to generate the isosurface shown in Figure 8.3(d).<br />
8.7 Discussion<br />
In this work we demonstrated, that generating proper 3D models from a set of depth<br />
images can be achieved at interactive rates using the processing power of modern GPUs.<br />
The quality of the obtained 3D models depends on the grade of the source depth maps
128 Chapter 8. Volumetric 3D Model Generation<br />
(a) One source image (of 43) (b) Shaded isosurface (102s)<br />
(c) One source image (of 16) (d) Shaded isosurface<br />
(36s)<br />
Figure 8.3: Source views <strong>and</strong> isosurfaces <strong>for</strong> two real-world datasets.<br />
<strong>and</strong> on the redundancy within the provided data, but the voting scheme is robust in case<br />
of outliers usually generated by pure local depth estimation procedures.<br />
Although the proposed method is efficient <strong>and</strong> often provides 3D geometry suitable<br />
<strong>for</strong> visualization <strong>and</strong> further processing, the results are inferior in many cases with low
8.7. Discussion 129<br />
redundancy in the source depth maps. In these settings, the pure local averaging <strong>and</strong><br />
voting approach to combine the depth maps is not sufficient. Global surface reconstruction<br />
methods resulting in smoother <strong>and</strong> often watertight 3D geometry were recently<br />
proposed. Volumetric graph-cut approaches [Vogiatzis et al., 2005, Tran <strong>and</strong> Davis, 2006,<br />
Hornung <strong>and</strong> Kobbelt, 2006b, Hornung <strong>and</strong> Kobbelt, 2006a] appear highly successful to<br />
create smooth models, but they are computationally expensive <strong>and</strong> provide only limited<br />
choices <strong>for</strong> regularization terms. Moreover, graph-cut methods in general do not benefit<br />
much from GPU or SIMD accelerated implementation.<br />
Consequently, future work will likely focus on variational reconstruction<br />
approaches. Since determining the surface of an imaged object from<br />
multiple depth maps can be seen as segmentation problem (separation of<br />
empty space <strong>and</strong> interior volume), variational image segmentation methods<br />
(e.g. [Caselles et al., 1997, Westin et al., 2000, Appleton <strong>and</strong> Talbot, 2006]) could be<br />
adapted <strong>for</strong> multiple-view surface reconstruction tasks. The nature of the underlying<br />
implementations enables substantial per<strong>for</strong>mance gains by employing graphics processing<br />
units <strong>for</strong> these methods.
Chapter 9<br />
Results<br />
Contents<br />
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
9.2 Synthetic Sphere Dataset . . . . . . . . . . . . . . . . . . . . . . . 131<br />
9.3 Synthetic House Dataset . . . . . . . . . . . . . . . . . . . . . . . 134<br />
9.4 Middlebury Multi-View Stereo Temple Dataset . . . . . . . . . 137<br />
9.5 Statue of Emperor Charles VI . . . . . . . . . . . . . . . . . . . . 138<br />
9.6 Bodhisattva Figure . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
9.1 Introduction<br />
This chapter provides results illustrating the complete GPU-based work-flow on several<br />
datasets. At first, two synthetic datasets are discussed, which allow a comparison of<br />
the purely image-based reconstruction with the known ground truth. Thereafter, several<br />
real-world datasets from various domains <strong>and</strong> the generated respective 3D models are<br />
presented. The focus of the discussion of these datasets lies on the comparison between<br />
medium resolution <strong>and</strong> high resolution results. Consequently, the potential gain of more<br />
expensive computations at higher resolution is visually illustrated.<br />
The depth maps <strong>for</strong> the real-world datasets are generated using the plane-sweep (Chapter<br />
4) <strong>and</strong> scanline optimization approaches (Chapter 7), since these methods are less vulnerable<br />
to illumination changes in the images <strong>and</strong> do not require a suitable initialization<br />
as the iterative methods (Chapter 3 <strong>and</strong> 6) depend on.<br />
9.2 Synthetic Sphere Dataset<br />
The first presented dataset is a synthetically rendered perfect sphere with radius 1 (see<br />
Figure 9.1). The surface is textured using a procedurally generated stone texture. 36<br />
131
132 Chapter 9. Results<br />
views at 512 × 512 resolution are created using the Persistence of <strong>Vision</strong> raytracer. ∗ The<br />
cameras are placed in even intervals around the sphere center looking towards the center.<br />
(a) (b) (c)<br />
Figure 9.1: Three source views of the synthetic sphere dataset.<br />
Choosing a sphere as the ground truth geometry has the advantage, that the comparison<br />
of the reconstructed model with the ground truth is extremely simple: the offset of an<br />
arbitrary 3D point to the sphere surface is just the difference between the sphere radius<br />
<strong>and</strong> the distance of the point to the center. This allows an easy evaluation of the reconstructed<br />
meshes, <strong>and</strong> the regular structure of the target model allows the identification of<br />
systematic errors <strong>and</strong> biases <strong>and</strong> the reconstruction methods.<br />
We compare three depth estimation methods in this section:<br />
1. a plane-sweep approach using a winner-takes-all depth extraction described in Chapter<br />
4 (denoted by WTA)<br />
2. the GPU-based scanline optimization procedure presented in Chapter 7<br />
3. a GPU accelerated variational approach to depth estimation as described in Chapter<br />
6 (indicated by PDE)<br />
All methods have a triplet of images as input with the central view designated as key<br />
image. The image dissimilarity function is the SAD aggregated in a 5 × 5 window <strong>for</strong><br />
the first two methods, <strong>and</strong> the single pixel SSD <strong>for</strong> the variational approach. The planesweep<br />
<strong>and</strong> the scanline optimization procedures evaluate 400 potential depth values <strong>for</strong><br />
every pixel of the key image. Figure 9.2 displays the result of the three depth estimation<br />
methods <strong>for</strong> one particular key view. The discrete set of depth values can be clearly seen<br />
in Figure 9.2(a) <strong>and</strong> (b).<br />
The obtained three sets comprising 36 depth maps are merged into a final 3D model<br />
using the procedure described in Chapter 8. We set the main parameters Tsurf <strong>and</strong><br />
∗ www.povray.org
9.2. Synthetic Sphere Dataset 133<br />
(a) WTA (b) Scanline opt. (c) PDE<br />
Figure 9.2: Depth estimation results <strong>for</strong> a view triplet of the sphere dataset<br />
#RequiredDefinite to 0.03 <strong>and</strong> 7 respectively. This step requires about 5.5s in order to<br />
combine the 36 depth maps. The final meshes <strong>for</strong> the three depth estimation methods are<br />
depicted in Figure 9.3. The visual appearance is quite similar; the staircasing artefacts of<br />
the WTA <strong>and</strong> the scanline optimization approach are removed by the depth integration<br />
step. The polar regions of the sphere are not visible in the source views, hence those parts<br />
are not reconstructed.<br />
(a) WTA (b) Scanline opt. (c) PDE<br />
Figure 9.3: Fused 3D models <strong>for</strong> the sphere dataset wrt. the depth estimation method<br />
In order to provide a quantitative evaluation, the final meshes are compared with the<br />
ground truth sphere. In Table 9.1 the total depth estimation runtime <strong>for</strong> 36 views is given<br />
in the second column. The third column reports the average sphere radius as induced<br />
by the generated final mesh (with respect to the true sphere center). The final column<br />
specifies the percentage of vertices from the final meshes, which lie within 0.5% of the<br />
sphere radius.
134 Chapter 9. Results<br />
Depth est. method Total runtime Reported radius Points within 0.5%<br />
Winner-takes-all 83s 1.0012 97.7%<br />
Scanline opt. 350s 0.9992 97.4%<br />
PDE 125s 0.9987 95.5%<br />
Table 9.1: Quantitative evaluation of the reconstructed spheres<br />
Of course, the figures in Table 9.1 indicate the achievable accuracy under best circumstances.<br />
9.3 Synthetic House Dataset<br />
Another synthetic dataset depicting a simple textured house model is illustrated in Figure<br />
9.4. 36 views of the VRML model were generated, <strong>and</strong> the source images were resized<br />
to 512 × 512 pixels. Since the model house is rotated during the virtual capturing process,<br />
but the (virtual) lights remain in constant position, this dataset simulates a turntable<br />
sequence with a moving object <strong>and</strong> fixed light sources. Consequently, purely intensity<br />
based image dissimilarity measures fail in this case. There<strong>for</strong>e we excluded the variational<br />
approach in the evaluation.<br />
(a) (b) (c)<br />
Figure 9.4: Three source views of the synthetic house dataset.<br />
In order to obtain a 3D model, 36 triplets of views were used to create depth images<br />
using the plane-sweep approaches with either winner-takes-all or scanline optimization <strong>for</strong><br />
depth extraction. A 5 × 5 ZNCC image similarity score was employed in the experiments.<br />
The purely local approach is further divided into two variant: a plain method taking the<br />
depth maps as is (denoted by WTA (1)) <strong>and</strong> a conservative method marking unreliable<br />
pixels in the depth map with a low matching score as invalid (WTA (2)). Since the<br />
difference between these two variants lies only in a depth map post-processing step, the<br />
runtimes are equivalent.<br />
The depth maps were again combined using the volumetric integration approach, which
9.3. Synthetic House Dataset 135<br />
took 5.2s. The reconstruction volume encloses the house model <strong>and</strong> its proximity, but does<br />
not include the complete ground plane.<br />
(a) WTA (1) (b) WTA (2)<br />
(c) Scanline opt.<br />
Figure 9.5: Fused 3D models <strong>for</strong> the synthetic house dataset wrt. the depth estimation<br />
method<br />
The pure local methods encounter problems in homogeneous regions as expected (Figure<br />
9.5(a) <strong>and</strong> (b)). Surprisingly, employing scanline optimization to fill the depth images<br />
in textureless areas does not yield to to expected high-quality result. An explanation can<br />
be given, if the depth maps displayed in Figure 9.6 are examined: the depth maps generated<br />
by the local methods contain mismatches resp. unreliable depth values in textureless<br />
region (Figure 9.6(a) <strong>and</strong> (b), <strong>and</strong> recall Figure 9.4(c)).<br />
Scanline optimization (Figure 9.6(c)) fills homogeneous regions with reasonable depth<br />
values, but because of the linear discontinuity cost model there is an ambiguity in perfect<br />
homogeneous regions: in such cases, the smoothness cost � |d(x) − d(x + 1)| is minimized
136 Chapter 9. Results<br />
(a) WTA (1) (b) WTA (2) (c) SO<br />
Figure 9.6: Three generated depth maps of the synthetic house dataset. The results<br />
of the local approaches show incorrect depth estimations in textureless regions. Scanline<br />
optimization with a linear discontinuity cost fills the pixel in the depth image suboptimally<br />
due to the ambiguity of the optimal path.<br />
<strong>for</strong> a set of pixels not providing discriminative matching costs. The minimum is not unique<br />
<strong>and</strong> the method may report any of these optima. Our implementation reports piecewise<br />
constant depth maps (as illustrated e.g. in the right section of Figure 9.6(c)) in contrast<br />
to the expected piecewise planar ones.<br />
This surprising behavior is caused by the 1-dimensional depth optimization in combination<br />
with the linear discontinuity cost model. If a quadratic smoothness cost model<br />
is utilized, the minimum even in textureless regions is uniquely yielding a planar map.<br />
Per<strong>for</strong>ming full 2-dimensional depth optimization (e.g. by graph-cut methods) gives again<br />
a unique optimum <strong>and</strong> is not vulnerable to this ambiguity.<br />
In order to evaluate the obtained final 3D models wrt. the ground truth, two measures<br />
are employed: the model accuracy specifies the ratio of model surface, which are close<br />
to the ground truth model within a given distance threshold. The model completeness<br />
depicts the portion the ground truth model, which is covered by the reconstructed mesh<br />
(i.e. where the reconstructed surface is close to the ground truth wrt. a given threshold).<br />
For the completeness calculation the wide-stretched ground plane is omitted from the<br />
reference model, since it is only reconstructed in the proximity of the house. Measuring the<br />
completeness of a model accurately is difficult, since small holes may not have any influence<br />
<strong>and</strong> larger holes shrink depending on the tolerated distance. Consequently, we set the<br />
distance threshold <strong>for</strong> completeness evaluation in the order of the reported average distance<br />
of inliers as reported by the accuracy evaluation (which is about 0.2% of the diameter of<br />
the reconstructed box). The obtained values are still only approximately accurate, but<br />
they match the visual appearance of the models. For instance, the conservative winnertakes-all<br />
approach has the highest accuracy (since only reliable depth values are retained),<br />
but the lowest completeness result (unreliable regions remain unfilled).<br />
The surface-to-surface distance computations are approximated by converting the tri-
9.4. Middlebury Multi-View Stereo Temple Dataset 137<br />
angular mesh models into point sets by uni<strong>for</strong>mly sampling the meshes <strong>and</strong> calculating the<br />
closest point-pairs <strong>for</strong> these sets. Table 9.2 presents the results of this evaluation. Beside<br />
the total runtime, the model accuracy <strong>and</strong> the completeness are given <strong>for</strong> two distance<br />
thresholds. These thresholds are indicated as fractions of the diameter of the reconstructed<br />
volume.<br />
Depth est. method Runtime Accuracy 1% 0.5% Completeness 0.4% 0.2%<br />
Winner-takes-all (1) 120s 92.54% 83.7% 95.65% 75.99%<br />
Winner-takes-all (2) 120s 99.07% 93.47% 90.78% 63.51%<br />
Scanline opt. 170s 96.27% 90.51% 95.91% 82.30%<br />
Table 9.2: Quantitative evaluation of the reconstructed synthetic house<br />
9.4 Middlebury Multi-View Stereo Temple Dataset<br />
This dataset is one of the currently two proposed datasets with known ground-truth geometry<br />
[Seitz et al., 2006] † . The images show a replications of an ancient temple (see<br />
Figure 9.7). The ground-truth geometry was obtained by laser-scanning the miniature<br />
model. There are three variants of the dataset: at first, a large set of images is provided,<br />
which contains more than 300 source views acquired using a spherical gantry <strong>and</strong> a moving<br />
camera. Additionally, two smaller subsets are supplied: a dense ring set of images<br />
consisting of 47 views, <strong>and</strong> a sparse ring with 16 images. All images have 640 × 480 pixels<br />
resolution. We used the medium sized dense ring dataset to generate the results presented<br />
below.<br />
We provide two final results <strong>for</strong> this dataset: the first mesh displayed in Figures 9.8(a)<br />
<strong>and</strong> (b) is created using the camera matrices <strong>and</strong> orientations supplied by the originators.<br />
Since the authors of this dataset do not claim high accuracy of their camera parameters,<br />
we additionally calculated the relatives poses between the views from scratch using<br />
our multi-view reconstruction pipeline. Two views of the resulting mesh are shown in<br />
Figure 9.9(a) <strong>and</strong> (b). In both cases the same parameters <strong>for</strong> depth estimation <strong>and</strong> volumetric<br />
integration are used. The initial depth maps are computed employing a 3 × 3 SAD<br />
matching score <strong>and</strong> scanline optimization <strong>for</strong> depth extraction. 255 potential depth values<br />
are evaluated <strong>for</strong> every pixel. This procedure takes 3m7s to finish. Subsequent fusion of<br />
all depth maps into a volumetric model with 288 3 voxels resolution requires another 12s<br />
to complete.<br />
The surface mesh created with our own calculated camera matrices appears smoother<br />
<strong>and</strong> less noisy than the one based on the supplied camera poses. The drawback of camera<br />
poses computed from scratch is, that the obtained 3D model is calculated with respect to<br />
a local camera coordinate system <strong>and</strong> cannot be compared with the laser-scanned model<br />
directly.<br />
† http://vision.middlebury.edu/mview/
138 Chapter 9. Results<br />
(a) (b) (c)<br />
Figure 9.7: Three (out of 47) source images of the temple model dataset. The images are<br />
taken approximately evenly spaced on a circular sequence around the model.<br />
9.5 Statue of Emperor Charles VI<br />
Figure 9.10 displays two source views (out of 42) showing a statue of the Austrian Emperor<br />
Charles VI. inside the state hall of the Austrian National Library. The source images have<br />
a significant variation in brightness conditions due to the back light induced by the large<br />
windows of the hall.<br />
A set of 40 depth maps is generated <strong>for</strong> every triplet of source images, which are subsequently<br />
fused using our volumetric depth image integration approach. We calculated the<br />
final model <strong>for</strong> two different resolutions: at first, a medium resolution model is generated<br />
<strong>for</strong> depth images with 336 × 512 pixels <strong>and</strong> 256 × 256 × 384 voxels used <strong>for</strong> volumetric<br />
integration. Further, a high resolution result at 676 × 1016 pixels <strong>and</strong> 384 × 384 × 512<br />
voxels is created to evaluate the benefit of increased resolution. Table 9.3 depicts the<br />
required run-times to generate the 40 depth maps using 250 depth hypotheses at the specified<br />
image resolution. Volumetric fusion takes 8.5s at medium resolution <strong>and</strong> 27s at high<br />
resolution, respectively.<br />
Resolution Depth est. method Runtime<br />
336 × 512 Winner-takes-all 1m40s<br />
Scanline opt. 2m10s<br />
676 × 1016 Winner-takes-all 5m30s<br />
Scanline opt. 7m40s<br />
Table 9.3: Timing results <strong>for</strong> the Emperor Charles dataset. These figures represent the<br />
time needed to generate 40 depth maps at the specified resolution. 250 depth hypo<strong>thesis</strong><br />
are evaluated <strong>for</strong> every pixel.
9.5. Statue of Emperor Charles VI 139<br />
(a) Front view (b) Back view<br />
Figure 9.8: Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />
the original camera matrices (1095000 triangles).<br />
The meshes obtained at medium resolution using a winner-takes-all <strong>and</strong> a scanline<br />
optimization depth extraction method are illustrated in Figure 9.11(a)–(d). The surface<br />
mesh generated using the simple winner-takes-all approach is essentially as good as the<br />
scanline optimization based result.<br />
Figures 9.12(a)–(f) depict the meshes obtained at the higher resolution. Again, a<br />
winner-takes-all <strong>and</strong> a scanline optimization approach are used <strong>for</strong> depth extraction. At<br />
this resolution the WTA result has more noise as illustrated in the close-up view of the<br />
cloak in Figure 9.12(c) <strong>and</strong> (f). The corresponding depth maps generated by the WTA<br />
<strong>and</strong> SO approach can be seen is Figure 9.13. Volumetric fusion evidently removes the<br />
mismatches occurring on the WTA-based depth image only partially, which yields to holes<br />
in the final mesh.<br />
If one compares the outcomes of the two resolutions directly, e.g. Figure 9.11(c) <strong>and</strong><br />
Figure 9.12(d), then the increased geometric details of the high resolution result are clearly<br />
visible. Nevertheless, the high resolution mesh containing approximately 1 000 000 triangles<br />
is too complex <strong>for</strong> real-time display <strong>and</strong> requires geometric simplification <strong>and</strong> other<br />
enhancements to be suitable <strong>for</strong> further visualization.
140 Chapter 9. Results<br />
(a) Front view (b) Back view<br />
Figure 9.9: Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />
new calculated camera matrices (857000 triangles).<br />
9.6 Bodhisattva Figure<br />
The final dataset is a set of images displaying a wooden Bodhisattva statue inside a<br />
Buddhist stupa building (Figure 9.14). These images were taken with a digital singlelens<br />
reflex camera under difficult lighting conditions. Additionally, the views are partially<br />
widely separated due to the narrow interior of the stupa. This dataset focuses directly on<br />
the digital preservation of cultural heritage, since the wooden statue weathers slowly due<br />
to atmospheric conditions. Furthermore, this <strong>and</strong> similar religious artefacts are highly in<br />
dem<strong>and</strong> of gatherers <strong>and</strong> consequently susceptible to theft.<br />
The complete set of images contains 13 views of the statue. Two sequences of depth<br />
images (using scanline optimization) are generated: a medium resolution set at 512 ×<br />
768 pixels <strong>and</strong> a high resolution one at 1000 × 1504 pixels, <strong>for</strong> which a few depth maps<br />
are depicted in Figure 9.15. In both cases the number of depth hypotheses is set to<br />
250. The medium resolution result utilized a ZNCC correlation using a 5 × 5 support<br />
window. The generation of 11 depth images using triplets of source views needed 1m12s.<br />
Volumetric fusion was applied in a 256×512×512 voxel space yielding the mesh displayed<br />
in Figure 9.16(a). In the high resolution case a 7 × 7 aggregation window was applied
9.6. Bodhisattva Figure 141<br />
(a) Front view (b) Back view<br />
Figure 9.10: Two views of the statue showing Emperor Charles VI inside the state hall of<br />
the Austrian National Library.<br />
<strong>for</strong> matching costs computation, <strong>and</strong> the volumetric fusion is based on a 384 × 768 × 768<br />
voxel space. Depth map generation took 5m to complete. The finally extracted mesh is<br />
illustrated in Figure 9.16(b).<br />
For this dataset the lower resolution mesh appears smoother <strong>and</strong> less noisy in comparison<br />
with the high resolution outcome. There are two reasons <strong>for</strong> this behavior: at<br />
first, several depth maps contain a substantial amount of noise <strong>and</strong> mismatches due to<br />
the widely separated views <strong>for</strong> some triplets (e.g. Figure 9.15(d)). During volumetric fusion<br />
this noise is largely suppressed at the medium resolution. Additionally, the lack of a<br />
global smoothing term in the “greedy” depth map fusion procedure does not inhibit high<br />
variations (i.e. local noise) in the extracted surface mesh. Future work needs to address<br />
an efficient depth map integration approach, which incorporates some discontinuity cost<br />
to prevent unnecessary noise in the final outcome. In any case, a feature preserving mesh<br />
simplification procedure is required to enable further processing <strong>and</strong> visualization.
142 Chapter 9. Results<br />
(a) WTA, front view (b) WTA, back view<br />
(c) SO, front view (d) SO, back view<br />
Figure 9.11: Medium resolution mesh <strong>for</strong> the Charles VI dataset. Figures (a) <strong>and</strong> (b)<br />
show the surface mesh obtained from a winner-takes-all plane-sweep approach to depth<br />
map generation. Figures (c) <strong>and</strong> (d) illustrate the results using scanline optimization.
9.6. Bodhisattva Figure 143<br />
(a) Front view (b) Front view (c) Front view<br />
(d) Front view (e) Front view (f) Front view<br />
Figure 9.12: High resolution mesh <strong>for</strong> the Charles VI dataset. Figures (a) <strong>and</strong> (b) show<br />
the surface mesh obtained from a winner-takes-all plane-sweep approach to depth map<br />
generation. (c) displays a close-up view of the cloak revealing substantial noise in the<br />
mesh. Figures (d)–(f) illustrate the results using scanline optimization. The cloak in<br />
Figure (f) is much smoother in this setting.
144 Chapter 9. Results<br />
(a) WTA (b) SO<br />
Figure 9.13: Two depth maps <strong>for</strong> the same reference view of the Charles dataset generated<br />
by the winner-takes-all <strong>and</strong> the scanline optimization approach, respectively.<br />
(a) (b) (c) (d) (e) (f) (g)<br />
Figure 9.14: Every other of the 13 source images of the Bodhisattva statue dataset.
9.6. Bodhisattva Figure 145<br />
(a) (b) (c) (d)<br />
Figure 9.15: Several Depth images <strong>for</strong> the Bodhisattva statue<br />
(a) Medium resolution (512 × 768, ≈ 1Mio<br />
triangles)<br />
(b) High resolution (1000 × 1504, ≈ 2.7Mio<br />
triangles)<br />
Figure 9.16: Medium <strong>and</strong> high resolution results <strong>for</strong> the Bodhisattva statue images. The<br />
depth images <strong>for</strong> the left model are computed at 512 × 768 pixels resolution, <strong>and</strong> the<br />
subsequent volumetric depth map integration is per<strong>for</strong>med at 256 × 512 × 512 voxels. The<br />
depth map <strong>and</strong> the voxel resolution <strong>for</strong> the right model are 1000×1504 <strong>and</strong> 384×768×768,<br />
respectively. For this dataset the inherent smoothing induced by the lower resolution yields<br />
to slightly more appealing results.
Chapter 10<br />
Concluding Remarks<br />
This <strong>thesis</strong> outlines high-per<strong>for</strong>mance approaches to several stages in the reconstruction<br />
pipeline regarding dense depth <strong>and</strong> mesh generation using modern GPUs. Several approaches<br />
<strong>for</strong> multi-view reconstruction benefit substantially from the data-parallel computing<br />
model <strong>and</strong> the processing power of modern GPUs. The provided accuracy of<br />
arithmetic operations on the GPU is sufficient <strong>for</strong> most image processing <strong>and</strong> computer<br />
vision methods not relying on high-precision computations.<br />
The range of described methods starts with GPU-based correlation calculation followed<br />
by a simple winner-takes-all depth extraction procedure <strong>and</strong> reaches semi-global methods<br />
using dynamic programming <strong>and</strong> volumetric methods to merge a set of depth images into<br />
a final 3D model. So far, several important global methods <strong>for</strong> depth estimation can<br />
only partially benefit from GPUs: graph cut approaches are currently too sophisticated<br />
<strong>for</strong> substantial GPU acceleration, <strong>and</strong> loopy belief propagation methods have too high<br />
memory requirements to be useful <strong>for</strong> high-resolution reconstructions. Hence, we believe<br />
that the methods proposed in this <strong>thesis</strong> are good c<strong>and</strong>idates <strong>for</strong> GPU utilization to<br />
generate high-resolution models from multiple views.<br />
It is evident to ask whether other steps in the pipeline can be accelerated by graphics<br />
hardware as well. Several processing steps in the early pipeline like distortion correction,<br />
basic corner extraction <strong>and</strong> similar low level image processing tasks can easily exploit the<br />
processing power of modern GPUs (e.g. [Sugita et al., 2003] <strong>and</strong> [Colantoni et al., 2003]).<br />
Other important procedures mostly related to pose estimation like tracking <strong>and</strong> matching<br />
of correspondences <strong>and</strong> RANSAC based relative pose estimation require too sophisticated<br />
control flow mechanisms to be rewarding targets <strong>for</strong> SIMD processing model offered by<br />
current GPUs. There might the possibility of hybrid approaches <strong>for</strong> these tasks incorporating<br />
CPU <strong>and</strong> GPU processing power at equal parts. In particular, the estimation of<br />
sparse correspondences is still a relatively slow procedure within our current reconstruction<br />
pipeline. Accelerating this stage of the pipeline seems to be the most worthwhile goal <strong>for</strong><br />
the near future. Sinha et al. [Sinha et al., 2006] recently addressed KLT tracking <strong>for</strong> video<br />
streams <strong>and</strong> SIFT key extraction using the GPU <strong>and</strong> reported substantial per<strong>for</strong>mance<br />
gains. Incorporating <strong>and</strong> extending these techniques is part of future investigations.<br />
147
148 Chapter 10. Concluding Remarks<br />
With the emergence of more general programming models <strong>for</strong> graphics hardware, more<br />
sophisticated depth estimation <strong>and</strong> other computer vision methods may become relevant<br />
targets <strong>for</strong> a GPU-based implementation. According to current technical proposals, nextgeneration<br />
graphics hardware will provide a more flexible <strong>and</strong> dynamic programming<br />
approach, which potentially allows to assign more control flow <strong>and</strong> more dynamic behavior<br />
to the GPU. Additionally, the strict locality found in our algorithms induced by<br />
the current GPU programming model might be softened, <strong>and</strong> more global knowledge of<br />
the views <strong>and</strong> the depth hypo<strong>thesis</strong> could be incorporated into future procedures. In<br />
particular, the introduction of geometry shaders as an additional step in the rendering<br />
pipeline [Blythe, 2006] adds extended dynamical behavior by allowing vertices to be created<br />
<strong>and</strong> removed by shader programs executed by the GPU. Sophisticated use of this <strong>and</strong><br />
other currently emerging features may yield to interesting efficient approaches to computer<br />
vision problems.<br />
Every long-term prognosis about future graphics hardware <strong>and</strong> its non-graphical applications<br />
is highly speculative. Similar objections apply to the future of CPUs. Nevertheless<br />
we outline two recent developments, which may provide some insights on future graphics<br />
<strong>and</strong> parallel processing technology in general: at first, we mention the highly innovative<br />
(<strong>and</strong> unconventional) design of the Cell microprocessor [Kahle et al., 2005], which essentially<br />
consists of a traditional CPU core tightly coupled with eight SIMD co-processors<br />
providing the computing power e.g. <strong>for</strong> multimedia tasks. The most prominent use of the<br />
Cell architecture will be a video gaming console still equipped with a dedicated graphics<br />
processing unit, but the main goal of substantial enhancing the SIMD capabilities<br />
of general purpose processors is obvious. One important application of this design is<br />
physically correct simulation of objects in computer games. Another <strong>for</strong>thcoming development<br />
in SIMD processing hardware is the unification of previously distinct vertex <strong>and</strong><br />
fragment shaders on GPUs. This means, that the shader pipelines on the GPU can execute<br />
either vertex programs or fragment programs as requested by the application or the<br />
graphics driver software. Consequently, the shader pipelines closely resemble the SIMD coprocessors<br />
of the Cell architecture. This evolution of CPUs <strong>and</strong> GPUs is partially driven<br />
by the need of efficient physic simulation engines used in modern computer games. Hence,<br />
one can expect arrays of versatile SIMD co-processors in future computer hardware, which<br />
are located close to the CPU (as in the Cell model) or close to the GPU (in the unified<br />
shader case).<br />
These developments will substantially change the programming model to implement<br />
multimedia tasks <strong>and</strong> related high-per<strong>for</strong>mance applications. The current technological<br />
trends indicate, that main CPUs mainly augmented with data-parallel co-processors will<br />
be the most dominant future computing device. Several techniques developed to utilize<br />
the GPU <strong>for</strong> computer vision tasks can be transferred to this new architecture, whereas<br />
other per<strong>for</strong>mance optimizations specifically targeted <strong>for</strong> GPUs (e.g. using the z-buffer <strong>for</strong><br />
conditional evaluation) have no general SIMD counterpart. Since every new generation of<br />
computer hardware, <strong>and</strong> graphics hardware in particular, provides a set of new features,
the required frequent adaption of GPU-based implementations will likely enable a smooth<br />
transition to future computer architectures.<br />
Currently, the programming interface <strong>for</strong> GPU application is a graphics library (mainly<br />
OpenGL <strong>and</strong> Direct3D). At least it is counter-intuitive <strong>and</strong> error-prone to use graphics<br />
comm<strong>and</strong>s to implement non-graphical methods <strong>and</strong> computations. Consequently, there<br />
are <strong>for</strong>th-coming proposals to interact with the GPU as a non-graphical device: Accelerator<br />
[Tarditi et al., 2005] provides a high-level SIMD programming model <strong>and</strong> translates<br />
the library calls into suitable fragment shaders <strong>and</strong> graphical comm<strong>and</strong>s of the underlying<br />
graphics library. Peercy et al [Peercy et al., 2006] present a library, which exposes the<br />
data-parallel capabilities of the GPU directly without invocation of the systems graphics<br />
library. These trends illustrate the transition of hardware <strong>and</strong> software vendors from<br />
h<strong>and</strong>ling the GPU exclusively as graphics device to a more general parallel computing<br />
device.<br />
Nevertheless, the main focus of future work is not the sole acceleration of computer<br />
vision methods using off-the-shelf parallel computing devices (most notably the GPU),<br />
but the enhancement of the underlying computer vision algorithms. As an example,<br />
semantic segmentation of the input images into relevant regions (facades, static objects)<br />
<strong>and</strong> irrelevant ones (sky, vegetation, moving objects) allows the exclusion of undesirable<br />
values in the depth map. Consequently, the fusion of the depth images is more robust,<br />
<strong>and</strong> the final model omits unnecessary clutter induces by negligible objects.<br />
The presented volumetric approach to 3D model generation from several<br />
depth maps is very efficient, but yields to water-tight models only in ideal cases.<br />
Additionally, the extracted meshes have poor overall smoothness due to the lack<br />
of an appropriate neighborhood h<strong>and</strong>ling. Recently, volumetric mesh extraction<br />
approaches based on graph cuts incorporating global smoothness were proposed<br />
(e.g. [Vogiatzis et al., 2005, Hornung <strong>and</strong> Kobbelt, 2006c]), but these methods have<br />
their own difficulties beside the increased computational complexity. For instance, some<br />
volumetric graph-cut procedures work best only if a suitable visual hull is available.<br />
Furthermore, graph cut solutions prefer minimal surfaces, hence an ad-hoc ballooning<br />
term needs to be added to the cost functional. The limitations of current methods imply,<br />
that there is still room <strong>for</strong> further research in range image integration.<br />
Finally, there is often the requirement of human interaction in the reconstruction<br />
pipeline. In particular, post-processing steps like model trimming <strong>and</strong> the integration<br />
of independently reconstructed objects into one common model commonly depend on a<br />
human operator. The topic of providing user interfaces <strong>for</strong> efficient execution of such tasks<br />
is not directly suited <strong>for</strong> future research. More promising is the integration of efficient<br />
model computation methods with manual interaction schemes in order to intervene in the<br />
depth map or 3D model generation procedure: <strong>for</strong> instance, manual labeling of unmodeled<br />
surface properties like specular highlights combined with a real-time update of the final<br />
3D model may yield to highly effective modeling applications.<br />
149
Appendix A<br />
Selected Publications<br />
A.1 Publications Related to this Thesis<br />
The original approach to mesh-based stereo reconstruction on the GPU as described in<br />
Chapter 3 can be found in [Zach et al., 2003a]. The per<strong>for</strong>mance of the proposed method<br />
was substantially increased using the techniques presented in [Zach et al., 2003b].<br />
Material from Chapter 4 (plane-sweep depth estimation on the GPU) <strong>and</strong> Chapter 8<br />
(fast volumetric integration of depth maps) appeared in [Zach et al., 2006a].<br />
The scanline optimization implementation on the GPU (Chapter 7) is published<br />
as [Zach et al., 2006b].<br />
A.2 Other Selected Scientific Contributions<br />
Most work in the first half of my time as <strong>PhD</strong> student addressed rendering of large 3D environments,<br />
which were typically generated by remote sensing methods (e.g. satellite laser<br />
scans) <strong>and</strong> photogrammetric methods. Hence, early papers covered the task of interactive<br />
visualization of such dataset using view-dependent multi-resolution geometry.<br />
In [Zach <strong>and</strong> Karner, 2003a] an efficient algorithm <strong>for</strong> selective refinement of viewdependent<br />
meshes is presented. View-dependent refinement of meshes typically requires a<br />
top-down traversal of a tree-like structure, which affects the obtained frame rate significantly.<br />
The proposed method is an event-drived approach to the dynamic mesh refinement<br />
procedure, which exploits temporal coherence explicitly <strong>and</strong> achieves significantly reduced<br />
refinement times.<br />
Mapping textures on multiresolution meshes is straigh<strong>for</strong>ward, if texture coordinates<br />
can be interpolated across all levels of detail (e.g. when only one texture is applied to<br />
the geometry). If the geometry is texture mapped with several images, the displayed<br />
level of detail is constrained or artifacts occur, if no additional processing is per<strong>for</strong>med.<br />
[Zach <strong>and</strong> Bauer, 2002] <strong>and</strong> [Sormann et al., 2003] generalize clipmap like approaches <strong>for</strong><br />
texturing multiresolution heightfields to more general 3D models by generating a texture<br />
151
152 Chapter A. Selected Publications<br />
hierarchy in correspondence with the vertex hierarchy used <strong>for</strong> view-dependent rendering<br />
of multiresolution meshes.<br />
Efficient external encoding of multiresolution meshes suitable <strong>for</strong> view-dependent access<br />
of relevant fractions of the complete 3D model was mainly addressed by M. Grabner<br />
[Grabner, 2003]. In [Zach et al., 2004a] we replace the originally proposed topology<br />
encoding method <strong>for</strong> multiresolution meshes with a different encoding scheme. Our new<br />
encoding method is superior in worst case examples <strong>and</strong> in real-world data sets. We prove<br />
that two vertices of a triangle can be encoded with 1 bit on average, whereas the third<br />
vertex requires O(log n) bits in the worst case.<br />
[Zach <strong>and</strong> Karner, 2003b] addresses again compression of model data suitable <strong>for</strong> efficient<br />
transmission over a network. This time, the compressed encoding of precomputed<br />
visibility in<strong>for</strong>mation <strong>for</strong> walk-through applications is described. It is assumed that the<br />
user can navigate in an urban scenario with the virtual camera fixed at a predefined<br />
eye height. For every node in the view-dependent mesh hierarchy a conservative estimation<br />
of visibility is precomputed using software provided by P. Wonka <strong>and</strong> M. Wimmer<br />
[Wonka et al., 2000]. The result of this calculation is a set of visible nodes <strong>for</strong> each cell<br />
in the maneuverable space. This data comprise essentially a large binary matrix, which is<br />
appropriately encoded to be used in remote visualization applications.<br />
Rendering large view-dependent multiresolution models in combination with many<br />
view-independent multiresolution objects was addressed in [Zach et al., 2002]. In particular,<br />
the real-time rendering of a large digital elevation model augmented with a huge number<br />
of trees is discussed. In order to achieve real-time per<strong>for</strong>mance, a new level of detail<br />
selection procedure is proposed, which is fast enough to assign suitable resolutions to more<br />
than 1 million objects. The digital elevation model is represented as coarse view-dependent<br />
hierachical level of detail, <strong>and</strong> the tree models are rendered using point-based graphics<br />
primitives. An extended version of this paper is recently published [Zach et al., 2004b].
Bibliography<br />
[Akbarzadeh et al., 2006] Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels,<br />
C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q.-X.,<br />
Stewénius, H., Yang, R., Welch, G., Towles, H., Nistér, D., <strong>and</strong> Pollefeys, M. (2006).<br />
Towards urban 3d reconstruction from video. In International Symposium on 3D Data<br />
Processing, Visualization <strong>and</strong> Transmission (3DPVT).<br />
[Appleton <strong>and</strong> Talbot, 2006] Appleton, B. <strong>and</strong> Talbot, H. (2006). Globally minimal surfaces<br />
by continuous maximal flows. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />
Intelligence, 28(1):106–118.<br />
[Baker <strong>and</strong> Bin<strong>for</strong>d, 1981] Baker, H. H. <strong>and</strong> Bin<strong>for</strong>d, T. (1981). Depth from edge <strong>and</strong><br />
intensity based stereo. In Proc. 7th Intl Joint Conf. Artificial Intelligence, pages 631–<br />
636.<br />
[Birchfield <strong>and</strong> Tomasi, 1998] Birchfield, S. <strong>and</strong> Tomasi, C. (1998). A pixel dissimilarity<br />
measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis<br />
<strong>and</strong> Machine Intelligence, 20(4):401–406.<br />
[Blythe, 2006] Blythe, D. (2006). The Direct3D 10 system. In Proceedings of SIGGRAPH<br />
2006, pages 724–734.<br />
[Bolz et al., 2003] Bolz, J., Farmer, I., Grinspun, E., <strong>and</strong> Schröder, P. (2003). Sparse<br />
matrix solvers on the GPU: Conjugate gradients <strong>and</strong> multigrid. In Proceedings of SIG-<br />
GRAPH 2003, pages 917–924.<br />
[Bornik et al., 2001] Bornik, A., Karner, K., Bauer, J., Leberl, F., <strong>and</strong> Mayer, H. (2001).<br />
High-quality texture reconstruction from multiple views. Journal of Visualization <strong>and</strong><br />
<strong>Computer</strong> Animation, 12(5):263–276.<br />
[Boykov et al., 2001] Boykov, Y., Veksler, O., <strong>and</strong> Zabih, R. (2001). Fast approximate energy<br />
minimization via graph cuts. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />
Intelligence (PAMI), 23(11):1222–1239.<br />
[Brown et al., 2003] Brown, M. Z., Burschka, D., <strong>and</strong> Hager, G. D. (2003). Advances in<br />
computational stereo. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence,<br />
25(8):993–1008.<br />
153
154<br />
[Brox et al., 2004] Brox, T., Bruhn, A., Papenberg, N., <strong>and</strong> Weickert, J. (2004). High<br />
accuracy optical flow estimation based on a theory <strong>for</strong> warping. In European Conference<br />
on <strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 25–36.<br />
[Brunton <strong>and</strong> Shu, 2006] Brunton, A. <strong>and</strong> Shu, C. (2006). Belief propagation <strong>for</strong><br />
panorama generation. In International Symposium on 3D Data Processing, Visualization<br />
<strong>and</strong> Transmission (3DPVT).<br />
[Buck et al., 2004] Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston,<br />
M., <strong>and</strong> Hanrahan, P. (2004). Brook <strong>for</strong> GPUs: Stream computing on graphics hardware.<br />
In Proceedings of SIGGRAPH 2004, pages 777–786.<br />
[Canny, 1986] Canny, J. (1986). A computational approach to edge detection. IEEE<br />
Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence, 8(6):679-698., 8(6):679–<br />
698.<br />
[Caselles et al., 1997] Caselles, V., Kimmel, R., <strong>and</strong> Sapiro, G. (1997). Geodesic active<br />
contours. Int. Journal on <strong>Computer</strong> <strong>Vision</strong>, 22(1):61–79.<br />
[Chan <strong>and</strong> Vese, 2002] Chan, T. F. <strong>and</strong> Vese, L. A. (2002). A multiphase levelset framework<br />
<strong>for</strong> image segmentation using the Mum<strong>for</strong>d <strong>and</strong> Shah model. Int. Journal of<br />
<strong>Computer</strong> <strong>Vision</strong>, 50(3):271–293.<br />
[Chefd’Hotel et al., 2001] Chefd’Hotel, C., Hermosillo, G., <strong>and</strong> Faugeras, O. (2001). A<br />
variational approach to multi-modal image matching. In IEEE Workshop on Variational<br />
<strong>and</strong> Level Set Methods in <strong>Computer</strong> <strong>Vision</strong>, pages 21–28.<br />
[Colantoni et al., 2003] Colantoni, P., Boukala, N., <strong>and</strong> Rugna, J. D. (2003). Fast <strong>and</strong><br />
accurate color image processing using 3D graphics cards. In Proc. of <strong>Vision</strong>, Modeling<br />
<strong>and</strong> Visualization 2002.<br />
[Cornelis <strong>and</strong> Van Gool, 2005] Cornelis, N. <strong>and</strong> Van Gool, L. (2005). Real-time connectivity<br />
constrained depth map computation using programmable graphics hardware. In<br />
IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 1099–<br />
1104.<br />
[Criminisi et al., 2005] Criminisi, A., Shotton, J., Blake, A., Rother, C., <strong>and</strong> Torr, P.<br />
(2005). Efficient dense-stereo with occlusions <strong>and</strong> new view syn<strong>thesis</strong> by four state dp<br />
<strong>for</strong> gaze correction. Technical report, Microsoft Research Cambridge.<br />
[Crow, 1984] Crow, F. C. (1984). Summed-area tables <strong>for</strong> texture mapping. In Proceedings<br />
of SIGGRAPH 84, pages 207–212.<br />
[Culbertson et al., 1999] Culbertson, W. B., Malzbender, T., <strong>and</strong> Slabaugh, G. (1999).<br />
Generalized voxel coloring. In Proc. ICCV Workshop, <strong>Vision</strong> Algorithms Theory <strong>and</strong><br />
Practice, pages 100–115.
BIBLIOGRAPHY 155<br />
[Curless <strong>and</strong> Levoy, 1996] Curless, B. <strong>and</strong> Levoy, M. (1996). A volumetric method <strong>for</strong><br />
building complex models from range images. In Proceedings of SIGGRAPH ’96, pages<br />
303–312.<br />
[Dally et al., 2003] Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labont, F., Ahn,<br />
J.-H., Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., <strong>and</strong> Buck, I. (2003). Merrimac:<br />
Supercomputing with streams. In Proceedings of SC2003.<br />
[Darabiha et al., 2003] Darabiha, A., Rose, J., <strong>and</strong> MacLean, W. J. (2003). Video-rate<br />
stereo depth measurement on programmable hardware. In IEEE Conference on <strong>Computer</strong><br />
<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 203–210.<br />
[Davis et al., 2002] Davis, J., Marschner, S., Garr, M., <strong>and</strong> Levoy, M. (2002). Filling holes<br />
in complex surfaces using volumetric diffusion. In First International Symposium on<br />
3D Data Processing, Visualization, <strong>and</strong> Transmission.<br />
[Devernay <strong>and</strong> Faugeras, 2001] Devernay, F. <strong>and</strong> Faugeras, O. (2001). Straight lines have<br />
to be straight. Machine <strong>Vision</strong> <strong>and</strong> Applications, 13(1):14–24.<br />
[Dixit et al., 2005] Dixit, N., Keriven, R., <strong>and</strong> Paragios, N. (2005). GPU-cuts <strong>and</strong> adaptive<br />
object extraction. Technical Report 05-07, CERTIS.<br />
[Dominé et al., 2002] Dominé, S., Rege, A., <strong>and</strong> Cebenoyan, C. (2002). Real-time hatching.<br />
Game Developers Conference.<br />
[Dubois <strong>and</strong> Rodrigue, 1977] Dubois, P. <strong>and</strong> Rodrigue, G. H. (1977). An analysis of the<br />
recursive doubling algorithm. High Speed <strong>Computer</strong> <strong>and</strong> Algorithm Organization, pages<br />
299–307.<br />
[Eisert et al., 1999] Eisert, P., Steinbach, E., <strong>and</strong> Girod, B. (1999). Multi-hypo<strong>thesis</strong>,<br />
volumetric reconstruction of 3-D objects from multiple calibrated camera views. In<br />
Proc. of International Conference on Acoustics, Speech <strong>and</strong> Signal Processing, pages<br />
3509–3512.<br />
[Engel <strong>and</strong> Ertl, 2002] Engel, K. <strong>and</strong> Ertl, T. (2002). Interactive high-quality volume<br />
rendering with flexible consumer graphics hardware. In STAR – State of the Art Report.<br />
Eurographics ’02.<br />
[Engel et al., 2001] Engel, K., Kraus, M., <strong>and</strong> Ertl, T. (2001). High-quality pre-integrated<br />
volume rendering using hardware-accelerated pixel shading. In Eurographics / SIG-<br />
GRAPH Workshop on <strong>Graphics</strong> Hardware ’01, pages 9–16.<br />
[Faugeras et al., 1996] Faugeras, O., Hotz, B., Mathieu, H., Viéville, T., Zhang, Z., Fua,<br />
P., Théron, E., Moll, L., Berry, G., Vuillemin, J., Bertin, P., <strong>and</strong> Proy, C. (1996).<br />
Real time correlation based stereo: algorithm implementations <strong>and</strong> applications. The<br />
International Journal of <strong>Computer</strong> <strong>Vision</strong>.
156<br />
[Faugeras <strong>and</strong> Keriven, 1998] Faugeras, O. <strong>and</strong> Keriven, R. (1998). Variational principles,<br />
surface evolution, PDEs, level set methods, <strong>and</strong> the stereo problem. IEEE Transactions<br />
on Image Processing, 7(3):336–344.<br />
[Faugeras et al., 2002] Faugeras, O., Malik, J., <strong>and</strong> Ikeuchi, K., editors (2002). Special<br />
Issue on Stereo <strong>and</strong> Multi-Baseline <strong>Vision</strong>. International Journal of <strong>Computer</strong> <strong>Vision</strong>.<br />
[Felzenszwalb <strong>and</strong> Huttenlocher, 2004] Felzenszwalb, P. F. <strong>and</strong> Huttenlocher, D. P.<br />
(2004). Efficient belief propagation <strong>for</strong> early vision. In IEEE <strong>Computer</strong> Society Conference<br />
on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 261–268.<br />
[Forstmann et al., 2004] Forstmann, S., Ohya, J., Kanou, Y., Schmitt, A., <strong>and</strong> Thuering,<br />
S. (2004). Real-time stereo by using dynamic programming. In CVPR 2004 Workshop<br />
on real-time 3D sensors <strong>and</strong> their use.<br />
[Förstner <strong>and</strong> Gülch, 1987] Förstner, W. <strong>and</strong> Gülch, E. (1987). A fast operator <strong>for</strong> detection<br />
<strong>and</strong> precise location of distinct points, corners <strong>and</strong> centres of circular features.<br />
Proc. of the ISPRS Intercommission Workshop on Fast Processing of Photogrammetric<br />
Data, Interlaken, pages 285–301.<br />
[Fua, 1993] Fua, P. (1993). A parallel stereo algorithm that produces dense depth maps<br />
<strong>and</strong> preserves image features. Machine <strong>Vision</strong> <strong>and</strong> Applications, 6:35–49.<br />
[Garl<strong>and</strong> <strong>and</strong> Heckbert, 1997] Garl<strong>and</strong>, M. <strong>and</strong> Heckbert, P. S. (1997). Surface simplification<br />
using quadric error metrics. In Proceedings of SIGGRAPH ’97, pages 209–216.<br />
[Geiger et al., 1995] Geiger, D., Ladendorf, B., <strong>and</strong> Yuille, A. (1995). Occlusions <strong>and</strong><br />
binocular stereo. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 14:211–226.<br />
[Goesele et al., 2006] Goesele, M., Curless, B., <strong>and</strong> Seitz, S. (2006). Multi-view stereo<br />
revisited. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />
Recognition (CVPR), pages 2402–2409.<br />
[Gong <strong>and</strong> Yang, 2005a] Gong, M. <strong>and</strong> Yang, R. (2005a). Image-gradient-guided real-time<br />
stereo on graphics hardware. In Fifth International Conference on 3-D Digital Imaging<br />
<strong>and</strong> Modeling, pages 548–555.<br />
[Gong <strong>and</strong> Yang, 2005b] Gong, M. <strong>and</strong> Yang, Y.-H. (2005b). Near real-time reliable stereo<br />
matching using programmable graphics hardware. In IEEE Conference on <strong>Computer</strong><br />
<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 924–931.<br />
[Goodnight et al., 2003] Goodnight, N., Woolley, C., Lewin, G., Luebke, D., <strong>and</strong><br />
Humphreys, G. (2003). A multigrid solver <strong>for</strong> boundary value problems using programmable<br />
graphics hardware. In Eurographics/SIGGRAPH Workshop on <strong>Graphics</strong><br />
Hardware 2003.
BIBLIOGRAPHY 157<br />
[Grabner, 2003] Grabner, M. (2003). Compressed Adaptive Multiresolution Encoding. <strong>PhD</strong><br />
<strong>thesis</strong>, Technical University <strong>Graz</strong>.<br />
[Hadwiger et al., 2001] Hadwiger, M., Theußl, T., Hauser, H., <strong>and</strong> Gröller, M. E. (2001).<br />
Hardware-accelerated high-quality filtering on PC hardware. In Proc. of <strong>Vision</strong>, Modeling<br />
<strong>and</strong> Visualization 2001, pages 105–112.<br />
[Harris <strong>and</strong> Stephens, 1988] Harris, C. <strong>and</strong> Stephens, M. (1988). A combined corner <strong>and</strong><br />
edge detector. Proceedings 4th Alvey Visual Conference, pages 189–192.<br />
[Harris <strong>and</strong> Luebke, 2005] Harris, M. <strong>and</strong> Luebke, D. (2005). SIGGRAPH 2005 GPGPU<br />
course notes.<br />
[Harris et al., 2002] Harris, M. J., Coombe, G., Scheuermann, T., <strong>and</strong> Lastra, A. (2002).<br />
Physically-based visual simulation on graphics hardware. In Eurographics/SIGGRAPH<br />
Workshop on <strong>Graphics</strong> Hardware, pages 109–118.<br />
[Hart <strong>and</strong> Mitchell, 2002] Hart, E. <strong>and</strong> Mitchell, J. L. (2002). Hardware shading with<br />
EXT vertex shader <strong>and</strong> ATI fragment shader. ATI Technologies.<br />
[Heckbert, 1986] Heckbert, P. S. (1986). Filtering by repeated integration. In Proceedings<br />
of SIGGRAPH 86, pages 315–321.<br />
[Heikkilä, 2000] Heikkilä, J. (2000). Geometric camera calibration using circular control<br />
points. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />
22(10):1066–1077.<br />
[Hensley et al., 2005] Hensley, J., Scheuermann, T., Coombe, G., Singh, M., <strong>and</strong> Lastra,<br />
A. (2005). Fast summed-area table generation <strong>and</strong> its applications. In Proceedings of<br />
Eurographics 2005, pages 547–555.<br />
[Hermosillo et al., 2001] Hermosillo, G., Chefd’Hotel, C., <strong>and</strong> Faugeras, O. (2001). A variational<br />
approach to multi-modal image matching. Technical Report RR 4117, INRIA.<br />
[Hilton et al., 1996] Hilton, A., Stoddart, A. J., Illingworth, J., <strong>and</strong> Windeatt, T. (1996).<br />
Reliable surface reconstruction from multiple range images. In European Conference on<br />
<strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 117–126.<br />
[Hirschmüller, 2005] Hirschmüller, H. (2005). Accurate <strong>and</strong> efficient stereo processing by<br />
semi-global matching <strong>and</strong> mutual in<strong>for</strong>mation. In IEEE <strong>Computer</strong> Society Conference<br />
on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 807–814.<br />
[Hirschmüller, 2006] Hirschmüller, H. (2006). Stereo vision in structured environments by<br />
consistent semi-global matching. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong><br />
<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 2386–2393.
158<br />
[Hoff III et al., 1999] Hoff III, K. E., Keyser, J., Lin, M., Manocha, D., <strong>and</strong> Culver, T.<br />
(1999). Fast computation of generalized Voronoi diagrams using graphics hardware. In<br />
Proceedings of SIGGRAPH ’99, pages 277–286.<br />
[Hopf <strong>and</strong> Ertl, 1999a] Hopf, M. <strong>and</strong> Ertl, T. (1999a). Accelerating 3D convolution using<br />
graphics hardware. In Visualization 1999, pages 471–474.<br />
[Hopf <strong>and</strong> Ertl, 1999b] Hopf, M. <strong>and</strong> Ertl, T. (1999b). Hardware-based wavelet trans<strong>for</strong>mations.<br />
In Workshop of <strong>Vision</strong>, Modelling, <strong>and</strong> Visualization (VMV ’99), pages<br />
317–328.<br />
[Hornung <strong>and</strong> Kobbelt, 2006a] Hornung, A. <strong>and</strong> Kobbelt, L. (2006a). Hierarchical volumetric<br />
multi-view stereo reconstruction of manifold surfaces based on dual graph embedding.<br />
In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />
(CVPR), pages 503–510.<br />
[Hornung <strong>and</strong> Kobbelt, 2006b] Hornung, A. <strong>and</strong> Kobbelt, L. (2006b). Robust <strong>and</strong> efficient<br />
photo-consistency estimation <strong>for</strong> volumetric 3d reconstruction. In European Conference<br />
on <strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 179–190.<br />
[Hornung <strong>and</strong> Kobbelt, 2006c] Hornung, A. <strong>and</strong> Kobbelt, L. (2006c). Robust reconstruction<br />
of watertight 3D models from non-uni<strong>for</strong>mly sampled point clouds without normal<br />
in<strong>for</strong>mation. In Eurographics Symposium on Geometry Processing, pages 41–50.<br />
[Jia et al., 2003] Jia, Y., Xu, Y., Liu, W., Yang, C., Zhu, Y., Zhang, X., <strong>and</strong> An, L. (2003).<br />
A miniature stereo vision machine <strong>for</strong> real-time dense depth mapping. In Conference<br />
on <strong>Computer</strong> <strong>Vision</strong> Systems (ICVS 2003), pages 268–277.<br />
[Jung et al., 2006] Jung, Y. M., Kang, S. H., <strong>and</strong> Shen, J. (2006). Multiphase image<br />
segmentation via Modica-Mortola phase transition. Technical report, Department of<br />
Mathematics, University of Kentucky.<br />
[Kahle et al., 2005] Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer,<br />
T. R., <strong>and</strong> Shippy, D. (2005). Introduction to the Cell multiprocessor. IBM Journal of<br />
Research <strong>and</strong> Development, 49(4/5):589–604.<br />
[Kanade et al., 1996] Kanade, T., Yoshida, A., Oda, K., Kano, H., <strong>and</strong> Tanaka, M. (1996).<br />
A stereo engine <strong>for</strong> video-rate dense depth mapping <strong>and</strong> its new applications. In IEEE<br />
Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 196–202.<br />
[Kautz <strong>and</strong> Seidel, 2001] Kautz, J. <strong>and</strong> Seidel, H.-P. (2001). Hardware accelerated displacement<br />
mapping <strong>for</strong> image based rendering. In <strong>Graphics</strong> Interface 2001, pages 61–70.<br />
[Kim <strong>and</strong> Lin, 2003] Kim, T. <strong>and</strong> Lin, M. (2003). Visual simulation of ice crystal growth.<br />
In Proc. ACM SIGGRAPH / Eurographics Symposium on <strong>Computer</strong> Animation.
BIBLIOGRAPHY 159<br />
[Klaus et al., 2002] Klaus, A., Bauer, J., Karner, K., <strong>and</strong> Schindler, K. (2002). MetropoGIS:<br />
A semi-automatic city documentation system. In Photogrammetric <strong>Computer</strong><br />
<strong>Vision</strong> 2002 (PCV’02).<br />
[Kolmogorov <strong>and</strong> Zabih, 2001] Kolmogorov, V. <strong>and</strong> Zabih, R. (2001). Computing visual<br />
correspondence with occlusions using graph cuts. In IEEE International Conference on<br />
<strong>Computer</strong> <strong>Vision</strong> (ICCV), pages 508–515.<br />
[Kolmogorov <strong>and</strong> Zabih, 2002] Kolmogorov, V. <strong>and</strong> Zabih, R. (2002). Multi-camera scene<br />
reconstruction via graph cuts. In European Conference on <strong>Computer</strong> <strong>Vision</strong> (ECCV),<br />
pages 82–96.<br />
[Kolmogorov <strong>and</strong> Zabih, 2004] Kolmogorov, V. <strong>and</strong> Zabih, R. (2004). What energy functions<br />
can be minimized via graph cuts? IEEE Transactions on Pattern Analysis <strong>and</strong><br />
Machine Intelligence (PAMI), 26(2):147–159.<br />
[Kolmogorov et al., 2003] Kolmogorov, V., Zabih, R., <strong>and</strong> Gortler, S. (2003). Generalized<br />
multi-camera scene reconstruction using graph cuts. In Fourth International Workshop<br />
on Energy Minimization Methods in <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />
(EMMCVPR).<br />
[Konolige, 1997] Konolige, K. (1997). Small vision systems: Hardware <strong>and</strong> implementation.<br />
In Proceedings of 8th International Symposium on Robotic Research, pages 203–<br />
212.<br />
[Krishnan et al., 2002] Krishnan, S., Mustafa, N., <strong>and</strong> Venkatasubramanian, S. (2002).<br />
Hardware-assisted computation of depth contours. In 13th ACM-SIAM Symposium on<br />
Discrete Algorithms.<br />
[Krüger <strong>and</strong> Westermann, 2003] Krüger, J. <strong>and</strong> Westermann, R. (2003). Linear algebra<br />
operators <strong>for</strong> GPU implementation of numerical algorithms. In Proceedings of SIG-<br />
GRAPH 2003, pages 908–916.<br />
[Kutulakos <strong>and</strong> Seitz, 2000] Kutulakos, K. <strong>and</strong> Seitz, S. (2000). A theory of shape by<br />
space carving. Int. Journal of <strong>Computer</strong> <strong>Vision</strong>, 38(3):198–216.<br />
[Labatut et al., 2006] Labatut, P., Keriven, R., <strong>and</strong> Pons, J.-P. (2006). A GPU implementation<br />
of level set multiview stereo. In International Symposium on 3D Data Processing,<br />
Visualization <strong>and</strong> Transmission (3DPVT).<br />
[Lanczos, 1986] Lanczos, C. (1986). The Variational Principles of Mechanics. Dover<br />
Publications, fourth edition.<br />
[Laurentini, 1995] Laurentini, A. (1995). How far 3d shapes can be understood from 2d<br />
silhouettes. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />
17(2).
160<br />
[Lefohn et al., 2003] Lefohn, A., Kniss, J. M., Hansen, C. D., <strong>and</strong> Whitaker, R. T. (2003).<br />
Interactive de<strong>for</strong>mation <strong>and</strong> visualization of level set surfaces using graphics hardware.<br />
In Proceedings of IEEE Visualization 2003, pages 75–82.<br />
[Lei et al., 2006] Lei, C., Selzer, J., <strong>and</strong> Yang, Y. (2006). Region-tree based stereo using<br />
dynamic programming optimization. In IEEE <strong>Computer</strong> Society Conference on<br />
<strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 2378–2385.<br />
[Lévy et al., 2002] Lévy, B., Petitjean, S., Ray, N., <strong>and</strong> Maillot, J. (2002). Least squares<br />
con<strong>for</strong>mal maps <strong>for</strong> automatic texture atlas generation. In Proceedings of SIGGRAPH<br />
2002, pages 362–371.<br />
[Li et al., 2003] Li, M., Magnor, M., <strong>and</strong> Seidel, H.-P. (2003). Hardware-accelerated visual<br />
hull reconstruction <strong>and</strong> rendering. In Proceedings of <strong>Graphics</strong> Interface 2003.<br />
[Li et al., 2004] Li, M., Magnor, M., <strong>and</strong> Seidel, H.-P. (2004). Hardware-accelerated rendering<br />
of photo hulls. In Proceedings of Eurographics 2004, pages 635–642.<br />
[Li et al., 2002] Li, M., Schirmacher, H., Magnor, M., <strong>and</strong> Seidel, H.-P. (2002). Combining<br />
stereo <strong>and</strong> visual hull in<strong>for</strong>mation <strong>for</strong> on-line reconstruction <strong>and</strong> rendering of dynamic<br />
scenes. In Proceedings of IEEE 2002 Workshop on Multimedia <strong>and</strong> Signal Processing,<br />
pages 9–12.<br />
[Lindholm et al., 2001] Lindholm, E., Kilgard, M. J., <strong>and</strong> Moreton, H. (2001). A userprogrammable<br />
vertex engine. In Proceedings of SIGGRAPH 2001, pages 149–158.<br />
[Lok, 2001] Lok, B. (2001). Online model reconstruction <strong>for</strong> interactive virtual environments.<br />
In Symposium on Interactive 3D <strong>Graphics</strong>, pages 69–72.<br />
[Lorenson <strong>and</strong> Cline, 1987] Lorenson, W. <strong>and</strong> Cline, H. (1987). Marching Cubes: A high<br />
resolution 3d surface construction algorithm. In Proceedings of SIGGRAPH ’87, pages<br />
163–170.<br />
[Lourakis <strong>and</strong> Argyros, 2004] Lourakis, M. <strong>and</strong> Argyros, A. (2004). The design <strong>and</strong> implementation<br />
of a generic sparse bundle adjustment software package based on the<br />
levenberg-marquardt algorithm. Technical Report 340, <strong>Institute</strong> of <strong>Computer</strong> Science -<br />
FORTH. Available from http://www.ics.<strong>for</strong>th.gr/~lourakis/sba.<br />
[Lowe, 1999] Lowe, D. (1999). Object recognition from local scale-invariant features. Proc.<br />
of the International Conference on <strong>Computer</strong> <strong>Vision</strong> ICCV, pages 1150–1157.<br />
[Lu et al., 2002] Lu, A., Taylor, J., Hartner, M., Ebert, D., <strong>and</strong> Hansen, C. (2002). Hardware<br />
accelerated interactive stipple drawing of polygonal objects. In Proc. of <strong>Vision</strong>,<br />
Modeling <strong>and</strong> Visualization 2002, pages 61–68.
BIBLIOGRAPHY 161<br />
[Mairal <strong>and</strong> Keriven, 2006] Mairal, J. <strong>and</strong> Keriven, R. (2006). A GPU implementation of<br />
variational stereo. In International Symposium on 3D Data Processing, Visualization<br />
<strong>and</strong> Transmission (3DPVT).<br />
[Mark et al., 2003] Mark, W., Glanville, R., Akeley, K., <strong>and</strong> Kilgard, M. (2003). Cg: A<br />
system <strong>for</strong> programming graphics hardware in a C-like language. In Proceedings of<br />
SIGGRAPH 2003, pages 896–907.<br />
[Matas et al., 2002] Matas, J., Chum, O., Urban, M., <strong>and</strong> Pajdla, T. (2002). Robust<br />
wide baseline stereo from maximally stable extremal regions. In Proceedings of the 13th<br />
British Machine <strong>Vision</strong> Conference, pages 384–393.<br />
[Matusik et al., 2001] Matusik, W., Buehler, C., <strong>and</strong> McMillan, L. (2001). Polyhedral<br />
visual hulls <strong>for</strong> real-time rendering. In Proceedings of 12th Eurographics Workshop on<br />
Rendering, pages 115–125.<br />
[Mayer et al., 2001] Mayer, H., Bornik, A., Bauer, J., Karner, K., <strong>and</strong> Leberl, F. (2001).<br />
Multiresolution texture <strong>for</strong> photorealistic rendering. In Proceedings of the Spring Conference<br />
on <strong>Computer</strong> <strong>Graphics</strong> SCCG.<br />
[Mendonça <strong>and</strong> Cipolla, 1999] Mendonça, P. R. S. <strong>and</strong> Cipolla, R. (1999). A simple technique<br />
<strong>for</strong> self-calibration. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong><br />
<strong>and</strong> Pattern Recognition (CVPR), pages 1500–1506.<br />
[Mikolajczyk <strong>and</strong> Schmid, 2004] Mikolajczyk, K. <strong>and</strong> Schmid, C. (2004). Scale <strong>and</strong> affine<br />
invariant interest point detectors. Int. Journal of <strong>Computer</strong> <strong>Vision</strong>, 60(1):63–86.<br />
[Mitchell, 2002] Mitchell, J. L. (2002). Hardware shading on the Radeon 9700. ATI<br />
Technologies.<br />
[Mitchell et al., 2002] Mitchell, J. L., Brennan, C., <strong>and</strong> Card, D. (2002). Real-time image<br />
space outlining <strong>for</strong> non-photorealistic rendering. In SIGGRAPH 2002. Technical Sketch.<br />
[Morel<strong>and</strong> <strong>and</strong> Angel, 2003] Morel<strong>and</strong>, K. <strong>and</strong> Angel, E. (2003). The FFT on a GPU. In<br />
Eurographics/SIGGRAPH Workshop on <strong>Graphics</strong> Hardware 2003, pages 112–119.<br />
[Mühlmann et al., 2002] Mühlmann, K., Maier, D., Hesser, J., <strong>and</strong> Männer, R. (2002).<br />
Calculating dense disparity maps from color stereo images, an efficient implementation.<br />
IJCV, 47:79–88.<br />
[Mulligan et al., 2002] Mulligan, J., Isler, V., <strong>and</strong> Daniilidis, K. (2002). Trinocular stereo:<br />
a new algorithm <strong>and</strong> its evaluation. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 47:51–<br />
61.<br />
[Nagel <strong>and</strong> Enkelmann, 1986] Nagel, H.-H. <strong>and</strong> Enkelmann, W. (1986). An investigation<br />
of smoothness constraints <strong>for</strong> the estimation of displacement vector fields from image
162<br />
sequences. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />
8:565–593.<br />
[Nistér, 2001] Nistér, D. (2001). Calibration with robust use of cheirality by quasi-affine<br />
reconstruction of the set of camera projection centres. In Int. Conference on <strong>Computer</strong><br />
<strong>Vision</strong> (ICCV), pages 116–123.<br />
[Nistér, 2004a] Nistér, D. (2004a). An efficient solution to the five-point relative pose<br />
problem. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />
26(6):756–770.<br />
[Nistér, 2004b] Nistér, D. (2004b). Untwisting a projective reconstruction. Int. Journal<br />
on <strong>Computer</strong> <strong>Vision</strong>, 60(2):165–183.<br />
[Nistér et al., 2004] Nistér, D., Naroditsky, O., <strong>and</strong> Bergen, J. (2004). Visual odometry.<br />
In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />
(CVPR), pages 652–659.<br />
[NVidia Corporation, 2002a] NVidia Corporation (2002a). Cg language specification.<br />
[NVidia Corporation, 2002b] NVidia Corporation (2002b). Developer relations.<br />
http://developer.nvidia.com.<br />
[Ohta <strong>and</strong> Kanade, 1985] Ohta, Y. <strong>and</strong> Kanade, T. (1985). Stereo by intra- <strong>and</strong> interscanline<br />
search using dynamic programming. IEEE Transactions on Pattern Analysis<br />
<strong>and</strong> Machine Intelligence, 7:139–154.<br />
[Papenberg et al., 2005] Papenberg, N., Bruhn, A., Brox, T., Didas, S., <strong>and</strong> Weickert, J.<br />
(2005). Highly accurate optic flow computation with theoretically justified warping.<br />
Technical report, Department of Mathematics, Saarl<strong>and</strong> University.<br />
[Peercy et al., 2006] Peercy, M., Segal, M., <strong>and</strong> Gerstmann, D. (2006). A per<strong>for</strong>manceoriented<br />
data parallel virtual machine <strong>for</strong> gpus. In ACM SIGGRAPH sketches.<br />
[Peercy et al., 2000] Peercy, M. S., Olano, M., Airey, J., <strong>and</strong> Ungar, P. J. (2000). Interactive<br />
multi-pass programmable shading. In Proceedings of SIGGRAPH 2000, pages<br />
425–432.<br />
[Perona <strong>and</strong> Malik, 1990] Perona, P. <strong>and</strong> Malik, J. (1990). Scale-space <strong>and</strong> edge detection<br />
using anisotropic diffusion. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />
Intelligence (PAMI), 12(7):629–639.<br />
[Point Grey Research Inc., 2005] Point Grey Research Inc. (2005).<br />
http://www.ptgrey.com.
BIBLIOGRAPHY 163<br />
[Pollefeys et al., 1999] Pollefeys, M., Koch, R., <strong>and</strong> Gool, L. V. (1999). Self-calibration<br />
<strong>and</strong> metric reconstruction in spite of varying <strong>and</strong> unknown internal camera parameters.<br />
Int. Journal on <strong>Computer</strong> <strong>Vision</strong>, 32(1):7–25.<br />
[Pons et al., 2005] Pons, J.-P., Keriven, R., <strong>and</strong> Faugeras, O. (2005). Modelling dynamic<br />
scenes by registering multi-view image sequences. In IEEE <strong>Computer</strong> Society Conference<br />
on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 822–827.<br />
[Prock <strong>and</strong> Dyer, 1998] Prock, A. <strong>and</strong> Dyer, C. (1998). Towards real-time voxel coloring.<br />
In Proc. Image Underst<strong>and</strong>ing Workshop, pages 315–321.<br />
[Proudfoot et al., 2001] Proudfoot, K., Mark, W., Tzvetkov, S., <strong>and</strong> Hanrahan, P. (2001).<br />
A real-time procedural shading system <strong>for</strong> programmable graphics hardware. In Proceedings<br />
of SIGGRAPH 2001, pages 159–170.<br />
[Rodrigues <strong>and</strong> Ramires Fern<strong>and</strong>es, 2004] Rodrigues, R. <strong>and</strong> Ramires Fern<strong>and</strong>es, A.<br />
(2004). Accelerated epipolar geometry computation <strong>for</strong> 3d reconstruction using projective<br />
texturing. In Proceedings of Spring Conference on <strong>Computer</strong> <strong>Graphics</strong> 2004,<br />
pages 208–214.<br />
[Rudin et al., 1992] Rudin, L. I., Osher, S., <strong>and</strong> Fatemi, E. (1992). Nonlinear total variation<br />
based noise removal algorithms. Physica D, 60:259–268.<br />
[Sainz et al., 2002] Sainz, M., Bagherzadeh, N., <strong>and</strong> Susin, A. (2002). Hardware accelerated<br />
voxel carving. In 1st Ibero-American Symposium in <strong>Computer</strong> <strong>Graphics</strong> (SIACG<br />
2002), pages 289–297.<br />
[Scharstein <strong>and</strong> Szeliski, 2002] Scharstein, D. <strong>and</strong> Szeliski, R. (2002). A taxonomy <strong>and</strong><br />
evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. <strong>Vision</strong>,<br />
47(1-3):7–42.<br />
[Schmidegg, 2005] Schmidegg, H. (2005). Texturing 3D models from historical images.<br />
Master’s <strong>thesis</strong>, <strong>Graz</strong> University of Technology.<br />
[Seitz et al., 2006] Seitz, S., Curless, B., Diebel, J., Scharstein, D., <strong>and</strong> Szeliski, R. (2006).<br />
A comparison <strong>and</strong> evaluation of multi-view stereo reconstruction algorithms. In IEEE<br />
<strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR).<br />
[Seitz <strong>and</strong> Dyer, 1997] Seitz, S. <strong>and</strong> Dyer, C. (1997). Photorealistic scene reconstruction<br />
by voxel coloring. In IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />
(CVPR), pages 1067–1073.<br />
[Seitz <strong>and</strong> Dyer, 1999] Seitz, S. <strong>and</strong> Dyer, C. (1999). Photorealistic scene reconstruction<br />
by voxel coloring. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 35(2):151–173.
164<br />
[Seitz <strong>and</strong> Kutulakos, 2002] Seitz, S. <strong>and</strong> Kutulakos, K. (2002). Plenoptic image editing.<br />
Int. Journal of <strong>Computer</strong> <strong>Vision</strong>, 48(2):115–129.<br />
[Shen, 2006] Shen, J. (2006). A stochastic-variational model <strong>for</strong> soft Mum<strong>for</strong>d-Shah segmentation.<br />
International Journal on Biomedical Imaging, 2006:1–14.<br />
[Sinha et al., 2006] Sinha, S. N., Frahm, J.-M., Pollefeys, M., <strong>and</strong> Genc, Y. (2006). Gpubased<br />
video feature tracking <strong>and</strong> matching. Technical Report 06-012, Department of<br />
<strong>Computer</strong> Science, UNC Chapel Hill.<br />
[Slabaugh et al., 2001] Slabaugh, G., Culbertson, W. B., <strong>and</strong> Malzbender, T. (2001). A<br />
survey of methods <strong>for</strong> volumetric scene reconstruction from photographs. In Int. Workshop<br />
on Volume <strong>Graphics</strong>, pages 81–100.<br />
[Slabaugh et al., 2002] Slabaugh, G., Schafer, R., <strong>and</strong> Hans, M. (2002). Image-based<br />
photo hulls. In The 1st International Symposium on 3D Processing, Visualization, <strong>and</strong><br />
Transmission (3DPVT).<br />
[Slesareva et al., 2005] Slesareva, N., Bruhn, A., <strong>and</strong> Weickert, J. (2005). Optic flow goes<br />
stereo: A variational method <strong>for</strong> estimating discontinuity-preserving dense disparity<br />
maps. In Proc. 27th DAGM Symposium, pages 33–40.<br />
[Sormann et al., 2005] Sormann, M., Zach, C., Bauer, J., Karner, K., <strong>and</strong> Bischof, H.<br />
(2005). Automatic <strong>for</strong>eground propagation in image sequences <strong>for</strong> 3d reconstruction. In<br />
Proc. 27th DAGM Symposium, pages 93–100.<br />
[Sormann et al., 2003] Sormann, M., Zach, C., <strong>and</strong> Karner, K. (2003). Texture mapping<br />
<strong>for</strong> view-dependent rendering. In Proceedings of Spring Conference on <strong>Computer</strong><br />
<strong>Graphics</strong> 2003, pages 146–155.<br />
[Sormann et al., 2006] Sormann, M., Zach, C., <strong>and</strong> Karner, K. (2006). Graph cut based<br />
multiple view segmentation <strong>for</strong> 3d reconstruction. In International Symposium on 3D<br />
Data Processing, Visualization <strong>and</strong> Transmission (3DPVT).<br />
[Stegmaier et al., 2005] Stegmaier, S., Strengert, M., Klein, T., <strong>and</strong> Ertl, T. (2005). A<br />
simple <strong>and</strong> flexible volume rendering framework <strong>for</strong> graphics-hardware-based raycasting.<br />
In Proceedings of Volume <strong>Graphics</strong>, pages 187–195.<br />
[Stevens et al., 2002] Stevens, M. R., Culbertson, W. B., <strong>and</strong> Malzbender, T. (2002). A<br />
histogram-based color consistency test <strong>for</strong> voxel coloring. In Intl. Conference on Pattern<br />
Recognition, pages 118–121.<br />
[Strecha et al., 2003] Strecha, C., Tuytelaars, T., <strong>and</strong> Van Gool, L. (2003). Dense matching<br />
of multiple wide-baseline views. In Int. Conference on <strong>Computer</strong> <strong>Vision</strong> (ICCV),<br />
pages 1194–1201.
BIBLIOGRAPHY 165<br />
[Strecha <strong>and</strong> Van Gool, 2002] Strecha, C. <strong>and</strong> Van Gool, L. (2002). PDE-based multi-view<br />
depth estimation. In 1st International Symposium od 3D Data Processing Visualization<br />
<strong>and</strong> Transmission, pages 416–425.<br />
[Sugita et al., 2003] Sugita, K., Naemura, T., <strong>and</strong> Harashima, H. (2003). Per<strong>for</strong>mance<br />
evaluation of programmable graphics hardware <strong>for</strong> image filtering <strong>and</strong> stereo matching.<br />
In Proceedings of ACM Symposium on Virtual Reality Software <strong>and</strong> Technology 2003.<br />
[Sun et al., 2005] Sun, J., Li, Y., Kang, S., <strong>and</strong> Shum, H.-Y. (2005). Symmetric stereo<br />
matching <strong>for</strong> occlusion h<strong>and</strong>ling. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong><br />
<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 399–406.<br />
[Sun et al., 2003] Sun, J., Shum, H. Y., <strong>and</strong> Zheng, N. N. (2003). Stereo matching using<br />
belief propagation. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence<br />
(PAMI), 25(7):787–800.<br />
[Tappen <strong>and</strong> Freeman, 2003] Tappen, M. F. <strong>and</strong> Freeman, W. T. (2003). Comparison of<br />
graph cuts with belief propagation <strong>for</strong> stereo, using identical mrf parameters. In Int.<br />
Conference on <strong>Computer</strong> <strong>Vision</strong> (ICCV), pages 900–907.<br />
[Tarditi et al., 2005] Tarditi, D., Puri, S., <strong>and</strong> Oglesby, J. (2005). Accelerator: simplified<br />
programming of graphics processing units <strong>for</strong> general-purpose uses via data-parallelism.<br />
Technical Report MSR-TR-2005-184, Microsoft Research.<br />
[Tell <strong>and</strong> Carlsson, 2000] Tell, D. <strong>and</strong> Carlsson, S. (2000). Wide baseline point matching<br />
using affine invariants computed from intensity profiles. In European Conference on<br />
<strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 814–828.<br />
[Thompson et al., 2002] Thompson, C. J., Hahn, S., <strong>and</strong> Oskin, M. (2002). Using modern<br />
graphics architectures <strong>for</strong> general-purpose computing: A framework <strong>and</strong> analysis. In<br />
35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35).<br />
[Tran <strong>and</strong> Davis, 2006] Tran, S. <strong>and</strong> Davis, L. (2006). 3d surface reconstruction using<br />
graph cuts with surface constraints. In European Conference on <strong>Computer</strong> <strong>Vision</strong><br />
(ECCV), pages 219–231.<br />
[Tsai <strong>and</strong> Lin, 2003] Tsai, D.-M. <strong>and</strong> Lin, C.-T. (2003). Fast normalized cross correlation<br />
<strong>for</strong> defect detection. Pattern Recognition Letters, 24(15):2625–2631.<br />
[Turk <strong>and</strong> Levoy, 1994] Turk, G. <strong>and</strong> Levoy, M. (1994). Zippered polygon meshes from<br />
range images. In Proceedings of SIGGRAPH ’94, pages 311–318.<br />
[Veksler, 2003] Veksler, O. (2003). Fast variable window <strong>for</strong> stereo correspondence using<br />
integral images. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />
Recognition (CVPR), pages 556–561.
166<br />
[Vogiatzis et al., 2005] Vogiatzis, G., Torr, P., <strong>and</strong> Cipolla, R. (2005). Multi-view stereo<br />
via volumetric graph-cuts. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong><br />
<strong>and</strong> Pattern Recognition (CVPR), pages II: 391–398.<br />
[Wang et al., 2006] Wang, L., Liao, M., Gong, M., Yang, R., <strong>and</strong> Nistér, D. (2006). High<br />
quality real-time stereo using adaptive cost aggregation <strong>and</strong> dynamic programming.<br />
In International Symposium on 3D Data Processing, Visualization <strong>and</strong> Transmission<br />
(3DPVT).<br />
[Weickert <strong>and</strong> Brox, 2002] Weickert, J. <strong>and</strong> Brox, T. (2002). Diffusion <strong>and</strong> regularization<br />
of vector- <strong>and</strong> matrix-valued images. Inverse Problems, Image Analysis <strong>and</strong> Medical<br />
Imaging. Contemporary Mathematics, 313:251–268.<br />
[Weickert et al., 2004] Weickert, J., Bruhn, A., <strong>and</strong> ans T. Brox, N. P. (2004). Variational<br />
optic flow computation: From continuous models to algorithms. In International<br />
Workshop on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Image Analysis, pages 1–6.<br />
[Weiskopf et al., 2002] Weiskopf, D., Erlebacher, G., Hopf, M., <strong>and</strong> Ertl, T. (2002).<br />
Hardware-accelerated langrangian-eulerian texture advection <strong>for</strong> 2d flow. In Proc. of<br />
<strong>Vision</strong>, Modeling <strong>and</strong> Visualization 2002, pages 77–84.<br />
[Weiss <strong>and</strong> Freeman, 2001] Weiss, Y. <strong>and</strong> Freeman, W. T. (2001). On the optimality of<br />
solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE<br />
Transactions on In<strong>for</strong>mation Theory, 47(2):723–735.<br />
[Westin et al., 2000] Westin, C.-F., Lorigo, L. M., Faugeras, O. D., Grimson, W. E. L.,<br />
Dawson, S., Norbash, A., <strong>and</strong> Kikinis, R. (2000). Segmentation by adaptive geodesic<br />
active contours. In Proceedings of MICCAI 2000, Third International Conference on<br />
Medical Image Computing <strong>and</strong> <strong>Computer</strong>-Assisted Intervention, pages 266–275.<br />
[Wheeler et al., 1998] Wheeler, M., Sato, Y., <strong>and</strong> Ikeuchi, K. (1998). Consensus surfaces<br />
<strong>for</strong> modeling 3d objects from multiple range images. In Proceedings of ICCV ’98, pages<br />
917 – 924.<br />
[Woetzel <strong>and</strong> Koch, 2004] Woetzel, J. <strong>and</strong> Koch, R. (2004). Real-time multi-stereo depth<br />
estimation on GPU with approximative discontinuity h<strong>and</strong>ling. In 1st European Conference<br />
on Visual Media Production (CVMP 2004), pages 245–254.<br />
[Wonka et al., 2000] Wonka, P., Wimmer, M., <strong>and</strong> Schmalstieg, D. (2000). Visibility preprocessing<br />
with occluder fusion <strong>for</strong> urban walkthroughs. In Rendering Techniques 2000<br />
(Proceedings of the Eurographics Workshop 2000), pages 71–82.<br />
[Woodfill <strong>and</strong> Herzen, 1997] Woodfill, J. <strong>and</strong> Herzen, B. V. (1997). Real-time stereo vision<br />
on the parts reconfigurable computer. In IEEE Symposium on FPGAs <strong>for</strong> Custom<br />
Computing Machines.
BIBLIOGRAPHY 167<br />
[Yang et al., 2006] Yang, Q., Wang, L., <strong>and</strong> Yang, R. (2006). Real-time global stereo<br />
matching using hierarchical belief propagation. In Proceedings of the 17th British Machine<br />
<strong>Vision</strong> Conference.<br />
[Yang <strong>and</strong> Pollefeys, 2003] Yang, R. <strong>and</strong> Pollefeys, M. (2003). Multi-resolution real-time<br />
stereo on commodity graphics hardware. In Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />
Recognition (CVPR).<br />
[Yang et al., 2004] Yang, R., Pollefeys, M., <strong>and</strong> Li, S. (2004). Improved real-time stereo<br />
on commodity graphics hardware. In CVPR 2004 Workshop on Real-Time 3D Sensors<br />
<strong>and</strong> Their Use.<br />
[Yang et al., 2003] Yang, R., Pollefeys, M., <strong>and</strong> Welch, G. (2003). Dealing with textureless<br />
regions <strong>and</strong> specular highlights – a progressive space carving scheme using a novel photoconsistency<br />
measure. In Int. Conference on <strong>Computer</strong> <strong>Vision</strong> (ICCV), pages 576–584.<br />
[Yang et al., 2002] Yang, R., Welch, G., <strong>and</strong> Bishop, G. (2002). Real-time consensus based<br />
scene reconstruction using commodity graphics hardware. In Proceedings of Pacific<br />
<strong>Graphics</strong>, pages 225–234.<br />
[Yezzi <strong>and</strong> Soatto, 2003] Yezzi, A. <strong>and</strong> Soatto, S. (2003). Stereoscopic segmentation. Intl.<br />
J. of <strong>Computer</strong> <strong>Vision</strong>, 53(1):31–43.<br />
[Zach <strong>and</strong> Bauer, 2002] Zach, C. <strong>and</strong> Bauer, J. (2002). Automatic texture hierarchy generation<br />
from orthographic facade textures. In 26th Workshop of the Austrian Association<br />
<strong>for</strong> Pattern Recognition (AAPR) 2002.<br />
[Zach et al., 2004a] Zach, C., Grabner, M., <strong>and</strong> Karner, K. (2004a). Improved compression<br />
of topology <strong>for</strong> view-dependent rendering. In Proceedings of Spring Conference on<br />
<strong>Computer</strong> <strong>Graphics</strong> 2004, pages 174–182.<br />
[Zach <strong>and</strong> Karner, 2003a] Zach, C. <strong>and</strong> Karner, K. (2003a). Fast event-driven refinement<br />
of dynamic levels of detail. In Proceedings of Spring Conference on <strong>Computer</strong> <strong>Graphics</strong><br />
2003, pages 65–72.<br />
[Zach <strong>and</strong> Karner, 2003b] Zach, C. <strong>and</strong> Karner, K. (2003b). Progressive compression of<br />
visibility data <strong>for</strong> view-dependent multiresolution meshes. Journal of WSCG, 11(3):546–<br />
553.<br />
[Zach et al., 2003a] Zach, C., Klaus, A., Hadwiger, M., <strong>and</strong> Karner, K. (2003a). Accurate<br />
dense stereo reconstruction using graphics hardware. In Proc. Eurographics 2003, Short<br />
Presentations.<br />
[Zach et al., 2003b] Zach, C., Klaus, A., Reitinger, B., <strong>and</strong> Karner, K. (2003b). Optimized<br />
stereo reconstruction using 3D graphics hardware. In Workshop of <strong>Vision</strong>, Modelling,<br />
<strong>and</strong> Visualization (VMV 2003), pages 119–126.
168<br />
[Zach et al., 2002] Zach, C., Mantler, S., <strong>and</strong> Karner, K. (2002). Time-critical rendering<br />
of discrete <strong>and</strong> continuous levels of detail. In Proceedings of ACM Symposium on Virtual<br />
Reality Software <strong>and</strong> Technology 2002, pages 1–8.<br />
[Zach et al., 2004b] Zach, C., Mantler, S., <strong>and</strong> Karner, K. (2004b). Time-critical rendering<br />
of huge ecosystems using discrete <strong>and</strong> continuous levels of detail. Presence: Teleoperators<br />
<strong>and</strong> Virtual Environment.<br />
[Zach et al., 2006a] Zach, C., Sormann, M., <strong>and</strong> Karner, K. (2006a). High-per<strong>for</strong>mance<br />
multi-view reconstruction. In International Symposium on 3D Data Processing, Visualization<br />
<strong>and</strong> Transmission (3DPVT).<br />
[Zach et al., 2006b] Zach, C., Sormann, M., <strong>and</strong> Karner, K. (2006b). Scanline optimization<br />
<strong>for</strong> stereo on graphics hardware. In International Symposium on 3D Data Processing,<br />
Visualization <strong>and</strong> Transmission (3DPVT).<br />
[Zebedin, 2005] Zebedin, L. (2005). Texturing complex 3D models. Master’s <strong>thesis</strong>, Technical<br />
University <strong>Graz</strong>.