06.02.2013 Views

PhD thesis - Institute for Computer Graphics and Vision - Graz ...

PhD thesis - Institute for Computer Graphics and Vision - Graz ...

PhD thesis - Institute for Computer Graphics and Vision - Graz ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Graz</strong> University of Technology<br />

<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> <strong>Graphics</strong> <strong>and</strong> <strong>Vision</strong><br />

Dissertation<br />

High-Per<strong>for</strong>mance Modeling From<br />

Multiple Views Using <strong>Graphics</strong><br />

Hardware<br />

Christopher Zach<br />

<strong>Graz</strong>, Austria, February 2007<br />

Thesis supervisors<br />

Prof. Dr. Franz Leberl, <strong>Graz</strong> University of Technology<br />

Prof. Dr. Horst Bischof, <strong>Graz</strong> University of Technology


Abstract<br />

Generating 3-dimensional virtual representations of real world environments is still a challenging<br />

scientific <strong>and</strong> technological objective. Photogrammetric computer vision methods<br />

enable the creation of virtual copies from a set of acquired images. These methods are<br />

usually based on either off-the-shelf digital cameras or large-scale sensors. High quality<br />

image-based models with minimal human assistance are achieved by ensuring sufficient<br />

redundancy in the image content. As a consequence, a large amount of image data needs<br />

to be captured <strong>and</strong> subsequently processed. Recent advances in the computational per<strong>for</strong>mance<br />

of graphics processing units (GPUs) <strong>and</strong> in the provided programmable features<br />

make these devices a natural plat<strong>for</strong>m <strong>for</strong> generic high-per<strong>for</strong>mance parallel processing.<br />

In particular, several fundamental computer vision methods can be successfully accelerated<br />

by graphics hardware due to their intrinsic parallelism <strong>and</strong> due to the highly efficient<br />

filtered pixel access.<br />

The contribution of this <strong>thesis</strong> is the development of several new 3D vision algorithms<br />

intended <strong>for</strong> efficient execution on current generation GPUs. All proposed methods address<br />

the fully automated creation of dense 2.5D <strong>and</strong> 3D geometry of objects <strong>and</strong> environments<br />

captured on a sequence of images. The range of depicted methods starts with simple <strong>and</strong><br />

purely local approaches with very efficient respective implementations. Furthermore, a<br />

novel <strong>for</strong>mulation of a semi-global depth estimation approach suitable <strong>for</strong> fast execution<br />

on the GPU is presented. In addition it is shown, that variational methods <strong>for</strong> depth<br />

estimation can benefit significantly from GPU acceleration as well. Finally, highly efficient<br />

methods are presented, which generate 3D models from the input image set, either<br />

directly from the images or indirectly via intermediate 2.5D geometry. The per<strong>for</strong>mance<br />

of the developed methods <strong>and</strong> their respective implementations is evaluated on artificial<br />

datasets to obtain quantitative results, <strong>and</strong> demonstrated in real world applications as<br />

well. The proposed methods are incorporated into a complete 3D vision pipeline, which<br />

was successfully applied in several research projects.<br />

Keywords. multiple view reconstruction, depth estimation, dynamic programming,<br />

variational depth map evolution, space carving, volumetric range image integration,<br />

general purpose programming on graphics processing units (GPGPU), GPU acceleration<br />

iii


Acknowledgments<br />

Writing a <strong>PhD</strong> <strong>thesis</strong> is a large scale project. Everybody with a <strong>PhD</strong> degree knows<br />

this simple fact from his (or her) own experience. Although oneself has the primary<br />

responsibility to make progress with the <strong>thesis</strong>, the support from many other people is<br />

very substantial <strong>for</strong> a successful completion. This section is the place to mention <strong>and</strong> to<br />

thank those people helping me directly or indirectly in preparing this <strong>thesis</strong>.<br />

At first I need to thank my <strong>thesis</strong> supervisors, Prof. Franz Leberl <strong>and</strong> Prof. Horst<br />

Bischof from the <strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> <strong>Graphics</strong> <strong>and</strong> <strong>Vision</strong> <strong>for</strong> their advice during my<br />

time as <strong>PhD</strong> student. In those times, when Prof. Leberl was engaged with highly ambitious<br />

projects, Prof. Bischof provided significant guidance <strong>for</strong> my scientific work.<br />

During my <strong>PhD</strong> time I was a researcher at the VRVis Research Center <strong>for</strong> Virtual<br />

Reality <strong>and</strong> Visualization, <strong>and</strong> this <strong>thesis</strong> was largely funded by this research company. I<br />

would like to thank my current <strong>and</strong> <strong>for</strong>mer colleagues from VRVis <strong>Graz</strong> <strong>and</strong> Vienna <strong>for</strong><br />

the opportunity of this position <strong>and</strong> their collaboration.<br />

In particular, the full reconstruction pipeline creating virtual copies from a set of<br />

images contains many more steps than those developed by me during this <strong>thesis</strong>. Several<br />

stages in the pipeline is work done by my colleagues in the “Virtual Habitat” group at<br />

VRVis. At first I would like to thank Mario, who acquired many of the source images <strong>and</strong><br />

is mainly responsible <strong>for</strong> the first steps in the modeling pipeline. The textures <strong>for</strong> the final<br />

3D models displayed in this <strong>thesis</strong> were generated by Lukas as part of his master <strong>thesis</strong>.<br />

I would like to thank Dr. Ivana Kolingerova <strong>and</strong> her <strong>PhD</strong> students from Plzen, who<br />

invited me to work <strong>for</strong> several weeks in this really nice town. I spent almost two months<br />

there (including the annual WSCG conference).<br />

During my time as <strong>PhD</strong> student I advised three master students: Mario, Lukas <strong>and</strong><br />

Manni, who all did valuable work <strong>for</strong> their respective projects. Mario <strong>and</strong> Lukas started<br />

working at VRVis after finishing their master <strong>thesis</strong>. Manni began working at the associated<br />

computer vision institute, hence I guess I didn’t discourage those students too<br />

much.<br />

Having the office located directly at the institute <strong>for</strong> computer graphics <strong>and</strong> vision<br />

proved highly beneficial. Several new ideas were developed during personal talks with the<br />

institute members. In particular, I would like to thank the current <strong>and</strong> <strong>for</strong>mer attendees<br />

of the espresso club, namely Bernhard, Horst, Martina, Mike, Tom (2x), Pierre <strong>and</strong> last<br />

v


vi<br />

but not least Roli, whose legendary parties will be remembered <strong>for</strong> a long, long time.<br />

Additionally, I had fruitful <strong>and</strong> interesting discussions with Peter, Matthias, Suri, Markus,<br />

Alex <strong>and</strong> especially with Martin, who shared the office with me now <strong>for</strong> so many years.<br />

Finishing this <strong>thesis</strong> was not possible without some additional activities freeing the<br />

mind <strong>and</strong> relaxing the body. At first I would like to thank all Aikido teachers <strong>and</strong> fellows<br />

on the tatami from <strong>Graz</strong>, who worked hard <strong>for</strong> the last seven years to make my body less<br />

stiff.<br />

Furthermore, I would like to thank Vera <strong>for</strong> persuading me to start dancing lessons with<br />

her. She is not only a clever <strong>and</strong> ambitious person, but she was additionally discovered as<br />

a gifted partner in the dance hall.<br />

<strong>Graz</strong>, January 2007<br />

Christopher Zach<br />

The problem is not that people will steal your ideas. On the contrary,<br />

your job as an academic is to ensure that they do.<br />

Tom’s advice, according to Frank Dellaert


Contents<br />

1 Introduction 1<br />

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> . . . . . . . . . . . . 2<br />

1.3 3D Models from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.4 Overview of this Thesis <strong>and</strong> Contributions . . . . . . . . . . . . . . . . . . . 10<br />

2 Related Work 15<br />

2.1 Dense Depth <strong>and</strong> Model Estimation . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.1.1 Computational Stereo on Rectified Images . . . . . . . . . . . . . . . 15<br />

2.1.2 Multi-View Depth Estimation . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.1.3 Direct 3D Model Reconstruction . . . . . . . . . . . . . . . . . . . . 18<br />

2.2 GPU-based 3D Model Computation . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.2.1 General Purpose Computations on the GPU . . . . . . . . . . . . . 19<br />

2.2.2 Real-time <strong>and</strong> GPU-Accelerated Dense Reconstruction from Multiple<br />

Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3 Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware 27<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

3.2 Overview of Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.2.1 Image Warping <strong>and</strong> Difference Image Computation . . . . . . . . . . 29<br />

3.2.2 Local Error Summation . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

3.2.3 Determining the Best Local Modification . . . . . . . . . . . . . . . 31<br />

3.2.4 Hierarchical Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.3.1 Mesh Rendering <strong>and</strong> Image Warping . . . . . . . . . . . . . . . . . . 33<br />

3.3.2 Local Error Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

3.3.3 Encoding of Integers in RGB Channels . . . . . . . . . . . . . . . . . 35<br />

3.4 Per<strong>for</strong>mance Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

3.4.1 Amortized Difference Image Generation . . . . . . . . . . . . . . . . 36<br />

3.4.2 Parallel Image Trans<strong>for</strong>ms . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

3.4.3 Minimum Determination Using the Depth Test . . . . . . . . . . . . 37<br />

vii


viii CONTENTS<br />

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

4 GPU-based Depth Map Estimation using Plane Sweeping 43<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

4.2 Plane Sweep Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

4.2.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

4.2.2 Image Correlation Functions . . . . . . . . . . . . . . . . . . . . . . 45<br />

4.2.2.1 Efficient Summation over Rectangular Regions . . . . . . . 46<br />

4.2.2.2 Normalized Correlation Coefficient . . . . . . . . . . . . . . 47<br />

4.2.3 Sum of Absolute Differences <strong>and</strong> Variants . . . . . . . . . . . . . . . 48<br />

4.2.4 Depth Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

4.3 Sparse Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

4.3.1 Sparse Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.3.1.1 Sparse Data Cost Volume During Plane-Sweep . . . . . . . 51<br />

4.3.1.2 Sparse Data Cost Volume <strong>for</strong> Message Passing . . . . . . . 52<br />

4.3.2 Sparse Message Update . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

4.3.2.1 Sparse 1D Distance Trans<strong>for</strong>m . . . . . . . . . . . . . . . . 53<br />

4.4 Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

4.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.6 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

5 Space Carving on 3D <strong>Graphics</strong> Hardware 63<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

5.2 Volumetric Scene Reconstruction <strong>and</strong> Space Carving . . . . . . . . . . . . . 64<br />

5.3 Single Sweep Voxel Coloring in 3D Hardware . . . . . . . . . . . . . . . . . 66<br />

5.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

5.3.2 Voxel Layer Generation . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

5.3.3 Updating the Depth Maps . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

5.3.4 Immediate Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

5.4 Extensions to Multi Sweep Space Carving . . . . . . . . . . . . . . . . . . . 70<br />

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

5.5.1 Per<strong>for</strong>mance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

5.5.2 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

6 PDE-based Depth Estimation on the GPU 79<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

6.2 Variational Techniques <strong>for</strong> Multi-View Depth Estimation . . . . . . . . . . . 80<br />

6.2.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


CONTENTS ix<br />

6.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

6.2.3 Extensions <strong>and</strong> Variations . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

6.2.3.1 Back-Matching . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

6.2.3.2 Local Changes in Illumination . . . . . . . . . . . . . . . . 84<br />

6.2.3.3 Other Variations . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

6.3 GPU-based Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

6.3.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

6.3.2 Regularization Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

6.3.3 Depth Update Equation . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

6.3.3.1 Jacobi Iterations . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

6.3.3.2 Conjugate Gradient Solver . . . . . . . . . . . . . . . . . . 87<br />

6.3.4 Coarse-to-Fine Approach . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.4.1 Facade Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.4.2 Small Statue Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

6.4.3 Mirabellstatue Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

7 Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware 97<br />

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

7.2 Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo . . . . . . . . . . . . 98<br />

7.2.1 Scanline Optimization <strong>and</strong> Min-Convolution . . . . . . . . . . . . . . 98<br />

7.2.2 Overall Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

7.2.3 GPU Implementation Enhancements . . . . . . . . . . . . . . . . . . 101<br />

7.2.3.1 Fewer Passes Through Bidirectional Approach . . . . . . . 101<br />

7.2.3.2 Disparity Tracking <strong>and</strong> Improved Parallelism . . . . . . . . 102<br />

7.2.3.3 Readback of Tracked Disparities . . . . . . . . . . . . . . . 103<br />

7.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

7.3 Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong><br />

Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.3.1 Input Data <strong>and</strong> General Setting . . . . . . . . . . . . . . . . . . . . 106<br />

7.3.2 Similarity Scores based on Incremental Summation . . . . . . . . . . 107<br />

7.3.3 Sensor Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

7.3.4 Slice Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.3.5 SAD Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.3.6 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.3.7 Depth Extraction by Scanline Optimization . . . . . . . . . . . . . . 111<br />

7.3.8 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.3.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


x CONTENTS<br />

8 Volumetric 3D Model Generation 119<br />

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

8.2 Selecting the Volume of Interest . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

8.3 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

8.4 Isosurface Determination <strong>and</strong> Extraction . . . . . . . . . . . . . . . . . . . . 124<br />

8.5 Implementation Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

9 Results 131<br />

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

9.2 Synthetic Sphere Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

9.3 Synthetic House Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

9.4 Middlebury Multi-View Stereo Temple Dataset . . . . . . . . . . . . . . . . 137<br />

9.5 Statue of Emperor Charles VI . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

9.6 Bodhisattva Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

10 Concluding Remarks 147<br />

A Selected Publications 151<br />

A.1 Publications Related to this Thesis . . . . . . . . . . . . . . . . . . . . . . . 151<br />

A.2 Other Selected Scientific Contributions . . . . . . . . . . . . . . . . . . . . . 151<br />

Bibliography 153


List of Figures<br />

1.1 Several reconstructed statue models . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 A possible pipeline to create virtual models from images . . . . . . . . . . . 5<br />

1.3 The reconstruction pipeline in an example . . . . . . . . . . . . . . . . . . . 13<br />

2.1 The stream computation model of a GPU . . . . . . . . . . . . . . . . . . . 20<br />

3.1 Mesh reconstruction from a pair of stereo images . . . . . . . . . . . . . . . 29<br />

3.2 The regular grid as seen from the key camera . . . . . . . . . . . . . . . . . 30<br />

3.3 The neighborhood of a currently evaluated vertex . . . . . . . . . . . . . . . 30<br />

3.4 The correspondence between vertex indices <strong>and</strong> grid positions. . . . . . . . 31<br />

3.5 The basic workflow of the matching procedure . . . . . . . . . . . . . . . . . 32<br />

3.6 The modified pipeline to minimize P-buffer switches . . . . . . . . . . . . . 38<br />

3.7 Fragment program to write the depth component . . . . . . . . . . . . . . . 39<br />

3.8 Results <strong>for</strong> the artificial earth dataset. . . . . . . . . . . . . . . . . . . . . . 39<br />

3.9 Results <strong>for</strong> a dataset showing the yard inside a historic building. . . . . . . 40<br />

3.10 Results <strong>for</strong> a dataset showing an apartment house . . . . . . . . . . . . . . 41<br />

3.11 Visual results <strong>for</strong> the Merton college dataset . . . . . . . . . . . . . . . . . . 42<br />

4.1 Plane sweeping principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

4.2 NCC images calculated on the CPU (left) <strong>and</strong> on the GPU (right) . . . . . 48<br />

4.3 Determining the lower envelope using a sparse 1D distance trans<strong>for</strong>m. . . . 53<br />

4.4 Sparse belief propagation timing results wrt. the number of heap entries K 57<br />

4.5 Depth images with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 60<br />

4.6 Point models with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 61<br />

4.7 Point models with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 61<br />

4.8 Depth images with <strong>and</strong> without belief propagation . . . . . . . . . . . . . . 62<br />

5.1 A possible configuration <strong>for</strong> plane sweeping through the voxel space . . . . 65<br />

5.2 Perspective texture mapping using visibility in<strong>for</strong>mation . . . . . . . . . . . 67<br />

5.3 Evolution of depth maps <strong>for</strong> two views during the sweep process . . . . . . 69<br />

5.4 Plane sweep with partial knowledge from the preceding sweeps . . . . . . . 71<br />

5.5 Timing results <strong>for</strong> the Bowl dataset . . . . . . . . . . . . . . . . . . . . . . . 74<br />

xi


xii LIST OF FIGURES<br />

5.6 Space carving results <strong>for</strong> the synthetic Dino dataset . . . . . . . . . . . . . 75<br />

5.7 Space carving results <strong>for</strong> the synthetic Bowl dataset . . . . . . . . . . . . . 76<br />

5.8 Space carving results <strong>for</strong> a statue dataset . . . . . . . . . . . . . . . . . . . 77<br />

5.9 Voxel coloring results <strong>for</strong> a statue dataset . . . . . . . . . . . . . . . . . . . 78<br />

6.1 Sparse structure of the linear system obtained from the semi-implicit approach 88<br />

6.2 A reconstructed historical statue displayed as colored point set . . . . . . . 89<br />

6.3 The depth maps of the embedded statue reconstructed with the numerical<br />

schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />

6.4 The effect of bidirectional matching on the embedded statue scene. . . . . 91<br />

6.5 Two views on the colored point set showing the front facade of a church. . 92<br />

6.6 The three source images <strong>and</strong> the resulting unsuccessful reconstruction of<br />

the statue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

6.7 Two of the successfully reconstructed point sets using image segmentation<br />

to omit the background scenery. . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

6.8 An enhanced depth map <strong>and</strong> 3D point set obtained using the truncated<br />

error model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

6.9 The effect of image-driven anisotropic diffusion . . . . . . . . . . . . . . . . 96<br />

7.1 Graphical illustration of the <strong>for</strong>ward pass using a recursive doubling approach.100<br />

7.2 Parallel processing of vertical scanlines using the bidirectional approach <strong>for</strong><br />

optimal utilization of the four available color channels . . . . . . . . . . . . 103<br />

7.3 Disparity images <strong>for</strong> the Tsukuba dataset <strong>for</strong> several horizontal resolutions<br />

generated by the GPU-based scanline approach. . . . . . . . . . . . . . . . 105<br />

7.4 Disparity images <strong>for</strong> the Cones <strong>and</strong> Teddy image pairs from the Middlebury<br />

stereo evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />

7.5 Plane-sweep approach to multiple view matching . . . . . . . . . . . . . . . 108<br />

7.6 Plane sweep from left to right . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

7.7 Spatial aggregation <strong>for</strong> the correlation window using sliding sums . . . . . . 110<br />

7.8 The three input views of the synthetic dataset . . . . . . . . . . . . . . . . . 113<br />

7.9 The obtained depth maps <strong>and</strong> timing results <strong>for</strong> the synthetic dataset using<br />

multiview scanline optimization on the GPU . . . . . . . . . . . . . . . . . 114<br />

7.10 The three input views of a wooden Bodhisattva statue <strong>and</strong> the corresponding<br />

depth maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

8.1 Classification of the voxel according to the depth map <strong>and</strong> camera parameters122<br />

8.2 Visual results <strong>for</strong> a small statue dataset generated from a sequence of 47<br />

images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

8.3 Source views <strong>and</strong> isosurfaces <strong>for</strong> two real-world datasets. . . . . . . . . . . 128<br />

9.1 Three source views of the synthetic sphere dataset. . . . . . . . . . . . . . . 132<br />

9.2 Depth estimation results <strong>for</strong> a view triplet of the sphere dataset . . . . . . . 133


LIST OF FIGURES xiii<br />

9.3 Fused 3D models <strong>for</strong> the sphere dataset wrt. the depth estimation method . 133<br />

9.4 Three source views of the synthetic house dataset. . . . . . . . . . . . . . . 134<br />

9.5 Fused 3D models <strong>for</strong> the synthetic house dataset wrt. the depth estimation<br />

method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

9.6 Three generated depth maps of the synthetic house dataset . . . . . . . . . 136<br />

9.7 Three (out of 47) source images of the temple model dataset . . . . . . . . 138<br />

9.8 Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />

the original camera matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />

9.9 Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />

new calculated camera matrices . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

9.10 Two views of the statue showing Emperor Charles VI inside the state hall<br />

of the Austrian National Library. . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

9.11 Medium resolution mesh <strong>for</strong> the Charles VI dataset . . . . . . . . . . . . . . 142<br />

9.12 High resolution mesh <strong>for</strong> the Charles VI dataset . . . . . . . . . . . . . . . . 143<br />

9.13 Two depth maps <strong>for</strong> the same reference view of the Charles dataset generated<br />

by the WTA <strong>and</strong> the SO approach . . . . . . . . . . . . . . . . . . . . 144<br />

9.14 Every other of the 13 source images of the Bodhisattva statue dataset . . . 144<br />

9.15 Several depth images <strong>for</strong> the Bodhisattva statue . . . . . . . . . . . . . . . 145<br />

9.16 Medium <strong>and</strong> high resolution results <strong>for</strong> the Bodhisattva statue images . . . 145


List of Tables<br />

3.1 Timing results <strong>for</strong> the sphere dataset on two different graphic cards. . . . . 40<br />

4.1 Timing results <strong>for</strong> the plane-sweeping approach on the GPU with winnertakes-all<br />

depth extraction at different parameter settings <strong>and</strong> image resolutions.<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

6.1 Regularization terms induced by diffusion processes . . . . . . . . . . . . . . 82<br />

7.1 Average timing result <strong>for</strong> various dataset sizes in seconds/frame. . . . . . . 104<br />

7.2 Runtimes of GPU-scanline optimization using a 9 × 9 NCC at different<br />

resolutions using three views . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

9.1 Quantitative evaluation of the reconstructed spheres . . . . . . . . . . . . . 134<br />

9.2 Quantitative evaluation of the reconstructed synthetic house . . . . . . . . . 137<br />

9.3 Timing results <strong>for</strong> the Emperor Charles dataset . . . . . . . . . . . . . . . . 138<br />

xv


Chapter 1<br />

Introduction<br />

Contents<br />

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> . . . . . 2<br />

1.3 3D Models from Multiple Images . . . . . . . . . . . . . . . . . . 5<br />

1.4 Overview of this Thesis <strong>and</strong> Contributions . . . . . . . . . . . . 10<br />

1.1 Introduction<br />

Creating a 3D virtual representation of a real object or scenery from images or other sensory<br />

data has many important real-world applications – ranging from city planning tasks<br />

per<strong>for</strong>med by surveying offices to virtual conservation of historic buildings <strong>and</strong> objects,<br />

to entertainment <strong>and</strong> gaming applications creating virtual models of real <strong>and</strong> well-known<br />

locations. Consequently, the need <strong>for</strong> automated <strong>and</strong> reliable 3D model generation workflow<br />

<strong>for</strong> data acquired by active <strong>and</strong> passive sensors is still an active research topic. In<br />

particular, creating 3D representations of real objects solely from multiple images is a<br />

challenging task, since the complete automated work-flow is based only on passive sensory<br />

data.<br />

The development of suitable algorithms <strong>and</strong> methods <strong>for</strong> a multi-view reconstruction<br />

pipeline depends substantially on the objects of interest <strong>and</strong> on the number <strong>and</strong> quality<br />

of the acquired images. In order to enable a fully automated work-flow the images must<br />

contain substantial redundancy, i.e. the same 3D features must appear in several images.<br />

Furthermore, static <strong>and</strong> rigid objects are assumed in this work to make the traditional<br />

multiple view approaches <strong>for</strong> image registration applicable. A further question addresses<br />

the intended accuracy of the obtained models. As it is explained later in more detail, the<br />

major objectives of the methods developed in this <strong>thesis</strong> are achieving high per<strong>for</strong>mance <strong>for</strong><br />

immediate visual feedback to the user <strong>and</strong> attaining sufficient accuracy <strong>for</strong> photorealistic<br />

visualization of the virtual models. Dense meshes <strong>and</strong> depth maps generated from multiple<br />

1


2 Chapter 1. Introduction<br />

views are usually not directly suitable <strong>for</strong> accurate 3D measurements, since the achievable<br />

accuracy especially in low textured regions is limited. Nevertheless, further knowledge<br />

about the object under interest enables e.g. fitting geometric primitives into the dense<br />

mesh potentially yielding higher accuracy.<br />

The methods proposed in our work-flow are mainly designed <strong>for</strong> typical close-range<br />

imagery, but the methods are not strictly limited to these settings. In order to illustrate<br />

the kind of datasets to be reconstructed using our modeling pipeline, we give at first<br />

a few examples of obtained virtual models generated by employing the proposed workflow.<br />

Figure 1.1 displays three 3D models generated solely from multiple images using<br />

the methods proposed in this <strong>thesis</strong> in several stages. In particular, efficient dense depth<br />

estimation methods (Chapter 4 <strong>and</strong> 7) were applied to obtain 2 1<br />

2<br />

D height-fields, which<br />

were subsequently fused into a final 3D model using a volumetric approach (Chapter 8).<br />

All procedures in the 3D reconstruction pipeline to create 3D models solely from images<br />

are outlined shortly in Section 1.3.<br />

The models displayed in Figure 1.1 are partially used <strong>for</strong> a historical documentation<br />

system ∗ . The generated models are high-resolution 3D meshes, which are intended <strong>for</strong><br />

visualization when combined with a photorealistic texture.<br />

1.2 Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong><br />

In this <strong>thesis</strong> we propose employing the computing power of modern programmable graphics<br />

processing units (GPU) <strong>for</strong> several essential stages in the 3D reconstruction pipeline.<br />

One goal of this work is fast visual feedback to the human operator, who can immediately<br />

judge the quality of the results <strong>and</strong> may optionally adjust suitable parameters, if necessary.<br />

Further, it is inevitable to have a substantial amount of redundancy in the image<br />

content when applying the current methods <strong>for</strong> reconstruction from multiple views in<br />

order to achieve high quality models. This implies that full 3D modeling of even a single<br />

object typically requires at least tens of images to be processed. Fast processing of these<br />

image sets is desirable, since obtaining the final model after two or 20 minutes makes a<br />

substantial difference. † If special purpose hardware — mainly graphics processing units,<br />

but digital signal processors (DSP) <strong>and</strong> field programmable gate arrays (FPGA), too — is<br />

employed in computer vision methods, several types of application can be distinguished:<br />

1. The first scenario en<strong>for</strong>ces real-time response within specified temporal limits, <strong>and</strong><br />

special purpose hardware provides the required processing power. Much of the initial<br />

research on accelerating computer vision methods is driven by the real-time needs<br />

of the particular application.<br />

2. The main objective in the second setting is faster (but not necessary real-time)<br />

processing using special hardware intensely. Since the computational accuracy <strong>and</strong><br />

∗ www.josefsplatz.info<br />

† especially if the outcome is unsatisfying.


1.2. Using <strong>Graphics</strong> Processing Units <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> 3<br />

(a) Small statue of<br />

St. Barbara<br />

(b) Emperor Joseph – Josephsplatz (c) Empower Karl – Josephsplatz<br />

Figure 1.1: Several reconstructed statue models generated by our high-per<strong>for</strong>mance modeling<br />

pipeline. In (a) the model of a small statue depicting depicting St. Barbara is shown.<br />

Figure (b) illustrates the model of an outdoor statue displaying Empire Joseph. Finally,<br />

(c) shows the virtual model of the Empire Karl statue inside the Austrian National Library.<br />

The displayed models are not post-processed (e.g. smoothed or geometrically simplified).<br />

In (a) <strong>and</strong> (c) some noise <strong>and</strong> clutter can be seen, which can be removed by incorporating<br />

silhouette data.<br />

the programming model of special purpose hardware is often limited, the quality of<br />

the result may be decreased if compared with the outcome of CPU implementations.<br />

Finding an appropriate trade-off between higher per<strong>for</strong>mance <strong>and</strong> limiting the quality<br />

degradation is the challenge in this setting. Most methods proposed in this <strong>thesis</strong><br />

fall in this category.<br />

3. Finally, special purpose hardware can be used purely as auxiliary processing unit<br />

executing only fractions of the overall method. In this case there is typically no<br />

degradation in the quality of the result, but the achieved per<strong>for</strong>mance gain can be<br />

limited. Special purpose hardware usually per<strong>for</strong>ms its computation asynchronously<br />

to the main CPU, hence a load-balanced implementation employing both processing<br />

units concurrently gives the largest gain. Most computer vision methods must be<br />

redesigned in order to benefit from this combined processing power.<br />

With the general availability of programmable graphics processing units <strong>and</strong> their large<br />

processing power it is natural, that modern graphics hardware attracts many researchers


4 Chapter 1. Introduction<br />

to accelerate their non-graphical applications as well. We focus on programmable graphics<br />

hardware as computing device <strong>for</strong> the following reasons:<br />

• Driven by the needs of the gaming industry, graphics hardware evolves currently<br />

much faster than traditional CPUs or other processing devices. Selected numerical<br />

operations per<strong>for</strong>m almost 10 times faster on high-end graphics hardware than on<br />

high-end CPUs.<br />

• A reasonable fast graphics processing unit is nowadays equipped in many consumer<br />

personal computers. Hence, the necessary hardware equipment is available virtually<br />

<strong>for</strong> everyone.<br />

• Recently there exist st<strong>and</strong>ardized programming interfaces working <strong>for</strong> hardware of<br />

different vendors. This allows our procedures to execute on a wider range of hardware<br />

not limited to a specific vendor. Additionally, the development cycle is now eased<br />

due to multi-vendor programming interfaces <strong>and</strong> tools.<br />

• While per<strong>for</strong>ming non-graphical computations the GPU can be directly used to<br />

display intermediate <strong>and</strong> final results to the operator, since the necessary data is<br />

already stored in GPU memory.<br />

Due to these factors modern graphics hardware is currently the ideal target plat<strong>for</strong>m <strong>for</strong><br />

high-per<strong>for</strong>mance parallel computing.<br />

Note, that the rapid development of new features built into every upcoming generation<br />

of graphics hardware requires a constant adaption of GPU-based methods to obtain maximal<br />

per<strong>for</strong>mance. Consequently, a continuous redesign of GPU-based implementations is<br />

still necessary, since new features may enable significant per<strong>for</strong>mance improvements, <strong>and</strong><br />

various techniques to increase the speed on current hardware may become obsolete in next<br />

generation graphics hardware. Nevertheless, we assume a stabilizing feature set <strong>for</strong> GPUs<br />

in the medium term.<br />

Using the GPU as a major processing unit <strong>for</strong> non-graphical problems allows direct<br />

visualization of intermediate <strong>and</strong> final results without an additional per<strong>for</strong>mance penalty.<br />

We employ this feature in most of our proposed reconstruction methods to give the user a<br />

direct visual feedback showing the progress of the procedure. Whether immediate visual<br />

feedback (i.e. after a few seconds at most) is available depends on the reconstruction<br />

pipeline as well. Using relatively simple methods e.g. developed <strong>for</strong> small baseline sets of<br />

images yielding a depth map allows the sequential processing of the whole dataset, <strong>and</strong> the<br />

first depth images are available with little delay. In these cases the provided intermediate<br />

results have full resolution, but refer only to a fraction of the final model. Sophisticated<br />

multiple view methods incorporating all images simultaneously often do not have this<br />

fine granularity <strong>and</strong> generally provide no intermediate result (at full resolution) to the<br />

human operator. Typically, a coarse-to-fine scheme <strong>for</strong>ms the basis of these methods <strong>and</strong><br />

intermediate results at coarser resolutions can be shown to the operator.


1.3. 3D Models from Multiple Images 5<br />

In any case, when processing larger datasets with different characteristics <strong>and</strong> from<br />

different sources, the opportunity to evaluate the outcome of the whole modeling pipeline<br />

visually at early processing stages proves very useful.<br />

Although graphics processing units have a very high computing power, the programming<br />

model of graphics hardware is limited. Consequently, the set of computer vision<br />

methods suitable <strong>for</strong> full acceleration by GPUs is restricted. E.g. several highly sophisticated<br />

dense depth estimation methods are currently beyond the capabilities of programmable<br />

graphics hardware, or allow only acceleration of fractions of the whole procedure<br />

in the best case. Hence, only relatively simple (but still nontrivial) computer vision<br />

methods can fully benefit from graphics processing units so far.<br />

Nevertheless, in many cases the generated 3D models created by our high-per<strong>for</strong>mance<br />

work-flow have sufficient quality <strong>for</strong> further processing <strong>and</strong> photorealistic display of the<br />

virtual models. The main contribution of this <strong>thesis</strong> consists of the adaption of several<br />

multi-view reconstruction methods to enable an efficient implementation using graphics<br />

hardware in the first place. Further, the actual efficiency <strong>and</strong> the quality of the obtained<br />

3D models is demonstrated on multiple real-world datasets.<br />

1.3 3D Models from Multiple Images<br />

The creation of virtual 3D models of real objects from a set of digital images requires a<br />

pipeline of several stages. The set of procedures applied in this pipeline depends on the<br />

actual setup <strong>and</strong> on the intended use of the generated model. The steps per<strong>for</strong>med to<br />

create many virtual models shown in this <strong>thesis</strong> is illustrated in Figure 1.2.<br />

Digital images<br />

Depth images<br />

Feature<br />

extraction<br />

Features<br />

POIs<br />

Correspondence<br />

estimation<br />

Sparse<br />

model<br />

Dense depth<br />

estimation<br />

Multiview Geometry<br />

Multiview<br />

depth integration<br />

Raw 3D<br />

geometry<br />

processing<br />

Refined 3D<br />

geometry<br />

texturing<br />

Textured 3D<br />

model<br />

Figure 1.2: A possible pipeline to create virtual models from images.<br />

The steps in this pipeline are suitable <strong>for</strong> reconstructing a 3D object from many small<br />

baseline images taken with a high-quality <strong>and</strong> already calibrated digital single lens reflex<br />

camera. If the images are recorded with a digital video camera or a cheap digital consumer<br />

camera, several (especially early) stages in the pipeline will be substantially different.<br />

We describe the individual processing steps in this pipeline briefly <strong>and</strong> outline necessary<br />

adaptions in case of different source material.


6 Chapter 1. Introduction<br />

Camera Calibration <strong>and</strong> Self-Calibration The term camera calibration often refers<br />

to two related, but nevertheless distinct steps to obtain several parameters of the employed<br />

digital camera <strong>and</strong> its lens system: the first procedure determines lens distortion<br />

parameters to remove the deviations in the image induced by optical lenses. Knowledge of<br />

the lens distortion <strong>and</strong> subsequent resampling of the source images allows the application<br />

of the simple pinhole camera model in the successive processing stages. The second part<br />

of the camera calibration step addresses the determination of the main parameters of the<br />

now applicable idealized pinhole camera model. These parameters are typically comprised<br />

in a 3-by-3 upper triangular matrix<br />

K =<br />

⎛<br />

⎜<br />

⎝<br />

f s x0<br />

0 a f x1<br />

0 0 1<br />

Knowledge of this matrix allows the obtained 3D reconstructions to reside in a metric<br />

space, i.e. the obtained angles <strong>and</strong> length ratios correspond to the ones of the true model.<br />

Without additional knowledge it is not possible to determine the overall scale (or object<br />

size) solely from images.<br />

The most important parameter in this matrix is the focal length f. If the focal length<br />

is incorrectly estimated, the resulting 3D model is severely distorted. The skew parameter<br />

s is determined by the x <strong>and</strong> y-axes of the sensor pixels <strong>and</strong> is very close to zero <strong>for</strong><br />

all practical cameras. Many calibration <strong>and</strong> especially self-calibration techniques assume<br />

orthogonal sensor axes <strong>and</strong> consequently, s = 0. The aspect ratio parameter a is one <strong>for</strong><br />

squared shaped sensor pixels, which is a very common assumption. The intersection of<br />

the optical axis with the image plane is called the principal point (x0, y0) <strong>and</strong> is usually<br />

close to the image center. Accurate estimation of the principal point is difficult (since<br />

moving the principal point can be largely compensated by world space translation), but<br />

the quality of the 3D model is only weakly affected by an incorrect principal point.<br />

Since we focus mainly on generating 3D models from images taken with precalibrated<br />

cameras, a st<strong>and</strong>ard camera calibration procedure [Heikkilä, 2000] using predefined targets<br />

is typically employed in our work-flow. Several images of a planar target with known<br />

circular control points are taken, <strong>and</strong> camera matrices <strong>and</strong> lens distortion parameters are<br />

determined using a nonlinear optimization approach. The advantage of using precalibrated<br />

cameras is the high accuracy of the estimated intrinsic parameters of the camera. Hence,<br />

the subsequently calculated relative orientation <strong>and</strong> the dense depth estimation are based<br />

on reliable camera parameters <strong>and</strong> yield high quality results.<br />

On the other h<strong>and</strong>, good calibration results are mainly available <strong>for</strong> high-quality cameras,<br />

<strong>and</strong> usually fixed lenses set to infinite focus are required. A work-flow based on<br />

target calibration is only partially applicable to cheap consumer cameras with zooming<br />

<strong>and</strong> automatic focusing, <strong>and</strong> it typically fails <strong>for</strong> video sequences. Self-calibration methods<br />

attempt to recover the intrinsic camera parameters solely from image in<strong>for</strong>mation like<br />

⎞<br />

⎟<br />

⎠ .


1.3. 3D Models from Multiple Images 7<br />

correspondences between multiple views. Radial distortion parameters can be determined<br />

even from single images using extracted 2D lines [Devernay <strong>and</strong> Faugeras, 2001], but <strong>for</strong><br />

real datasets some manual intervention is often necessary in order to connect short line<br />

segments belonging to the same object line [Schmidegg, 2005]. Of course, this approach<br />

requires, that e.g. a building with dominant feature lines or even a printed page with<br />

straight lines is captured by the camera.<br />

During self-calibration the parameters of the pinhole camera model are determined<br />

by utilizing certain analytic properties of the epipolar geometry. Several self-calibration<br />

method start with a projective reconstruction based on point correspondences <strong>and</strong> the induced<br />

fundamental matrices between the images. The inherent projective ambiguity can<br />

be resolved using algebraic invariants <strong>and</strong> reasonable assumptions on the camera model<br />

(like zero skew <strong>and</strong> square pixels) [Pollefeys et al., 1999, Nistér, 2001, Nistér, 2004b]. The<br />

main difficulty of these approaches is the creation of an initial accurate <strong>and</strong> outlier-free projective<br />

reconstruction, since the self-calibration procedures are very sensitive to incorrect<br />

input data. A simple self-calibration method not requiring a projective 3D reconstruction<br />

is proposed in [Mendonça <strong>and</strong> Cipolla, 1999]. This approach refines the intrinsic camera<br />

parameters to upgrade the supplied fundamental matrices to essential matrices, which<br />

have stronger algebraic properties. The essential matrix encodes the relative pose between<br />

two views <strong>and</strong> has fewer degrees of freedom than the fundamental matrix. In particular,<br />

the two non-zero singular values of an essential matrix are equal. This property is utilized<br />

in [Mendonça <strong>and</strong> Cipolla, 1999] to adjust initially provided camera intrinsic parameters,<br />

such that the non-zero singular values of the upgraded fundamental matrices are as close<br />

as possible. We employ this method optionally even in the calibrated case to refine the<br />

camera intrinsic parameters <strong>for</strong> highest accuracy.<br />

Feature Extraction Feature extraction selects image points or regions which give significant<br />

structural in<strong>for</strong>mation to be identified in other images showing the same objects<br />

of interest. Commonly used point features are Harris corners [Harris <strong>and</strong> Stephens, 1988]<br />

<strong>and</strong> Förstner points [Förstner <strong>and</strong> Gülch, 1987]. Point features are well suited <strong>for</strong> sparse<br />

correspondence search, but extracting lines may be beneficial <strong>for</strong> images showing manmade<br />

structures. Instead of extracting isolated corner points a set of edge elements (edgel<br />

<strong>for</strong> short) are determined [Canny, 1986] <strong>and</strong> subsequently grouped to obtain geometric<br />

line segments.<br />

If the provided images are taken from rather different positions, more advanced<br />

features <strong>and</strong> local image descriptors are required. In particular, the projected size <strong>and</strong><br />

shape of objects varies substantially in wide baseline setups, which is addressed by<br />

scale- <strong>and</strong> affine-invariant feature detectors <strong>and</strong> descriptors, including scale invariant<br />

feature trans<strong>for</strong>ms [Lowe, 1999], intensity profiles [Tell <strong>and</strong> Carlsson, 2000], maximally<br />

stable extremal regions [Matas et al., 2002] <strong>and</strong> scale <strong>and</strong> affine invariant Harris<br />

points [Mikolajczyk <strong>and</strong> Schmid, 2004]).<br />

In our current work-flow we utilize Harris corners as primary point features, which are


8 Chapter 1. Introduction<br />

extended with either local image patches or intensity profiles as feature descriptors.<br />

Correspondence <strong>and</strong> Pose Estimation In order to relate a set of images geometrically<br />

it is necessary to find correspondences, i.e. the images of identical scene objects.<br />

For the task of calculating the relative orientation between images it is suitable to extract<br />

features with good point localization as provided by the feature extraction step. In a calibrated<br />

setting the relative orientation between two views can be calculated from five point<br />

correspondences. Hence a RANSAC-based approach is used <strong>for</strong> robust initial estimation<br />

of the relative pose between two adjacent views. In order to test many samples an efficient<br />

procedure <strong>for</strong> relative pose estimation is utilized [Nistér, 2004a]. With the knowledge of<br />

the relative poses between all consecutive views <strong>and</strong> corresponding point features visible<br />

in at least 3 images, the orientations of all views in the sequence can be upgraded to a<br />

common coordinate system. The camera poses <strong>and</strong> the sparse reconstruction consisting of<br />

3D points triangulated from point correspondences are refined using a simple but efficient<br />

implementation of sparse bundle adjustment [Lourakis <strong>and</strong> Argyros, 2004]. This step concludes<br />

the pipeline to establish the 3D relationship <strong>for</strong> a sequence of images. The essential<br />

data generated by this pipeline are distortion-free images <strong>and</strong> the camera matrices relating<br />

positions in 3D space with 2D image locations.<br />

In case of video sequences it is sufficient to track simple point features over time <strong>and</strong><br />

to apply a RANSAC scheme to obtain the relative poses of the images, which can be<br />

optionally accomplished in real-time [Nistér et al., 2004]. In our setting targeted at offline<br />

reconstructions using high resolution images a real-time behavior to determine the<br />

geometrical relationship between the views is not necessary. Nevertheless, high processing<br />

per<strong>for</strong>mance of these early reconstruction stages are relevant due to the amount of taken<br />

images. Even reconstructing a small, isolated object like a statue easily results in 50<br />

images of that object, which must be integrated into a common coordinate system.<br />

Foreground Segmentation If the 3D reconstruction of individual or free-st<strong>and</strong>ing objects<br />

is desired, an image segmentation procedure separating <strong>for</strong>eground objects from<br />

the unwanted background is suitable. If the result of this segmentation step is accurately<br />

representing the silhouette of the object of interest, any shape from silhouette technique<br />

[Laurentini, 1995, Lok, 2001, Matusik et al., 2001, Li et al., 2003] can be applied<br />

to obtain a first coarse 3D model called the visual hull. When encountered with many<br />

small baseline images, manual segmentation of <strong>for</strong>eground pixels against a complex background<br />

is a tedious task. Hence, a automated or semi-automatic approach to generate<br />

the object silhouettes <strong>for</strong> these cases is reasonable. Specifying an initial object silhouette<br />

<strong>and</strong> propagating it through the image sequence is described in [Sormann et al., 2005,<br />

Sormann et al., 2006]. Silhouette in<strong>for</strong>mation is partially used in the successive dense<br />

matching procedures to suppress unintended fragments in the final model.


1.3. 3D Models from Multiple Images 9<br />

Dense Depth Estimation With the knowledge of the camera parameters <strong>and</strong> the<br />

relative poses between the source views dense correspondences <strong>for</strong> all pixels of a particular<br />

key view can be estimated. Since the epipolar geometry is already known, this procedure<br />

is basically a one-dimensional search along the epipolar line <strong>for</strong> every pixel. Triangulation<br />

of these correspondences results in a dense 3D model, which reflects the true surface<br />

geometry of the captured object in ideal settings.<br />

In order to simplify the depth estimation task <strong>and</strong> to make it more robust, almost all<br />

dense depth estimation method assume opaque surfaces with diffuse reflection properties<br />

to be reconstructed. In some approaches the lighting conditions <strong>and</strong> the exposure settings<br />

of the camera may change between the captured views to some amount. The depth map<br />

<strong>for</strong> the particular key view is usually estimated from a set of nearby views having a large<br />

overlap in their image content.<br />

The major part of this <strong>thesis</strong> addresses the generation of dense depth maps, in particular<br />

Chapter 3, 4, 6 <strong>and</strong> Chapter 7. The main differences between dense depth estimation<br />

approaches in general are the utilized image dissimilarity function, which ranks potential<br />

correspondences on the epipolar line, <strong>and</strong> the h<strong>and</strong>ling of textureless regions, where the<br />

dissimilarity score is ambiguous <strong>and</strong> unreliable. Both factors influence the range of potential<br />

applications <strong>for</strong> the method <strong>and</strong> its per<strong>for</strong>mance in terms of time <strong>and</strong> 3D model<br />

quality. The main contribution of the chapters discussing dense depth estimation is the<br />

efficient generation of depth maps by utilizing the computational power <strong>and</strong> programming<br />

model of modern graphics hardware. The presented methods <strong>and</strong> implementations include<br />

several dissimilarity scores <strong>and</strong> different approaches to cope with regions containing<br />

indiscriminative surface texture.<br />

Multiview Depth Integration The set of depth images obtained from dense depth<br />

estimation needs to be combined in order to obtain a consistent final geometric model of<br />

the captured scene or object. If we assume a redundancy of depth in<strong>for</strong>mation, potential<br />

outliers generated by the previous depth estimation procedure can be detected <strong>and</strong> removed<br />

at this point. A successful method <strong>for</strong> multiple depth map fusion is the volumetric<br />

range image integration approach [Curless <strong>and</strong> Levoy, 1996, Wheeler et al., 1998]. Chapter<br />

8 describes our fast depth integration procedure. Alternatively, proper 3D models can<br />

be directly generated using voxel coloring methods (see Chapter 5).<br />

Geometry Processing Depending on the actual depth image integration methods the<br />

obtained 3D mesh may contain holes <strong>and</strong> may appear still somewhat noisy. Furthermore,<br />

the generated mesh is almost always over-tessellated <strong>and</strong> is not directly appropriate <strong>for</strong><br />

further processing or visualization. Consequently, a final geometry processing step may<br />

include mesh simplification techniques <strong>and</strong> other mesh refinement <strong>and</strong> cleaning procedures.<br />

In particular, we apply a mesh simplification tool [Garl<strong>and</strong> <strong>and</strong> Heckbert, 1997] to reduce<br />

the geometric complexity of the model.


10 Chapter 1. Introduction<br />

Photorealistic Texturing The simplified <strong>and</strong> enhanced geometry of the imaged object<br />

still lacks an appropriate texture <strong>for</strong> a photorealistic display within virtual scenes.<br />

Texture map generation <strong>for</strong> arbitrary 3D shapes requires cutting of the original polygonal<br />

representation into several disk-like patches. Each of these patches has its own texture coordinate<br />

mapping associated. In order to obtain few distortions <strong>and</strong> better visual quality,<br />

these patches should be preferably flat. Our implementation [Zebedin, 2005] combines the<br />

texture atlas generation procedure described in [Lévy et al., 2002] with robust multi-view<br />

texturing techniques in presence of occlusions [Mayer et al., 2001, Bornik et al., 2001]. If<br />

a surface element is visible in several images (which is usually the case), unmodeled occlusions<br />

can be detected <strong>and</strong> removed using a robust color averaging method. Additionally,<br />

the orientation of a surface patch with respect to the source images <strong>and</strong> its projected footprint<br />

provides reliability in<strong>for</strong>mation, which can be used to weight the color contribution<br />

from the source images.<br />

An Illustrative Example We illustrate various stages of this pipeline with a statue<br />

example in Figure 1.3. In addition to two (out of 47) input images we show two dense<br />

depth estimation results based on a GPU-accelerated plane-sweep ((c) <strong>and</strong> (d)). These<br />

small-baseline reconstructions are still noisy <strong>and</strong> have outliers. Volumetric depth image<br />

integration uses all available depth images to remove the artifacts <strong>and</strong> creates a suitable<br />

geometry representing the statue (images (e) <strong>and</strong> (f)). Finally, the decimated <strong>and</strong> textured<br />

mesh is illustrated ((g) <strong>and</strong> (h)).<br />

After this coarse presentation of the modeling pipeline, we provide a more in-depth<br />

description of the various stages in the work-flow, which are not directly related with this<br />

<strong>thesis</strong>.<br />

1.4 Overview of this Thesis <strong>and</strong> Contributions<br />

Chapter 2 presents work <strong>and</strong> publications related to this <strong>thesis</strong>. It is divided into two major<br />

sections: Section 2.1 presents important approaches <strong>and</strong> work focusing on dense depth<br />

estimation <strong>and</strong> computational stereo in general. From the vast number of publications in<br />

this field only a few seminal ones are briefly presented. Some of these comprise the basis <strong>for</strong><br />

our procedures <strong>and</strong> are described in more detail in the appropriate chapters. Section 2.2<br />

gives a general overview of GPU-accelerated approaches <strong>and</strong> algorithms that appeared<br />

in recent years. Further, several research lines <strong>for</strong> realtime <strong>and</strong> GPU-based methods to<br />

computational stereo <strong>and</strong> multi-view reconstruction approaches are presented.<br />

Our first computational stereo method accelerated by graphics hardware is described<br />

in Chapter 3. This dense stereo reconstruction procedure is essentially an iterative local<br />

mesh refinement method to generate a surface consistent with the given views. The main<br />

motivation <strong>for</strong> this approach is the fast projective texturing capability provided by graphics<br />

hardware since its beginnings. With the emergence of programmable GPUs, it is possible<br />

to calculate simple image dissimilarity functions by GPUs as well. CPU intervention


1.4. Overview of this Thesis <strong>and</strong> Contributions 11<br />

is necessary to update the current mesh hypo<strong>thesis</strong> according to the determined best<br />

local modifications <strong>and</strong> to occasionally smooth the mesh. Since this approach works on<br />

meshes, this method is the only one presented in this <strong>thesis</strong> making extensive use of vertex<br />

programs. The obtained software per<strong>for</strong>ms reconstructions at interactive or near realtime<br />

rates.<br />

This chapter contains material from two publications ([Zach et al., 2003a] <strong>and</strong><br />

[Zach et al., 2003b]).<br />

Note, that all other procedures presented in the following chapter are purely per<strong>for</strong>med<br />

on the graphics hardware with the CPU only executing the flow control <strong>for</strong> GPU routines.<br />

By providing the source images <strong>and</strong> the camera parameters <strong>and</strong> poses the full reconstruction<br />

pipeline to the final 3D model visualization per<strong>for</strong>ms entirely on the graphics<br />

hardware <strong>and</strong> no expensive data transfer from GPU memory to main memory is necessary.<br />

Consequently, these methods are perfectly suited <strong>for</strong> fast visual feedback to the human<br />

operator.<br />

Plane-sweep methods to depth estimation are still the most suitable approaches <strong>for</strong><br />

efficient implementations on the GPU. So far, most algorithms presented in the literature<br />

require images with exactly the same lighting conditions, since very simple correlation<br />

measures like the sum of absolute differences (SAD) or sum of squared differences (SSD)<br />

are utilized. In Chapter 4 we propose an approximated zero-mean normalized sum of absolute<br />

differences correlation function, which produces results similar to the widely used<br />

NCC function <strong>and</strong> can be more efficiently calculated on current generation graphics hardware.<br />

Using GPU-based summed area tables (aka. integral images) the computation time<br />

<strong>for</strong> this image correlation measure is independent of the template window size. Furthermore,<br />

a sparse belief propagation method is proposed to obtain depth maps incorporating<br />

smoothness constraints. Material from this chapter can be found in [Zach et al., 2006a].<br />

Chapter 5 describes, how a voxel-coloring technique can be executed entirely on graphics<br />

hardware by combining plane-sweep approaches with correct visibility h<strong>and</strong>ling. Thus,<br />

3D volumetric models from many images can be obtained at interactive rates. Additionally,<br />

several voxel-coloring passes can be applied in orthogonal directions to obtain true<br />

3D models from a complete sequence around the object in interest. But this particular<br />

space carving technique on the GPU requires a 3D volume texture to be stored in video<br />

memory, thereby limiting the resolution of the voxel space.<br />

A very fast variational approach to depth estimation in presented in Chapter 6. On a<br />

first view it seems unlikely that graphics hardware can accelerate the numerical calculations<br />

required to solve the partial differential equations derived from variational <strong>for</strong>mulations<br />

of depth estimation. However, it turns out that the current programming features<br />

of GPUs substantially decrease the run-time of iterative PDE solvers on regular grids.<br />

Variational depth estimation methods can provide very high quality models, but they are<br />

very sensitive to parameter settings <strong>and</strong> to the initial depth hypo<strong>thesis</strong> in general, hence<br />

an immediate feedback is very useful to a human operator.


12 Chapter 1. Introduction<br />

The most versatile method <strong>for</strong> dense depth estimation, which can be per<strong>for</strong>med by<br />

the GPU entirely, is scanline optimization as described in Chapter 7. Conceptually, the<br />

technique described in this chapter extends the plane-sweep method from Chapter 4 with<br />

a semi-global depth extraction technique. The key innovation in this chapter is the <strong>for</strong>mulation<br />

of a specific dynamic programming approach to depth estimation in a manner<br />

suitable <strong>for</strong> the programming model of GPUs. Although the time complexity after the<br />

trans<strong>for</strong>mation is O(N log N) instead of O(N), the observed timing results are promising.<br />

The core method from this chapter is presented in [Zach et al., 2006b].<br />

The final algorithmic contribution of this <strong>thesis</strong> discussed in Chapter 8 is a volumetric<br />

approach to generate proper 3D models from multiple depth maps at interactive rates.<br />

The final 3D model is represented implicitly as isosurface in a scalar volume dataset <strong>and</strong><br />

the corresponding mesh geometry can be extracted using marching cubes or tetraheda<br />

methods. Alternatively, the isosurface can be directly visualized from the volume data using<br />

recent methods of volume visualization. A condensed version of this chapter appeared<br />

in [Zach et al., 2006a].<br />

Chapter 9 presents several multi-view datasets <strong>and</strong> the associated depth maps <strong>and</strong><br />

models generated with the proposed methods. In few cases, where a ground truth is<br />

available, a quantitative accuracy evaluation is provided as well.


1.4. Overview of this Thesis <strong>and</strong> Contributions 13<br />

(a) (b) (c) (d)<br />

(e) (f) (g) (h)<br />

Figure 1.3: Several step in the reconstruction pipeline illustrated with a statue example.<br />

(a) <strong>and</strong> (b) are two source images out of 47 images in total. The result of GPU-based<br />

dense depth estimation <strong>for</strong> two views is shown in (c) <strong>and</strong> (d). Two views of the result mesh<br />

after volumetric depth image integration are given in (e) <strong>and</strong> (f). The finally simplified<br />

<strong>and</strong> textured 3D geometry of the statue is displayed in (g) <strong>and</strong> (h).


Chapter 2<br />

Related Work<br />

Contents<br />

2.1 Dense Depth <strong>and</strong> Model Estimation . . . . . . . . . . . . . . . . 15<br />

2.2 GPU-based 3D Model Computation . . . . . . . . . . . . . . . . 19<br />

2.1 Dense Depth <strong>and</strong> Model Estimation<br />

There is a huge bibliography on the generation of depth images <strong>and</strong> dense geometry from<br />

multiple views, hence we focus on seminal work in this field. We divide the approaches to<br />

computational stereo into three subtopics <strong>for</strong> a better structure: at first, important publications<br />

dealing with the classical stereo setup consisting of two images with vertically<br />

aligned epipolar geometry are discussed. Subsequently, major approaches to depth estimation<br />

from multiple, not necessarily rectified images are presented. Finally, true multi-view<br />

methods generating a 3D model (<strong>and</strong> not just depth images) directly are briefly sketched.<br />

Note, that computational stereo <strong>and</strong> depth estimation can be seen as a subtopic of the<br />

more general optical flow computation between images. The main difference between the<br />

<strong>for</strong>mer <strong>and</strong> optical flow is the reduced (one-dimensional) search space <strong>for</strong> stereo methods,<br />

since knowledge of the epipolar geometry is assumed. In order to obtain metric models<br />

the internal camera parameters are required to be known, too.<br />

2.1.1 Computational Stereo on Rectified Images<br />

The minimal requirement to obtain a depth map, or equivalently a 2.5D height field solely<br />

from images, is a pair of input images with a typically convergent view on the scene<br />

to be reconstructed. Many methods generating depth maps from such input data work<br />

on rectified images with aligned epipolar geometry mostly <strong>for</strong> efficiency reasons, since<br />

vertically aligned epipolar lines allow efficient image dissimilarity calculations <strong>and</strong> the<br />

reuse of already computed values. Recent surveys of computational stereo methods are<br />

15


16 Chapter 2. Related Work<br />

given in [Scharstein <strong>and</strong> Szeliski, 2002], [Faugeras et al., 2002] <strong>and</strong> [Brown et al., 2003].<br />

Additionally, in [Scharstein <strong>and</strong> Szeliski, 2002] an evaluation framework is proposed, which<br />

is still widely used to compare stereo methods in terms of their ability to recover the true<br />

geometry.<br />

Many depth estimation methods per<strong>for</strong>m typically the following four subsequent steps<br />

to constitute a depth map (after [Scharstein <strong>and</strong> Szeliski, 2002]):<br />

1. matching cost (i.e. image dissimilarity score) computation;<br />

2. an aggregation procedure to accumulate the matching costs within some region;<br />

3. depth map extraction;<br />

4. <strong>and</strong> an optional refinement of the depth map.<br />

Often, the first two steps cannot be separated, e.g. if the utilized matching score is<br />

already based on some measure involving pixel neighborhoods. The major difference<br />

between the various computational stereo approaches lies in the method of depth<br />

map extraction given the matching costs data structure. Purely local methods apply<br />

a very greedy winner-takes-all approach, which assigns the depth value with the<br />

lowest matching cost to a pixel. Global methods <strong>for</strong> depth map extraction apply an<br />

optimization procedure, which takes matching scores <strong>and</strong> spatial smoothness of the<br />

depth map into account. Smoothness is typically modeled by a regularization function,<br />

which has the depth values assigned to adjacent pixels as input <strong>and</strong> yields a (positive)<br />

penalty value <strong>for</strong> unequal depths. If smoothness of the depth map is en<strong>for</strong>ced only on<br />

vertical scanlines (which coincide with the epipolar lines), a very efficient <strong>and</strong> elegant<br />

algorithms based on the dynamic programming principle can be devised. Earlier<br />

work includes [Baker <strong>and</strong> Bin<strong>for</strong>d, 1981, Ohta <strong>and</strong> Kanade, 1985, Geiger et al., 1995,<br />

Birchfield <strong>and</strong> Tomasi, 1998]. Although dynamic programming approaches to stereo<br />

are known <strong>for</strong> a long time, there is still ongoing research on this topic [Veksler, 2003,<br />

Criminisi et al., 2005, Hirschmüller, 2005, Hirschmüller, 2006, Lei et al., 2006]. A more<br />

detailed discussion of one employed dynamic programming approach to stereo <strong>and</strong> its<br />

GPU-based implementation is provided in Chapter 7.<br />

More recently, many proposed global methods <strong>for</strong> stereo focus on en<strong>for</strong>cing<br />

smoothness in both directions, not just within the same scanline. Since finding<br />

the true global optimum is not feasible, various approximation schemes have<br />

been presented in the literature. Largely, two lines of global optimization<br />

procedures have been applied successfully to stereo problems: maximum network<br />

flow methods (usually called graph-cut approaches in the computer vision literature<br />

[Boykov et al., 2001, Kolmogorov <strong>and</strong> Zabih, 2001, Kolmogorov <strong>and</strong> Zabih, 2002]),<br />

<strong>and</strong> Markov r<strong>and</strong>om field methods based on iterative belief updating (belief<br />

propagation [Sun et al., 2003, Felzenszwalb <strong>and</strong> Huttenlocher, 2004, Sun et al., 2005]).<br />

Although the depth maps obtained from these advanced procedures are generally


2.1. Dense Depth <strong>and</strong> Model Estimation 17<br />

better than those generated by dynamic programming methods, their time <strong>and</strong><br />

space complexities are substantially higher than those <strong>for</strong> 1-dimensional optimization<br />

procedures.<br />

Graph-cut methods are iterative procedures to update the current labeling (i.e. depth<br />

values in the stereo case) of pixels to obtain a lower total energy value. The initial depth<br />

labeling can be computed e.g. by pure local stereo methods. In every iteration a greedy, but<br />

large ∗ relabeling of pixels is determined, which yields the lowest total energy. A suitable<br />

graph network is built in every iteration, <strong>and</strong> the maximum flow solution corresponds to<br />

an optimal greedy relabeling. These iterations are repeated until a (strong) local minimum<br />

is reached.<br />

While dynamic programming, belief propagation <strong>and</strong> graph cut approaches to computational<br />

stereo treat the underlying energy minimization problem as combinatorial problem<br />

with a discrete set of pixels <strong>and</strong> disparity labels, it is nevertheless possible to employ variational<br />

methods developed to solve problems on a continuous domain <strong>for</strong> stereo vision.<br />

Since many of the proposed variational approaches <strong>for</strong> multi-view reconstruction are typically<br />

<strong>for</strong>mulated <strong>for</strong> a general multiple view setup, these methods are discussed below in<br />

Section 2.1.2.<br />

The depth maps returned by any of the above-mentioned methods may still contain<br />

wrong depth values <strong>for</strong> certain pixels, e.g. due to occlusions, specular reflections<br />

etc. These mismatches can be potentially detected by a very simple left-right consistency<br />

check [Fua, 1993] (also called bidirectional matching or back-matching). This technique<br />

reverses the role of the input images <strong>and</strong> generates two depth maps (one wrt. the first<br />

image <strong>and</strong> one wrt. the second image). Only depth values <strong>for</strong> pixels which agree in both<br />

depth maps (according to some metric) are retained.<br />

2.1.2 Multi-View Depth Estimation<br />

In this section we summarize work on dense depth estimation from multiple, but usually<br />

still small baseline views. In general, more than two views cannot be rectified in order to<br />

simplify <strong>and</strong> accelerate the depth estimation procedure. Since small baselines between the<br />

images are assumed, explicit or implicit occlusion detection <strong>and</strong> h<strong>and</strong>ling strategies are<br />

possible. Implicit occlusion h<strong>and</strong>ling approaches typically use truncated matching scores<br />

or multiple scores between pairs of images to reduce the influence of occluded pixels in the<br />

estimation procedure (e.g. [Woetzel <strong>and</strong> Koch, 2004] <strong>and</strong> Chapter 4).<br />

Several approaches developed <strong>for</strong> a multi-view setup utilize variational methods to<br />

search <strong>for</strong> a 3D surface or depth map color-consistent with the provided input images. A<br />

hypothetical surface or depth map (together with the known epipolar geometry between<br />

the views) induces a (nonlinear) 2D transfer between the images. If the correct depth map<br />

is found, all warped source images are very similar according to a provided image similarity<br />

metric. Additionally, surface smoothness is assumed if the image data is ambiguous (i.e.<br />

∗ Meaning, that the subset of pixels with a newly assigned label is as large as possible.


18 Chapter 2. Related Work<br />

lacking sufficient texture). Variational approaches to multi-view stereo <strong>for</strong>mulate the reconstruction<br />

problem as continuous energy optimization task <strong>and</strong> apply methods from the<br />

variational calculus (most notably the Euler-Lagrange equation) to determine a suitable<br />

gradient descent direction in function space. The current mesh (or depth map hypo<strong>thesis</strong>)<br />

is updated according to this direction until convergence. All variational methods to stereo<br />

employ a coarse-to-fine strategy to avoid reaching a weak local minimum in early stages<br />

of the procedure.<br />

If a surface is evolved within a variational framework to obtain a final<br />

mesh consistent with the images, an implicit level-set representation of the<br />

current mesh hypo<strong>thesis</strong> allows simple h<strong>and</strong>ling of topological changes of the<br />

mesh [Faugeras <strong>and</strong> Keriven, 1998, Yezzi <strong>and</strong> Soatto, 2003, Pons et al., 2005]. Generating<br />

depth images instead of meshes from multiple views within a continuous framework<br />

yields to a set of partial differential equations, which are numerically solved to obtain the<br />

final depth map [Strecha <strong>and</strong> Van Gool, 2002, Strecha et al., 2003, Slesareva et al., 2005].<br />

Chapter 6 describes depth estimation using variational principles more precisely <strong>and</strong><br />

presents an efficient GPU-based implementation of one particular approach.<br />

Combinatorial <strong>and</strong> graph optimization methods can be applied in the multi-view stereo<br />

case as well: Kolmogorov et al. [Kolmogorov <strong>and</strong> Zabih, 2002, Kolmogorov et al., 2003]<br />

employ graph-cut optimization to obtain a depth map from multiple views. In addition to<br />

image similarity <strong>and</strong> smoothness terms the energy function is augmented with an explicit<br />

visibility term derived from the current depth map.<br />

2.1.3 Direct 3D Model Reconstruction<br />

This section outlines several approaches <strong>for</strong> multi-view reconstruction targeted at<br />

using all available images from different viewpoints simultaneously. Early methods<br />

include space carving <strong>and</strong> its variants, which projects 3D voxels in the available images<br />

according to the current visibility <strong>and</strong> an image consistency score is calculated from the<br />

sampled pixels. If the voxel is declared as inconsistent, the voxel is classified as empty<br />

<strong>and</strong> the current model <strong>and</strong> visibility in<strong>for</strong>mation is updated. The variants of the basic<br />

space carving principle mostly differ in their employed consistency function <strong>and</strong> the<br />

voxel traversal order [Seitz <strong>and</strong> Dyer, 1997, Prock <strong>and</strong> Dyer, 1998, Seitz <strong>and</strong> Dyer, 1999,<br />

Culbertson et al., 1999, Kutulakos <strong>and</strong> Seitz, 2000, Slabaugh et al., 2001,<br />

Sainz et al., 2002, Stevens et al., 2002] (see also Chapter 5). All space carving methods<br />

compute the so called photo hull (the set of image-consistent voxels), which typically<br />

contains the true geometry, but in practice the photo hull can be a substantial<br />

over-estimate of the true model. Textureless regions yield to poor photo hulls in<br />

particular because of the absence of a smoothing <strong>for</strong>ce.<br />

In order to address the shortcomings of pure space carving methods with<br />

its instant classification of voxels, volumetric graph cut extraction of surface<br />

voxels incorporating image consistency <strong>and</strong> smoothness constraints were recently


2.2. GPU-based 3D Model Computation 19<br />

proposed [Vogiatzis et al., 2005, Tran <strong>and</strong> Davis, 2006, Hornung <strong>and</strong> Kobbelt, 2006b,<br />

Hornung <strong>and</strong> Kobbelt, 2006a]. Since individual voxels essentially correspond to nodes<br />

in the network graph used to determine the maximum flow, these methods still rely<br />

on existing object silhouettes in order to consider only voxels close to the visual hull.<br />

Additionally, approximate visibility is inferred from the visual hull to determine occluded<br />

views <strong>for</strong> each voxel.<br />

Instead of a direct, one-pass reconstruction approach from multiple views, one can utilize<br />

a two-pass method, which generates at first a set of depth images from small baseline<br />

subsets of the provided source views, <strong>and</strong> subsequently creates a full 3D model by merging<br />

the depth maps. Goesele et al. [Goesele et al., 2006] employs a simple plane-sweep<br />

based depth estimation approach followed by a volumetric range image integration procedure<br />

[Curless <strong>and</strong> Levoy, 1996] to obtain the final 3D model. Only relatively confident<br />

depth values are retained in the depth maps, hence the final model may still contain holes<br />

e.g. in textureless regions. Additionally, the range image integration is based on weighted<br />

depth values with the weights induced from the corresponding matching score. This approach<br />

is very similar to our purely GPU-based reconstruction pipeline comprising the<br />

methods presented in Chapter 4 <strong>and</strong> Chapter 8 (see also [Zach et al., 2006a]). In contrast<br />

to volumetric graph cut methods, which generate watertight surfaces, the result of the<br />

pure locally working volumetric range image method may contain holes, which can be<br />

geometrically filled e.g. using volumetric diffusion processes [Davis et al., 2002].<br />

2.2 GPU-based 3D Model Computation<br />

2.2.1 General Purpose Computations on the GPU<br />

Because of the rapid development <strong>and</strong> per<strong>for</strong>mance increase of current 3D graphics hardware,<br />

the goal of using graphics processing units <strong>for</strong> non-graphical purposes became appealing.<br />

The SIMD design of graphics hardware allows much higher peak per<strong>for</strong>mance in<br />

certain applications than it is achievable <strong>for</strong> a general purpose CPU. Whereas a traditional<br />

CPU like a 3 GHz Pentium 4 achieves a theoretical per<strong>for</strong>mance of 6 GFlops <strong>and</strong> a memory<br />

b<strong>and</strong>width of about 6 GByte/sec, a high-end graphics card such as a NVidia GeForce<br />

6800 achieves 53 GFlops at 34 GByte/sec [Harris <strong>and</strong> Luebke, 2005]. Furthermore, the<br />

annual increase of per<strong>for</strong>mance <strong>for</strong> graphics processing units is significantly higher than<br />

<strong>for</strong> CPUs. In contrast to the MIMD programming model <strong>for</strong> traditional processing units<br />

the computational model <strong>for</strong> GPUs is a stream processing approach applying the same<br />

instructions to multiple data items. Consequently, existing CPU-based algorithms must<br />

be mapped onto this computational model, <strong>and</strong> not every algorithm can benefit from the<br />

processing power of the GPU.<br />

Since the emergence of programmable graphics hardware in the year 2001, a huge<br />

number of research papers addresses the acceleration of known algorithms <strong>and</strong> numerical<br />

methods using the GPU as specialized, but fast coprocessor. In this section we only refer


20 Chapter 2. Related Work<br />

to seminal work in this area.<br />

At first we give a brief overview of the computational model of GPU-based computations<br />

(Figure 2.1). The incoming vertex stream with several attributes per vertex (vertex<br />

position, color, texture coordinates) is processed by a vertex program <strong>and</strong> trans<strong>for</strong>med into<br />

normalized screen space. A set of three vertices constitutes a triangle, which is prepared<br />

<strong>for</strong> the rasterization step. The rasterizer generates fragments <strong>and</strong> interpolates vertex attributes.<br />

An optional fragment program takes the incoming fragments <strong>and</strong> may per<strong>for</strong>m<br />

additional calculation, thereby modifying the outgoing fragment color <strong>and</strong> depth. The<br />

blending stage per<strong>for</strong>ms optional alpha blending <strong>and</strong> combines several fragment samples<br />

into one pixel, if multi-sampling based antialiasing is enabled. Fragment programs <strong>and</strong><br />

recently vertex programs as well can per<strong>for</strong>m texture lookups to retrieve arbitrary image<br />

data.<br />

Vertex stream<br />

Vertex<br />

program<br />

Texture<br />

Trans<strong>for</strong>med<br />

vertex stream<br />

Triangle<br />

assembly/clip<br />

Screen−space<br />

triangle stream<br />

Rasterization<br />

Unprocessed<br />

Fragment<br />

program<br />

Fragment<br />

Blending<br />

Pixel<br />

Framebuffer<br />

Image<br />

fragment stream stream stream<br />

Figure 2.1: The stream computation model of a GPU (adapted<br />

from [Harris <strong>and</strong> Luebke, 2005]).<br />

Most applications using the GPU as a general purpose SIMD processor employ the<br />

fragment shaders to per<strong>for</strong>m computational tasks, since most of the processing power of<br />

modern graphics hardware is concentrated in the fragment units. Additionally, direct <strong>and</strong><br />

dependent texture lookups provided by fragment shaders constitute a powerful instrument<br />

<strong>for</strong> data array access. Consequently, general purpose computing on the GPU focuses on the<br />

second row in pipeline depicted in Figure 2.1 (notably fragment programs <strong>and</strong> blending).<br />

Textures act as data array sources, on which the same set of instructions is applied.<br />

The resulting fragments represent the calculated outcome of these computation. Hence,<br />

in most applications a screen-aligned quadrilateral with appropriate texture coordinates<br />

is drawn <strong>and</strong> the requested computation is entirely per<strong>for</strong>med in the fragment processing<br />

units.<br />

Vertex <strong>and</strong> fragment programs are specified in an assembly like language in the<br />

first instance. Several higher level specification languages <strong>for</strong> vertex <strong>and</strong> fragment<br />

programs were developed to ease the development of GPU programs. A commonly


2.2. GPU-based 3D Model Computation 21<br />

used language <strong>for</strong> visual effects <strong>and</strong> general purpose programming on the GPU is<br />

Cg [NVidia Corporation, 2002a, Mark et al., 2003], which provides a C-like specification<br />

language <strong>for</strong> GPU programs <strong>and</strong> a compiler <strong>for</strong> translation to the native instruction set<br />

of graphics hardware. Brook is a language designed specifically <strong>for</strong> parallel numerical<br />

algorithms [Dally et al., 2003], <strong>and</strong> now an implementation is available <strong>for</strong> current<br />

programmable graphics hardware [Buck et al., 2004]. The two main concepts of Brook<br />

(<strong>and</strong> parallel numerical approaches in general) are kernels <strong>and</strong> reductions. A kernel is a<br />

procedure applied to a large set of data items <strong>and</strong> represents the more powerful version<br />

of a SIMD instruction. Since the computation of a kernel only depends on the incoming<br />

data <strong>and</strong> a kernel has no additional side-effects, a kernel can be executed <strong>for</strong> many data<br />

values in parallel. Application of a kernel is similar to the higher order map function<br />

found in most functional programming languages. A reduction operation combines the<br />

elements in a data array to generate a single result. In functional programming this<br />

operation corresponds to the (again higher-order) fold function. On graphics hardware<br />

kernels correspond mainly to fragment programs <strong>and</strong> can be applied in a straight<strong>for</strong>ward<br />

manner. Reductions require a rather expensive multipass procedure based on recursive<br />

doubling with a logarithmic number of passes.<br />

Because of the close relationship between the computational model of modern GPUs<br />

<strong>and</strong> general stream processing concepts, similar benefits <strong>and</strong> limitations <strong>for</strong> algorithm implementations<br />

can be found in both models. Nevertheless, there are significant differences<br />

between general stream processors <strong>and</strong> graphics hardware: In contrast to general parallel<br />

programming <strong>and</strong> stream computation models, a GPU only provides a very limited support<br />

<strong>for</strong> scatter operations (i.e. indexed array updates) <strong>and</strong> other general purpose operations<br />

(e.g. bit-wise integer manipulation). On the other h<strong>and</strong>, linearly filtered data access is<br />

per<strong>for</strong>med very efficiently by the GPU, since this is an intrinsic feature of texture units. In<br />

spite of these (<strong>and</strong> many other) differences between stream processing models <strong>and</strong> modern<br />

GPUs, essentially the same set of algorithms can be accelerated by both architectures.<br />

Even be<strong>for</strong>e programmable graphics hardware was available, the fixed<br />

function pipeline of 3D graphics processors was utilized to accelerate several<br />

numerical [Hopf <strong>and</strong> Ertl, 1999a, Hopf <strong>and</strong> Ertl, 1999b] <strong>and</strong> geometric<br />

calculations [Hoff III et al., 1999, Krishnan et al., 2002] <strong>and</strong> even to emulate<br />

programmable shading not available at that time [Peercy et al., 2000]. The<br />

introduction of a quite general programming model <strong>for</strong> vertex <strong>and</strong> pixel processing<br />

[Lindholm et al., 2001, Proudfoot et al., 2001] opened a very active research<br />

area. The primary application <strong>for</strong> programmable vertex <strong>and</strong> fragment processing<br />

is the enhancement of photorealism <strong>and</strong> visual quality in interactive visualization<br />

systems (e.g. [Engel et al., 2001, Hadwiger et al., 2001]) <strong>and</strong> entertainment applications<br />

([Mitchell, 2002, NVidia Corporation, 2002b]). Additionally several non-photorealistic<br />

rendering techniques can be effectively implemented in modern graphics hardware<br />

[Lu et al., 2002, Mitchell et al., 2002, Weiskopf et al., 2002, Dominé et al., 2002].<br />

Thompson et al. [Thompson et al., 2002] implemented several non-graphical


22 Chapter 2. Related Work<br />

algorithms to run on programmable graphics hardware <strong>and</strong> profiled the execution times<br />

against CPU based implementations. They concluded that an efficient memory interface<br />

(especially when transferring data from graphics memory into main memory) is still an<br />

unsolved issue. For the same reason our implementations are designed to minimize the<br />

memory traffic between graphics hardware <strong>and</strong> main memory.<br />

Naturally the texture h<strong>and</strong>ling capability <strong>and</strong> especially the free bilinear <strong>and</strong> accelerated<br />

anisotropic texture fetch operation makes graphics hardware suitable <strong>for</strong> image<br />

processing tasks, e.g. filtering with linear kernels. Sugita et al. [Sugita et al., 2003] <strong>and</strong><br />

Colantoni et al. [Colantoni et al., 2003] compared the per<strong>for</strong>mance of CPU-based <strong>and</strong><br />

GPU-based implementations of several image filters <strong>and</strong> image trans<strong>for</strong>ms, <strong>and</strong> observed<br />

substantial per<strong>for</strong>mance gains using the GPU over optimized CPU implementations.<br />

Numerical methods <strong>and</strong> simulations became feasible on the GPU since the emergence<br />

of floating point texture capabilities, that enables specification <strong>and</strong> h<strong>and</strong>ling of floating<br />

point values <strong>for</strong> use on the GPU (instead of the 8 bit fixed point precision provided<br />

so far). Numerical solvers <strong>for</strong> sparse matrix equations were proposed by Bolz et<br />

al. [Bolz et al., 2003] <strong>and</strong> by Krüger <strong>and</strong> Westermann [Krüger <strong>and</strong> Westermann, 2003].<br />

Note that the system matrices appearing in variational methods to optical flow <strong>and</strong> depth<br />

estimation are huge, but sparse matrices with usually 4 or 8 off-diagonal b<strong>and</strong>s. Consequently,<br />

variational methods exploiting the computational power of modern GPUs are<br />

now feasible <strong>and</strong> outper<strong>for</strong>m CPU based implementation substantially. Of course, the<br />

limited floating point precision of current GPUs (essentially an IEEE 32 bit float <strong>for</strong>mat)<br />

is an obstacle to high precision numerical computations. Actual numerical or physical<br />

simulations are described in [Harris et al., 2002, Kim <strong>and</strong> Lin, 2003, Lefohn et al., 2003,<br />

Goodnight et al., 2003, Morel<strong>and</strong> <strong>and</strong> Angel, 2003].<br />

2.2.2 Real-time <strong>and</strong> GPU-Accelerated Dense Reconstruction from Multiple<br />

Images<br />

In this section we focus on multi-view reconstruction methods that are either aimed on<br />

realtime execution or use programmable 3D graphics hardware to accelerate the depth<br />

estimation procedure.<br />

<strong>Vision</strong>-based dense depth estimation methods per<strong>for</strong>ming at interactive rates<br />

or even in real-time were initially implemented using special hardware <strong>and</strong> digital<br />

signal processors [Faugeras et al., 1996, Kanade et al., 1996, Konolige, 1997,<br />

Woodfill <strong>and</strong> Herzen, 1997, Jia et al., 2003, Darabiha et al., 2003]. With the<br />

appearance of SIMD instructions sets like MMX <strong>and</strong> SSE primarily intended<br />

<strong>for</strong> multimedia applications on general purpose CPUs, several implementations<br />

targeted on the efficient use of these extensions <strong>for</strong> computational stereo applications<br />

[Mühlmann et al., 2002, Mulligan et al., 2002, Forstmann et al., 2004]. The basic<br />

ideas of high per<strong>for</strong>mance CPU depth estimation method include a cache friendly design<br />

of the algorithm to minimize CPU pipeline stalls, <strong>and</strong> exploiting the SIMD functionality


2.2. GPU-based 3D Model Computation 23<br />

e.g. by rating four disparity values simultaneously. All these approaches work usually<br />

with very simple image similarity measure like the SSD or SAD.<br />

The Triclops vision system [Point Grey Research Inc., 2005] is a commercially available<br />

realtime stereo implementation. Typically the setup consists of two or three cameras <strong>and</strong><br />

appropriate software <strong>for</strong> realtime stereo matching. Depending on the image resolution<br />

<strong>and</strong> the disparity range, the system is able to generate depth images at a rate of about<br />

30Hz <strong>for</strong> images of 320x240 pixels on current PC hardware. The software exploits the<br />

particular L-shape orientation of the cameras <strong>and</strong> MMX/SSE instructions available on<br />

current CPUs.<br />

Probably the first multi-view depth estimation approach executed on programmable<br />

graphics hardware was presented by Yang et al. [Yang et al., 2002], who developed a fast<br />

stereo reconstruction method per<strong>for</strong>med in 3D hardware by utilizing a plane sweep approach<br />

to find correct depth values. The proposed method uses projective texturing capabilities<br />

of 3D graphics hardware to project the given image onto the reference plane.<br />

Further, single pixel error accumulation <strong>for</strong> all given views is per<strong>for</strong>med on the GPU<br />

as well. The number of iterations is linear in the requested resolution of depth values,<br />

there<strong>for</strong>e this method is limited to rather coarse depth estimation in order to fulfill the<br />

realtime requirements of their video conferencing application. Further, their approach<br />

requires a true multi-camera setup to be robust, since the error function is only aggregated<br />

in single pixel windows. Since the application behind this method is a multi-camera<br />

teleconferencing system, accuracy is less important than realtime behavior. In later work<br />

the method was made more robust using trilinear texture access to accumulate error differences<br />

within a window [Yang <strong>and</strong> Pollefeys, 2003]. Their developed ideas were reused<br />

<strong>and</strong> improved to obtain a GPU-based dense matching procedure <strong>for</strong> a rectified stereo<br />

setup [Yang et al., 2004].<br />

The basic GPU-based plane-sweep technique <strong>for</strong> depth estimation can be enhanced<br />

with implicit occlusion h<strong>and</strong>ling <strong>and</strong> smoothness constraints to obtain depth maps with<br />

higher quality. Woetzel <strong>and</strong> Koch [Woetzel <strong>and</strong> Koch, 2004] addressed occlusion occurring<br />

in the source images by a best n out of m <strong>and</strong> by a best half-sequence multi-view selection<br />

policy to limit the impact of occlusions on the resulting depth map. In order to obtain<br />

sharper depth discontinuities a shiftable correlation window approach was utilized. The<br />

employed image similarity measure is a truncated sum of squared differences, which is<br />

sensitive to changing lighting conditions.<br />

Cornelis <strong>and</strong> Van Gool [Cornelis <strong>and</strong> Van Gool, 2005] proposed several refinement<br />

steps per<strong>for</strong>med after a plane-sweep procedure used to obtain an initial depth map using<br />

a single pixel truncated SSD correlation measure. Outliers in the initially obtained depth<br />

map are removed by a modified median filtering procedure, which may destroy fine 3D<br />

structures. These fine details are recovered by an subsequent depth refinement pass.<br />

Since this approach is based on single pixel similarity instead of a window based one,<br />

slanted surfaces <strong>and</strong> depth discontinuities are reconstructed more accurately compared<br />

with window-based approaches.


24 Chapter 2. Related Work<br />

Typically, the correlation windows used in realtime dense matching have a fixed size,<br />

which causes inaccuracies close to depth discontinuities. Since large depth changes are<br />

often accompanied by color or intensity changes in the corresponding image, adapting<br />

the correlation window to extracted edges is a reasonable approach. Gong <strong>and</strong><br />

Yang [Gong <strong>and</strong> Yang, 2005a] investigated in a GPU-based computational stereo procedure<br />

with an additional color segmentation step to increase the quality of the depth map<br />

near object borders.<br />

A GPU-based plane-sweeping technique suitable <strong>for</strong> sparse 3D reconstructions was<br />

presented by Rodrigues <strong>and</strong> Fern<strong>and</strong>es [Rodrigues <strong>and</strong> Ramires Fern<strong>and</strong>es, 2004]. They<br />

used projective texturing hardware to map rays going through interest points into the<br />

other views according to the epipolar geometry. In contrast to the dense depth planesweeping<br />

methods, a true multi-view configuration of the cameras can be used. The result<br />

of the procedure is a sparse 3D point cloud corresponding to 2D interest point seen in<br />

several input images.<br />

For several applications, e.g. video teleconferencing <strong>and</strong> mixed reality applications,<br />

it is sufficient to reconstruct the visual hull, which is the intersection of the generalized<br />

cones generated by the silhouette of the object <strong>and</strong> the optical center of a camera.<br />

Even with the non-programmable traditional graphics pipeline real-time generation <strong>and</strong><br />

rendering of visual hulls can be accelerated by 3D graphics hardware. Lok [Lok, 2001],<br />

Matusik et al. [Matusik et al., 2001] <strong>and</strong> Li et al. [Li et al., 2003] present on-line visual<br />

hull reconstructions systems mostly aimed on video conferencing <strong>and</strong> mixed reality applications.<br />

In order to improve the visual quality of the reconstructed models, the visual<br />

hull can be upgraded with depth in<strong>for</strong>mation generated by computational stereo algorithms<br />

[Slabaugh et al., 2002, Li et al., 2002].<br />

Li et al [Li et al., 2004] present a method <strong>for</strong> GPU-based photo hull generation<br />

used <strong>for</strong> viewpoint interpolation, that is in some aspects similar to the material<br />

presented in Chapter 5. Essentially their work combine the plane-sweep approach<br />

proposed by Yang [Yang et al., 2002] with visibility h<strong>and</strong>ling used in the space carving<br />

framework [Seitz <strong>and</strong> Dyer, 1997, Kutulakos <strong>and</strong> Seitz, 2000]. In contrast to our<br />

approach only depth maps suitable <strong>for</strong> view interpolation are generated, whereas our<br />

approach creates proper 3D models as obtained by other voxel coloring <strong>and</strong> space carving<br />

techniques.<br />

Recently, Gong <strong>and</strong> Yang [Gong <strong>and</strong> Yang, 2005b] implemented a dynamic programming<br />

approach to computational stereo with a simple discontinuity cost model on the<br />

GPU <strong>and</strong> achieved at least interactive rates. In contrast to the other GPU based depth<br />

estimation methods this approach belongs to the category of global matching procedures<br />

(as opposed to the winner-takes-all local methods). Although their framework can be implemented<br />

entirely on the GPU, they report higher per<strong>for</strong>mance using a hybrid CPU/GPU<br />

approach, in which the dynamic programming step is per<strong>for</strong>med on the CPU. Currently,<br />

GPU-based global methods <strong>for</strong> disparity assignment are slowly emerging in the literature.<br />

Dixit et al. [Dixit et al., 2005] present a GPU implementation of a graph cut opti-


2.2. GPU-based 3D Model Computation 25<br />

mization method called GPU-cut used <strong>for</strong> image segmentation. Since graph cut based<br />

approaches to computational stereo are highly successful, further investigations of GPUcut<br />

<strong>for</strong> dense stereo are expected.<br />

Mairal <strong>and</strong> Keriven [Mairal <strong>and</strong> Keriven, 2006] propose a GPU-based variational stereo<br />

framework, which iteratively refines a 3D mesh hypo<strong>thesis</strong> until convergence. The basic<br />

framework <strong>and</strong> goals are similar to our system presented in Chapter 3. A variational multiview<br />

approach <strong>for</strong> 3D reconstruction using graphics hardware is proposed by Labatut et<br />

al. [Labatut et al., 2006], which uses a level-set approach to de<strong>for</strong>m an initial mesh to<br />

match the image similarity constraint. The authors reported a per<strong>for</strong>mance speedup by a<br />

factor of approximately four compared with their CPU implementation. The overall time<br />

required to obtain the final model using a 128 3 volumetric grid is about 5 to 7 minutes<br />

depending on the data-set.<br />

Loopy belief propagation with its basically parallel message update scheme is ostensibly<br />

an ideal c<strong>and</strong>idate <strong>for</strong> GPU-based methods: Brunton <strong>and</strong> Shu [Brunton <strong>and</strong> Shu, 2006]<br />

<strong>and</strong> Yang et al. [Yang et al., 2006] describe implementations utilizing the GPU. The disadvantage<br />

of belief propagation is at first the huge memory consumption <strong>for</strong> large images<br />

<strong>and</strong> depth resolutions requiring either limited depth range [Brunton <strong>and</strong> Shu, 2006] <strong>and</strong><br />

image resolution [Yang et al., 2006]. Additionally, the purely parallel (synchronous) message<br />

update feasible on the GPU converges slower than the sequential update available on<br />

the CPU [Tappen <strong>and</strong> Freeman, 2003].


Chapter 3<br />

Mesh-based Stereo Reconstruction<br />

Using <strong>Graphics</strong> Hardware<br />

3.1 Introduction<br />

This chapter describes a computational stereo method generating a 2.5D height-field represented<br />

as a triangular mesh from a pair of images with known relative pose. The key<br />

idea is a generate-<strong>and</strong>-test approach, which successively modifies a mesh hypo<strong>thesis</strong> <strong>and</strong><br />

evaluates an image correlation measure to rate the refined hypo<strong>thesis</strong>. The current 3D<br />

mesh geometry <strong>and</strong> the relative pose between the images can be used to generate virtual<br />

views of the source images with respect to one particular view. The generated images of<br />

the virtual views should match closely if the correct 3D geometry is found.<br />

The procedure works iteratively: mesh modification resulting in better image correlation<br />

are kept, whereas mesh variations lowering the image similarity are discarded. These<br />

iterations are embedded in a coarse-to-fine framework to avoid convergence to purely local<br />

minima. This procedure can be seen as a simple <strong>and</strong> discrete <strong>for</strong>mulation of a variational,<br />

mesh-based dense stereo approach.<br />

The virtual view generation <strong>and</strong> the subsequent image similarity calculation are per<strong>for</strong>med<br />

by programmable graphics processing units. In contrast to several GPU-based<br />

3D reconstruction methods described in following chapters, the required feature set to be<br />

provided by the GPU <strong>for</strong> this method is very small. Consequently, the proposed stereo<br />

approach described in this chapter works on early generations of programmable graphics<br />

hardware.<br />

Unlike to the approaches proposed in later chapters this approach still uses a mixed<br />

computation model employing the GPU <strong>for</strong> many portions of the procedures, but nevertheless<br />

relies on CPU-based computations in some aspects. Essentially, only those parts of<br />

the method are accelerated by graphics hardware, which can be efficiently implemented on<br />

DirectX 8.1 class GPUs. ∗ The proposed approach in this chapter substantially exploits the<br />

∗ DirectX 8.1 type GPUs provide relatively powerful vertex shaders, but only very limited pixel shaders<br />

27


28 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

main capabilities of graphics hardware by repeated rendering of multi-textured mesh geometry<br />

<strong>for</strong> virtual view generation. Virtual view creation induces a non-linear de<strong>for</strong>mation<br />

of the source image, hence we refer to this operation as image warping procedure.<br />

3.2 Overview of Our Method<br />

The input <strong>for</strong> our procedure consists of two gray-scale images with known relative pose<br />

<strong>and</strong> camera calibration suitable <strong>for</strong> stereo reconstruction, <strong>and</strong> a coarse initial mesh to<br />

start with. This mesh can be based on a sparse reconstruction obtained by the relative<br />

orientation procedure (e.g. a mesh generated from a sparse set of corresponding points by<br />

some triangulation). In our experiments we use a planar mesh as the starting point <strong>for</strong><br />

dense reconstruction. One image of the stereo pair is referred as the key image, whereas<br />

the other one is denoted as the sensor image. † Consequently the cameras (resp. their<br />

positions) are designated as the key camera <strong>and</strong> the sensor camera.<br />

The overall idea of the dense stereo procedure is that if the current mesh hypo<strong>thesis</strong><br />

corresponds to the true model, the appropriately warped sensor image virtually created<br />

<strong>for</strong> the key camera position resembles the original key image. This similarity is quantified<br />

by some suitable error metric on images, which is the sum of absolute difference values in<br />

our current implementation. Modifying the current mesh results in different warped sensor<br />

images with potentially higher similarity to the key image (see Figure 3.1). The current<br />

mesh hypo<strong>thesis</strong> is iteratively refined to generate <strong>and</strong> evaluate improved hypotheses. The<br />

huge space of possible mesh hypotheses can be explored efficiently, since local mesh refinements<br />

have only local impacts on the warped image, there<strong>for</strong>e many local modifications<br />

can be applied <strong>and</strong> evaluated in parallel.<br />

The matching procedure consists of three nested loops:<br />

1. The outermost loop determines the mesh <strong>and</strong> image resolutions. In every iteration<br />

the mesh <strong>and</strong> image resolutions are doubled. The refined mesh is obtained by linear<br />

(<strong>and</strong> optionally median) filtering of the coarser one. This loop adds the hierarchical<br />

strategy to our method.<br />

2. The inner loop chooses the set of vertices to be modified <strong>and</strong> updates the depth<br />

values of these vertices after per<strong>for</strong>ming the innermost loop.<br />

3. The innermost loop evaluates depth variations <strong>for</strong> c<strong>and</strong>idate vertices selected in the<br />

enclosing loop. The best depth value is determined by repeated image warping<br />

<strong>and</strong> error calculation wrt. the tested depth hypo<strong>thesis</strong>. The body of this loop runs<br />

entirely on 3D graphics hardware.<br />

with a small number of instructions are available. Additionally, floating point accuracy <strong>for</strong> textures <strong>and</strong><br />

pixel shaders is not supported.<br />

† There is no unique fixed convention to denote the role of the two views. Sometimes the images are<br />

called master <strong>and</strong> slave views to indicate the key resp. the sensor view. In medical image processing the<br />

notion of template <strong>and</strong> moving image are very common.


3.2. Overview of Our Method 29<br />

Mesh to reconstruct<br />

Key camera<br />

camera ray<br />

Secondary camera<br />

mesh vertex<br />

tested displacement<br />

Figure 3.1: Mesh reconstruction from a pair of stereo images. Vertices of the current mesh<br />

hypo<strong>thesis</strong> are translated along the back-projected ray of the key camera. The image<br />

obtained from the sensor camera is warped onto the mesh <strong>and</strong> the effect in the local<br />

neighborhood of the modified vertex is evaluated.<br />

To per<strong>for</strong>m image warping the current mesh hypo<strong>thesis</strong> is rendered like a regular<br />

height-field as illustrated in Figure 3.2. As it can be seen in Figure 3.3, a change of<br />

the depth value of one vertex has only influence on few adjacent triangles. There<strong>for</strong>e<br />

one fourth of the vertices can be modified simultaneously without affecting each other.<br />

The optimization procedure to minimize the error between key image <strong>and</strong> warped image<br />

is a sequence of determining the best depth values <strong>for</strong> alternating fractions of the mesh<br />

vertices. Since vertices of the grid are numbered such that vertices, which are modified<br />

<strong>and</strong> evaluated in the same pass, comprise a connected block (Figure 3.4), we denote the<br />

fraction of vertices to change as a block.<br />

In every step the depth values of one <strong>for</strong>th of the vertices is modified <strong>and</strong> the local<br />

error between the key image <strong>and</strong> the warped image in the affected neighborhood of the<br />

vertex is evaluated. For every modified vertex the best depth value is determined <strong>and</strong><br />

the mesh is updated accordingly. The procedure to calculate <strong>and</strong> update error values <strong>for</strong><br />

modified vertices is outlined in Figure 3.5.<br />

3.2.1 Image Warping <strong>and</strong> Difference Image Computation<br />

Since the vertices of the mesh are moved along the back-projected rays of the key camera,<br />

the mesh as seen from the first camera is always a regular grid <strong>and</strong> mesh modifications<br />

do not distort the key image. The appearance of the sensor image as seen from the key<br />

camera depends on the mesh geometry.


30 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

Figure 3.2: The regular grid as seen from the key camera. This grid structure allows<br />

fast rendering of the mesh using triangle strips with only one call. The marked vertices<br />

comprise one block. These vertices are shifted on the back-projected ray <strong>and</strong> evaluated<br />

simultaneously in every iteration.<br />

Modified vertex<br />

Affected triangles<br />

Accumulated neighborhood<br />

Figure 3.3: The neighborhood of a currently evaluated vertex. Moving this vertex on the<br />

back-projected ray will only effect the 6 shaded triangles. The actual error <strong>for</strong> this vertex<br />

is calculated <strong>for</strong> the enclosing rectangle, that is still disjoint with the neighborhoods of all<br />

other tested vertices.<br />

From the 3D positions of the current mesh vertices <strong>and</strong> the known relative orientation<br />

between the cameras, it is easy to use automatic texture coordinate generation with appropriate<br />

coefficients to per<strong>for</strong>m the image warping step. To minimize updates of mesh<br />

geometry we use our own vertex program to calculate texture coordinates <strong>for</strong> the sensor<br />

image. This vertex shader is described in more detail in Section 3.3.1.<br />

3.2.2 Local Error Summation<br />

After the difference between the key image <strong>and</strong> the warped image is computed <strong>and</strong> stored<br />

in a pixel buffer, we need to accumulate the error in the neighborhoods of modified ver-


3.2. Overview of Our Method 31<br />

(0,0) (2, 0) (1,0) (3, 0) (0,1) (2,1) (3,1) (1,1)<br />

Block 0 Block 1 Block 2<br />

Block 3<br />

Figure 3.4: The correspondence between vertex indices <strong>and</strong> grid positions.<br />

tices. In order to sum the values within a rectangular window, we employ a variant of<br />

a recursive doubling scheme. The required modification of the recursive approach refers<br />

to the encoding <strong>and</strong> accumulation of larger integer values, if only traditional 8 bit color<br />

channels are available (see Section 3.3.3). Essentially, we per<strong>for</strong>m a repeated downsampling<br />

procedure, which sums up four adjacent pixels into one resulting pixel. The target<br />

pixel buffer has half the resolution in every dimension of the source buffers. If one vertex<br />

is located every four pixels, the downsampling is per<strong>for</strong>med three times to sum the error<br />

in an 8 by 8 pixel window.<br />

We need to mention that only 2 n × 2 n error values are computed <strong>for</strong> a mesh with<br />

(2 n + 1) × (2 n + 1) vertices. Vertices at the right <strong>and</strong> lower edge of the grid do not have<br />

an associated error value. For these vertices we propagate the depth values from the left<br />

resp. upper neighbors.<br />

3.2.3 Determining the Best Local Modification<br />

If δ denotes the largest allowed depth change, then the tested depth variations are sampled<br />

regularly from the interval [−δ, δ]. To minimize the amount of data that needs to be copied<br />

from graphics memory to main memory, we do not directly read back the local errors to<br />

determine the best local modification in software. We store the currently best local error<br />

<strong>and</strong> the corresponding index in a texture <strong>and</strong> update these values within an additional<br />

pass. These values are read back after all depth variations <strong>for</strong> one block of vertices are<br />

evaluated.<br />

3.2.4 Hierarchical Matching<br />

In order to avoid local optima during dense matching we utilize a hierarchical approach.<br />

The coarsest level consists of a mesh with 9 by 9 vertices <strong>and</strong> an image resolution of 32<br />

by 32 pixels. The initial model comprise a planar mesh with the approximate correct<br />

depth values known from the points of interest generated by the relative pose estimation<br />

procedure. After a fixed number of iterations the mesh calculated in the coarser level<br />

is upsampled (using a bilinear filter) <strong>and</strong> used as input to the next level. A median<br />

filter is optionally applied to the mesh to remove potential outliers especially found in<br />

homogeneous image regions.<br />

The largest allowed displacement <strong>for</strong> mesh vertices is decreased <strong>for</strong> higher levels to


32 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

Key image<br />

Sensor image<br />

Absolute difference<br />

Sum of abs.<br />

differences<br />

Range image<br />

Update mesh hypo<strong>thesis</strong><br />

New minimal error<br />

<strong>and</strong> optimal depth<br />

Minimum calculation<br />

Old minimal error<br />

Figure 3.5: The basic workflow of the matching procedure. For the current mesh hypo<strong>thesis</strong><br />

a difference image between key image <strong>and</strong> warped sensor image is calculated in<br />

hardware. The error in the local neighborhood of the modified vertices are accumulated<br />

<strong>and</strong> compared with the previous minimal error value. The result of these calculations are<br />

minimal error values (stored in the red, green <strong>and</strong> blue channel) <strong>and</strong> the index of the best<br />

modification of vertices so far (stored in the alpha channel). All these steps are executed<br />

in graphics hardware <strong>and</strong> do not require transfer of large datasets between main memory<br />

<strong>and</strong> video memory.<br />

enable higher precision. It is assumed that the model generated at the previous level is<br />

already a sufficiently accurate approximation of the true model, <strong>and</strong> only local refinements<br />

to the mesh are required at the next level. In the current implementation we halve the<br />

largest evaluated depth variation when entering the next hierarchy level. The coarsest<br />

level starts with a maximum depth variation roughly equal to the distance of the object<br />

to the key camera.


3.3. Implementation 33<br />

3.3 Implementation<br />

In this section we describe in more detail some aspects of our approach. Our<br />

implementation is based on OpenGL extensions available <strong>for</strong> the ATI Radeon<br />

9700Pro, namely VERTEX_OBJECT_ATI, ELEMENT_ARRAY_ATI, VERTEX_SHADER_EXT <strong>and</strong><br />

FRAGMENT_SHADER_ATI [Hart <strong>and</strong> Mitchell, 2002]. These extensions are available on<br />

the Radeon 8500 <strong>and</strong> 9000 as well, there<strong>for</strong>e our method can be applied with these<br />

older (<strong>and</strong> cheaper) cards, too. For better reading we sketch the vertex program in Cg<br />

notation [NVidia Corporation, 2002a].<br />

The major design criterion is to minimize the amount of data transferred between the<br />

CPU memory <strong>and</strong> GPU memory. In particular, reading back data from the graphics card<br />

is very slow, there<strong>for</strong>e only absolutely necessary in<strong>for</strong>mation is copied from video memory.<br />

3.3.1 Mesh Rendering <strong>and</strong> Image Warping<br />

For maximum per<strong>for</strong>mance we employ the VERTEX_OBJECT_ATI <strong>and</strong> ELEMENT_ARRAY_ATI<br />

OpenGL extension to store mesh vertices <strong>and</strong> connectivity in<strong>for</strong>mation directly in graphics<br />

memory. In every iteration one fourth of the vertices needs to be updated to test<br />

mesh modifications. In order to reduce memory traffic we update the mesh only after all<br />

modifications are evaluated <strong>and</strong> the best one is determined. The current tested offset is a<br />

parameter to a vertex program, that moves vertices along the camera ray as indicated by<br />

the given offset.<br />

Additionally the mesh vertices are ordered such that vertices that are modified in the<br />

same pass comprise a single connected block, there<strong>for</strong>e only one fourth of the vertex array<br />

object stored in video memory needs to be updated.<br />

We sketch the vertex program that calculates the appropriate texture coordinates <strong>for</strong><br />

the sensor image in Algorithm 1. The vertex attributes consists of the position <strong>and</strong> the<br />

block mask encoded in the primary color attribute. Program parameters common <strong>for</strong> all<br />

vertices are<br />

1. the currently tested depth displacement <strong>for</strong> the active block,<br />

2. a matrix M1 trans<strong>for</strong>ming pixel positions into back-projected rays of the key camera,<br />

3. <strong>and</strong> a matrix M2 representing the trans<strong>for</strong>mation from the key camera into image<br />

positions of the sensor camera.<br />

If a vertex belongs to block i, then the i-th component of the block mask attribute of<br />

this vertex is set to one. The other channels are set to zero. If all vertices of block j<br />

are currently evaluated, the displacement represented as a 4-component vector has the<br />

current offset value at position j <strong>and</strong> zeros otherwise. There<strong>for</strong>e a four-component dot<br />

product between the mask <strong>and</strong> the given displacement is either the displacement or zero,<br />

depending whether the block numbers match.


34 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

Algorithm 1 The vertex program responsible <strong>for</strong> warping the sensor image. This vertex<br />

shader calculates appropriate texture coordinates <strong>for</strong> the second image based on the<br />

relative orientation of the cameras <strong>and</strong> the currently evaluated offset.<br />

Procedure Vertex program <strong>for</strong> sensor image warping<br />

Input: Constant parameters: Matrices M1 <strong>and</strong> M2, displacement (a 4-vector)<br />

Input: Vertex attributes: position (homogeneous 3D position), mask (a 4-vector, provided<br />

in the associated vertex color)<br />

depthold ← position.z<br />

{Inner product to determine actual depth displacement}<br />

delta ← displacement · mask<br />

depthnew ← depthold + delta<br />

{Back-project pixel to obtain the corresponding ray of the key camera}<br />

ray ← M1 · position<br />

positionnew ← depthnew · ray<br />

{Position on 2D screen, to be trans<strong>for</strong>med by the modelview-projection matrix}<br />

windowP osition ← (position.x, position.y, 0, 1)<br />

{Project perturbed 3D position to obtain final texture coordinate to sample the sensor<br />

image}<br />

texcoord ← M2 · positionnew<br />

If K1 <strong>and</strong> K2 are the internal parameters of the key resp. the sensor camera (arranged<br />

in an upper-triangular matrix) <strong>and</strong> O is the relative orientation � �between<br />

the cameras<br />

(with O being a 4 × 4 matrix with the components O =<br />

are calculated as follows:<br />

R<br />

0<br />

t<br />

1<br />

), then M1 <strong>and</strong> M2<br />

⎛<br />

1<br />

⎜<br />

M1 = ⎜<br />

⎝<br />

1<br />

0<br />

⎞<br />

⎟ × K−1 1<br />

1 ⎠<br />

1<br />

<strong>and</strong><br />

⎛<br />

1/w<br />

⎜<br />

M2 = ⎜<br />

⎝<br />

1/h<br />

1<br />

1 0<br />

⎞<br />

⎟<br />

⎠ × K2 × O,<br />

where w <strong>and</strong> h represent the image width <strong>and</strong> height in pixels. If M1 is applied to a vector<br />

(x, y, ·, 1), the result is the direction (∆x, ∆y, 1, 1) of the camera ray going through the<br />

pixel at (x, y). This direction is scaled by the target depth value to obtain the vertex in<br />

the key camera space. Consequently, the vertex data <strong>for</strong> mesh points consists of vectors<br />

(x, y, z, 1), where (x, y) are the pixel coordinates in the key image <strong>and</strong> z is the current<br />

depth value. The obtained texture coordinates (s, t, q, q) <strong>for</strong> the sensor image are subject


3.3. Implementation 35<br />

to perspective division prior to texture lookup. On current hardware perspective texture<br />

lookup is per<strong>for</strong>med <strong>for</strong> every texel, hence the correct perspective projection (<strong>and</strong> warping)<br />

is achieved.<br />

Additionally we remark, that texture coordinate trans<strong>for</strong>mation from one image to<br />

another cannot be accomplished only by one trans<strong>for</strong>mation matrix: in this case the<br />

depth changes are applied in screen space, which maps world coordinates non-linearly due<br />

to perspective division.<br />

The described image warping trans<strong>for</strong>mation can result in texture coordinates lying<br />

outside the sensor image. It is possible to ignore mesh regions outside the sensor image<br />

explicitly, but according to our experience simple clamping of texture coordinates is<br />

sufficient in those cases.<br />

3.3.2 Local Error Aggregation<br />

Aggregating the intensity difference values between the key image <strong>and</strong> the warped sensor<br />

image is per<strong>for</strong>med by a recursive doubling approach, which is basically a successive<br />

downsampling procedure.<br />

One iteration of the downsampling procedure is quite simple: the input texture is<br />

bound to four texture units <strong>and</strong> a quadrilateral covering the whole viewport is rendered.<br />

The texture coordinates <strong>for</strong> the 4 texturing units are jittered slightly, such that the correct<br />

adjacent pixels are accessed <strong>for</strong> each final fragment. The filtering mode <strong>for</strong> the source<br />

textures is set to GL_NEAREST. Since the aggregation window is fixed to a 8 × 8 rectangle,<br />

three iterations are applied.<br />

3.3.3 Encoding of Integers in RGB Channels<br />

Although the input images are grayscale images <strong>and</strong> one 8 bit gray channel is sufficient<br />

to represent the absolute difference image, summation of local errors is likely to generate<br />

overflows. Current generations of graphics cards supports float textures, but at the time<br />

of our first attempts to employ the GPU <strong>for</strong> computer vision applications no pixel buffer<br />

<strong>for</strong>mat allowed color channels with floating point precision. There<strong>for</strong>e we decided to utilize<br />

a slightly more complex method to per<strong>for</strong>m error summation with 8 bit RGB channels.<br />

In the proposed implementation floating point textures are not required.<br />

Our integer encoding assigns the least significant 6 bits of a larger integer value to the<br />

red channel, the middle 6 bits to the green channel <strong>and</strong> the remaining bits to the blue<br />

channel. The two most significant bits of the red <strong>and</strong> green channel are always zero. This<br />

encoding allows summation of four error values without loss of precision using a fragment<br />

program utilizing a dependent texture lookup. After (component-wise) summation of 4<br />

input values the most significant bits of the red <strong>and</strong> green component of the register storing<br />

the sum are possibly set, hence this register requires an additional conversion to obtain<br />

the final error value with the desired encoding. This conversion is per<strong>for</strong>med using a 256<br />

by 256 texture map.


36 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

If more than four values are summed in one step, the number of spare bits needs to<br />

be adjusted, e.g. if 8 values are summed in one pass, the three most significant bits of the<br />

red <strong>and</strong> green channel must be reserved to avoid overflows.<br />

3.4 Per<strong>for</strong>mance Enhancements<br />

As it turns out, the implementation described above has still per<strong>for</strong>mance bottlenecks,<br />

that can be avoided by a careful design of the particular implementation.<br />

3.4.1 Amortized Difference Image Generation<br />

For larger image resolutions (e.g. 1024 × 1024) rendering of the corresponding mesh<br />

generated by the sampling points takes a considerable amount of time. In the 1-megapixel<br />

case the mesh consists of approximately 131 000 triangles, which must be rendered <strong>for</strong> every<br />

depth value (several hundred times in total). Especially on mobile graphic boards, mesh<br />

processing implies a severe per<strong>for</strong>mance penalty: stereo matching of two 256 × 256 pixel<br />

images shows similar per<strong>for</strong>mances on the evaluated desktop GPU <strong>and</strong> on the employed<br />

mobile GPU of a laptop, but matching 1-megapixel images requires two times longer on<br />

the mobile GPU.<br />

In order to reduce the number of mesh drawings up to four depth values are evaluated<br />

in one pass. We use multitexturing facilities to generate four texture coordinates <strong>for</strong><br />

different depth values within the vertex program. The fragment shader calculates the<br />

absolute differences <strong>for</strong> these de<strong>for</strong>mations simultaneously <strong>and</strong> stores the results in the four<br />

color channels (red, green, blue <strong>and</strong> alpha). Note that the mesh hypo<strong>thesis</strong> is updated<br />

infrequently <strong>and</strong> the actually evaluated mesh is generated within the vertex shader by<br />

de<strong>for</strong>ming the incoming vertices according to the current displacement.<br />

The vertex program has now more work to per<strong>for</strong>m, since four trans<strong>for</strong>mations (matrixvector<br />

multiplications) are executed to generate texture coordinates <strong>for</strong> the right image<br />

<strong>for</strong> each vertex. Nevertheless, the obtained timing results (see Section 3.5) indicate a<br />

significant per<strong>for</strong>mance improvement by utilizing this approach. Several operations are<br />

executed only once <strong>for</strong> up to 4 mesh hypotheses: transferring vertices <strong>and</strong> trans<strong>for</strong>ming<br />

them into window coordinates, triangle rasterization setup <strong>and</strong> texture access to the left<br />

image.<br />

3.4.2 Parallel Image Trans<strong>for</strong>ms<br />

In contrast to Yang <strong>and</strong> Pollefeys [Yang <strong>and</strong> Pollefeys, 2003] we calculate the error within a<br />

window explicitly using multiple passes. In every pass four adjacent pixels are accumulated<br />

<strong>and</strong> the result is written to a temporary off-screen frame buffer (usually called pixel buffer<br />

or P-buffer <strong>for</strong> short). It is possible to set pixel buffers as destination <strong>for</strong> rendering<br />

operations (write access) or to bind a pixel buffer as a texture (read access), but a combined<br />

read <strong>and</strong> write access is not available. In the default setting the window size is 8 × 8,


3.4. Per<strong>for</strong>mance Enhancements 37<br />

there<strong>for</strong>e 3 passes are required. Note that we use specific encoding of summed values to<br />

avoid overflow due to the limited accuracy of one color channel.<br />

Executing this multipass pipeline to obtain the sum of absolute differences within a<br />

window requires several P-buffer activations to select the correct target buffer <strong>for</strong> writing.<br />

These switches turned out to be relatively expensive (about 0.15ms per switch). In combination<br />

with the large number of switches the total time spent within these operations<br />

comprise a significant fraction of the overall matching time (about 50% <strong>for</strong> 256 × 256<br />

images). If the number of these operations can be optimized, one can expect substantial<br />

increase in per<strong>for</strong>mance of the matching procedure.<br />

Instead of directly executing the pipeline in the innermost loop (requiring 5 P-buffer<br />

switches) we reorganize the loops to accumulate several intermediate results in one larger<br />

buffer with temporary results arranged in tiles (see Figure 3.6). There<strong>for</strong>e P-buffer switches<br />

are amortized over several iterations of the innermost loop. This flexibility in the control<br />

flow is completely transparent <strong>and</strong> needs not to be coded explicitly within the software.<br />

Those stages in the pipeline waiting <strong>for</strong> the input buffer to become ready are skipped<br />

automatically.<br />

3.4.3 Minimum Determination Using the Depth Test<br />

We have two procedures available to update the minimal error <strong>and</strong> optimal depth value:<br />

the first approach utilizes a separate pass employing a simple fragment program <strong>for</strong> the<br />

conditional update. This method works on a wider range of graphic cards (on some mobile<br />

GPUs in particular), but it is rather slow due to necessary P-buffer activations (since the<br />

minimum computation cannot be done in-place). The alternative implementation employs<br />

Z-buffer tests <strong>for</strong> the conditional updates of the frame buffer in-place, but the range of<br />

supported graphics hardware is more limited. In order to utilize this simpler (<strong>and</strong> faster)<br />

method, the GPU must support user-defined assignment of z-values within the fragment<br />

shader (e.g. by using the ARB_FRAGMENT_PROGRAM OpenGL extension). Older hardware<br />

always interpolates z-values from the given geometry (vertices).<br />

We use the rather simple fragment program shown in Figure 3.7 to obtain one scalar<br />

error value from the color coded error <strong>and</strong> move this value to the depth register used<br />

by graphics hardware to test the incoming depth against the z-buffer. Using the depth<br />

test provided by 3D graphics hardware, the given index of the currently evaluated depth<br />

variation <strong>and</strong> the corresponding sum of absolute differences is written into the destination<br />

buffer, if the incoming error is smaller than the minimum already stored in the buffer.<br />

There<strong>for</strong>e a point-wise optimum <strong>for</strong> evaluated depth values <strong>for</strong> mesh vertices can be computed<br />

easily <strong>and</strong> efficiently.


38 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

Difference<br />

image 1<br />

Difference<br />

image 2<br />

Difference<br />

image 3<br />

n/2 x n/2<br />

n x n<br />

n x n<br />

Difference<br />

image 4<br />

pixel summation<br />

<strong>for</strong> every iteraion<br />

n x n<br />

pixel summation<br />

every four iterations<br />

pixel summation<br />

once per block<br />

Figure 3.6: The modified pipeline to minimize P-buffer switches. Several temporary results<br />

are accumulated in larger pixel buffers arranged like tiles. Later passes operate on all those<br />

intermediate results <strong>and</strong> are there<strong>for</strong>e executed less frequently.<br />

3.5 Results<br />

We tested our hardware based matching procedure on artificial <strong>and</strong> on real datasets. In all<br />

test cases the source images are grayscale images with a resolution of 1024 by 1024 pixels.<br />

For the real datasets the relative orientations between stereo images are determined using<br />

the method described by Klaus et al. [Klaus et al., 2002].<br />

We run the timing experiments on a desktop PC with an Athlon XP 2700 <strong>and</strong> an ATI<br />

Radeon 9700 <strong>and</strong> on a laptop PC with a mobile Athlon XP 2200 <strong>and</strong> an ATI Radeon 9000<br />

Mobility.<br />

The artificial dataset comprise two images of a sphere mapped with an earth texture<br />

rendered by the Inventor scene viewer (Figure 3.8). The meshes obtained by our<br />

reconstruction method are displayed as point set <strong>for</strong> easier visual evaluation. Timing


3.5. Results 39<br />

PARAM depth_index = program.env[0];<br />

PARAM coeffs = { 1/256, 1/16, 1, 0 };<br />

TEMP error, col;<br />

TEX col, fragment.texcoord[0],<br />

texture[0], 2D;<br />

DP3 error, coeffs, col;<br />

MOV result.color, depth_index;<br />

MOV result.depth, error;<br />

Figure 3.7: Fragment program to transfer the incoming, color coded error value to the<br />

depth component of the fragment. The dot product (DP3) between the texture element<br />

<strong>and</strong> the coefficient vector restores the scalar error value encoded in the color channels.<br />

statistics <strong>for</strong> this dataset reconstructed at different resolutions are given in Table 3.1. The<br />

matching procedure per<strong>for</strong>ms 8 iterations with 7 tested depth variations <strong>for</strong> each hierarchy<br />

level. These values result in high quality reconstructions in reasonable time. There<strong>for</strong>e<br />

the pipeline shown in Figure 3.5 is executed 56 times <strong>for</strong> each level. The number of levels<br />

varies from 4 to 6 depending on the given image resolution. The total number of evaluated<br />

mesh hypo<strong>thesis</strong> is 224 (256x256), 280 (512x512) <strong>and</strong> 336 (1024x1024). In the highest<br />

resolution (1024x1024) each vertex is actually tested with 84 depth values out of a range<br />

of approximately 600 possible values. Because of limitations in graphics hardware we are<br />

currently restricted to images with power of two dimensions.<br />

(a) The key image (b) The second image (c) The reconstructed model<br />

Figure 3.8: Results <strong>for</strong> the artificial earth dataset.<br />

In addition to the timing experiments we applied the proposed procedure to several<br />

real-world datasets consisting of stereo image pairs showing various buildings. The source<br />

image of these datasets are grayscale images resampled to 1024 × 1024 pixels to meet the<br />

power-of-two graphics hardware requirement. The source images <strong>and</strong> the reconstructed<br />

models are visualized in Figure 3.9–3.11. In Figure 3.10 the homogeneously textured<br />

regions showing the sky yield to poor reconstructions in these areas in particular. The


40 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

Hardware Resolution Matching time<br />

Radeon 9700 Pro 256x256 0.106s<br />

512x512 0.198s<br />

1024x1024 0.501s<br />

Radeon 9000 Mobility 256x256 0.095s<br />

512x512 0.31s<br />

1024x1024 1.05s<br />

Table 3.1: Timing results <strong>for</strong> the sphere dataset on two different graphic cards.<br />

same holds <strong>for</strong> the repetitive pattern on the <strong>for</strong>eground lawn in Figure 3.11.<br />

Since the number of iterations is equal to the one chosen <strong>for</strong> the artificial dataset, the<br />

times required <strong>for</strong> dense reconstruction are similar.<br />

(a) The key image (b) The second image (c) The reconstructed model<br />

Figure 3.9: Results <strong>for</strong> a dataset showing the yard inside a historic building.<br />

3.6 Discussion<br />

This chapter presents a method to reconstruct dense meshes from stereo images with<br />

known relative pose, which is almost completely per<strong>for</strong>med in programmable graphics<br />

hardware. Dense reconstructions can be generated <strong>for</strong> pairs of images with one megapixel<br />

resolution in less than one second on the evaluated hardware plat<strong>for</strong>ms.<br />

With the emergence of additional features provided by the GPU, the approach proposed<br />

in this chapter is extended <strong>and</strong> enhanced as described in the following chapters.<br />

The simple sum of absolute differences image similarity measure can be replaced by more<br />

robust correlation function to achieve better results <strong>for</strong> real-world datasets. Additionally,<br />

the presented method can be easily extended to a multi-view setup at the cost of<br />

higher execution times. A true variational multi-view dense depth estimation framework<br />

per<strong>for</strong>med by the GPU is presented in Chapter 6.


3.6. Discussion 41<br />

(a) The key image (b) The second image (c) The reconstructed model<br />

Figure 3.10: Results <strong>for</strong> a dataset showing an apartment house. Unstructured regions<br />

showing the sky are poorly reconstructed due to the ambiguity in the local image similarity.<br />

Another straight<strong>for</strong>ward extension of the method described in this chapter addresses<br />

the generation of an optical flow field between two views. If no epipolar geometry is known<br />

or the static scene assumption is violated, the one-dimensional search along back-projected<br />

pixels is replaced by a 2D disparity search space. Since a 3D reconstruction from a sole<br />

disparity field is not possible, we focused on the setting with known epipolar geometry<br />

allowing 3D models to be generated.


42 Chapter 3. Mesh-based Stereo Reconstruction Using <strong>Graphics</strong> Hardware<br />

(a) Left image (b) Right image (c) The depth image<br />

(d) The reconstructed model as 3D point cloud<br />

Figure 3.11: Visual results <strong>for</strong> the Merton college dataset. The source images have a<br />

resolution of 1024 × 1024 pixels.


Chapter 4<br />

GPU-based Depth Map<br />

Estimation using Plane Sweeping<br />

Contents<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

4.2 Plane Sweep Depth Estimation . . . . . . . . . . . . . . . . . . . 43<br />

4.3 Sparse Belief Propagation . . . . . . . . . . . . . . . . . . . . . . 50<br />

4.4 Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

4.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.6 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.1 Introduction<br />

This chapter describes the implementation of a multiview depth estimation method based<br />

on a plane-sweeping approach, which is accelerated by 3D graphics hardware. The goal<br />

of our approach is the generation of depth maps with suitable quality at interactive rates.<br />

The final depth extraction can be per<strong>for</strong>med using a fast <strong>and</strong> simple winner-takes-all<br />

approach, or alternatively a time- <strong>and</strong> memory-efficient variant of belief propagation can<br />

be employed to obtain higher quality depth images.<br />

4.2 Plane Sweep Depth Estimation<br />

Plane sweep techniques in computer vision are simple <strong>and</strong> elegant approaches to image<br />

based reconstruction from multiple views, since a rectification procedure as required in<br />

many traditional computational stereo methods is not required. The 3D space is iteratively<br />

traversed by parallel planes, which are usually aligned with a particular key view<br />

43


44 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

(Figure 4.1). The plane at a certain depth from the key view induces homographies <strong>for</strong><br />

all other views, thus the sensor images can be mapped onto this plane easily.<br />

Key view<br />

Sensor view<br />

Figure 4.1: Plane sweeping principle. For different depths the homography between the<br />

reference plane <strong>and</strong> the sensor view is varying. Consequently, the projected image of the<br />

sensor view changes with the depth according to the epipolar geometry.<br />

If the plane at a certain depth passes exactly through the surface of the object to<br />

be reconstructed, the color values from the key image <strong>and</strong> from the mapped sensors images<br />

should coincide at appropriate positions (assuming constant brightness conditions).<br />

Hence, it is reasonable to assign the best matching depth value (according to some image<br />

correlation measure) to the pixels of the key view. By sweeping the plane through the<br />

3D space (i.e. varying the plane depth with respect to the key view) a 3D volume can be<br />

filled with image correlation values similar to the disparity space image (DSI) in traditional<br />

stereo. There<strong>for</strong>e the dense depth map can be extracted using global optimization<br />

methods, if depth continuity or any other constraint on the depth map is required.<br />

Note, that a plane sweep technique in a two frame rectified stereo setup coincides<br />

with traditional stereo methods <strong>for</strong> disparity estimation. In these cases the homography<br />

between the plane <strong>and</strong> the sensor view is solely a translation along the X-axis.<br />

There are several techniques to make dense reconstruction approaches more robust in<br />

case of occlusions in a multi-view setup. Typically, occlusions are only modeled implicitly<br />

in contrast to e.g. space carving methods, where the generated model so far directly influences<br />

visibility in<strong>for</strong>mation. Here we discuss briefly two approaches to implicit occlusion<br />

h<strong>and</strong>ling:


4.2. Plane Sweep Depth Estimation 45<br />

• Truncated scores: The image correlation measure is calculated between the key view<br />

<strong>and</strong> the sensor view <strong>and</strong> the final score <strong>for</strong> the current depth hypo<strong>thesis</strong> is the<br />

accumulated sum of the truncated individual similarities. The reasoning behind this<br />

approach is that the effect of occlusions between a pair of views on the total score<br />

should be limited to favor good depth hypotheses supported by other image pairs.<br />

• Best half-sequence selection: In many cases the set of images comprise a logical<br />

sequence of views, which can be totally ordered (e.g. if the camera positions are<br />

approximately on a line). Hence the set of images used to determine the score in<br />

terms of the key view can be split into two half-sequences, <strong>and</strong> the final score is the<br />

better score of these subsets. The motivation behind this approach is, that occlusion<br />

with respect to the key view appear either in the left or in the right half-sequence.<br />

Dense depth estimation using plane sweeping as described in this chapter is restricted to<br />

small baseline setups, since <strong>for</strong> larger baselines occlusions should be modeled explicitly.<br />

Additionally, the inherent fronto-parallel surface assumption of correlation windows yields<br />

inferior results in wide baseline cases.<br />

4.2.1 Image Warping<br />

In the first step, the sensor images are warped onto the current 3D key plane π = (n ⊤ , d)<br />

using the projective texturing capability of graphics hardware. If we assume the canonical<br />

coordinate frame <strong>for</strong> the key view, the sensor images are trans<strong>for</strong>med by the appropriate<br />

homography H with<br />

H = K<br />

�<br />

R − t n ⊤ �<br />

/d K −1 .<br />

K denotes the intrinsic matrix of the camera <strong>and</strong> (R|t) is the relative pose of the sensor<br />

view.<br />

In order to utilize the vector processing capabilities of the fragment pipeline in an<br />

optimal manner, the (grayscale) sensor images are warped wrt. four plane offset values d<br />

simultaneously. All further processing works on a packed representation, where the four<br />

values in the color <strong>and</strong> alpha channels correspond to four depth hypotheses.<br />

4.2.2 Image Correlation Functions<br />

After a sensor image is projected onto the current plane hypo<strong>thesis</strong>, a correlation score<br />

<strong>for</strong> the current sensor view is calculated, <strong>and</strong> the scores <strong>for</strong> all sensor views are integrated<br />

into a final correlation score of the current plane hypo<strong>thesis</strong>. The accumulation of the<br />

single image correlation scores depend on the selected occlusion h<strong>and</strong>ling policy: simple<br />

additive blending operations are sufficient if no implicit occlusion h<strong>and</strong>ling is desired. If the<br />

best half-sequence policy is employed, additive blending is per<strong>for</strong>med <strong>for</strong> each individual<br />

subsequence <strong>and</strong> a final minimum-selection blending operation is applied.


46 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

To our knowledge, all published GPU-based dense depth estimation methods use the<br />

simple sum of absolute differences (SAD) or squared differences (SSD) <strong>for</strong> image dissimilarity<br />

computation (usually <strong>for</strong> per<strong>for</strong>mance reasons). By contrast, we have a set<br />

of GPU-based image correlation functions available, including the SAD, the normalized<br />

cross correlation (NCC) <strong>and</strong> the zero-mean NCC (ZNCC) similarity functions. The<br />

NCC <strong>and</strong> ZNCC implementations optionally use sum tables <strong>for</strong> an efficient implementation<br />

[Tsai <strong>and</strong> Lin, 2003]. Small row <strong>and</strong> column sums can be generated directly by<br />

sampling multiple texture elements within the fragment shader. Summation over larger<br />

regions can be per<strong>for</strong>med using a recursive doubling approach similar to the GPU-based<br />

generation of integral images [Hensley et al., 2005]. Full integral image generation is also<br />

possible, but precision loss is observed <strong>for</strong> the NCC <strong>and</strong> ZNCC similarity functions in this<br />

case (see Section 4.2.2.2).<br />

For longer image sequences one cannot presume constant brightness conditions across<br />

all images, hence an optional prenormalization step is per<strong>for</strong>med, which subtracts the boxfiltered<br />

image from the original one to compensate changes in illumination conditions. If<br />

this prenormalization is applied, the depth maps obtained using the different correlation<br />

functions have similar quality.<br />

4.2.2.1 Efficient Summation over Rectangular Regions<br />

The image similarity functions described in the following section can be efficiently implemented<br />

by utilizing integral images (also known as summed-area tables in computer<br />

graphics). Integral images allow constant-time box filtering regardless of the window<br />

size [Crow, 1984]. Given the integral image of a source image any box filtering can be<br />

per<strong>for</strong>med in constant time using four image accesses (resp. texture lookups). This efficient<br />

box filtering approach can be extended more complex higher-order filtering operations<br />

[Heckbert, 1986].<br />

The single-pass procedure to calculate the integral image efficiently on a general purpose<br />

processor is slow when mapped on SIMD architectures. Consequently, a different<br />

approach using a logarithmic number of passes to generate the integral image on the GPU<br />

is much more efficient [Hensley et al., 2005]. Note, that the integral image requires a much<br />

higher precision of the color channels than the source image precision. Calculating <strong>and</strong><br />

using integral images on the GPU is only feasible since the emergence of floating point<br />

support on current graphics hardware.<br />

Note that <strong>for</strong> very small window sizes the utilization of bilinear texture fetches available<br />

on current graphics hardware essentially <strong>for</strong> free is usually more efficient than the<br />

computation <strong>and</strong> application of integral images. Bilinear texturing allows the summation<br />

of four adjacent pixels by just one texture access, e.g. summing the values inside a 4x4<br />

windows can be done using 4 bilinear texture lookups (instead of 16 individual accesses).<br />

Consequently, in order to obtain highest per<strong>for</strong>mance suitably customized procedures are<br />

best <strong>for</strong> very small correlation windows.


4.2. Plane Sweep Depth Estimation 47<br />

4.2.2.2 Normalized Correlation Coefficient<br />

The widely used (zero-mean) normalized correlation coefficient <strong>for</strong> window-based local<br />

matching of two images X <strong>and</strong> Y is (where ¯ X <strong>and</strong> ¯ Y denote the means inside the rectan-<br />

gular region W)<br />

ZNCC =<br />

�<br />

i∈W (Xi − ¯ X) (Yi − ¯ Y )<br />

�� i∈W (Xi − ¯ �<br />

X) 2<br />

i∈W (Yi − ¯ Y ) 2<br />

which is invariant under (affine linear) changes of luminance between images, but relatively<br />

costly to calculate. Using integral images the ZNCC can be calculated in constant time<br />

regardless of the correlation window size [Tsai <strong>and</strong> Lin, 2003], since<br />

ZNCC =<br />

� XiYi − ( � Xi) ( � Yi) /N<br />

��� X2 i − ( � Xi) 2 � �� /N Y 2<br />

i − ( � Yi) 2 �<br />

/N<br />

.<br />

From the above <strong>for</strong>mula it can be seen that five integral images are requires to calculate<br />

the ZNCC: the integral image <strong>for</strong> � Xi, � Yi, � X 2 i , � Y 2<br />

i <strong>and</strong> finally � XiYi. The<br />

precision requirement <strong>for</strong> the higher order sums is 8 + 8 + log 2 512 + log 2 512 = 34 bit <strong>for</strong><br />

512 × 512 source images. The 32 bit floating point <strong>for</strong>mat of current GPUs has a mantissa<br />

of 23 bit <strong>and</strong> artefacts due to precision loss may occur. Figure 4.2 illustrates the reduced<br />

precision by depicting a ZNCC error image generated in software on a CPU <strong>and</strong> another<br />

one computed on the GPU. An increasing loss of precision can be seen towards the lower<br />

right corner of the image. Since the integral image generations starts from the upper left<br />

corner, the lower right portion has the highest precision requirements within the integral<br />

image.<br />

Note that the precision requirements <strong>for</strong> the simple sums � Xi <strong>and</strong> � Yi are 26 bit <strong>for</strong><br />

8 bit images with 512 × 512 pixels resolution. By subtracting the image mean in advance<br />

from the source image two additional precision bits can be saved: one by halving the<br />

magnitude of the source values <strong>and</strong> another one by exploiting the sign bit in the integral<br />

image.<br />

Instead of creating full integral images, which allows box filtering with arbitrary window<br />

sizes, it is usually sufficient to sum the values with a given specific window, since we do<br />

not vary the aggregation window size during similarity score computation. Accumulation<br />

of larger windows can be per<strong>for</strong>med using a similar recursive doubling scheme as used <strong>for</strong><br />

integral image generation. Consequently, the precision requirements on the target buffer<br />

storing the aggregated values depend on the window size, <strong>and</strong> these are substantially lower<br />

than the requirements <strong>for</strong> integral images.<br />

,


48 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

(a) CPU generated (b) GPU generated<br />

Figure 4.2: NCC images calculated on the CPU (left) <strong>and</strong> on the GPU (right) using<br />

integral images.. Darker pixels indicate smaller similarity values. The image computed on<br />

the GPU has significant deviations especially in the lower right regions.<br />

4.2.3 Sum of Absolute Differences <strong>and</strong> Variants<br />

The sum of absolute differences (SAD) is a widely used image similarity function because<br />

of its simple computation, the minimal precision requirements <strong>and</strong> its high per<strong>for</strong>mance:<br />

SAD = �<br />

|Xi − Yi|,<br />

i∈W<br />

where W denotes the aggregation window. It is not insensitive to illumination changes,<br />

which results in limited use of the SAD <strong>for</strong> real-world application.<br />

Lighting changes in the scene can be incorporated by subtracting the local mean from<br />

the original image values yielding a zero-mean sum of absolute differences (ZSAD):<br />

ZSAD = �<br />

|(Xi − ¯ X) − (Yi − ¯ Y )|.<br />

i∈W<br />

In contrast to the correlation coefficient the subtracted local means cannot be moved<br />

outside the absolute value bars. Hence a similar technique like the shifting theorem <strong>for</strong><br />

the correlation coefficient is not applicable <strong>and</strong> the ZSAD is not suitable <strong>for</strong> efficient<br />

computation. In a first step we replace the true zero-mean intensity values Xi − ¯ X resp.<br />

Yi − ¯ Y by the differences Xi − X σ i , where Xσ is a smoothed version of the image X. X σ<br />

is typically generated by box-filtering the original image. The same applies to Y . The


4.2. Plane Sweep Depth Estimation 49<br />

net effect of this approximation is, that the normalization of the images can be per<strong>for</strong>med<br />

once <strong>for</strong> the input images.<br />

Hence, the first step is to calculate images ˜ X <strong>and</strong> ˜ Y , which are difference images<br />

between the original image <strong>and</strong> the smoothed one (i.e. ˜ X = X −X σ <strong>and</strong> ˜ Y = Y −Y σ ). The<br />

the approximated zero-mean sum of absolute differences reads as simple SAD operating<br />

on the trans<strong>for</strong>med images:<br />

ZSAD ≈ �<br />

| ˜ Xi − ˜ Yi|.<br />

i∈W<br />

The SAD (<strong>and</strong> the approximated ZSAD) can be normalized to the range [0, 1] by appropriate<br />

division:<br />

SAD = 1 �<br />

|Xi − Yi|,<br />

|W|<br />

i∈W<br />

if Xi ∈ [0, 1] <strong>and</strong> Yi ∈ [0, 1] is assumed. An alternative normalized variant of the SAD is<br />

known as the Bray Curtis (respectively Sorensen) distance:<br />

<strong>and</strong><br />

NSAD =<br />

ZNSAD =<br />

�<br />

i∈W |Xi − Yi|<br />

�<br />

i∈W |Xi| + �<br />

i∈W<br />

�<br />

i∈W | ˜ Xi − ˜ Yi|<br />

|Yi| ,<br />

�<br />

i∈W | ˜ Xi| + �<br />

i∈W | ˜ Yi| .<br />

These similarity scores are between 0 <strong>and</strong> 1, where 0 indicates perfect match between the<br />

two local windows.<br />

Computing the NSAD (<strong>and</strong> the ZNSAD) between two images requires three integral<br />

images II(·) to be generated <strong>for</strong> every depth value:<br />

• II(|Xi − Yi|) to calculate the numerator <strong>for</strong> the NSAD efficiently,<br />

• II(|Xi|) <strong>and</strong> II(|Yi|) to compute the denominator of the NSAD <strong>for</strong>mula.<br />

For the ZNSAD, the integral images are computed <strong>for</strong> ˜ Xi <strong>and</strong> ˜ Yi.<br />

If the plane sweep is per<strong>for</strong>med normal to an input view, II(|Xi|) must be calculated<br />

only once be<strong>for</strong>e the sweep. In case of a rectified stereo setup, the integral images (resp. the<br />

box filtered images) of the mean-normalized inputs can be entirely precomputed be<strong>for</strong>e<br />

the sweep. For every depth (resp. disparity) value the integral image <strong>for</strong> the absolute<br />

difference image |Xi − Yi| between the two views must be calculated.<br />

Of course, the required sums <strong>for</strong> rectangular regions can be achieved by direct summation<br />

as well, but such an approach is only suitable <strong>and</strong> efficient <strong>for</strong> small support window<br />

sizes.


50 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

4.2.4 Depth Extraction<br />

In order to achieve high per<strong>for</strong>mance <strong>for</strong> depth estimation, we employ primarily a simple<br />

winner-takes-all strategy to assign the final depth values. This approach can be easily <strong>and</strong><br />

efficiently implemented on the GPU using the depth test <strong>for</strong> a conditional update of the<br />

current depth image hypo<strong>thesis</strong> (see [Yang et al., 2002] <strong>and</strong> Section 3.4.3).<br />

Unreliable depth values can be masked by a subsequent thresholding pass removing<br />

pixels in the obtained depth map, which have a low image correlation.<br />

If the resulting depth map is converted to 3D geometry, staircasing artefacts are typically<br />

visible in the obtained model. In order to reduce these artefacts an optional selective,<br />

diffusion-based depth image smoothing step is per<strong>for</strong>med, which preserves true depth discontinuities<br />

larger than the steps induced by the discrete set of depth hypotheses (see<br />

Section 4.4).<br />

4.3 Sparse Belief Propagation<br />

Belief propagation (e.g. [Weiss <strong>and</strong> Freeman, 2001]) is an approximation technique <strong>for</strong><br />

global optimization on graphs, which is based on passing messages on the arcs of the<br />

underlying graph structure. The algorithm iteratively refines the estimated probabilities<br />

of the hypotheses within the graph structure by updating the probability weighting of<br />

neighboring nodes. These updates are referred as message passing between adjacent nodes.<br />

The belief propagation method maintains an array of probabilities called messages <strong>for</strong><br />

every arc in the graph, hence this method requires substantial memory <strong>for</strong> larger graphs<br />

<strong>and</strong> hypo<strong>thesis</strong> spaces. We denote the value of a message from node p going to node q<br />

<strong>for</strong> hypo<strong>thesis</strong> d at time t with m (t)<br />

p→q(d). Here d ranges over the possible hypo<strong>thesis</strong> at<br />

node q. After the belief propagation procedure converged to a stable solution, the final<br />

hypo<strong>thesis</strong> assignment to every node is typically extracted by taking the hypo<strong>thesis</strong> with<br />

the maximum estimated a posteriori probability. We refer to Section 4.3.2 <strong>for</strong> the details<br />

on message passing <strong>and</strong> hypo<strong>thesis</strong> extraction.<br />

In image processing <strong>and</strong> computer vision applications this graph is usually induced<br />

by the rectangular image grid with nodes representing pixels <strong>and</strong> arcs connecting adjacent<br />

pixels. Depth estimation integrating smoothness weights <strong>and</strong> occlusion h<strong>and</strong>ling<br />

can be <strong>for</strong>mulated as global optimization problem <strong>and</strong> solved with belief propagation<br />

methods [Sun et al., 2003]. Nevertheless, basic belief propagation methods are computationally<br />

dem<strong>and</strong>ing, but the special structure of the regularization function typically<br />

used in computer vision to en<strong>for</strong>ce smooth depth maps can be exploited to obtain more<br />

efficient implementations [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. In particular, the Potts<br />

discontinuity cost function <strong>and</strong> the optionally truncated linear cost model allow an efficient<br />

linear-time message passing method. In the Potts model, equal depth values assigned to<br />

adjacent pixels imply no smoothness penalty, whereas any different adjacent depth values<br />

result in a constant regularization penalty. More <strong>for</strong>mally, the smoothness cost V (dp, dq)


4.3. Sparse Belief Propagation 51<br />

is zero, if dp = dq, <strong>and</strong> a constant λ otherwise. In the linear smoothness model we have<br />

V (dp, dq) = λ|dp − dq|.<br />

Our implementation of belief propagation to extract the depth map from image correlation<br />

values is based on the work proposed in [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. In<br />

contrast to already proposed depth estimation techniques based on belief propagation we<br />

apply the message passing procedure only to a promising subset of depth/disparity values.<br />

Consequently, the consumed memory <strong>and</strong> time is a fraction of the original method.<br />

Consider the following concrete example: a depth map with 512×512 pixels resolution<br />

should be extracted from 200 potential depth values. Traditional (dense) belief propagation<br />

requires about 4 × 512 × 512 × 200 message components to be stored (the factor 4<br />

results from the utilized 4-neighborhood of pixels), which gives 800MB <strong>for</strong> 32 bit floating<br />

point components. But most of the 200 depth hypo<strong>thesis</strong> per pixel can be rejected immediately<br />

because of low image similarities. If on average only 10 tentative depth hypo<strong>thesis</strong><br />

survive <strong>for</strong> every pixel, only 4 × 512 × 512 × 10 message components need to be stored,<br />

which results in 40MB of memory consumption. The actual memory footprint is somewhat<br />

larger, since additional data structures are required <strong>for</strong> sparse belief propagation.<br />

We can adopt two of the three ideas proposed in [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]<br />

<strong>for</strong> sparse belief propagation:<br />

• The checker-board update pattern <strong>for</strong> messages can be used directly to halve the<br />

memory requirements.<br />

• The two pass method to compute the message updates in linear time <strong>for</strong> the Potts<br />

<strong>and</strong> the linear regularization can be modified to work <strong>for</strong> sparse representations as<br />

well (see Section 4.3.2).<br />

Additionally, a coarse-to-fine approach to belief propagation <strong>for</strong> vision to accelerate the<br />

convergence is proposed in [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. The basic idea is the<br />

hierarchical grouping of pixels in coarser levels <strong>and</strong> to per<strong>for</strong>m message passing in the<br />

reduced graphs. The results from coarser levels are used as initialization values <strong>for</strong> the<br />

next finer level. Since the hypo<strong>thesis</strong> space (i.e. the range of admissible depth values) <strong>for</strong><br />

a group of pixels in a coarser level consists of the union of all depth hypo<strong>thesis</strong> valid <strong>for</strong><br />

individual pixels, the data structures become less sparse. In the example above starting<br />

with 10 tentative depth values <strong>for</strong> every pixel, the next coarser level is comprised of 2 × 2<br />

pixel blocks associated with up to 40 possible depth values. Hence, there is no direct<br />

improvement in the time complexity using a hierarchical approach <strong>for</strong> our proposed sparse<br />

belief propagation method.<br />

4.3.1 Sparse Data Structures<br />

4.3.1.1 Sparse Data Cost Volume During Plane-Sweep<br />

Since belief propagation is a global optimization framework, a data structure similar to<br />

the disparity space image must be maintained, which stores the correlation value <strong>for</strong>


52 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

every depth hypo<strong>thesis</strong> <strong>and</strong> pixel. We propose a sparse representation to store tentative<br />

depth/correlation value pairs. One simple implementation would store exactly K<br />

depth/correlation pairs <strong>for</strong> every pixel, which is a appropriate approach in practice. In<br />

certain situations this uni<strong>for</strong>m choice <strong>for</strong> the number if hypotheses to be stored <strong>for</strong> every<br />

pixel is not appropriate: In highly textured regions there are possibly very few tentative<br />

depth hypotheses, whereas in low textured areas the similarity measure is not discriminative<br />

<strong>and</strong> the choice of K may be to low to include all potential depth c<strong>and</strong>idates.<br />

Consequently, we choose a more dynamic data structure, which stores at least K depth<br />

hypotheses (together with the corresponding correlation value) <strong>and</strong> additionally allocates<br />

a pool of a user defined size, which stores the globally next best depth hypotheses.<br />

For efficient update of this data structure after computing the image similarity <strong>for</strong> a<br />

certain depth plane, the K entries associated with every pixel comprise a heap sorted wrt.<br />

the correlation value. Maintaining the heaps <strong>for</strong> every pixel is relatively cheap, since every<br />

heap has exactly K elements. The dynamically assigned depth hypotheses are maintained<br />

in a heap structure as well. Updating this pool is more costly due to its relative large size.<br />

4.3.1.2 Sparse Data Cost Volume <strong>for</strong> Message Passing<br />

After finishing the plane-sweep procedure to generate the data costs associated with every<br />

pixel <strong>and</strong> every tentative depth value, the gathered sparse data cost volume is restructured<br />

<strong>for</strong> efficient access during message passing. Whereas during plane-sweep the image similarity<br />

value serves as primary key <strong>for</strong> efficient incremental updates, the sparse 1D distance<br />

trans<strong>for</strong>m per<strong>for</strong>med during message updates requires a depth-sorted list of items. Consequently,<br />

the sparse data cost volume used in message passing stage consists of an array of<br />

depth value/similarity value pairs <strong>for</strong> every pixel. In order to avoid memory fragmentation<br />

a scheme similar to compressed row storage <strong>for</strong>mat <strong>for</strong> sparse matrix representations is<br />

employed.<br />

4.3.2 Sparse Message Update<br />

Belief propagation uses repeated communication between adjacent pixels to strengthen<br />

or weaken the support of depth hypotheses. The iterative procedure updates the value<br />

<strong>for</strong> a message going from pixel p to its neighbor q at iteration t, m (t)<br />

p→q, according to the<br />

following rule:<br />

m (t)<br />

p→q(dq) := min ⎝V (|dp − dq|) + D(dp) + �<br />

dp<br />

⎛<br />

s∈N (p)\q<br />

⎞<br />

m (t−1)<br />

s→p (dp) ⎠ , (4.1)<br />

where dp <strong>and</strong> dq are tentative depth values at pixel p <strong>and</strong> q respectively. V (·) is the<br />

regularization term <strong>and</strong> D(dp) is the image similarity value <strong>for</strong> the depth dp. The sum<br />

�<br />

s∈N (p)\q m(t−1)<br />

s→p (dp) denotes the incoming messages from the neighborhood of q excluding


4.3. Sparse Belief Propagation 53<br />

p. The values from the previous iteration are used to determine the incoming messages<br />

(as denoted by the superscript (t − 1)).<br />

We utilize a linear regularization model, i.e.<br />

or a truncated linear approach with<br />

V (d) = λ d,<br />

V (d) = min {Vmax, λ d} ,<br />

with a regularization weight λ.<br />

After a user-specified number of iterations T <strong>for</strong> each pixel p the depth hypo<strong>thesis</strong> with<br />

the highest support (belief) is chosen as the actual depth:<br />

d result<br />

p<br />

= arg min<br />

dp<br />

⎧<br />

⎨<br />

4.3.2.1 Sparse 1D Distance Trans<strong>for</strong>m<br />

⎩ D(dp) + �<br />

s∈N (p)<br />

m T s→p(dp)<br />

For the linear regularization model the quadratic time complexity of message updates<br />

can be reduced to linear complexity using a two-pass scheme to calculate the<br />

min-convolution [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. Computing the min-convolution<br />

can be easily extended <strong>for</strong> sparse belief propagation. The procedure to <strong>for</strong> the sparse 1D<br />

distance trans<strong>for</strong>m is illustrated in Figure 4.3 <strong>and</strong> outlined in Algorithm 2.<br />

q1 p1 p2<br />

q2 q3 p3<br />

q4<br />

Figure 4.3: Determining the lower envelope using a sparse 1D distance trans<strong>for</strong>m. Solid<br />

lines represent given values of h[pi] = D[pi] + �<br />

s�=q ms→p[pi] <strong>and</strong> dashed lines indicate<br />

inferred values h[qi] from the distance trans<strong>for</strong>m.<br />

The algorithm applies a <strong>for</strong>ward <strong>and</strong> a backward pass to calculate the lower envelope<br />

in essentially the same manner as in the basic belief propagation framework. The main<br />

⎫<br />

⎬<br />


54 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

observation or the distance trans<strong>for</strong>ms in the sparse settings is, that only the potential<br />

depth hypotheses <strong>for</strong> the nodes <strong>for</strong>ming the arc p → q in interest need to be considered.<br />

Consequently, the lower envelope is derived solely from the potential depth hypotheses<br />

associated with pixel p <strong>and</strong> q. In order to apply the <strong>for</strong>ward <strong>and</strong> backward pass, these two<br />

sets of selected depth values need to be sorted into a common sequence. This is the first<br />

step in Algorithm 2.<br />

Subsequently, the procedure embeds the given samples stored in the array h to the<br />

corresponding position in the combined sequence f. The subsequent <strong>for</strong>ward <strong>and</strong> backward<br />

passes propagate the distance values through the sequence. Focusing on the <strong>for</strong>ward pass,<br />

the successive element f[i + 1] is updated to<br />

min(f[i + 1], f[i] + λ |mergeddepths[i + 1] − mergeddepths[i]|.<br />

The backward pass in analogous.<br />

Algorithm 2 Sparse variant of 1D distance trans<strong>for</strong>m<br />

Procedure Sparse DT-1D<br />

Input: h[], depthsp[], sizep, depthsq[], sizeq, result mp→q[]<br />

Do a merge-sort step to combine depthsp <strong>and</strong> depthsq to obtain mergeddepths with at<br />

most sizep + sizeq entries<br />

Simultaneously, fill a temporary array f such that<br />

f[j] := h[i], if mergeddepths[j] = depthsp[i]<br />

f[j] := ∞, otherwise<br />

Per<strong>for</strong>m <strong>for</strong>ward pass on f<br />

Per<strong>for</strong>m backward pass on f<br />

Fill in result array mp→q:<br />

mp→q[i] = f[j], if mergeddepths[j] = depthsq[i]<br />

The merge sort step stated in Algorithm 2 can be avoided by precomputing suitable<br />

arrays, but this approach is only slightly faster than using the inlined merge sort step <strong>and</strong><br />

requires additional memory.<br />

4.4 Depth Map Smoothing<br />

If the 3D models generated by the plane sweep procedure are visualized directly, staircase<br />

artefacts induced by the discrete set of depth hypo<strong>thesis</strong> are often clearly visible. If several<br />

individual depth maps resp. the induced 3D meshes are combined into one final model (e.g.<br />

as described in Chapter 8), these artefacts are typically removed by suitable averaging of<br />

the single models <strong>and</strong> the smoothing procedure proposed in this section is not necessary.<br />

Otherwise, a depth smoothing approach selectively removing the staircase effects without<br />

filtering larger depth discontinuities as described in this section can be applied.<br />

In the following we assume that the tentative depth values of every pixels are evenly


4.5. Timing Results 55<br />

spaced in a user-specified interval, <strong>and</strong> successive depth values vary by a constant depth<br />

difference T . Hence, depth variations between neighboring pixel in the magnitude T (or<br />

a small multiple of T ) indicate potential regions <strong>for</strong> depth map smoothing. We per<strong>for</strong>m<br />

this selective filtering approach by applying a diffusion procedure to minimize<br />

�<br />

min (d − d0)<br />

d<br />

2 + µ�W (p) · ∇d� 2 dp.<br />

p<br />

In this term d0(·) denotes the depth map (a function of the pixel position p) generated by<br />

the plane-sweeping method in the first place. d(·) is the final smoothed depth map, <strong>and</strong><br />

W (·) is a weighting vector described below. µ is a user-specified weight to balance the<br />

data term (d − d0) 2 <strong>and</strong> the regularization term �W (p) · ∇d� 2 .<br />

In order to define the weight W (p) at pixel position p, the original depth map d0 is<br />

sampled at position p <strong>and</strong> its four neighbors comprising a vector N = (d E 0 , dW 0 , dN 0 , dS 0 ).<br />

If the depth difference |d0 − d (·)<br />

0<br />

| is smaller than T (or an other used-given threshold), the<br />

diffusion process is allowed in the corresponding direction <strong>and</strong> the appropriate component<br />

in W (p) is set to one. All other components are set to zero to inhibit the diffusion.<br />

In addition to the directional gradient (i.e. the finite differences) in the source depth<br />

map, confidence in<strong>for</strong>mation can be incorporated into W as well. Depth values <strong>for</strong> pixels<br />

with low confidence (e.g. detected by low image similarity) result in directional diffusion<br />

from confident pixels to unconfident ones by the appropriate update of W . We build<br />

a confidence map by assigning one to pixels with confident depths <strong>and</strong> zero otherwise.<br />

This map is based on hard-thresholding of the employed image similarity measure <strong>for</strong> the<br />

extracted depth value. The corresponding component of W is multiplied by the confidence<br />

map entries sampled <strong>for</strong> neighboring pixels.<br />

This diffusion procedure can be again executed by graphics hardware to increase the<br />

per<strong>for</strong>mance. Since Chapter 6 is entirely dedicated to variational methods <strong>for</strong> multi-view<br />

vision, we postpone the detailed description of the GPU-based implementation of diffusion<br />

processes <strong>and</strong> variational approaches in general to that chapter.<br />

4.5 Timing Results<br />

In this section we provide more detailed timing results <strong>for</strong> GPU-based depth estimation<br />

using the plane-sweeping approach. The benchmarking plat<strong>for</strong>m is a P4 3GHz as CPU<br />

<strong>and</strong> a NVidia GeForce 6800GTO as GPU. Since the adjustable parameters <strong>for</strong> our implementation<br />

have many degrees of freedom (image similarity score, aggregation window<br />

dimensions, number of used source images etc.), a tabular representation given in Table 4.1<br />

of the obtained timing results is preferred over a graphical representation. The input <strong>for</strong><br />

the depth estimation method are three grayscale source images at the resolution specified<br />

in the appropriate column (512 × 512 or 1024 × 1024). The use of power-of-two image<br />

dimensions is caused by the partial support of graphics hardware <strong>for</strong> non-power-of-two


56 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

textures. The timing results given in this table reflect essentially the per<strong>for</strong>mance of applying<br />

the homography on the sensor images <strong>and</strong> calculating the stated dissimilarity score,<br />

since the time used <strong>for</strong> actual depth extraction is negligible. Note, that these timings are<br />

mostly insensitive to the provided image content.<br />

Resolution #depth planes Aggr. window Dissimilarity score Time<br />

512 × 512 200 5 × 5 SAD 0.918s<br />

ZNSAD 1.573s<br />

NCC 1.647s<br />

ZNCC 2.344s<br />

9 × 9 SAD 1.362s<br />

ZNSAD 2.426s<br />

NCC 2.481s<br />

ZNCC 3.591s<br />

400 5 × 5 SAD 1.699s<br />

ZNSAD 3.058s<br />

NCC 3.188s<br />

ZNCC 4.611s<br />

9 × 9 SAD 2.579s<br />

ZNSAD 4.774s<br />

NCC 4.855s<br />

ZNCC 7.103s<br />

1024 × 1024 200 5 × 5 SAD 3.772s<br />

ZNSAD 7.096s<br />

NCC 7.402s<br />

ZNCC 10.861s<br />

9 × 9 SAD 6.059s<br />

ZNSAD 11.446s<br />

NCC 11.656s<br />

ZNCC 17.206s<br />

400 5 × 5 SAD 7.540s<br />

ZNSAD 14.146s<br />

NCC 14.842s<br />

ZNCC 21.684s<br />

9 × 9 SAD 11.973s<br />

ZNSAD 22.863s<br />

NCC 23.281s<br />

ZNCC 34.379s<br />

Table 4.1: Timing results <strong>for</strong> the plane-sweeping approach on the GPU with winner-takesall<br />

depth extraction at different parameter settings <strong>and</strong> image resolutions.<br />

At higher resolutions, the expected theoretical ratios between the run times between<br />

the various similarity score are attained. Every score uses one or several accumulation


4.5. Timing Results 57<br />

passes to calculate �<br />

i∈W op(Xi, Yi), which comprises the dominant fraction of the total<br />

run-time. The SAD requires only one accumulation pass ( �<br />

i∈W |Xi − Yi|), whereas the<br />

NSAD resp. the NCC needs two passes, <strong>and</strong> finally the ZNCC per<strong>for</strong>ms three invocations<br />

of the accumulation procedure. ∗ Hence, the observed ratios of approximately 1:2:2:3 <strong>for</strong><br />

the run-times of the evaluated correlation scores can be explained.<br />

Sparse belief propagation <strong>for</strong> the final depth extraction is much more costly in terms<br />

of computation time, as it is illustrated in Figure 4.4. The solid graph displays the required<br />

total run-time against the number of maintained heap entries <strong>for</strong> sparse belief<br />

propagation. This graph shows essentially a linear behavior, since the linear-time message<br />

passing dominates the heap construction with O(K log K) time complexity. For comparison,<br />

the dashed line depicts the runtime of the pure winner-takes all approach. Sparse<br />

belief propagation with just one heap entry requires about 5.8s, whereas the equivalent<br />

winner-takes-all method needs approximately 3s <strong>for</strong> these settings. The corresponding<br />

depth images obtained <strong>for</strong> the utilized dataset are shown later in Section 4.6.<br />

time in sec<br />

55<br />

50<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

BP times<br />

WTA time<br />

0<br />

0 5 10 15 20 25 30 35 40<br />

number of sparse BP entries<br />

Figure 4.4: Sparse belief propagation timing results wrt. the number of heap entries K.<br />

The image <strong>and</strong> depth map resolution is 512 × 512 pixels <strong>and</strong> 200 depth hypotheses are<br />

evaluated using a 7 × 7 ZNCC image similarity score.<br />

∗ Recall Section 4.2.2. Additionally, the summations involving only key image can be precomputed.


58 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

4.6 Visual Results<br />

In this section we provide depth maps <strong>and</strong> 3D models <strong>for</strong> real datasets in order to demonstrate<br />

the per<strong>for</strong>mance of our GPU-based depth estimation procedure <strong>and</strong> to indicate the<br />

differences between the winner-takes-all (WTA) depth extraction approach <strong>and</strong> the sparse<br />

belief propagation method. All source images are resampled to a resolution of 512 × 512<br />

pixels, since images with power-of-two dimensions are still better supported on graphics<br />

hardware.<br />

The L<strong>and</strong>haus dataset shown in Figures 4.5 <strong>and</strong> 4.6 represents a historical statue<br />

embedded into a building facade. Three grayscale images with small baselines are used <strong>for</strong><br />

depth estimation. At first, Figure 4.5 shows depth images generated by the winner-takesall<br />

<strong>and</strong> by the sparse belief propagation approach at different numbers of maintained<br />

heap entries K. 200 potential depth values are examined in all cases. The reported<br />

timings correspond to the values displayed in Figure 4.4. Most notably, belief propagation<br />

enhances the depth maps in the textureless wall regions on either side of the statue itself.<br />

Additionally, Figure 4.6 shows two 3D models represented as colored point sets obtained by<br />

a WTA depth extraction step <strong>and</strong> a sparse belief propagation procedure using 20 surviving<br />

depth entries. Both models look relatively similar <strong>and</strong> only a closer inspection reveals the<br />

outliers. If the models are rendered as shaded triangular meshes as in Figure 4.7, the<br />

noisy structure of the WTA result is clearly manifested. Note, that many outliers found in<br />

the initial depth maps can be removed by the subsequent depth image fusion procedure,<br />

which generates a proper 3D model from a set of depth maps.<br />

Three source images of another statue dataset <strong>and</strong> the respective depth results are<br />

shown in Figure 4.8. 400 tentative depth planes are evaluated on three adjacent images<br />

with small baseline. Since the dark background scenery to the left <strong>and</strong> right of the statue<br />

is out of the plane-sweep range, the depth image has poor quality in these regions. Belief<br />

propagation significantly smooths the depth map especially near depth discontinuities.<br />

4.7 Discussion<br />

GPU-based plane-sweeping procedures allow the efficient generation of depth images from<br />

multiple small base-line images. Several image dissimilarity measures are available in<br />

our implementation, which are efficiently calculated on graphics hardware <strong>and</strong> give good<br />

results even <strong>for</strong> varying lighting conditions.<br />

In case of highly textured scenes a final winner-takes-all depth extraction method is<br />

sufficient <strong>and</strong> fast enough to calculate to allow almost interactive feedback to the user.<br />

Optionally, a sparse belief propagation method is proposed, which significantly enhances<br />

the depth map in ambiguous regions.<br />

Future work needs to address a qualitative <strong>and</strong> quantitative comparison of traditional<br />

belief propagation <strong>and</strong> our proposed sparse counterpart. The question, whether the early<br />

rejection of unpromising depth values can have a negative impact on the extracted depth


4.7. Discussion 59<br />

maps, is still unresolved. Additionally, even sparse belief propagation is 5 to 10 times<br />

slower than the (fully hardware accelerated) winner-takes-all strategy, which opens the<br />

question, if further per<strong>for</strong>mance enhancements are possible <strong>for</strong> sparse BP.<br />

In Chapter 7 a GPU-based one-dimensional energy minimization approach based on<br />

the dynamic programming principle is presented.


60 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

(a) Sensor image (b) Without BP (WTA); 3s<br />

(c) BP, K = 10; 16.5s (d) BP, K = 20; 29.5s<br />

(e) BP, K = 30; 40.3s (f) BP, K = 40; 50.1s<br />

Figure 4.5: Depth images with <strong>and</strong> without belief propagation <strong>for</strong> the L<strong>and</strong>haus dataset.<br />

With more allowed heap entries K, the amount of noisy pixels in textureless regions is<br />

reduced, but the runtime increases accordingly.


4.7. Discussion 61<br />

(a) Without BP (WTA) (b) With BP (K = 20)<br />

Figure 4.6: Point models with <strong>and</strong> without belief propagation<br />

(a) Without BP (WTA) (b) With BP (K = 20)<br />

Figure 4.7: Point models with <strong>and</strong> without belief propagation


62 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping<br />

(a) Left image (b) Middle (sensor) image (c) Right image<br />

(d) Without BP (WTA), 6.7s (e) With BP, 37s<br />

Figure 4.8: Depth images with <strong>and</strong> without belief propagation


Chapter 5<br />

Space Carving on 3D <strong>Graphics</strong><br />

Hardware<br />

Contents<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

5.2 Volumetric Scene Reconstruction <strong>and</strong> Space Carving . . . . . . 64<br />

5.3 Single Sweep Voxel Coloring in 3D Hardware . . . . . . . . . . 66<br />

5.4 Extensions to Multi Sweep Space Carving . . . . . . . . . . . . 70<br />

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

5.1 Introduction<br />

This chapter presents a direct scene reconstruction approach fully accelerated by graphics<br />

hardware. It shares the plane-sweep principle to obtain a model from multiple images with<br />

the method discussed in the previous chapter. In contrast to the plane sweep based depth<br />

estimation approach, the voxel coloring <strong>and</strong> space carving implementations proposed in<br />

this chapter generate a true 3D model from a large set of input views directly.<br />

Voxel coloring [Seitz <strong>and</strong> Dyer, 1997] <strong>and</strong> its derivatives incorporate multiple, optionally<br />

wide-baseline views simultaneously, <strong>and</strong> produce directly volumetric 3D models.<br />

Methods derived from the voxel coloring approach test a large number of voxels <strong>for</strong> photoconsistency<br />

<strong>and</strong> are there<strong>for</strong>e rather slow. Reported calculation times <strong>for</strong> voxel coloring<br />

range from several seconds <strong>for</strong> low resolutions up to hours <strong>for</strong> high quality models.<br />

In this chapter we address efficient implementations <strong>for</strong> voxel coloring <strong>and</strong> space carving<br />

exploiting commodity 3D graphics cards. Our current implementation is based on OpenGL<br />

using fragment shader extension (ATI fragment shader in particular). The hardware requirements<br />

are rather modest; in particular any ATI Radeon 8500 or better is supported<br />

63


64 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

by our implementation. Medium resolution models are generated at interactive rates on<br />

present-day graphics hardware, whereas high resolution models are typically obtained after<br />

a few seconds. There are at least two application scenarios, which can benefit from a<br />

fast voxel coloring implementation: at first, our implementation provides a fast preview<br />

<strong>for</strong> more highly sophisticated algorithms. The second scenario addresses improved functionality<br />

of plenoptic image editing: modifications in one or several images can be used<br />

to update the 3D model instantly. After recalculating the new model, these changes are<br />

propagated to the remaining images as well. Thus, specular highlights on surfaces <strong>and</strong><br />

similar flaws can be removed interactively to improve the quality of the generated 3D<br />

model.<br />

5.2 Volumetric Scene Reconstruction <strong>and</strong> Space Carving<br />

Voxel coloring [Seitz <strong>and</strong> Dyer, 1997] generates a volumetric model by analyzing the consistency<br />

of scene voxels. As the voxel space is traversed using a plane sweeping approach,<br />

the state of each voxel is determined. For scenes without translucent objects a voxel can<br />

be classified either as empty or opaque. During the voxel coloring procedure voxels are<br />

projected into the input images <strong>and</strong> the distribution of the corresponding pixel values<br />

is used to determine the state of each voxel. A so-called photo-consistency (or colorconsistency)<br />

measure decides, whether a voxel is on the surface of a scene object, i.e. the<br />

voxel is opaque. This method is conservative in the sense that only assured inconsistent<br />

voxels are labeled as empty. There<strong>for</strong>e already processed voxels can be used to determine<br />

visibility of voxels with respect to the input views.<br />

In order to traverse the voxels in correct depth by a simple plane sweep, the placement<br />

of cameras is restricted by the so called ordinal visibility constraint. This constraint<br />

ensures, that voxels are visited prior to voxels they occlude. In [Seitz <strong>and</strong> Dyer, 1999] it is<br />

shown, that this visibility constraint is satisfied if the scene to be reconstructed is outside<br />

the convex hull of the camera centers. One typical camera configuration suitable <strong>for</strong> voxel<br />

coloring <strong>and</strong> possible slices used <strong>for</strong> reconstruction are shown in Figure 5.1.<br />

Several extensions of voxel coloring were proposed to allow more general<br />

camera placements. Space carving [Kutulakos <strong>and</strong> Seitz, 2000], generalized voxel<br />

coloring [Culbertson et al., 1999] <strong>and</strong> multi-hypo<strong>thesis</strong> voxel coloring [Eisert et al., 1999]<br />

remove the limitations on camera positions. Space carving per<strong>for</strong>ms multiple iterations<br />

of voxel coloring <strong>for</strong> different sweep directions. Only a suitable subset of all input views<br />

is used <strong>for</strong> each sweep.<br />

A crucial question is how to measure color consistency: the original voxel coloring<br />

approach utilized the variance of colors from projected voxels to determine consistency.<br />

Stevens et al. [Stevens et al., 2002] propose a histogram-based consistency metric. In their<br />

approach the footprint of a voxel in an image contains several pixels, which are organized<br />

in a histogram. A voxel is consistent, if the histograms of the footprints are not pairwise<br />

disjoint. The consistency measure presented by Yang et al. [Yang et al., 2003] h<strong>and</strong>les


5.2. Volumetric Scene Reconstruction <strong>and</strong> Space Carving 65<br />

1 2 3 8<br />

Depth index<br />

Figure 5.1: A possible configuration <strong>for</strong> plane sweeping through the voxel space. The<br />

camera positions are restricted, such that voxels in subsequent layers can only be occluded<br />

by already processed voxels.<br />

non-lambertian, specular surfaces explicitly.<br />

Voxel coloring is a computationally expensive procedure, which typically requires<br />

at least tens of seconds up to tens of minutes to compute the reconstruction. Several<br />

researchers proposed improved implementations <strong>for</strong> voxel coloring, e.g. Prock <strong>and</strong><br />

Dyer [Prock <strong>and</strong> Dyer, 1998] primarily utilize a hierarchical oct-tree representation to<br />

speed up voxel coloring. Additionally, they use graphics hardware to speed up certain calculations.<br />

Their multi-resolution voxel coloring method needs about 15s to generate a reconstruction<br />

with 256 3 voxels. However, a hierarchical, multi-resolution approach to volumetric<br />

3D reconstruction can potentially miss scene details. Sainz et al. [Sainz et al., 2002]<br />

use texture mapping features of 3D graphics hardware to accelerate the computations.<br />

Nevertheless, a 256 3 voxel model requires several minutes to be computed even on recent<br />

hardware.<br />

Seitz <strong>and</strong> Kutulakos [Seitz <strong>and</strong> Kutulakos, 2002] present an image editing approach<br />

<strong>for</strong> multiple images of a 3D scene. Changes in one image are propagated to other images<br />

by using an initially generated voxel model of the scene. There<strong>for</strong>e direct manipulation of<br />

surface textures <strong>and</strong> other image editing operations are possible. Image editing is limited to<br />

methods, which do not require a complete volumetric reconstruction step to propagate the<br />

modifications. With our efficient space carving implementation, it is possible to allow more<br />

general editing methods useful <strong>for</strong> a user-driven interactive refinement of voxel models,<br />

since the volumetric reconstruction can be generated almost instantly from altered input<br />

images.


66 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

5.3 Single Sweep Voxel Coloring in 3D Hardware<br />

In this section we describe the hardware based implementation of voxel coloring. This<br />

description applies to the case of a single sweep <strong>for</strong> camera configurations satisfying the<br />

ordinal visibility constraints [Seitz <strong>and</strong> Dyer, 1997]; we will discuss the extensions required<br />

<strong>for</strong> the multi sweep case in Section 5.4.<br />

The input <strong>for</strong> our method consists of N resampled color images <strong>and</strong> the corresponding<br />

projection matrices, <strong>and</strong> a bounding box denoting the space volume to be reconstructed.<br />

The bounding box of the volume to be reconstructed is organized as a stack of parallel<br />

planes. These planes are traversed in a front-to-back ordering during the reconstruction<br />

procedure. The algorithm maintains a depth map <strong>for</strong> every camera, which stores the depth<br />

with respect to the camera position of the reconstructed model so far. For each plane the<br />

algorithm executes the following steps:<br />

1. The images of the camera views are projected onto the current plane <strong>and</strong> a consistency<br />

measure is evaluated.<br />

2. Surface pixels (voxels) are determined by thresholding the consistency map.<br />

3. For each camera view the associated depth map is updated by rendering the currently<br />

reconstructed voxel layer according to the input views.<br />

At the end of each iteration a layer of voxels is obtained <strong>and</strong> can be used <strong>for</strong> further<br />

processing.<br />

Figure 5.2 illustrates the first step in the procedure to obtain the color of a voxel<br />

with respect to a particular input view. Perspective texture mapping is combined with<br />

a depth test against the so far available depth map to sieve unoccluded voxels. This<br />

procedure is repeated <strong>for</strong> every input view to accumulate the necessary in<strong>for</strong>mation <strong>for</strong><br />

color consistency calculation.<br />

The following sections describe the steps per<strong>for</strong>med in our implementation in more<br />

detail.<br />

5.3.1 Initialization<br />

In addition to the currently calculated voxel layer the algorithm maintains a depth map<br />

<strong>for</strong> every input view to test the visibility of voxels. Since voxel layers are processed in a<br />

front-to-back ordering, it is sufficient to use bitmaps to represent the depth map (pixels<br />

with value 1 indicate empty space along the line-of-sight, whereas value 0 denotes rays<br />

with already processed opaque voxels). In this paper we use range images <strong>for</strong> the depth<br />

maps with gray levels indicating the depth of the voxel layer <strong>for</strong> better visual feedback.<br />

At the beginning of the sweep these depth maps are cleared with a value indicating<br />

empty voxels (i.e. 1). Additionally, we need to h<strong>and</strong>le voxels that are outside the viewing<br />

volume of a camera as well (since other cameras can possibly see these voxels). We set


5.3. Single Sweep Voxel Coloring in 3D Hardware 67<br />

Figure 5.2: Perspective texture mapping using visibility in<strong>for</strong>mation. The original input<br />

image (depicted on the leftmost quad) is filtered using the depth map (in the middle), <strong>and</strong><br />

only unoccluded pixels are rendered on the current voxel layer.<br />

the texture coordinate wrapping mode to GL_CLAMP to h<strong>and</strong>le voxels outside the frustum<br />

correctly. Whenever a depth outside the frustum is accessed, a minimal depth value (0) is<br />

returned. Note, that only voxels in front of the camera can be culled against the viewing<br />

frustum, there<strong>for</strong>e all camera positions must be entirely outside the reconstructed volume.<br />

5.3.2 Voxel Layer Generation<br />

With the knowledge of the depth maps generated <strong>for</strong> every view so far, an estimate <strong>for</strong><br />

photo-consistency can be calculated. We accumulate the consistency value very similar<br />

to the method proposed by Yang et al. [Yang et al., 2002]. In order to obtain the color<br />

of a voxel as seen from a particular input view, projective texture mapping is applied to<br />

determine the color hypo<strong>thesis</strong> <strong>for</strong> every voxel in the current layer. The color hypotheses<br />

<strong>for</strong> all visible views are accumulated to obtain a consistency score <strong>for</strong> each voxel.<br />

Using the color variance as the consistency function is suboptimal on graphics hardware.<br />

At first, a significant number of passes is needed to calculate the variance ∗ , <strong>and</strong><br />

the squaring operation causes numerical problems due to the limited precision available<br />

on the GPU.<br />

A simple consistency measure is the length of the interval generated by the color<br />

hypotheses <strong>for</strong> a voxel, which can be easily computed on graphics hardware <strong>and</strong> turned<br />

out to result in reasonable reconstructions. More <strong>for</strong>mally, the consistency value c of a<br />

∗ One sweep over all input views is required to count the number of visible views <strong>for</strong> every voxel; another<br />

sweep is required to calculate the mean <strong>and</strong> a third sweep is required to obtain the variance.


68 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

voxel projected to pixel with color ci = (ci.r, ci.g, ci.b) in input view i is assigned to<br />

c = max<br />

j∈{r,g,b} (max ci.j − min ci.j).<br />

i<br />

i<br />

If the color hypotheses have a significant disparity, then the interval is too large <strong>and</strong> the<br />

voxel is labeled as inconsistent. Calculation of the interval length can be done with two<br />

complete sweeps over the input views: the first sweep uses a blending equation set to<br />

GL_MIN <strong>and</strong> the second sweep sets the blending equation to GL_MAX. A final pass calculates<br />

the length of the interval, but this step can be integrated into the thresholding step to<br />

determine consistent voxels.<br />

The final result of this step is an opacity bitmap (stored in an off-screen pixel buffer)<br />

indicating consistent voxels of the currently processed layer. This binary image constitutes<br />

one slice of the final volumetric model <strong>and</strong> is used to update the visibility in<strong>for</strong>mation<br />

(Section 5.3.3). In our implementation the opacity of a voxel is stored in the alpha channel<br />

<strong>and</strong> the mean color of the voxel is stored in the remaining channels.<br />

In order to achieve high per<strong>for</strong>mance we exploit several features of graphics hardware:<br />

Visibility Determination Only views that are actually able to see a voxel contribute to<br />

the consistency value <strong>and</strong> image pixels from occluded cameras should be ignored. We employ<br />

the alpha test functionality <strong>for</strong> visibility calculation. The depth index of the current<br />

voxel layer is compared with the value stored in the depth map <strong>for</strong> the appropriate view.<br />

Pixels that fail the alpha test are discarded <strong>and</strong> are there<strong>for</strong>e ignored during consistency<br />

calculation.<br />

Note that it is possible to count the number of visible cameras <strong>for</strong> a voxel efficiently<br />

using the stencil buffer. Using this count it is easily possible to extract only surface voxels<br />

of the model.<br />

Selection of Consistent Voxels Voxels of the current layer are labeled as opaque if<br />

they are photo-consistent <strong>and</strong> if they are not part of the background. In our implementation<br />

dark pixels with an intensity value below some user-defined threshold are treated as<br />

background pixels <strong>and</strong> the state of the voxels is set to empty.<br />

Additional Processing At this stage of the procedure, additional processing the voxel<br />

bit-plane can be applied. In particular, prior knowledge from previous sweeps (see Section<br />

5.4) can be used to refine the generated slice. Furthermore, the generated voxel slice<br />

can be copied into a 3D texture used <strong>for</strong> direct visualization of the obtained volumetric<br />

model.


5.3. Single Sweep Voxel Coloring in 3D Hardware 69<br />

5.3.3 Updating the Depth Maps<br />

After determining filled voxels in the current layer, the depth maps must be updated to<br />

reflect occlusions of the additional solid voxels. For each input view the depth map is<br />

selected as rendering target <strong>and</strong> the corresponding camera matrix is used <strong>for</strong> projection.<br />

The blending mode is set to GL_MIN to achieve a conditional update of depth values. We<br />

apply a small fragment program to filter empty voxels by assigning a maximum depth<br />

value to these pixels. Consequently, transparent voxel do not affect the depth map.<br />

Figure 5.3 shows the successive update of depth maps <strong>for</strong> two input views. Snapshots<br />

of the depth map were taken after 25%, 50% <strong>and</strong> 100% of the reconstruction process.<br />

(a) view 1: 25% (b) 50% (c) 100%<br />

(d) view 2: 25% (e) 50% (f) 100%<br />

Figure 5.3: Evolution of depth maps <strong>for</strong> two views during the sweep process. Darker<br />

regions are closer to the camera. The images show depth maps obtained after processing<br />

25%, 50% <strong>and</strong> 100% of the reconstructed volume.


70 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

5.3.4 Immediate Visualization<br />

Immediate visual feedback is necessary to evaluate the quality of the reconstructed model<br />

rapidly. Reading back the voxel model from graphics memory into main memory to<br />

generate a surface representation is expensive <strong>and</strong> time-consuming, there<strong>for</strong>e direct volume<br />

rendering methods [Engel <strong>and</strong> Ertl, 2002] are more appropriate. The individual slices<br />

obtained by voxel coloring can be copied into a 3D texture <strong>and</strong> visualized immediately.<br />

Alternatively, the depth images generated <strong>for</strong> the input views can be displayed as displacement<br />

map [Kautz <strong>and</strong> Seidel, 2001], which allows the height-field stored in a texture<br />

to be rendered from novel views <strong>for</strong> visual inspection.<br />

5.4 Extensions to Multi Sweep Space Carving<br />

The procedure described in Section 5.3 is limited to cameras fulfilling the ordinal visibility<br />

constraints. In order to obtain reconstructions <strong>for</strong> more general camera setups, the plane<br />

sweep procedure is repeated several times <strong>for</strong> different sweep directions. Only a compatible<br />

set of cameras is used in each iteration. The difference to the single sweep approach lies<br />

in the amount of knowledge from the prior sweeps used in the current sweep. We have<br />

tested three alternatives:<br />

Independent Sweeps All sweeps are per<strong>for</strong>med independently <strong>and</strong> no prior in<strong>for</strong>mation<br />

is used in the current sweep. The reconstructed volumetric model is the intersection of<br />

the models generated by the independent sweeps. The intersection of the obtained voxel<br />

models is per<strong>for</strong>med by the main CPU. This approach has no restriction on the resolution<br />

of the voxel space, but the frequent transfer of voxel data from graphics memory imposes a<br />

severe per<strong>for</strong>mance penalty. In our experiments we observed significantly longer running<br />

times, when voxel data is read back into main memory. Copying image data from the<br />

frame buffer or texture memory into main memory is a rather slow operation (in contrast<br />

to the reverse direction). This per<strong>for</strong>mance penalty depends on the resolution, <strong>and</strong> results<br />

in more than doubled execution time e.g. at 256 3 scene resolution.<br />

Complete Prior Knowledge The opacity value of the voxels generated in the previous<br />

sweep is stored in a 3D texture, which is used in the subsequent sweep to determine already<br />

carved voxels. The need <strong>for</strong> a 3D texture residing on graphics memory limits the maximum<br />

resolution of the voxel space. On consumer level graphics hardware the resolution of the<br />

voxel space is typically bounded by 256 3 . Two 3D textures are required simultaneously;<br />

one texture represents the previous model <strong>and</strong> the other one serves as destination <strong>for</strong> the<br />

model generated in the current sweep. Additionally, the continuous access of a 3D texture<br />

lowers the runtime per<strong>for</strong>mance of the implementation. A significant advantage of this<br />

approach is the opportunity to visualize the generated model immediately using direct<br />

volume rendering methods.


5.4. Extensions to Multi Sweep Space Carving 71<br />

Partial Prior Knowledge In order to avoid the expensive 3D texture representing<br />

complete prior knowledge, a height field can be used as a trade-off between the <strong>for</strong>mer two<br />

alternatives. In the following we assume orthogonal sweep directions along the major axis<br />

of the voxel space. In addition to the depth maps <strong>for</strong> the input views, the preceding sweep<br />

maintains a depth map in the sweep direction. This height-field is used to inhibit already<br />

carved voxels from being classified as opaque in the current sweep. This can be achieved<br />

by comparing the appropriate component of the voxel position with the value stored in<br />

the height field (see Figure 5.4).<br />

Carved voxels<br />

Current sweep direction<br />

1 2 3 8<br />

Depth index<br />

Previous sweep direction<br />

Figure 5.4: Plane sweep with partial knowledge from the processing sweeps. Carved voxels<br />

remain unfilled by using a depth image. The shaded region is known to be empty from<br />

the previous sweep, there<strong>for</strong>e filling voxels inside this region is prohibited.<br />

The final model is again the intersection of the volumetric models generated by the<br />

sweeps, since the incoming knowledge <strong>for</strong> each sweep is only a partial model. In order to<br />

avoid the expensive transfer of data from graphics memory to per<strong>for</strong>m this intersection<br />

in software, we display the result of the final sweep to the user. Additionally, we use the<br />

height-fields of all prior sweeps to approximate the volumetric model.<br />

In this approach the available graphics memory does not limit the voxel space resolution,<br />

but the depth of the color channel is a restricting factor, if high precision depth<br />

buffers are not available.


72 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

5.5 Experimental Results<br />

5.5.1 Per<strong>for</strong>mance Results<br />

We have implemented voxel coloring <strong>and</strong> space carving as described in Sections 5.3<br />

<strong>and</strong> 5.4. Our implementation is based on fragment shader features as exposed by the<br />

ATI fragment shader OpenGL extension. Hence it is possible to per<strong>for</strong>m hardware<br />

accelerated voxel coloring <strong>and</strong> space carving on low-end or mobile graphics hardware as<br />

well.<br />

At first we give per<strong>for</strong>mance results obtained by our implementation. The benchmarking<br />

system is equipped with an AMD Athlon XP2000 as CPU <strong>and</strong> an ATI Radeon 9700<br />

Pro as graphics hardware. The per<strong>for</strong>mance plots are created <strong>for</strong> the synthetic “Bowl”<br />

dataset (see Figure 5.7). 36 views of the model were captured using a virtual turntable<br />

software. Each sweep uses 9 views <strong>for</strong> reconstruction. Figure 5.5(a) presents timing results<br />

<strong>for</strong> the voxel coloring implementation at different resolutions. The required time <strong>for</strong><br />

voxel coloring is approximately linear in the depth resolution (i.e. the number of generated<br />

slices). Surprisingly, the time needed <strong>for</strong> resolutions from 32 × 32 × d up to 128 × 128 × d<br />

are close to the time required <strong>for</strong> 256 × 256 × d. The runtime <strong>for</strong> lower resolutions is dominated<br />

by the expensive pixel buffer switches (which is linear in the number of slices, but<br />

independent of the resolution). At higher resolutions the fill rate of the graphics hardware<br />

becomes more dominant. For 256 × 256 × d scene resolutions our implementation of the<br />

voxel coloring approach generates 3D models at interactive rates.<br />

Figure 5.5(b) compares the observed timings <strong>for</strong> the proposed space carving methods.<br />

The final 3D model was generated using four sweeps in order to utilize all 36 captured<br />

views. The timings <strong>for</strong> single sweep voxel coloring are displayed <strong>for</strong> comparison, too. For<br />

resolutions up to 128 3 space carving is slightly more expensive than per<strong>for</strong>ming four voxel<br />

coloring sweeps, since some time is required to merge the individual sweeps. At 256 3<br />

resolution, space carving maintaining the full voxel model in graphics memory runs out of<br />

memory <strong>and</strong> requires substantially more time.<br />

5.5.2 Visual Results<br />

In this section we illustrate the visual quality of the obtained reconstructions. At first we<br />

demonstrate our implementation on a synthetic dataset obtained by off-screen rendering<br />

<strong>and</strong> capturing a 3D dinosaur model. The resolution of the input images is 256 × 256.<br />

Several input images are shown in Figure 5.6(a)–(c). The volumetric texture directly<br />

obtained by the space carving procedure is shown in Figure 5.6(d). In order to reduce the<br />

size of the 3D texture, only luminance values instead of colors are stored in the texture.<br />

Figure 5.6(e) <strong>and</strong> (f) are snapshots showing the 3D model as a point cloud within a VRML<br />

viewer.<br />

Another synthetic dataset, the “Bowl” dataset, is shown in Figure 5.7. The images<br />

were obtained under the same conditions as the Dino dataset. In Figure 5.7(d) complete


5.6. Discussion 73<br />

prior knowledge stored in a 3D texture is used, whereas in Figure 5.7(e) the already carved<br />

model is approximated by height-fields. The latter model contains more outliers <strong>and</strong> noise,<br />

but the memory requirement is substantially reduced.<br />

The real dataset consists of images showing a historic statue (Figure 5.8(a)–(c)). In<br />

Figure 5.8(d) the surface voxels of the reconstructed model generated from 7 input views<br />

is shown as point cloud. The number of voxels is 1024 × 1024 × 250 <strong>and</strong> the pure voxel<br />

coloring took about 4.8s. Reading the voxels back into the main memory <strong>and</strong> generating<br />

the VRML file requires additional 40s. A lower resolution version (256 3 ) of the same<br />

dataset generated in 0.77s is shown in Figure 5.9.<br />

5.6 Discussion<br />

This chapter described a hardware accelerated approach <strong>for</strong> voxel coloring <strong>and</strong> space carving<br />

scene reconstruction methods. Voxel coloring can be per<strong>for</strong>med at interactive rates<br />

<strong>for</strong> medium scene resolutions, <strong>and</strong> volumetric models can be obtained with space carving<br />

very quickly (in the order of seconds). Despite the simple consistency measure used in<br />

our implementation, the obtained 3D models are suitable <strong>for</strong> visual feedback to the user<br />

to estimate the parameters used <strong>for</strong> the final high-quality, software-based reconstruction.<br />

With new features provided by modern graphics processors, more sophisticated consistency<br />

measures can be implemented. In particular, a histogram-based consistency measure<br />

[Stevens et al., 2002] is a potential c<strong>and</strong>idate <strong>for</strong> efficient implementation in graphics<br />

hardware.<br />

At low resolution the per<strong>for</strong>mance of our implementation is dominated by the multipass<br />

rendering overhead. Consequently, reducing the number of passes especially at coarse<br />

resolutions may yield to a near real-time generation of volumetric models. Such improvements<br />

need further investigations.


74 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

time in millisecs<br />

time in millisecs<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

256x256xd<br />

512x512xd<br />

1024x1024xd<br />

0<br />

50 100 150<br />

depth resolution<br />

200 250<br />

partial knowledge<br />

independent sweeps<br />

complete knowledge<br />

voxel coloring<br />

(a)<br />

0<br />

32x32x32 64x64x64 128x128x128 256x256x256<br />

resolution<br />

(b)<br />

Figure 5.5: Timing results <strong>for</strong> the Bowl dataset. Each sweep used 9 views to calculate<br />

the consistency of voxels. (a) shows timing results <strong>for</strong> voxel coloring using a single plane<br />

sweep at different resolutions. (b) illustrates timing results <strong>for</strong> space carving using multiple<br />

sweeps at various voxel space resolution. With the exception of voxel coloring, which<br />

is depicted <strong>for</strong> comparison, four sweeps are per<strong>for</strong>med to obtain the final model. Space<br />

carving with complete prior knowledge requires almost 33s at 256 3 resolution; this behavior<br />

is caused by shortage of graphics memory.


5.6. Discussion 75<br />

(a) (b) (c)<br />

(d) (e) (f)<br />

Figure 5.6: (a)–(c) Three input views (of 36) from the synthetic Dino dataset. (d) The<br />

obtained volumetric model visualized with a 3D texture. We use only luminance <strong>and</strong> alpha<br />

channels <strong>for</strong> the texture to reduce the memory footprint of the 3D texture. (e) <strong>and</strong> (f)<br />

show the 3D model rendered as point cloud. In our current implementation, colors <strong>for</strong><br />

surface voxels are assigned in the final sweep, hence surface voxels not seen in the final<br />

sweep have a default color.


76 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

(a) (b) (c)<br />

(d) (e)<br />

Figure 5.7: (a)–(c) Three input views (of 36) from the synthetic Bowl dataset. (d) The<br />

obtained volumetric model visualized with a 3D texture. The model was generated in<br />

1.4s. (e) is generated by approximating the result of previous sweeps with height-fields<br />

instead of a full 3D texture.


5.6. Discussion 77<br />

(a) (b) (c)<br />

(d) (e)<br />

Figure 5.8: (a)–(c) Three input views from an image sequence showing a statue. (d)<br />

shows a high resolution reconstruction generated by carving 250 Mio. initial voxels. Pure<br />

voxel coloring done in graphics hardware required less than 5s. Only surface voxels are<br />

shown as a point cloud.


78 Chapter 5. Space Carving on 3D <strong>Graphics</strong> Hardware<br />

(a) (b)<br />

Figure 5.9: (a) A 3D reconstruction generated by single sweep voxel coloring using a<br />

space of 256 × 256 × 250 voxels. 7 input views are used <strong>for</strong> the reconstruction. Voxel<br />

coloring <strong>and</strong> VRML generation required about 3s. The displayed geometry consists of<br />

surface voxels rendered as points, hence several holes are apparent. (b) A depth image <strong>for</strong><br />

the same dataset generated in 0.77s.


Chapter 6<br />

PDE-based Depth Estimation on<br />

the GPU<br />

Contents<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

6.2 Variational Techniques <strong>for</strong> Multi-View Depth Estimation . . . 80<br />

6.3 GPU-based Implementation . . . . . . . . . . . . . . . . . . . . . 85<br />

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6.1 Introduction<br />

This chapter describes a variational approach to multi-view depth estimation, which is<br />

accelerated by 3D graphics hardware. Variational methods to multi-view depth estimation<br />

are techniques with its foundations in variational calculus <strong>and</strong> numerical analysis.<br />

The result of these procedures is a depth image which minimizes an energy functional<br />

incorporating image similarity <strong>and</strong> smoothness regularization terms. In contrast to many<br />

window-based dense matching approaches favoring fronto-parallel surfaces, the utilized<br />

variational depth estimation method is based on per-pixel image similarities <strong>and</strong> works<br />

well <strong>for</strong> slanted surfaces. Depth values communicate with the surrounding depth hypotheses<br />

through the regularization term.<br />

Energy-based approaches to dense correspondence estimation incorporate image similarity<br />

<strong>and</strong> smoothness constraints into the objective function <strong>and</strong> search <strong>for</strong> an appropriate<br />

minimum. Consequently, these methods allow the propagation of depth values into textureless<br />

regions, where no robust correspondences are available. Variational techniques<br />

express the discrete energy function in continuous terms <strong>and</strong> solve the corresponding<br />

Euler-Lagrange partial differential equation numerically.<br />

79


80 Chapter 6. PDE-based Depth Estimation on the GPU<br />

In contrast to energy-based methods <strong>for</strong> image restoration <strong>and</strong> segmentation, variational<br />

techniques <strong>for</strong> multi-view depth require successive de<strong>for</strong>mation (warping) of the<br />

sensor images according to the current depth map hypo<strong>thesis</strong>. In particular, this step<br />

can be significantly accelerated by the texture units of graphics hardware, which offer the<br />

necessary image interpolation virtually <strong>for</strong> free. Furthermore, the numerical procedures<br />

to solve variational problems are typically algorithms with high parallelism <strong>and</strong> can be<br />

transferred to current generation graphics hardware <strong>for</strong> optimal per<strong>for</strong>mance.<br />

This chapter outlines our implementation of the hardware-accelerated approach to<br />

variational depth estimation <strong>and</strong> presents some positive results. We demonstrate that<br />

a substantial per<strong>for</strong>mance gain is obtained by our approach. Additionally, difficult settings<br />

<strong>for</strong> variational stereo methods resulting in incorrect 3D models are discussed <strong>and</strong><br />

possible solutions proposed. Notice, that very fast numerical solvers allow the convenient<br />

investigation of potentially more complex <strong>and</strong> robust image similarity measures <strong>and</strong> other<br />

extensions to the basic model of variational depth.<br />

6.2 Variational Techniques <strong>for</strong> Multi-View Depth Estimation<br />

6.2.1 Basic Model<br />

This section describes a variational approach to depth estimation following<br />

mostly [Strecha <strong>and</strong> Van Gool, 2002, Strecha et al., 2003]. In order to allow a<br />

one-dimensional search <strong>for</strong> a depth value at every pixel, the camera calibration matrices<br />

<strong>and</strong> the external orientations are assumed to be known. In order to utilize a true<br />

multi-view setup, pixels in one image are transferred by the epipolar geometry (as<br />

described below), <strong>and</strong> an image rectification procedure is not required. In the set of<br />

employed images one image Ii represents the key image, <strong>for</strong> which the depth map is<br />

generated. The other images, Ij, j �= i, are sensor images. The camera imaging Ii is<br />

assumed to be in canonical position (Pi = Ki [I|0]) <strong>and</strong> the external orientation <strong>for</strong> Ij<br />

is [Rj|tj] <strong>and</strong> the camera calibration matrix is Kj. The depth map is calculated with<br />

respect to Ii <strong>and</strong> depth values assigned <strong>for</strong> pixels in Ii transfer to the other images as<br />

follows: The corresponding pixel qij <strong>for</strong> a pixel pi in Ii with associated depth di is given<br />

by<br />

qij(pi) = Hij pi + Tj/di,<br />

where Hij = KjR t j K−1<br />

i<br />

<strong>and</strong> Tj = Kj tj. Note, that pi <strong>and</strong> qij refer to homogeneous pixel<br />

positions <strong>and</strong> qij must be normalized by its third component.<br />

The primary goal of depth estimation is the assignment of depth values to every pixel<br />

of Ii, such that a cost function incorporating image similarity terms <strong>and</strong> smoothness terms<br />

is minimized. In particular, the following objective function is often used in variational<br />

stereo methods:


6.2. Variational Techniques <strong>for</strong> Multi-View Depth Estimation 81<br />

E(di) = �<br />

⎛<br />

⎝ �<br />

pi<br />

j<br />

(Ij(qij(di(pi))) − Ii(pi)) 2 + λ�∇di(pi)� 2<br />

⎞<br />

⎠ → min (6.1)<br />

Since the depth map di is defined on a grid, ∇di refers to a suitable finite<br />

difference scheme to calculate the gradient. We omit the explicit dependence of di<br />

on the pixel pi <strong>and</strong> abbreviate Ij(qij(di(pi))) as Ij(di). Minimizing Eq. 6.1 using<br />

discrete (non-continuous) methods can be achieved using e.g. graph cut methods<br />

[Boykov et al., 2001, Kolmogorov <strong>and</strong> Zabih, 2001, Kolmogorov <strong>and</strong> Zabih, 2002,<br />

Kolmogorov <strong>and</strong> Zabih, 2004]. Alternatively, Eq. 6.1 can be seen as discrete<br />

approximation to a continuous minimization problem <strong>and</strong> techniques from variational<br />

calculus can be applied. The continuous <strong>for</strong>mulation of Eq. 6.1 is<br />

�<br />

S(di) =<br />

⎛<br />

⎝ �<br />

(Ij(di) − Ii) 2 + λ�∇di� 2<br />

⎞<br />

⎠ dp → min (6.2)<br />

p<br />

j<br />

The Euler-Lagrange equation states a necessary condition <strong>for</strong> the function di to be a<br />

stationary value with respect to S [Lanczos, 1986]:<br />

δS<br />

δdi<br />

= � ∂Ij<br />

(Ij(di) − Ii) − λ∇<br />

∂di<br />

2 !<br />

di = 0 (6.3)<br />

j<br />

Note, that this equation holds <strong>for</strong> every pixel p in Ii. The spatial derivative ∂Ij<br />

∂dj<br />

is the<br />

intensity change along the epipolar line in image Ij. By discretizing Eq. 6.3 one can solve<br />

the associated partial differential equation using a numerical scheme on the grid of pixels.<br />

We describe a particular approach, which is very suitable <strong>for</strong> GPU-based implementation.<br />

At first, the image intensities Ij(di) are locally linearized around d0 i using the first<br />

order Taylor expansion:<br />

Ij(di) = Ij(d 0 i + ∆di) ≈ Ij(d 0 i ) + ∂Ij(d 0 i )<br />

Applying this expansion on the Euler-Lagrange equation yields<br />

�<br />

�<br />

∂Ij<br />

j<br />

∂di<br />

Ij(d 0 i ) + ∂Ij(d0 i )<br />

∆di − Ii<br />

∂di<br />

∂di<br />

∆di<br />

�<br />

− λ∇ 2 di = 0. (6.4)<br />

In combination with a (linear) finite differencing scheme <strong>for</strong> ∇ 2 di the equation above<br />

results in a huge, but sparse linear system to solve <strong>for</strong> di. This scheme iteratively refines<br />

the estimates <strong>for</strong> the depth map di given its previous estimate.<br />

In order to prevent the scheme to converge to a suboptimal local minimum, a coarseto-fine<br />

approach is m<strong>and</strong>atory.


82 Chapter 6. PDE-based Depth Estimation on the GPU<br />

Diffusion type Term in S Derivative<br />

Homogeneous diffusion ∇ t d∇d = �∇d� 2 ∇ 2 d<br />

Image-driven isotropic diffusion ∇ t d g(�∇I� 2 )∇d div(g(�∇I� 2 )∇d)<br />

Image-driven anisotropic diffusion ∇ t d D(∇I)∇d div(D(∇I)∇d)<br />

Flow-driven isotropic diffusion ∇ t d g(�∇d� 2 )∇d div(g(�∇I� 2 )∇d)<br />

Flow-driven anisotropic diffusion ∇ t d D(∇d)∇d div(D(∇d)∇d)<br />

Table 6.1: Regularization terms induced by diffusion processes<br />

6.2.2 Regularization<br />

Taking the Laplacian of the depth map, ∇ 2 di, to guide the regularization gives usually too<br />

smooth results <strong>and</strong> the obtained depth maps lack sharp depth discontinuities. Table 6.2.2<br />

lists several regularization functions based on diffusion processes mostly in accordance<br />

with the taxonomy of Weickert et al. [Weickert et al., 2004]. In this table the function<br />

g(s 2 ) is a decreasing scalar function solely based on the magnitude of the gradient, e.g.<br />

g(s 2 ) = exp(−Ks 2 ) (<strong>for</strong> a user specified K). D(∇c) denotes the diffusion tensor<br />

D(∇c) =<br />

1<br />

�∇c�2 + 2ν2 ��<br />

∂c<br />

∂y<br />

− ∂c<br />

� �<br />

∂c<br />

∂y<br />

∂x − ∂c<br />

�t + ν<br />

∂x<br />

2 �<br />

I .<br />

ν is a small constant to prevent singularities in perfectly homogeneous regions. Setting<br />

ν to 0.001 is a common choice. Note that D(∇c) is very similar to the structural tensor<br />

used to detect image corners. If <strong>for</strong> example | ∂c ∂c<br />

∂x | ≫ | ∂y | (a vertical edge in the image),<br />

the diffusion is inhibited in the x-direction.<br />

Isotropic diffusion inhibits diffusion at discontinuities regardless of the direction of the<br />

gradient, whereas anisotropic regularization allows diffusion parallel to edge discontinuities.<br />

Image-driven regularization is based solely on the gradients calculated in the source<br />

data (images) <strong>and</strong> the numerical scheme results in linear expressions. Hence, imagedriven<br />

diffusion is also called linear diffusion [Weickert <strong>and</strong> Brox, 2002]. In flow-based<br />

regularization the diffusion stops at discontinuities of the current flow resp. depth map.<br />

Consequently, the obtained equation system derived from finite differencing is a nonlinear<br />

system <strong>and</strong> requires e.g. fix-point iterations to be solved.<br />

Note that the terminology in not uni<strong>for</strong>m in the literature: flow-driven isotropic diffusion<br />

is often referred as nonlinear anisotropic diffusion [Perona <strong>and</strong> Malik, 1990]. In<br />

addition to homogeneous diffusion we employ an image-driven (linear) anisotropic regularization<br />

approach [Nagel <strong>and</strong> Enkelmann, 1986] <strong>for</strong> the following reasons:<br />

• The anisotropy of this regularization adapts very well to homogeneous image region<br />

boundaries <strong>and</strong> allows smoothing along image edges.<br />

• The linear nature of the numerical scheme allows efficient sparse matrix solvers to<br />

be utilized.


6.2. Variational Techniques <strong>for</strong> Multi-View Depth Estimation 83<br />

Pure image-driven diffusion employed <strong>for</strong> image smoothing <strong>and</strong> denoising will fail in highly<br />

textured regions, but in this case the discriminative image data will result in correct<br />

determination of the final depth map.<br />

6.2.3 Extensions <strong>and</strong> Variations<br />

In the literature several extensions <strong>and</strong> enhancements are proposed to increase the quality<br />

<strong>and</strong> reliability of variational approaches to depth estimation. We summarize a few<br />

important concepts in this section.<br />

6.2.3.1 Back-Matching<br />

In order to increase the robustness of the variational depth estimation method <strong>and</strong> to<br />

detect mismatches, a back-matching scheme can be utilized to assign confidence values<br />

to the depth values. Confident depth estimates should have a higher influence in the<br />

regularization term <strong>for</strong> adjacent pixels with lower confidence.<br />

In a back-matching setting, every image Ii takes the role of a key image <strong>and</strong> a dense<br />

depth map is computed with several Ij, j �= i, as sensor images. If we denote the depth<br />

map computed <strong>for</strong> Ii with di <strong>and</strong> qij(p, di) represents the transfer of a pixel p in image Ii<br />

with the associated depth into Ij, then the <strong>for</strong>ward-backward error is<br />

eij = �p − qji(qij(p, di), dj)�.<br />

The confidence cij is now a function of eij, e.g.<br />

or<br />

cij = 1/(1 + k eij)<br />

�<br />

cij = exp − c2<br />

�<br />

.<br />

k<br />

If cij is close to 1, then the depth value is highly confident. Values of cij close to zero<br />

indicate unreliable depth values. In [Strecha et al., 2003] the following energy functional<br />

is proposed:<br />

�<br />

S(di) =<br />

p<br />

⎛<br />

⎝ �<br />

j<br />

cij(Ij(d) − Ii) 2 + λ∇ t diD(∇Ci)∇di<br />

⎞<br />

⎠ dp → min,<br />

where Ci = maxj(cij) <strong>and</strong> D(∇Ci) is a anisotropic diffusion operator. The corresponding<br />

Euler-Lagrange equation reads<br />

δS<br />

δdi<br />

= �<br />

j<br />

∂Ij<br />

!<br />

cij (Ij(di) − Ii) − λdiv(D(∇Ci)∇di = 0.<br />

∂di


84 Chapter 6. PDE-based Depth Estimation on the GPU<br />

6.2.3.2 Local Changes in Illumination<br />

If the scene to be reconstructed contains not only purely Lambertian surfaces with diffuse<br />

reflection behavior, illumination changes appear between the images. These local lighting<br />

changes can be modeled by an additional intensity scaling function κij, which scales the<br />

intensity values of Ii to match the intensities in Ij. The extended energy function is<br />

�<br />

S(di, κij) =<br />

p<br />

⎛<br />

⎝ �<br />

j<br />

(Ij(d) − κij Ii) 2 + λ�∇di� 2 + λ2�∇κij� 2<br />

⎞<br />

⎠ dp → min,<br />

since both di <strong>and</strong> κij are assumed to change smoothly over the image domain. The<br />

corresponding Euler-Lagrange equations <strong>for</strong> di <strong>and</strong> κij are now:<br />

δS<br />

δdi<br />

δS<br />

δκij<br />

= � ∂Ij<br />

(Ij(d) − κij Ii) − λ∇<br />

∂di<br />

2 d<br />

j<br />

= Ii(Ij(di) − κij Ii) − λ2∇ 2 κij.<br />

Of course, confidence evaluation using back-matching <strong>and</strong> the estimation of local lighting<br />

changes can be combined into one framework.<br />

In case of local illumination changes the intensity scaling <strong>and</strong> the depth map will<br />

be affected. It is impossible to correctly estimate the depth from the available local<br />

in<strong>for</strong>mation only, since both depth <strong>and</strong> intensity scaling processes will adapt to match the<br />

pixel intensity values.<br />

6.2.3.3 Other Variations<br />

The energy functional presented in Eq. 6.2 <strong>and</strong> used in the previous sections can be<br />

modified in various ways. At first, the L 2 data term (Ij(di) − Ii) 2 can be replaced by a<br />

suitable function Ψ on the intensity differences, e.g.<br />

Ψ(Ij(di) − Ii) =<br />

�<br />

(Ij(di) − Ii) 2 + ε 2<br />

<strong>for</strong> small ε [Brox et al., 2004, Slesareva et al., 2005]. This choice of Ψ is a smooth, differentiable<br />

L 1 norm. Additionally, the data term may incorporate intensity gradient <strong>and</strong><br />

other higher order in<strong>for</strong>mation as well [Papenberg et al., 2005].<br />

It the L 1 image data term is utilized, it is common to employ a total variation regularization<br />

[Rudin et al., 1992], �∇d�, instead of the quadratic one. In general, the choice<br />

of the regularization significantly affects the results especially close to discontinuities.


6.3. GPU-based Implementation 85<br />

6.3 GPU-based Implementation<br />

This section describes our implementation of the variational depth estimation technique<br />

on a GPU. Depth estimation in our application is per<strong>for</strong>med on a set of three images<br />

(one key image plus two sensor images). In general, three passes are per<strong>for</strong>med in every<br />

iteration of depth refinements:<br />

1. In the first pass the sensor images Ij are warped according to the current depth map<br />

hypo<strong>thesis</strong> <strong>and</strong> the spatial derivatives ∂Ij/∂di are calculated.<br />

2. Expressions used in the regularization term are precomputed, e.g. the Laplacian or<br />

the anisotropic flow used in the subsequent semi-implicit solvers.<br />

3. Finally, the depth estimates are updated using some semi-implicit strategy derived<br />

from Eq. 6.4.<br />

The next sections describe each pass in more detail. These iterations are embedded in a<br />

coarse-to-fine framework using a Gaussian image pyramid to avoid immediate convergence<br />

to a local minimum. The depth map acquired after convergence at the coarser level is used<br />

as initial depth map at the next finer level.<br />

6.3.1 Image Warping<br />

The first pass of the GPU-based depth estimation implementation consists of warping the<br />

sensor images, Ij, according to the depth map di. The lookup in image Ij is per<strong>for</strong>med<br />

using the epipolar parametrization<br />

qij = (x, y, 1) t = Hij pi + Tij/di,<br />

Consequently, the warped image according to the current depth hypo<strong>thesis</strong> can be obtained<br />

by dependent texture lookups. The required spatial derivative ∂Ij(di)/∂di can be<br />

efficiently calculated by the chain rule:<br />

∂Ij(di)<br />

∂di<br />

= ∂Ij(qij) ∂qij<br />

qij ∂di<br />

� �t � �<br />

∂Ij/∂x ∂x/∂di<br />

=<br />

.<br />

∂Ij/∂y ∂y/∂di


86 Chapter 6. PDE-based Depth Estimation on the GPU<br />

If we define X = (X (1) , X (2) , X (3) ) t = Hij pi + Tij/di <strong>and</strong> Tij = (T (1)<br />

ij<br />

have<br />

∂x/∂di =<br />

∂y/∂di =<br />

(1)<br />

T ij X(3) − T (3)<br />

ij X(1)<br />

(X (3) ) 2 d2 i<br />

(2)<br />

T ij X(3) − T (3)<br />

ij X(2)<br />

(X (3) ) 2 d2 i<br />

(2) (3)<br />

, T ij , T ij )t , then we<br />

The advantage of this scheme is, that with precomputed gradient images ∇Ij the spatial<br />

derivative along the epipolar line, ∂Ij(di)/∂di, can be easily calculated <strong>and</strong> the computation<br />

of X = Hij pi + Tij/di can be shared, if Ij(di) <strong>and</strong> its derivative is calculated in the<br />

same fragment program. In our implementation, a texture representing Ij holds the intensity<br />

value, the horizontal <strong>and</strong> the vertical gradient in its three channels. Image warping<br />

assigns Ij(di) <strong>and</strong> its derivative to the two channels of the target buffer.<br />

Note, that Hij pi needs not to be calculated <strong>for</strong> every pixel, but can be linearly interpolated<br />

by the GPU rasterizer like any other texture coordinate. On our hardware<br />

the per<strong>for</strong>mance gain was rather minimal, since the matrix-vector multiplication in the<br />

fragment program is mostly hidden by the required texture fetches.<br />

The GPU version of this step per<strong>for</strong>ms approximately 100 times faster than a straight<strong>for</strong>ward,<br />

but otherwise completely equivalent software implementation.<br />

6.3.2 Regularization Pass<br />

If Laplacian regularization is employed, a simple fragment program is sufficient to calculate<br />

∇ 2 di. The more interesting case is the utilization of image-based or confidence-based<br />

anisotropic diffusion to control the depth map regularization. Both regularization approaches<br />

yield linear numerical schemes, since the diffusion weights remain constant <strong>for</strong><br />

the current level in the image pyramid.<br />

Confidence images are created as follows: After determining the depth maps at the<br />

next-coarser resolution, a confidence map cij between view i <strong>and</strong> j is generated with<br />

cij = 1/(1 + k eij), where eij = �p − qji(qij(p, di), dj)� is the back-matching error. This<br />

confidence map remains constant <strong>for</strong> the current resolution level. The confidence values<br />

cij adjacent to a pixel are normalized, such that their sum is one. For every pixel this<br />

results in a weight vector W with four components. The regularization term is calculated<br />

as<br />

⎛<br />

⎜<br />

⎝<br />

W [x−1]<br />

W [x+1]<br />

W [y−1]<br />

W [y+1]<br />

⎞t<br />

⎟<br />

⎠<br />

⎛<br />

⎜<br />

⎝<br />

d [x−1]<br />

i<br />

d [x+1]<br />

i<br />

d [y−1]<br />

i<br />

d [y+1]<br />

i<br />

This is proportional to the st<strong>and</strong>ard Laplacian, if W is set to (1/4, 1/4, 1/4, 1/4) t .<br />

− di<br />

− di<br />

− di<br />

− di<br />

⎞<br />

⎟<br />

⎠ .


6.3. GPU-based Implementation 87<br />

6.3.3 Depth Update Equation<br />

The finite difference scheme of equation 6.4 (respectively one of its extensions) is a large<br />

system of equations in the unknowns ∆di <strong>for</strong> every pixel:<br />

�<br />

�<br />

∂Ij<br />

Ij(di) +<br />

∂di<br />

∂Ii(di)<br />

�<br />

∆di − I0 − λ∇<br />

∂di<br />

2 (di + ∆di) = 0 (6.5)<br />

j<br />

Approximating the Laplacian (resp. the employed diffusion term) by a linear operator, the<br />

system becomes a sparse one, <strong>and</strong> the unknowns ∆di are coupled only <strong>for</strong> adjacent pixels<br />

through to the regularization term yielding a sparse system matrix.<br />

Using the st<strong>and</strong>ard 4-star scheme to calculate the Laplacian the matrix of the sparse<br />

linear system obtained from the above equation has a special structure containing 5 diagonal<br />

b<strong>and</strong>s (Figure 6.1). Two iterative numerical schemes to solve sparse linear system are<br />

currently applicable <strong>for</strong> the GPU: the Jacobi method <strong>and</strong> the conjugate gradient method.<br />

6.3.3.1 Jacobi Iterations<br />

In order to solve a linear system Ax = b with diagonally dominant matrix A, the Jacobi<br />

method per<strong>for</strong>ms the following iteration:<br />

x (n+1) �<br />

−1<br />

= D (D − A)x (n) �<br />

+ b ,<br />

where D is the diagonal part of A. Consequently, the new components of x (n+1) depend<br />

only on the old values of x (n) . The update procedure <strong>for</strong> every pixel according to Eq. 6.5<br />

is now<br />

∆d (n+1)<br />

i<br />

=<br />

�<br />

λ ∇2di + 1 �<br />

4<br />

p∈N ∆d(n)<br />

i<br />

λ + �<br />

j<br />

�<br />

− ∂Ij(di)<br />

∂di<br />

�2 � ∂Ij(di)<br />

∂di<br />

(Ij(di) − I0)<br />

,<br />

where p ∈ N runs over the four adjacent pixels to the current pixel. After several iterations<br />

=<br />

of this inner loop to obtain a converged ∆d final<br />

i<br />

d (k)<br />

i<br />

+ ∆dfinal<br />

i .<br />

6.3.3.2 Conjugate Gradient Solver<br />

, the depth map is updated as d (k+1)<br />

i<br />

In addition to the Jacobi method we implemented a conjugate gradient procedure on<br />

the GPU to solve the sparse linear system. This implementation is based on the ideas<br />

presented by Krüger <strong>and</strong> Westermann [Krüger <strong>and</strong> Westermann, 2003].<br />

On the GPU the system matrix with five diagonal b<strong>and</strong>s is stored in two textures: the<br />

off-diagonal b<strong>and</strong>s are stored in a 4 component texture image, which remains constant.<br />

The main diagonal is represented as a single component render target, since it must be<br />

updated after every warping pass. Analogous to the Jacobi method the result of the<br />

conjugate gradient approach is a stabilized depth update ∆di.


88 Chapter 6. PDE-based Depth Estimation on the GPU<br />

Figure 6.1: The sparse structure of the linear system obtained from the semi-implicit<br />

approach. Dark pixels indicate non-zero entries.<br />

6.3.4 Coarse-to-Fine Approach<br />

In order to avoid reaching a local minimum immediately we utilize a coarse-to-fine scheme.<br />

We chose a usual image pyramid, which halves the image dimensions in every level. After<br />

downsampling the image of the next finer level the obtained image was additionally<br />

smoothed. When going to the next coarser level, the regularization weight λ should be<br />

halved as well, but in practice scaling λ by a factor of � 1/2 gave better results.<br />

6.4 Results<br />

This section presents several depth maps <strong>and</strong> 3D models to illustrate the benefits <strong>and</strong><br />

possible shortcomings of the variational depth estimation method.<br />

6.4.1 Facade Datasets<br />

The first dataset depicts a historical statue embedded in a facade. The resolution of<br />

the grayscale source images <strong>and</strong> the resulting depth map is 512 × 512 pixels. Figure 6.2<br />

illustrates the obtained range map based on three small-baseline source images as colored<br />

3D point set. Figure 6.3 shows the corresponding depth images using the implemented<br />

numerical solvers <strong>and</strong> gives timing in<strong>for</strong>mation. Six pyramidal levels are generated <strong>for</strong><br />

the coarse-to-fine approach. The Jacobi <strong>and</strong> the CG solvers execute 50 iterations in the<br />

outer loop (image warping) <strong>and</strong> 3 iterations in the inner loop to calculate the actual depth<br />

update. The Jacobi solver runs fastest with 1.15s, whereas the conjugate gradient solver<br />

requires significantly more time. The obtained depth maps are almost identical with both<br />

approaches.


6.4. Results 89<br />

Figure 6.2: A reconstructed historical statue displayed as colored point set with a resolution<br />

of 512 × 512 points. Three small baseline images are used to generate the model.<br />

Figure 6.4 shows the consequences of back-matching. Without back-matching a severe<br />

mismatch appears near the feet of the statue (Figure 6.4(a)). Back-matching uses a larger<br />

sequence of images to mutually verify the depth maps as described in Section 6.2.3.1.<br />

Figure 6.4(b) shows the same close-up view of the feet with a significantly better geometry.<br />

Another result of the variational depth estimation approach is shown in Figure 6.5.<br />

The resolution of the depth map <strong>for</strong> this dataset is 1024 × 640.<br />

6.4.2 Small Statue Dataset<br />

This section addresses the reconstruction of another dataset, which requires additional<br />

methods to be applied to obtain a suitable model. The object to be reconstructed is a<br />

small statue, <strong>for</strong> which more than 40 images were taken in a circular path around the<br />

statue.<br />

Using the source images directly to generate the depth maps is not successful, which can<br />

be seen in Figure 6.6. Even including the back-matching approach does not improve the<br />

result. The reason <strong>for</strong> this failure is due to the very large depth discontinuities between<br />

the <strong>for</strong>eground statue <strong>and</strong> the background scenery. Consequently, the smoothness <strong>and</strong><br />

ordering constraint is violated in these images (see Figure 6.6(a–c)).<br />

The first approach to obtain better reconstructions is to per<strong>for</strong>m an image segmentation<br />

procedure to separate <strong>for</strong>eground <strong>and</strong> background regions. The initial manual<br />

segmentation <strong>for</strong> one image is propagated through the complete sequence, such that only


90 Chapter 6. PDE-based Depth Estimation on the GPU<br />

(a) Jacobi (n=3), 1.15s (b) CG (n=3), 3.15s<br />

Figure 6.3: The depth maps of the embedded statue reconstructed with the numerical<br />

schemes. Both numerical solver yields almost similar result, with the Jacobi solver being<br />

faster.<br />

little further manual interaction is necessary [Sormann et al., 2005]. Background pixels<br />

are set to a uni<strong>for</strong>m color be<strong>for</strong>e applying the depth estimation procedure. Two of the<br />

obtained point sets are shown in Figure 6.7.<br />

Alternatively we introduced a more robust image intensity error term in order to h<strong>and</strong>le<br />

the changing background <strong>and</strong> occlusions. The energy function to be optimized includes a<br />

truncated intensity difference:<br />

⎛<br />

�<br />

S(di) = ⎝ �<br />

min � T, (Ij(di) − Ii) 2� + λ�∇di� 2<br />

⎞<br />

⎠ dp → min, (6.6)<br />

p<br />

j<br />

with a thresholding parameter T . Instead of replacing the thresholding operator by a<br />

differentiable soft-min function, we chose a very different approach: Since we have two<br />

sensor images, Ij1 <strong>and</strong> Ij2 , zero, one or both data terms may be saturated <strong>and</strong> in the<br />

Euler-Lagrange equation the corresponding term is missing. Consequently, the new depth


6.4. Results 91<br />

(a) Without back-matching (b) With back-matching<br />

Figure 6.4: The effect of bidirectional matching on the embedded statue scene.<br />

is taken from this set of decoupled solutions:<br />

∆d (k+1)<br />

i<br />

∆d (k+1)<br />

i<br />

∆d (k+1)<br />

i<br />

=<br />

=<br />

=<br />

λ∇ 2 d (k)<br />

i<br />

∆d (k+1)<br />

i = ∇ 2 d (k)<br />

i<br />

− �<br />

j<br />

λ + �<br />

j<br />

∂Ij(d (k) �<br />

i )<br />

∂di<br />

� ∂Ij(d (k)<br />

i )<br />

∂di<br />

Ij(d (k)<br />

�<br />

i ) − I0<br />

� 2<br />

λ∇2d (k)<br />

i − ∂Ij (d 1 (k)<br />

i )<br />

Ij1 ∂di<br />

(d(k)<br />

�<br />

∂Ij (d 1 λ +<br />

(k)<br />

�2 i )<br />

∂di<br />

λ∇2d (k)<br />

i − ∂Ij (d 2 (k)<br />

i )<br />

Ij2 ∂di<br />

(d(k)<br />

�<br />

∂Ij (d 2 λ +<br />

(k)<br />

�2 i )<br />

∂di<br />

�<br />

�<br />

i ) − I0<br />

i ) − I0<br />

Note that the three lower equation are obtained by removing one or both image terms in<br />

the first equation. In case of truncation of the intensity error, the derivative of the constant<br />

threshold is zero. The depth value with the lowest actual error term is selected as the<br />

result <strong>for</strong> this iteration. In Figure 6.8 the resulting enhanced depth map <strong>and</strong> 3D model is<br />

illustrated. Although the depth image <strong>and</strong> the reconstructed model are far superior than<br />

the original model depicted in Figure 6.6, the obtained statue model has still some flaws<br />

<strong>and</strong> a more refined approach requires further investigation.<br />

�<br />


92 Chapter 6. PDE-based Depth Estimation on the GPU<br />

(a) (b)<br />

Figure 6.5: Two views on the colored point set showing the front facade of a church.<br />

6.4.3 Mirabellstatue Dataset<br />

The source images of this dataset display an outdoor statue (see Figure 6.9(a)). Depth map<br />

generation is restricted to the statue using silhouette masks to separate the <strong>for</strong>eground<br />

statue object from the background scenery. Three images with 512 × 512 pixels resolution<br />

are used to compute the depth maps illustrated in Figure 6.9(b)–(d). The differences<br />

between the displayed meshes come from the employed regularization approach. The first<br />

two meshes are acquired using a homogeneous regularization with different values <strong>for</strong> the<br />

weight λ. The third mesh is obtained utilizing image-driven anisotropic diffusion <strong>for</strong> a<br />

selective regularization in textureless image regions as discussed in Section 6.2.2.<br />

The mesh shown in Figure 6.9(b) uses a small value <strong>for</strong> λ, which results in noisy mesh<br />

geometry especially in textureless regions. The mesh displayed in Figure 6.9(c) is obtained<br />

by using a larger value <strong>for</strong> λ <strong>and</strong> appears clearly smoother, but sharper creases at depth<br />

discontinuities are missing. Image-driven anisotropic diffusion yields to a generally smooth<br />

mesh, but includes sharp edges at depth discontinuities.<br />

6.5 Discussion<br />

Variational approaches to depth estimation provide a mathematically sound tool <strong>for</strong> generating<br />

3D models from multiple images. These methods work best <strong>for</strong> images with constant<br />

lighting conditions <strong>and</strong> if only little occlusions <strong>and</strong> depth discontinuities are present in the<br />

imaged scene. Under these requirements high-quality depth maps can be generated at<br />

interactive rates.


6.5. Discussion 93<br />

Nevertheless, there are several issues that must be addressed: At first, scenes with large<br />

depth discontinuities <strong>and</strong> violated ordering constraints must be h<strong>and</strong>led in a more robust<br />

manner. The approach presented in Section 6.4.2 is only a first step in this direction, since<br />

the results are still not completely satisfying. Incorporating segmentation in<strong>for</strong>mation to<br />

detect piecewise connected objects can be based on color clustering, as it is partially<br />

employed in Section 6.4.2. Alternatively, combining a segmentation procedure based on<br />

initial <strong>and</strong> coarser depth hypotheses with the described variational approach appears to<br />

be promising. Variational multi-phase approaches (e.g. [Chan <strong>and</strong> Vese, 2002, Shen, 2006,<br />

Jung et al., 2006]) are potential c<strong>and</strong>idates to generate the combined initial depth <strong>and</strong><br />

segmentation hypo<strong>thesis</strong>.<br />

Incorporating lighting changes into a variational framework to optical flow <strong>and</strong><br />

depth estimation can be accomplished using techniques proposed by Hermosillo et<br />

al. [Hermosillo et al., 2001, Chefd’Hotel et al., 2001]. Whether such approaches are<br />

suitable <strong>for</strong> 3D modeling at interactive rates is an open question.<br />

Another item which needs to be addressed is the image smoothing used in the coarse-tofine<br />

hierarchy. In a multi-view setup the epipolar lines run arbitrarily through the source<br />

images <strong>and</strong> usual Gaussian smoothing possibly moves corresponding features away from<br />

the appropriate epipolar line. Consequently, the recovered geometry at a coarser scale<br />

is not a smoothed version of the true geometry, but only loosely coupled with the true<br />

underlying model. In a rectified stereo setup a pure horizontal blurring has the advantage,<br />

that features are smoothed along the epipolar lines, but not in their orthogonal direction.<br />

Extending this approach to a multi-view setting is a topic <strong>for</strong> future research.


94 Chapter 6. PDE-based Depth Estimation on the GPU<br />

(a) (b) (c)<br />

(d) (e)<br />

Figure 6.6: The three source images <strong>and</strong> the resulting unsuccessful reconstruction of the<br />

statue.


6.5. Discussion 95<br />

(a) (b)<br />

Figure 6.7: Two of the successfully reconstructed point sets using image segmentation to<br />

omit the background scenery.<br />

(a) (b)<br />

Figure 6.8: An enhanced depth map <strong>and</strong> 3D point set obtained using the truncated error<br />

model.


96 Chapter 6. PDE-based Depth Estimation on the GPU<br />

(a) One source view (b) Homogeneous, λ = 3<br />

(c) Homogeneous, λ = 10 (d) Image-driven anisotropic,<br />

λ = 10<br />

Figure 6.9: The effect of image-driven anisotropic diffusion. Two generated meshes using<br />

homogeneous regularization with different values of λ are shown in (a) <strong>and</strong> (b). The<br />

choice of λ = 3 in (a) yields a noisy result, wheres setting λ = 10 gives a significantly<br />

better geometry. Employing image-driven anisotropic diffusion yields to the visually most<br />

appealing mesh with sharp creases, but without noise in textureless regions (c).


Chapter 7<br />

Scanline Optimization <strong>for</strong> Stereo<br />

On <strong>Graphics</strong> Hardware<br />

Contents<br />

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

7.2 Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo . . . . . 98<br />

7.3 Cross-Correlation based Multiview Scanline Optimization on<br />

<strong>Graphics</strong> Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

7.1 Introduction<br />

In this chapter we propose a GPU-based computational stereo approach using scanline<br />

optimization to achieve optimal intrascanline disparity maps. Since we employ a linear<br />

discontinuity cost model, the central part of the procedure is the calculation of the<br />

appropriate min-convolution, which is usually implemented as two pass method using destructive<br />

array updates. We replace this in-place updates by a recursive doubling scheme<br />

better suited <strong>for</strong> stream programming models. Consequently, the entire dense estimation<br />

pipeline from matching cost computation to global optimization to obtain the disparity<br />

resp. depth map is per<strong>for</strong>med by the GPU <strong>and</strong> only the control flow is maintained by the<br />

CPU.<br />

Since the material of this chapter is rather technical, it is divided into two parts: the<br />

first section (Section 7.2) focuses on the details of a GPU-based scanline optimization<br />

procedure <strong>for</strong> the rectified stereo setup employing very simple image matching scores.<br />

The second section (Section 7.3) addresses the incorporation of the GPU-based scanline<br />

optimization implementation in a multiview setup. The focus of this section lies on the<br />

efficient utilization of ‘sliding’ sums to calculate the zero mean normalized cross correlation<br />

score in particular.<br />

97


98 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

7.2 Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo<br />

This section describes the core of the GPU implementation of scanline optimization. The<br />

main idea is the trans<strong>for</strong>mation of the main dynamic programming step (which has linear<br />

time complexity on sequential processors) to an equivalent procedure suitable <strong>for</strong> parallel<br />

computing (with O(N log N) time complexity). Additionally, several techniques to employ<br />

the parallelism within the fragment processor to full extent are presented. Not all of these<br />

methods are applicable <strong>for</strong> high-resolution depth maps (see Section 7.3.7 <strong>for</strong> one approach<br />

to overcome this limitation).<br />

7.2.1 Scanline Optimization <strong>and</strong> Min-Convolution<br />

Scanline optimization [Scharstein <strong>and</strong> Szeliski, 2002] searches <strong>for</strong> a globally optimal assignment<br />

of disparity values to pixels in the current (horizontal) scanline, i.e. it finds<br />

arg min<br />

dx<br />

W�<br />

(D(x, dx) + λV (dx, dx−1)) ,<br />

x=1<br />

where D(x, d) is the image dissimilarity cost <strong>and</strong> V (d, d ′ ) is the regularization cost. As in<br />

all dynamic programming approaches to stereo, different scanlines are treated independent<br />

from the neighboring ones (which may result in vertical streaks visible in the disparity<br />

image).<br />

The optimal assignment can be efficiently found using a dynamic programming approach<br />

to maintain the minimal accumulated costs ¯ C(x, d) up to the current position<br />

x:<br />

¯C(x + 1, d) = D(x + 1, d) +<br />

�<br />

min ¯C(x, d1) + V (d, d1)<br />

d1<br />

� .<br />

In a linear discontinuity cost model we have V (d, d1) = λ|d − d1| <strong>and</strong> the calculation of<br />

�<br />

min ¯C(x, d1) + λ|d − d1|<br />

d1<br />

�<br />

<strong>for</strong> every d can be per<strong>for</strong>med in linear time using a <strong>for</strong>ward <strong>and</strong> a backward pass to compute<br />

the lower envelope [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. The linear-time procedure to<br />

calculate the min-convolution is given in Algorithm 3.<br />

This procedure is not directly suitable <strong>for</strong> GPU implementation, since it relies at first<br />

on in-place array updates <strong>and</strong> secondly, a linear number of passes is required to update<br />

the entire array h. ∗<br />

∗ Using the depth test with the same depth buffer as texture source <strong>and</strong> target buffer would allow<br />

the direct implementation, but this approach results in undefined behavior according to the specifications.<br />

Such an approach would have additional disadvantages, mainly the reduced ability to utilize the parallelism<br />

of the GPU.


7.2. Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo 99<br />

Algorithm 3 Procedure to calculate the lower envelope efficiently<br />

Procedure Min-Convolution<br />

Input: Output h[]<br />

<strong>for</strong> d = 1 . . . k do<br />

h[d] ← ¯ C(x, d)<br />

end <strong>for</strong><br />

{Forward pass}<br />

<strong>for</strong> d = 2 . . . k do<br />

h[d] ← min(h[d], h[d − 1] + λ)<br />

end <strong>for</strong><br />

{Backward pass}<br />

<strong>for</strong> d = k − 1 . . . 1 do<br />

h[d] ← min(h[d], h[d + 1] + λ)<br />

end <strong>for</strong><br />

The basic idea to enable a GPU implementation of min-convolution is utilizing<br />

a recursive doubling approach, which is outlined in Algorithm 4. Recursive<br />

doubling [Dubois <strong>and</strong> Rodrigue, 1977] is a common technique in high-per<strong>for</strong>mance<br />

computing to enable parallelized implementations of sequential algorithms. This<br />

technique is frequently used in GPU-based applications to per<strong>for</strong>m stream reduction<br />

operations like accumulating all values of a texture image [Hensley et al., 2005].<br />

If we focus on the <strong>for</strong>ward pass in Algorithm 4, the procedure calculates the result of<br />

[d] contains<br />

the <strong>for</strong>ward pass <strong>for</strong> subsequently longer sequences ending in d. Initially, h + 0<br />

the min-convolution of the single element sequence [d, d]. In every outer iteration with<br />

index L the h<strong>and</strong>led sequence is extended to [d − 2L , d] <strong>and</strong> its length is doubled. Note,<br />

that h + [d] is defined to be ∞ (i.e. a large constant), if d is outside the valid range [1 . . . k].<br />

After all iterations, h + [d] contains the correct result of the <strong>for</strong>ward pass, which can be<br />

easily shown by induction. The same argument applies to the backward pass, hence this<br />

procedure yields to the desired result. In addition to the lower envelope h the disparity<br />

values <strong>for</strong> which the minimum is attained are tracked in the array disp[].<br />

Note that the updates in the loops over d are independent <strong>and</strong> can be per<strong>for</strong>med<br />

as parallel loop. In GPGPU terminology, the bodies of these loops are computational<br />

kernels [Buck et al., 2004]. Additionally, the scanlines of the images are treated independently,<br />

there<strong>for</strong>e the min-convolution can be per<strong>for</strong>med <strong>for</strong> all scanlines in parallel.<br />

Figure 7.1 gives an illustration of the first few iterations in the <strong>for</strong>ward pass of Algorithm<br />

4. Since the next iteration of the outer loops in the min-convolution algorithm<br />

refers only to values generated in the previous iteration, only two arrays must be maintained<br />

(instead of a logarithmic number of arrays). The role of this two arrays is swapped<br />

after every iteration; the destination array becomes the new source <strong>and</strong> vice versa. In<br />

GPU terminology, these arrays correspond to render-to-texture targets, <strong>and</strong> alternating<br />

the roles of these textures is referred as ping-pong rendering.


100 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

Algorithm 4 Procedure to calculate the lower envelope using recursive doubling<br />

Procedure Min-Convolution using Recursive Doubling<br />

{Forward pass}<br />

<strong>for</strong> d = 1 . . . k do<br />

h + 0 [d] ← ¯ C(x, d)<br />

disp[d] ← d<br />

end <strong>for</strong><br />

<strong>for</strong> L = 0 . . . ⌈log 2(k − 1))⌉ do<br />

<strong>for</strong> d = 1 . . . k do<br />

d1 ← d − 2 L<br />

h +<br />

L<br />

[d] ← min(h+<br />

L−1<br />

disp[d] ← arg mind(h +<br />

end <strong>for</strong><br />

end <strong>for</strong><br />

{Backward pass}<br />

<strong>for</strong> d = 1 . . . k do<br />

[d], h+<br />

L−1 [d1] + λ 2 L )<br />

L−1<br />

h − 0 [d] ← h+<br />

L [d]<br />

end <strong>for</strong><br />

<strong>for</strong> L = 0 . . . ⌈log2(k − 1))⌉ do<br />

<strong>for</strong> d = 1 . . . k do<br />

d1 ← d + 2L h −<br />

L<br />

[d] ← min(h−<br />

L−1<br />

disp[d] ← arg mind(h −<br />

end <strong>for</strong><br />

end <strong>for</strong><br />

Return h −<br />

log2 (k−1) <strong>and</strong> disp<br />

[d], h+<br />

L−1 [d1] + λ 2 L )<br />

[d], h−<br />

L−1 [d1] + λ 2 L )<br />

L−1<br />

[d], h+<br />

L−1 [d1] + λ 2 L )<br />

The full linear discontinuity cost model is often not appropriate <strong>and</strong> a truncated linear<br />

cost model with V (d, d1) = λ min(T, |d − d1|) is preferable. If T is chosen to be a<br />

power of two, the truncated cost model can be incorporated without an additional per<strong>for</strong>mance<br />

penalty into Algorithm 4 by replacing the λ 2 L smoothness cost term in the<br />

A<br />

A’<br />

A’’<br />

A<br />

A’<br />

B<br />

B’<br />

min(A+1,B)<br />

C<br />

C’<br />

min(B+1,C)<br />

B’’ B’ C’’ min(A’+2,C’) D’’ min(B’+2,D’)<br />

D<br />

D’<br />

min(C+1,D)<br />

E<br />

E’<br />

E’’<br />

min(D+1,E)<br />

min(C’+2,E’)<br />

Figure 7.1: Graphical illustration of the <strong>for</strong>ward pass using a recursive doubling approach.


7.2. Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo 101<br />

min-convolution algorithm by λ min(T, 2 L ). For other values of T an additional pass over<br />

the ¯ C(x, ·) array is required [Felzenszwalb <strong>and</strong> Huttenlocher, 2004]. For optimal per<strong>for</strong>mance<br />

we restrict our implementation to the pure linear model resp. to the truncated<br />

model with power-of-two thresholds.<br />

7.2.2 Overall Procedure<br />

This section describes the basic procedure <strong>for</strong> scanline optimization on the GPU, which<br />

consists of several steps. The outline of the overall procedure is presented in Algorithm 5.<br />

The input consists of two rectified images with resolution W × H. The range of potential<br />

disparity values is [dmin, dmax] with k elements.<br />

The procedure traverses vertical scanlines positioned at x from left to right. At first<br />

the dissimilarity of the current scanline at x in the left image with the set of vertical<br />

scanlines [x + dmin, x + dmax] is calculated, resulting in a texture image with dimensions<br />

H <strong>and</strong> k. The dissimilarity is either a sum of absolute differences aggregated<br />

in a rectangular window or the sampling insensitive pixel dissimilarity score proposed<br />

in [Birchfield <strong>and</strong> Tomasi, 1998].<br />

If the first scanline is processed, the texture storing ¯ C is initialized with the dissimilarity<br />

score. For all subsequent scanlines the lower envelope of ¯ C is computed using<br />

�<br />

Algorithm 4 to obtain mind1<br />

¯C(x − 1, d1) + λ|d − d1| � <strong>for</strong> every row y <strong>and</strong> disparity value<br />

d. The computation of the lower envelope keeps track of the disparity value, where the<br />

minimum is attained (we refer to Section 7.2.3.2 <strong>for</strong> a detailed description of the efficient<br />

disparity tracking). These tracked disparities are read back into main memory <strong>for</strong> the<br />

subsequent optimal disparity map extraction. Afterwards, the ¯ C array is incremented by<br />

the dissimilarity score of the current vertical scanline.<br />

If the final scanline is reached, the total accumulated ¯ C is read back in order to<br />

determine the optimal disparities <strong>for</strong> the last column given by arg mind ¯ C(W, d). With the<br />

knowledge of the disparities <strong>for</strong> the final column, the disparities <strong>for</strong> previous columns can<br />

be assigned by a backtracking procedure.<br />

7.2.3 GPU Implementation Enhancements<br />

The basic method outlined in the last section does not utilize the free parallelism of fragment<br />

program operations, which work on four component vectors simultaneously. Consequently,<br />

the per<strong>for</strong>mance of the method can be substantially improved if this inherent<br />

parallelism is taken into account.<br />

7.2.3.1 Fewer Passes Through Bidirectional Approach<br />

Essentially, W passes of the min-convolution procedure are required to obtain the final<br />

¯C values <strong>and</strong> the corresponding disparity map. This number can be effectively halved,<br />

if scanline optimization is applied on two opposing horizontal positions simultaneously


102 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

Algorithm 5 Outline of the scanline optimization procedure on the GPU<br />

Procedure Scanline optimization on the GPU<br />

<strong>for</strong> x = 1 . . . W do<br />

Compute the image dissimilarity <strong>for</strong> the vertical scanline at x <strong>and</strong> all possible<br />

disparities, resulting in scoreTex<br />

if x = 1 then<br />

sumCostTex := scoreTex<br />

else<br />

Calculate the lower envelope h of sumCostTex resulting in lowerEnvTex.<br />

Read back tracked disparities from lowerEnvTex.<br />

sumCostTex := lowerEnvTex + scoreTex<br />

end if<br />

if x = W then<br />

Read back the accumulated cost <strong>for</strong> the final column from sumCostTex.<br />

end if<br />

end <strong>for</strong><br />

Extract final disparity map by backtracking<br />

finally meeting in the central position. More <strong>for</strong>mally, let ¯ Cfw(x, d) be the accumulated<br />

cost starting from x = 1 <strong>and</strong> ¯ Cbw the cost beginning at x = W , which are computed<br />

simultaneously using parallel fragment operations. If we assume even W , in every iteration<br />

the values <strong>for</strong> ¯ Cfw(x, d) <strong>and</strong> ¯ Cbw(W − x + 1, d) are determined. The iterations stop at<br />

x 1/2 := W/2 + 1 <strong>and</strong> the total cost <strong>for</strong> optimal paths with disparity d at position x 1/2 is<br />

¯Cfw(x 1/2, d) + ¯ Cbw(x 1/2, d) − D(x 1/2, d).<br />

Hence the initial disparity assigned to x 1/2 is the disparity attaining the minimum of this<br />

sum, <strong>and</strong> the complete disparity map can be extracted by the backtracking procedure<br />

as already outlined. This approach better utilizes the essentially free vector processing<br />

capabilities, <strong>and</strong> this modification reduces the total runtime by approximately 45% <strong>for</strong><br />

384 × 288 images.<br />

7.2.3.2 Disparity Tracking <strong>and</strong> Improved Parallelism<br />

Using a bidirectional approach does not only reduce the number of passes, but the parallelism<br />

of the fragment processor is employed to some extent – two ¯ C values are h<strong>and</strong>led<br />

in parallel ( ¯ Cfw <strong>and</strong> ¯ Cbw). Since GPUs are designed to operate on vector values with four<br />

components, an additional per<strong>for</strong>mance gain can be expected if four ¯ C values are stored<br />

in the color channels <strong>for</strong> every pixel.<br />

Note that the calculation of the lower envelope <strong>for</strong> ¯ C is not enough, since the appropriate<br />

disparity values attaining the minimum must be stored as well in order to enable an<br />

efficient backtracking phase. If one assumes integral disparity values, image dissimilarity


7.2. Scanline Optimization on the GPU <strong>for</strong> 2-Frame Stereo 103<br />

scores <strong>and</strong> an integral smoothness weight λ, then ¯ C <strong>and</strong> h are integer numbers as well.<br />

Hence, the associated disparity can be encoded in the fractional part of h. Furthermore,<br />

no additional operations are needed to track the disparities attaining the minimal accumulated<br />

costs. Of course, in case of ties in the min-convolution procedure, disparities with<br />

smaller encoded fractions are preferred (which is as good as any other strategy).<br />

Encoding the disparity value in the fractional part of floating point numbers limits the<br />

image resolution in order to avoid precision loss. If the dissimilarity score is an integer<br />

from the interval [0, T ], then the total accumulated cost is at most (W/2 + 1) × T , where<br />

W is the source image width. If the dissimilarity score is discretized into the range [0, 255],<br />

16 bit of the mantissa are required to encode ¯ C <strong>for</strong> half PAL resolution (W = 384), which<br />

leaves enough accuracy to encode the disparities in the fractional part. The sign bit of<br />

the floating point representation can be additionally incorporated by centering the range<br />

of dissimilarity scores around 0.<br />

Utilizing this compact representation <strong>for</strong> accumulated cost/disparity pairs allows us<br />

to h<strong>and</strong>le two horizontal scanlines in parallel, thereby reducing the effective image height<br />

to the half <strong>for</strong> the min-convolution. Figure 7.2 illustrates the parallel processing of two<br />

vertical scanlines in the bidirectional approach, <strong>and</strong> the assignment of the RGBA channels<br />

to pixel positions.<br />

R<br />

B<br />

R R<br />

G G<br />

B<br />

B A<br />

Figure 7.2: Parallel processing of vertical scanlines using the bidirectional approach <strong>for</strong><br />

optimal utilization of the four available color channels. The arrows indicate the progression<br />

of the processed scanlines in consecutive passes.<br />

7.2.3.3 Readback of Tracked Disparities<br />

After the lower envelope is computed, the encoded tracked disparities are read back into<br />

main memory to be available <strong>for</strong> the final back tracking procedure. The tracked disparity<br />

A<br />

G<br />

A


104 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

values encoded in the fractional part of the lower envelope are extracted directly on the<br />

GPU into an 8-bit framebuffer (which is efficient, since fragment programs on NVidia<br />

hardware support native instructions to get the fractional part of a floating point number).<br />

The tracked disparities are now read back as byte channels. We discovered, that this<br />

approach is the fastest, since the usually expensive conversion from floating point numbers<br />

to integers is per<strong>for</strong>med on the GPU without a per<strong>for</strong>mance penalty <strong>and</strong> the amount of<br />

data to be read back is substantially reduced.<br />

7.2.4 Results<br />

At first we give timing results <strong>for</strong> CPU <strong>and</strong> GPU implementation of scanline optimization<br />

software. The CPU version is a straight<strong>for</strong>ward C++ implementation using the minconvolution<br />

as described in Algorithm 3. The disparity map is determined <strong>for</strong> successive<br />

scanlines. Code optimization is left to the compiler. The GPU implementation is based<br />

on OpenGL using the frame buffer extension <strong>and</strong> the Cg language.<br />

The timing tests are per<strong>for</strong>med on two hardware plat<strong>for</strong>ms: the first plat<strong>for</strong>m is a<br />

PC with a 3 GHz Pentium 4 CPU (CPUA) <strong>and</strong> an NVidia Ge<strong>for</strong>ce 6800 graphics board<br />

(GPUA) running Linux. The C++ source is compiled with gcc 3.4.3 <strong>and</strong> -O2 optimization.<br />

The second system is a PC with an AMD Athlon64 X2 4400+ CPU (CPUB) <strong>and</strong> a<br />

Ge<strong>for</strong>e 7800GT graphics hardware (GPUB). The employed compiler is gcc 4.0.1 again<br />

with -O2 optimization.<br />

Table 7.1 displayed the obtained timing results. Tsukuba 1x denotes the original wellknown<br />

dataset with 384 × 288 image resolution <strong>and</strong> 15 possible disparity values. Tsukuba<br />

2x <strong>and</strong> 4x denote the same dataset, which is resized to 768 × 288 resp. 1536 × 288 pixels.<br />

The possible disparity range consists of 30 <strong>and</strong> 60 values, respectively. We select horizontal<br />

stretching of the image to simulate sub-pixel disparity estimation.<br />

The Pentagon dataset is another common stereo dataset with 512×512 pixels resolution<br />

<strong>and</strong> 16 potential disparity values (Pentagon 1x). Resizing the images to 1024 × 1024<br />

resolution yields the Pentagon 2x dataset (32 disparities). The image similarity function<br />

in all datasets is the SAD using a 3×1 window calculated on grayscale images. In order to<br />

avoid the memory consuming 3D disparity image space the image dissimilarity is calculated<br />

on dem<strong>and</strong> <strong>for</strong> the current vertical scanline.<br />

CPUA GPUA CPUB GPUB<br />

Tsukuba 1x 0.0462 0.1180 0.0373 0.0678<br />

Tsukuba 2x 0.1891 0.2911 0.1387 0.1565<br />

Tsukuba 4x 0.7257 1.0082 0.5655 0.4566<br />

Pentagon 1x 0.1261 0.1877 0.0953 0.1165<br />

Pentagon 2x 0.9458 1.0381 0.7065 0.4930<br />

Table 7.1: Average timing result <strong>for</strong> various dataset sizes in seconds/frame.


7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 105<br />

The results in Table 7.1 clearly indicate that the multi-pass GPU method is significantly<br />

slower than the CPU version <strong>for</strong> small image resolutions. For higher resolutions the<br />

speed is roughly equal resp. the GPU version shows better per<strong>for</strong>mance depending on the<br />

hardware. Note that most time is actually spent in the scanline optimization procedure<br />

itself; only about 15-20% of the frame time is spent to calculate this particularly simple<br />

image dissimilarity. Additionally, we observed that the CPU-based backtracking part to<br />

extract the optimal disparities has a negligible impact on the total runtime.<br />

The required time grows almost linearly on the CPU with increasing resolution, which<br />

is in contrast to the GPU curve. In theory, the 4 times stretched Tsukuba dataset should<br />

require the 16-fold runtime (fourfold number of disparities <strong>and</strong> horizontal pixels). The<br />

CPU version matches this expectation largely (15.1 <strong>and</strong> 15.7-fold runtime), whereas the<br />

GPU shows a sublinear behavior (8.5 resp. 6.7-fold runtime). At low resolutions the setup<br />

times <strong>for</strong> frame buffers etc. become a more dominant fraction of the total runtime.<br />

In order to provide a visual proof <strong>for</strong> the correctness of the proposed GPU implementation,<br />

the disparity maps <strong>for</strong> several st<strong>and</strong>ard stereo datasets are shown in Figure 7.3 <strong>and</strong><br />

Figure 7.4. Additionally, the obtained depth maps using subpixel disparity estimation <strong>for</strong><br />

the Tsukuba images are displayed in Figure 7.3(b) <strong>and</strong> (c).<br />

(a) 1x (b) 2x (c) 4x<br />

Figure 7.3: Disparity images <strong>for</strong> the Tsukuba dataset <strong>for</strong> several horizontal resolutions<br />

generated by the GPU-based scanline approach.<br />

7.3 Cross-Correlation based Multiview Scanline Optimization<br />

on <strong>Graphics</strong> Hardware<br />

This section extends <strong>and</strong> modifies the approach presented in Section 7.2 on depth estimation<br />

using scanline optimization on the GPU. The value of the <strong>for</strong>merly presented method<br />

is increased by enabling multiple views to be h<strong>and</strong>led. Additionally, the SAD matching<br />

cost function can be replaced by the usually more robust cross correlation similarity score.


106 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

(a) Cones (b) Teddy<br />

Figure 7.4: Disparity images <strong>for</strong> the Cones <strong>and</strong> Teddy image pairs from the Middlebury<br />

stereo evaluation datasets. These disparity images illustrate only the correctness of the<br />

GPU implementation, but the images are not intended to indicate superior matching<br />

per<strong>for</strong>mance.<br />

7.3.1 Input Data <strong>and</strong> General Setting<br />

The input data <strong>for</strong> this method consists of n ≥ 2 grayscale source images of dimension<br />

w × h with already removed lens distortion. Additionally, the camera intrinsic parameters<br />

<strong>and</strong> the relative poses between the views are known. One source image plays the particular<br />

role of a key view, <strong>for</strong> which the depth map is calculated. The other views are used to<br />

evaluate the depth hypotheses <strong>and</strong> are called sensor images. The depth image assigns one<br />

depth value in the range from [znear, zfar] with D possible values from that range. In our<br />

implementation the potential depth values are taken equally spaced from this interval.<br />

The viewing frustum induced by the key view limited to the depth range [znear, zfar]<br />

comprises a 3D volume, which encloses the feasible surface to be reconstructed. Planesweep<br />

methods <strong>and</strong> our approach traverse this volume using a sequence of 3D planes <strong>and</strong><br />

warp the sensor images onto this plane (resp. the corresponding quadrilateral <strong>for</strong>med by<br />

intersection with the view frustum). Plane-sweep methods typically use 3D planes parallel<br />

to the key image plane, whereas our method uses planes induced by vertical scanlines in<br />

the key image.<br />

In the later sections we describe the implementation of several image dissimilarity<br />

functions, which are calculated <strong>for</strong> a user-specified aggregation (support) window of W ×H<br />

pixels. The sum of absolute differences (SAD) between two rectangular sets of pixels is<br />

defined as<br />

SAD = �<br />

|Xi − Yi|,<br />

i∈W


7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 107<br />

where i ∈ W denotes the set of pixels in the rectangular support window W. The zeromean<br />

normalized cross correlations is defined as follows:<br />

�<br />

i∈W<br />

NCC =<br />

(Xi − ¯ X) (Yi − ¯ Y )<br />

�� i∈W (Xi − ¯ �<br />

X) 2<br />

=<br />

By the shifting property one gets:<br />

NCC =<br />

i∈W (Yi − ¯ Y ) 2<br />

�<br />

i∈W (Xi − ¯ X) (Yi − ¯ Y )<br />

�<br />

σ2 X σ2 Y<br />

�<br />

XiYi − 1<br />

N (� Xi) ( � Yi)<br />

�<br />

σ2 X σ2 , (7.1)<br />

Y<br />

with σ 2 X = � X 2 i − (� Xi) 2 /N <strong>and</strong> σ 2 Y = � Y 2<br />

i − (� Yi) 2 /N. Hence, it is possible<br />

to compute the cross correlation solely from several sums aggregated within the support<br />

window.<br />

If multiple sensor images are provided, the total matching cost <strong>for</strong> a depth hypo<strong>thesis</strong><br />

is the sum of individual (optionally truncated) matching costs between the key view <strong>and</strong><br />

each sensor image. Using 8- or 16-bit resolution <strong>for</strong> the correlation values, this sum can<br />

be obtained by utilizing the blending (i.e. in-place accumulation) stage of recent graphics<br />

hardware.<br />

7.3.2 Similarity Scores based on Incremental Summation<br />

If one employs a plane sweep approach combined with a purely local winner-takes-all depth<br />

extraction method (see Figure 7.5), spatial aggregation within the support window is easily<br />

per<strong>for</strong>med. Warping the sensor images on the current depth plane <strong>and</strong> spatial aggregation<br />

can be substantially accelerated by graphics hardware due to its specific projective texture<br />

sampling capabilities (see Chapter 4 <strong>and</strong> [Yang et al., 2002, Yang <strong>and</strong> Pollefeys, 2003,<br />

Cornelis <strong>and</strong> Van Gool, 2005]).<br />

On the other h<strong>and</strong>, if a global depth extraction method is utilized, the matching cost<br />

values conceptually comprise a disparity space image (DSI), which stores the matching<br />

score <strong>for</strong> every pixel in the key view <strong>and</strong> every c<strong>and</strong>idate depth value. Hence, the DSI<br />

is a 3D data array with w × h × D elements. When using scanline optimization to find<br />

the optimal depth assignments <strong>for</strong> horizontal scanlines in the key view, the matching<br />

costs <strong>for</strong> every pixel <strong>and</strong> depth value are only accessed once. Consequently, the matching<br />

scores can be calculated on dem<strong>and</strong> <strong>for</strong> vertical lines in the key view as the algorithm<br />

successively updates the ¯ C array from left to right. Due to this simple observation the<br />

memory-consuming construction of the DSI can be avoided. In the following paragraphs<br />

we describe this on-the-fly matching cost computation <strong>for</strong> multiple view configurations in<br />

more detail.


108 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

Key view<br />

Sensor view<br />

Figure 7.5: Plane-sweep approach to multiple view matching<br />

In contrast to plane-sweep approaches, which warp the sensor images onto a plane<br />

parallel to the key image plane positioned at a certain depth, we project the sensor images<br />

on a plane induced by a vertical scanline x = const in the key image (Figure 7.6). This<br />

plane is <strong>for</strong>med by all rays K −1<br />

0<br />

y<br />

(x, y, 1) <strong>for</strong> a fixed x value..<br />

key view<br />

znear<br />

x<br />

z<br />

zfar<br />

Figure 7.6: Plane sweep from left to right<br />

If the aggregation (correlation) window size is W × H, then (at least conceptually)<br />

W slices around the current x-value must be stored. For image dissimilarity functions,<br />

which can be computed by appropriate box filters, like the sum of absolute differences<br />

(SAD), sum of squared differences (SSD) <strong>and</strong> the normalized cross correlation (NCC),


7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 109<br />

maintaining the aggregated sums can be done in an incremental manner by providing the<br />

new incoming slice <strong>and</strong> the outgoing slice to the updating procedures.<br />

7.3.3 Sensor Image Warping<br />

We assume, that the key view has a canonical position, i.e. P0 = K0 (I|0) with the known<br />

camera intrinsic matrix K0. The sensor view i has the projection matrix Pi = (Mi|mi) =<br />

Ki (Ri, ti). Then the 2D point (x, y) wrt. the key view combined with a depth z maps into<br />

the sensor images in the following manner:<br />

qi ∼ z Ai (x, y, 1) t + mi,<br />

with Ai = Mi K −1<br />

0 . qi is a homogeneous quantity (a 3-vector). Using projective texture<br />

mapping, the correct intensity values from the sensor images can be sampled.<br />

Warping the sensor images onto the planar slices as indicated in Figure 7.6 can be<br />

per<strong>for</strong>med by rendering an aligned quadrilateral into a buffer of dimensions h × D. In<br />

world space the quad is determined by constant x value <strong>and</strong> varying y ∈ [1, h] <strong>and</strong> z ∈<br />

[znear, zfar]. Rasterization of this quadrilateral yields to sampling the pixels from the<br />

sensor images using projective texture mapping. Consequently, the sensor image intensity<br />

values wrt. all depth hypotheses <strong>for</strong> the current vertical scanline can be easily retrieved.<br />

Note, that during rendering of this slice additional operations can be per<strong>for</strong>med <strong>for</strong><br />

higher efficiency. For instance, the corresponding key view pixels (comprising a vertical line<br />

at the current x position) can be sampled as well, <strong>and</strong> a binary operation can be applied<br />

on the sampled key image pixel <strong>and</strong> the sensor image pixel. This feature is utilized as<br />

described in the next sections.<br />

Sensor Image Sampling In a plane-sweep approach the rendered quadrilateral corresponding<br />

to a depth plane matches the assumed fronto-parallel surface geometry. Consequently,<br />

higher quality sensor image sampling using mipmapped trilinear or anisotropic<br />

filtering is immediately available. Since our rendered slices do not match the assumed<br />

(fronto-parallel) object surface, the texture space to screen space derivatives interpolated<br />

by the rasterization hardware from the provided quadrilateral geometry are incorrect. The<br />

simplest solution is to revert to basic linear filtering without using derivative in<strong>for</strong>mation<br />

at all. Another solution is providing derivatives computed in the fragment program to the<br />

texture lookup functions, which is possible on newer graphics hardware. If qi = (q x i<br />

, qy<br />

i , qz i )<br />

is the homogeneous position in the sensor image <strong>for</strong> a given key image pixel (x, y) <strong>and</strong><br />

depth z (as described above), then the texture coordinates are (s, t) = (q x i /qz i<br />

Additionally, we have <strong>for</strong> the texture space derivatives<br />

∂s<br />

∂x = z(A11X3 − A31X1)<br />

(X3) 2 ,<br />

, qy<br />

i /qz i ).


110 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

with X = (X1, X2, X3) t = z Ai (x, y, 1) t + mi <strong>and</strong> Akl are the elements of Ai. The other<br />

derivatives ∂s/∂y, ∂t/∂x <strong>and</strong> ∂t/∂y are calculated in an analogue manner. Using these<br />

derivatives the texture footprint of a fronto-parallel surface can be simulated. The projective<br />

texture lookup to sample the sensor images is then replaced by a 2D lookup with<br />

supplied texture space derivatives.<br />

In our evaluated datasets the results using linear resp. anisotropic texture<br />

sampling are effectively indistinguishable due to the small baseline multiview<br />

geometry. If several surface orientation are evaluated to obtain more accurate<br />

reconstructions [Akbarzadeh et al., 2006], higher quality sensor image sampling could be<br />

beneficial. Enabling fourfold anisotropic texture filtering increased the total runtime by<br />

about 5–10% in our experiments.<br />

7.3.4 Slice Management<br />

The scanline optimization procedure stores the epipolar volume slices around the current<br />

x position, i.e. the slices corresponding to X ∈ {x − W/2, x + W/2}. When the matching<br />

cost computation <strong>and</strong> the update of ¯ C <strong>for</strong> the current x position are finished, the new<br />

slice corresponding to x + W/2 + 1 is rendered into a temporary buffer. The matching<br />

cost update routines are invoked with the now obsolete slice at x − W/2 <strong>and</strong> the newly<br />

generated slice at x + W/2 + 1 provided. This allows the cost update functions to per<strong>for</strong>m<br />

an incremental update of its stored values. Afterwards, the buffer holding the obsolete<br />

slice can be reused as target slice at x + W/2 + 2 in the next iteration.<br />

Figure 7.7 illustrates the incremental update of the accumulated values. Note, that several<br />

different accumulation results may be required depending on the employed matching<br />

cost function.<br />

previous sum incoming slice outgoing slice<br />

Figure 7.7: Spatial aggregation <strong>for</strong> the correlation window. At first, the pixels are aggregated<br />

in the x-direction by incremental summation of multiple slices. The final aggregated<br />

value is obtained by vertical summation of these intermediate pixels.<br />

7.3.5 SAD Calculation<br />

If the SAD is chosen as image dissimilarity cost, the incremental update is very simple:<br />

when rendering the 3D quadrilateral to sample the sensor images the absolute differences<br />

between the sensor image <strong>and</strong> the key image pixels is calculated on the fly. The procedure<br />

to calculate the SAD matching cost maintains only the horizontal sums of absolute differences<br />

<strong>for</strong> j ∈ {x−W/2, x+W/2}. This can be easily achieved, since the update procedure<br />

Σ


7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 111<br />

takes the obsolete <strong>and</strong> the newly generated slice as input. Computing the actual matching<br />

score is per<strong>for</strong>med by vertical aggregation of H pixels.<br />

7.3.6 Normalized Cross Correlation<br />

The basic method to maintain the sums <strong>for</strong> NCC calculation are essentially similar to<br />

the SAD version. In this case, three horizontal sums need to be maintained: �<br />

i Y (i, y),<br />

�<br />

i Y (i, y)2 , <strong>and</strong> �<br />

i X(i, y)Y (i, y), where X(·) denote key image pixels <strong>and</strong> Y (·) refers to<br />

sampled sensor image pixels. Epipolar volume slice extraction calculates Y (i, y) <strong>and</strong> the<br />

product X(i, y)Y (i, y) <strong>and</strong> stores these values in two of the color channels.<br />

The st<strong>and</strong>ard deviation σX wrt. the aggregation window <strong>for</strong> every pixel in the key<br />

image <strong>and</strong> the box filtering result �<br />

i∈W Xi can be precomputed <strong>and</strong> are immediately<br />

available during the iterations at no additional cost.<br />

The calculation of the final correlation score involves vertical aggregation of �<br />

i Y (i, y)<br />

<strong>and</strong> �<br />

�<br />

i X(i, y)Y (i, y) to obtain the sum <strong>for</strong> the rectangular window W, i∈W Yi resp.<br />

�<br />

i∈W XiYi. The squared sum �<br />

i∈W Y 2<br />

i can be generated simultaneously while aggregating<br />

�<br />

i∈W Yi. A final fragment program calculates the NCC using Equation 7.1 from these<br />

intermediate values.<br />

Note, that this approach requires additional buffers to store the appropriate horizontal<br />

sums <strong>for</strong> each sensor image.<br />

In practice we use the square root of the NCC as employed matching cost <strong>for</strong> the<br />

following reasons: at first, discretizing the NCC directly into e.g. 255 different values<br />

induces inaccuracies especially <strong>for</strong> small matching costs. On the contrary, the graph of<br />

√<br />

NCC has a more linear shape, hence a uni<strong>for</strong>m discretization is feasible. Secondly, the<br />

NCC behaves qualitatively like a squared difference between normalized intensities, since<br />

�<br />

�<br />

Xi − ¯ X<br />

i∈W<br />

σX<br />

− Yi − ¯ Y<br />

σY<br />

� 2<br />

= 2 − 2 NCC(X, Y ).<br />

Hence we consider it reasonable to adapt the matching cost to the linear regularization<br />

cost model by taking the square root.<br />

7.3.7 Depth Extraction by Scanline Optimization<br />

The matching costs <strong>for</strong> the current active vertical scanline are used to update the accumulated<br />

cost array ¯ C. In order to have a pure GPU implementation this step is per<strong>for</strong>med<br />

by graphics hardware as well as described in Section 7.2. Alternatively, readback of the<br />

matching scores <strong>and</strong> CPU-based depth extraction by dynamic programming is possible as<br />

well [Wang et al., 2006].<br />

In Section 7.2.3 the vector processing capability of the fragment processor (operating on<br />

4-component vectors simultaneously without additional costs) is utilized by a bidirectional<br />

approach: the accumulated costs ¯ C are calculated in parallel starting from x = 1 in the


112 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

<strong>for</strong>ward direction <strong>and</strong> x = w backwards, meeting in the central position. Backtracking the<br />

optimal depth values is subsequently per<strong>for</strong>med to the left <strong>and</strong> right border starting from<br />

the central pixel. This approach reduces the number of iteration in the multipass methods<br />

to the half <strong>and</strong> doubles the employed parallelism in the fragment programs. Additionally,<br />

two vertically adjacent pixels are treated within the same fragment requiring a compact<br />

encoding of ¯ C <strong>and</strong> the corresponding depth value in one floating point number. We apply<br />

the first, bidirectional scanning technique to improve the parallelism in this work as well.<br />

This implies, that matching costs are computed simultaneously <strong>for</strong> the vertical scanline<br />

at x1 = x <strong>and</strong> x2 = w − x simultaneously. The intermediate values <strong>and</strong> correlation score<br />

<strong>for</strong> x1 <strong>and</strong> x2 are stored in the red <strong>and</strong> green channel resp. the blue <strong>and</strong> alpha channel.<br />

We do not utilize the second method, since it limits the image <strong>and</strong> depth resolution to<br />

ensure accurate results. Nevertheless, we substantially improved the per<strong>for</strong>mance of the<br />

GPU-based scanline optimization method using the following approach: We restrict the<br />

precision of ¯ C stored in GPU memory to 16 bit float values (fp16), which allow accurate<br />

representation of integer values in the range [−2047, 2047]. Using fp16 values instead of<br />

the full IEEE precision floating point range halves the memory b<strong>and</strong>width required by<br />

the GPU-based scanline optimization method. Since this procedure is b<strong>and</strong>width limited<br />

(recall Algorithm 4), the per<strong>for</strong>mance of this step is approximately doubled.<br />

In order to maintain the accuracy of the generated depth maps we assume, that the<br />

matching cost is an integral value from the range [0, 255] <strong>and</strong> λ is integral as well. Hence<br />

¯C is an integral quantity, too. In order to avoid overflows of ¯ C, we per<strong>for</strong>m frequent<br />

renormalization of ¯ C using the following update:<br />

¯C(x, d) ← ¯ C(x, d) − min<br />

d1<br />

¯C(x, d1) − 2047.<br />

We subtract 2047 to exploit the sign bit of the fp16 representation, too. Using ¯ C(x +<br />

n, d) − ¯ C(x, d) ≤ 255n <strong>and</strong> ¯ C(x, d) − mind1 ¯ C(x, d1) ≤ λD, we can calculate the frequency<br />

of updates from<br />

¯C(x + n, d) − min<br />

d1<br />

¯C(x, d1) ≤ λD + 255n.<br />

For the fp16 representation we require that the right h<strong>and</strong> side is at most 4094 (i.e.<br />

2 × 2047), hence<br />

n ≤ (4094 − λD)/255.<br />

This means, that n vertical scanlines can be updated without renormalization. For D =<br />

200 <strong>and</strong> λ = 2 we get n = 14. For the experiments we fixed n = 16 without visible<br />

degradation of the obtained depth map.<br />

7.3.8 Memory Requirements<br />

The parallel computing pattern of our approach treating vertical scanlines at once requires<br />

saving the full data needed <strong>for</strong> the final backtracking procedure. After updating ¯ C this


7.3. Cross-Correlation based Multiview Scanline Optimization on <strong>Graphics</strong> Hardware 113<br />

data is read back from the GPU memory into main memory. If the depth range contains<br />

less that 256 entries, the required memory is w × h × D bytes, which is e.g. less than 190<br />

MB <strong>for</strong> datasets with 768 × 1024 × 250 resolution.<br />

7.3.9 Results<br />

The reported timing results in this section are obtained on a Linux PC equipped with a<br />

Pentium IV 3GHz main processor <strong>and</strong> an NVidia Ge<strong>for</strong>ce 6800 graphics card with 12 pixel<br />

pipelines.<br />

The first dataset depicted in Figure 7.8 consist of a virtual turntable sequence displaying<br />

a simple building model. The synthetically rendered images are resized to 512 × 512<br />

pixels. This is the resolution of the obtained depth image as well. Since a turntable is<br />

emulated, the scene objects are rotated, but the light sources remain constant. Hence<br />

the surface shading changes between the views substantially. Consequently, the resulting<br />

depth maps calculated with the SAD matching cost function shown in Figure 7.9(a) <strong>and</strong><br />

(b) have many significant defects. All these depth maps are computed <strong>for</strong> a depth range<br />

containing 200 equally spaced values. Figure 7.9(c) displays the depth image obtained<br />

by a plane-sweep approach using a winner-takes-all depth extraction method (Chapter 4).<br />

There are still mismatches in textureless regions visible. Finally, Figure 7.9(d) is the result<br />

of the proposed NCC + scanline optimization implementation. The scanline optimization<br />

procedure is per<strong>for</strong>med on the GPU as well. In all cases the correlation window is set to<br />

9 × 9 pixels. Alternatively to the pure GPU method, we implemented a mixed CPU/GPU<br />

approach: while the GPU calculates the matching cost <strong>for</strong> the next vertical scanline, the<br />

CPU updates ¯ C <strong>for</strong> the current vertical scanline in parallel (using a straight<strong>for</strong>ward C++<br />

implementation). The runtime of this mixed approach is almost identical to the GPU<br />

method <strong>for</strong> this dataset.<br />

(a) left view (b) center view (c) right view<br />

Figure 7.8: The three input views of the synthetic dataset<br />

Table 7.2 displays the runtimes of our implementation at different resolutions. We<br />

evaluated pure GPU approaches (GPU-fp32 <strong>and</strong> GPU-fp16) <strong>and</strong> mixed implementation


114 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

(a) WTA, SAD:<br />

0.82s<br />

(b) SO, SAD: 5.1s (c) WTA, NCC:<br />

2.86s<br />

(d) SO, NCC: 6.21s<br />

Figure 7.9: The obtained depth maps <strong>and</strong> timing results <strong>for</strong> the synthetic dataset. WTA<br />

denotes a GPU plane-sweep approach with a winner-takes-all depth extraction (Chapter<br />

4). SO designates the scanline optimization implementation proposed in this work.<br />

utilizing the CPU <strong>for</strong> the scanline optimization part. GPU-fp32 denotes the pure GPU implementation<br />

without successive renormalization every 16 scanlines. Hence, 32 bit floating<br />

point values are used to store the accumulated costs ¯ C. GPU-fp16 indicates the pure GPU<br />

algorithm using 16 bit values <strong>for</strong> ¯ C utilizing frequent renormalization. We give timing results<br />

<strong>for</strong> two mixed CPU/GPU approaches as well: the first approach is a synchronous<br />

approach, where the matching costs calculation on the GPU <strong>and</strong> dynamic programming<br />

on the CPU are per<strong>for</strong>med in a sequential manner (4th column). These timings allow<br />

direct comparison of the scanline optimization part with the corresponding runtimes on<br />

the GPU. The asynchronous version of the mixed approach calculates the matching cost<br />

<strong>for</strong> the next vertical scanline on the GPU while ¯ C is updated by the CPU (5th column).<br />

The runtime of this parallel approach is the fastest of all dynamic programming implementations,<br />

since the total runtime is dominated solely by the NCC computation (<strong>and</strong> the<br />

update of ¯ C is basically free). Finally, WTA denotes the local plane sweep approach from<br />

Chapter 4.<br />

Resolution GPU-fp32 GPU-fp16 Mixed sync. Mixed async. WTA<br />

256 × 256 × 100 0.79s 0.69s 0.66s 0.55s 0.34s<br />

512 × 512 × 200 6.2s 5.1s 5.0s 3.9s 2.7s<br />

512 × 768 × 200 9.2s 7.7s 7.7s 6.0s 4.1s<br />

768 × 1024 × 250 27.1s 21.4s 20.6s 16.5s 10.9s<br />

768 × 1024 × 250 10.1s 9.4s 9.6s 6.1s 5.0s<br />

Table 7.2: Runtimes of scanline optimization using a 9 × 9 NCC at different resolutions<br />

using three views. The last row displays the runtimes on a PC equipped with an Athlon64<br />

X2 4400+ <strong>and</strong> a GeForce 7800GT.<br />

The comparison of the last two columns (asynchronous CPU/GPU <strong>and</strong> winner-takesall<br />

depth extraction) reveals the per<strong>for</strong>mance penalty induced by the different sweep directions.<br />

The main reason <strong>for</strong> the higher per<strong>for</strong>mance of the WTA approach is, that


7.4. Discussion 115<br />

this method utilized all 4 components in the fragment processor, whereas the proposed<br />

implementation calculates only two matching score per pixel.<br />

The sole scanline optimization time <strong>for</strong> GPU-fp32 is approximately twice the time<br />

needed by GPU-fp16, as predicted. To see this, the NCC calculation time given in the next<br />

to last columns must be subtracted from the total time given in the respective columns.<br />

Finally, CPU scanline optimization using integer arithmetic is still slightly faster than our<br />

GPU-fp16 implementation (columns 3 <strong>and</strong> 4).<br />

The last row of Table 7.2 depicts the runtimes observed on more recent PC hardware<br />

equipped with an Athlon64 X2 4400+ <strong>and</strong> a GeForce 7800GT. The per<strong>for</strong>mance difference<br />

between the local approach <strong>and</strong> the fastest scanline optimization method is smaller than<br />

the gap observed on our main PC. Additionally, the per<strong>for</strong>mance gain of GPU-fp16 over<br />

GPU-fp32 is less eminent. These partially unexpected, but still preliminary results on<br />

current 3D hardware need further analysis.<br />

Figure 7.10 provides visual results <strong>for</strong> a dataset consisting of three images showing a<br />

wooden Bodhisattva statue. The source images <strong>and</strong> the depth maps have a resolution of<br />

512×768 pixels, <strong>and</strong> the depth range contains 200 values. The lighting conditions changes<br />

slightly between the input views (Figure 7.10(a)–(c)). The depth image obtained by a pure<br />

winner-takes-all approach using a 9 × 9 NCC is shown in Figure 7.10(d). The result of our<br />

multiview scanline optimization method is displayed as depth map (Figure 7.10(e)) <strong>and</strong><br />

as the triangulated surface mesh (Figure 7.10(f)). The computation times <strong>for</strong> the local<br />

method <strong>and</strong> our proposed one are 4.1s <strong>and</strong> 6s, respectively.<br />

7.4 Discussion<br />

In this chapter we propose a scanline optimization procedure <strong>for</strong> disparity estimation suitable<br />

<strong>for</strong> stream architectures like modern programmable graphics processing units. Although<br />

the direct implementation of scanline optimization using destructive (i.e. in-place)<br />

value updates must be replaced by a more expensive recursive approach, the huge computational<br />

power of current GPUs turns out to be beneficial <strong>for</strong> larger image resolutions<br />

<strong>and</strong> disparity ranges. Consequently, the entire disparity estimation pipeline comprising<br />

of matching score computation <strong>and</strong> semi-global disparity extraction can be per<strong>for</strong>med on<br />

graphics hardware, thereby avoiding the relatively costly data transfer between the GPU<br />

<strong>and</strong> the CPU <strong>and</strong> leaving the CPU idle <strong>for</strong> other tasks.<br />

Additionally, the basic GPU friendly approach to scanline optimization <strong>for</strong> a rectified<br />

stereo pair is extended to the multiple view case utilizing the more robust cross correlation<br />

matching score. The matching costs are generated on dem<strong>and</strong> as required by the main<br />

dynamic programming procedure. When using more complex dissimilarity scores it turns<br />

out to be most efficient to employ the GPU <strong>and</strong> the CPU in parallel: while the GPU<br />

calculates the next set of matching scores, the CPU updates the accumulated costs <strong>for</strong> the<br />

current vertical scanline.


116 Chapter 7. Scanline Optimization <strong>for</strong> Stereo On <strong>Graphics</strong> Hardware<br />

From the timing results presented in the Section 7.2.4 it can be concluded, that a<br />

GPU-based scanline optimization procedure is mostly suitable <strong>for</strong> larger images <strong>and</strong> disparity<br />

ranges, but not truly appropriate <strong>for</strong> realtime applications in particular. For small<br />

image resolutions the overhead of multipass rendering is still too significant to take advantage<br />

of the processing power of modern GPUs. Additionally, a scanline optimization<br />

procedure using a linear smoothness cost model is better dedicated <strong>for</strong> larger disparity<br />

ranges, where a (potentially truncated) linear model is preferable over the Potts model. If<br />

the disparity range contains only a few values, en<strong>for</strong>cing smooth disparity maps is futile,<br />

since consecutive values in the disparity range typically correspond to substantial depth<br />

discontinuities. Hence, a linear model is not effective in case of few potential disparities<br />

<strong>and</strong> a different approach like the near-realtime reliable dynamic programming (RDP)<br />

approach [Gong <strong>and</strong> Yang, 2005b] is better suited. On the contrary, we believe that the<br />

Potts model used in the RDP approach is not appropriate <strong>for</strong> high-quality reconstruction<br />

applications.<br />

If object silhouettes are available (e.g. by background segmentation), the quality of<br />

the depth map can be improved due to the knowledge of the visual hull. Datasets comprising<br />

turntable sequences with a known background (e.g. the reference multiview stereo<br />

datasets presented in [Seitz et al., 2006]) allow a simple background segmentation in particular.<br />

Additionally, the depth estimation per<strong>for</strong>mance can be increased by using the<br />

z-buffer test to avoid matching cost calculation <strong>for</strong> background pixels. Incorporating these<br />

improvements in such cases is ongoing work.<br />

In order to obtain better depth maps <strong>and</strong> to reduce the influence of the actual setting<br />

of the smoothness weight, the benefit of an adaptive smoothness weight based e.g. on the<br />

source image gradients [Fua, 1993, Scharstein <strong>and</strong> Szeliski, 2002] needs to be investigated.


7.4. Discussion 117<br />

(a) left view (b) center view (c) right view<br />

(d) depth map (WTA) (e) depth map (SO) (f) mesh view<br />

Figure 7.10: The three input views of a wooden Bodhisattva statue <strong>and</strong> the corresponding<br />

depth maps (using a local depth extraction approach indicated by WTA <strong>and</strong> the proposed<br />

scanline optimization method) <strong>and</strong> a view on the triangulated mesh.


Chapter 8<br />

Volumetric 3D Model Generation<br />

Contents<br />

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

8.2 Selecting the Volume of Interest . . . . . . . . . . . . . . . . . . 120<br />

8.3 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

8.4 Isosurface Determination <strong>and</strong> Extraction . . . . . . . . . . . . . 124<br />

8.5 Implementation Remarks . . . . . . . . . . . . . . . . . . . . . . . 126<br />

8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

8.1 Introduction<br />

With the exception of our voxel coloring approach, all methods presented so far generate<br />

a set of depth images resp. 2.5D height fields. In order to create true 3D models<br />

this set of depth maps must be combined into a common representation. The proposed<br />

method in this chapter to create proper 3D models is based on an implicit volumetric<br />

representation, from which the final surface can be extracted by any implicit surface polygonization<br />

technique. The principles of robust fusion of several depth maps in the context<br />

of laser-scanned data was developed by Hilton et al. [Hilton et al., 1996] <strong>and</strong> Curless <strong>and</strong><br />

Levoy [Curless <strong>and</strong> Levoy, 1996]. We apply essentially the same technique on depth maps<br />

obtained by dense depth estimation procedures, but the basic approach needs to be modified<br />

to be more robust against outliers occurring in the input depth maps. The basic idea<br />

of volumetric depth image integration is the conversion of depth maps to corresponding<br />

3D distance fields <strong>and</strong> the subsequent robust averaging of these distance fields. The resolution<br />

<strong>and</strong> the accuracy of the final model are determined by the quality of the source<br />

depth images <strong>and</strong> the resolution of the target volume.<br />

Instead of using an implicit representation of the surfaces induced by the depth images,<br />

one can merge a set of polygonal models directly [Turk <strong>and</strong> Levoy, 1994]. Such an<br />

119


120 Chapter 8. Volumetric 3D Model Generation<br />

approach is sensitive to outliers <strong>and</strong> mismatches occurring in the depth images. A volumetric<br />

approach can combine several surface hypotheses <strong>and</strong> per<strong>for</strong>m a robust voting in<br />

order to extract a more reliable surface. On the other h<strong>and</strong>, a volumetric range image<br />

fusion approach limits the size of 3D features found in the final model dependent on the<br />

voxel size.<br />

Our implementation of the purely software based (i.e. unaccelerated) approach, which<br />

is based on [Curless <strong>and</strong> Levoy, 1996], uses compressed volumetric representations of the<br />

3D distance fields <strong>and</strong> can h<strong>and</strong>le high resolution voxel spaces. Merging (averaging) of<br />

many distance fields induced by the corresponding depth maps is possible, since it is<br />

sufficient to traverse the compressed distance fields on a single voxel basis. Nevertheless<br />

our original implementation has substantial space requirements on external memory <strong>and</strong><br />

consumes significant time to generate the final surface (usually in the order of several<br />

minutes). Hence this approach is not suitable <strong>for</strong> immediate visual feedback to the user.<br />

At least <strong>for</strong> fast <strong>and</strong> direct inspection of the 3D model it is reasonable to develop a very<br />

efficient volumetric range image integration approach again accelerated by the computing<br />

power of modern graphics hardware. Many steps in the range image integration pipeline<br />

are very suitable <strong>for</strong> processing on graphics hardware <strong>and</strong> significant speedup can be<br />

expected.<br />

The overall procedure traverses the voxel space defined by the user slice by slice <strong>and</strong><br />

generates a section of the final implicit 3D mesh representation in every iteration. Consequently,<br />

the memory requirements are very low, but immediate postprocessing (e.g.<br />

filtering) of the generated slices is limited. Although the general idea is very close<br />

to [Curless <strong>and</strong> Levoy, 1996], several modifications are required to allow an efficient GPU<br />

implementations in the first instance. More importantly, the sensitivity to gross outliers<br />

frequently occurring in input depth maps is reduced by a robust voting approach. The<br />

details of our implementation are given in the next sections.<br />

8.2 Selecting the Volume of Interest<br />

The first step of proposed volumetric depth image integration pipeline is the specification<br />

of the 3D domain, <strong>for</strong> which the volumetric representation of the final model is built.<br />

Generally, it is not possible to determine this volume of interest automatically. In case of<br />

small objects entirely visible in each of the source images, the intersection of the viewing<br />

frustra can serve as indicator <strong>for</strong> the volume to be reconstructed. Larger objects only<br />

partially visible in the source images (e.g. large buildings) require human interaction to<br />

select the reconstruction volume. Consequently, there exists a user interface <strong>for</strong> manual<br />

selection of the reconstructed volume. This application displays a set of e.g. 3D feature<br />

points generated by the image orientation procedure or 3D point clouds generated from<br />

dense depth maps. The user can select <strong>and</strong> adjust the 3-dimensional bounding box of<br />

the region of interest. Additionally, the user specifies the intended resolution of the voxel<br />

space, which is set to 256 3 voxels in our experiments.


8.3. Depth Map Conversion 121<br />

8.3 Depth Map Conversion<br />

With the knowledge of the volume of interest <strong>and</strong> its orientation, the voxel space is traversed<br />

slice by slice <strong>and</strong> the values of the depth images are sampled according to the<br />

projective trans<strong>for</strong>mation induced by the camera parameters <strong>and</strong> the position of the slice.<br />

Since the sampled depth values denote the perpendicular distance of the surface to the<br />

camera plane, the distance of a voxel to the surface can be estimated easily as the difference<br />

between the depth value <strong>and</strong> the distance of the voxel to the image plane (see also<br />

Figure 8.1). This difference is an estimated signed distance to the surface; positive values<br />

indicate voxels in front of the surface <strong>and</strong> negative values correspond to voxels hidden by<br />

the surface. Of course, the accuracy of this approximation depends on the angle between<br />

the principal direction of the camera <strong>and</strong> the normal vector of the surface. Nevertheless,<br />

this efficiently computed approximation to the true distance trans<strong>for</strong>m gives very good<br />

results in practice. Additionally, we incorporated the angle between the surface normal<br />

<strong>and</strong> the viewing direction to scale this distance, but this modification had no apparent<br />

effect on the resulting models.<br />

The source depth maps contain two additional special values: one value (in our<br />

implementation chosen as -1) indicates absent depth values, which may occur due to<br />

some depth postprocessing procedure eliminating unreliable matches from the depth<br />

map. Another value (0 in our implementation) corresponds to pixels outside some<br />

<strong>for</strong>eground region of interest, which is based on an optional silhouette mask in our<br />

workflow [Sormann et al., 2005].<br />

Consequently, the processed voxels fall into one of the following categories:<br />

1. Voxels that are outside the camera frustum are labeled as culled.<br />

2. Voxels with an estimated distance D to the surface smaller than a user-specified<br />

threshold Tsurf are labeled as near-surface voxels (|D| ≤ Tsurf ).<br />

3. Voxels with a signed distance greater than this threshold are considered as definitely<br />

empty (D > Tsurf ).<br />

4. The fourth category includes occluded voxels, which have a negative distance with a<br />

magnitude larger than the threshold (D < −Tsurf ).<br />

5. If the depth value of the back-projected voxel indicates an absent value, the voxel is<br />

labeled as unfilled.<br />

6. Voxels back-projecting into pixels outside the <strong>for</strong>eground regions are considered as<br />

empty.<br />

These categories are illustrated in Figure 8.1. The threshold Tsurf specifies essentially the<br />

amount of noise that is expected in the depth images.


122 Chapter 8. Volumetric 3D Model Generation<br />

Camera<br />

center<br />

Image plane<br />

depth distance<br />

Empty region<br />

Culled region<br />

Outside silh.<br />

Culled region<br />

Occluded region<br />

Unfilled region<br />

Occluded region<br />

Outside silh.<br />

Figure 8.1: Classification of the voxel according to the depth map <strong>and</strong> camera parameters.<br />

Voxels outside the camera frustum are initially labeled as culled. Voxels close to the surface<br />

induced by the depth map are near-surface voxels (on both sides of the surface, indicated<br />

by shaded regions). Voxels with a distance larger than a threshold are either empty or<br />

occluded, depending on the sign of the distance.<br />

In many reconstruction setups it is possible to classify culled voxel depth values immediately.<br />

If the object of interest is visible in all images, culled voxels are outside the region<br />

to be reconstructed <strong>and</strong> can be classified as empty instantly. Declaring culled voxels as<br />

unfilled may generate unwanted clutter due to outliers in the depth maps. If the object<br />

to be reconstructed is only partially visible in the images, voxels outside the viewing frustum<br />

of a particular depth map do not contribute in<strong>for</strong>mation <strong>and</strong> are there<strong>for</strong>e labeled as<br />

unfilled. The choice between these two policies <strong>for</strong> h<strong>and</strong>ling culled data is specified by the<br />

user. Consequently, the 6 branches described above correspond to four voxel categories.<br />

A fragment program determines the status of voxels <strong>and</strong> updates an accumulated slice<br />

buffer <strong>for</strong> every given depth image. This buffer consists of four channels in accordance to<br />

the categories described above:<br />

1. The first channel accumulates the signed distances, if the voxel is a near-surface<br />

voxel.<br />

2. The second channel counts the number of depth images, <strong>for</strong> which the voxel is empty.


8.3. Depth Map Conversion 123<br />

3. The third channel tracks the number of depth images, <strong>for</strong> which the voxel is occluded.<br />

4. The fourth channel counts the number of depth images, <strong>for</strong> which the status of the<br />

voxel is unfilled.<br />

Thus, a simple but sufficient statistic <strong>for</strong> every voxel is accumulated, which is the basis <strong>for</strong><br />

the final isosurface determination. Algorithm 6 outlines the incremental accumulation of<br />

the statistic <strong>for</strong> a voxel, which is executed <strong>for</strong> every provided depth image. The accumulated<br />

statistic <strong>for</strong> a voxel is a quadruple comprising the components as described above.<br />

In addition to the user-specified parameter Tsurf , another threshold Tocc can be specified,<br />

which determines the border between occluded voxels <strong>and</strong> again unfilled voxels located<br />

behind the surface. This threshold is set to 10 · Tsurf in our experiments.<br />

Algorithm 6 Procedure to accumulate the statistic <strong>for</strong> a voxel<br />

Procedure stat = AccumulateVoxelStatistic<br />

Input: Camera image plane imageP lane, near-surface threshold Tsurf , Tocc > Tsurf , #Images<br />

Input: depth image D, projective texture coordinate stq, 3D voxel position pos<br />

Input: Voxel statistics: stat = ( � Di, #Empty, #Occluded, #Unfilled) (a quadruple)<br />

st ← stq.xy/stq.z {Perspective division}<br />

if st is inside [0, 1] × [0, 1] then<br />

depth ← tex2D(D, st) {Gather depth from range image}<br />

if depth > 0 then<br />

dist ← depth − imageP lane · pos {Calculate signed distance to the surface}<br />

if dist > Tsurf then<br />

increment #Empty {Too far in front of surface}<br />

else if dist < −Tocc then<br />

increment #Unfilled {Very far behind the surface}<br />

else if dist < −Tsurf then<br />

increment #Occluded {Too far behind the surface}<br />

else<br />

�<br />

Di ← � Di + dist {Near-surface voxel}<br />

end if<br />

else<br />

if depth = 0 then<br />

stat ← (0, #Images + 1, 0, 0) {Declare voxel definitely as empty}<br />

else<br />

increment #Unfilled<br />

end if<br />

end if<br />

else<br />

{Execute one of the following lines, depending on the h<strong>and</strong>ling of culled voxels:}<br />

increment #Empty, or {H<strong>and</strong>le culled voxel as empty}<br />

increment #Unfilled {Alternatively, h<strong>and</strong>le culled voxel as unfilled}<br />

end if<br />

Return stat<br />

This algorithm is very close to the range image integration approach proposed<br />

in [Curless <strong>and</strong> Levoy, 1996]. The main user-given parameter is the threshold Tsurf ,


124 Chapter 8. Volumetric 3D Model Generation<br />

which determines the set of near-surface voxels. This parameter is related to the accuracy<br />

of the depth maps <strong>and</strong> should be set to half of the uncertainty interval in theory. Since<br />

the uncertainty of depth images generated by dense estimation approaches depends on<br />

many parameters like the view geometry, scene content <strong>and</strong> surface properties, this<br />

threshold is determined empirically.<br />

Algorithm 6 has the following differences to the method proposed in<br />

[Curless <strong>and</strong> Levoy, 1996]:<br />

• Culled voxels (i.e. outside the viewing frustum) can be immediately carved away<br />

depending on the user specified policy.<br />

• Voxel very far behind the estimated surface are considered unreliable <strong>and</strong> are labeled<br />

as unfilled instead of being classified as occluded. A user specified threshold Tocc<br />

is introduced to distinguish between occluded (solid) voxels <strong>and</strong> unfilled ones. The<br />

choice of this parameter does not critically affect the obtained model. We use a<br />

default value of Tocc = 10 Tsurf in our experiments.<br />

Weighted Accumulation <strong>for</strong> Near-Surface Voxels<br />

It is possible to compute a weighted average <strong>for</strong> the near-surface voxels by accumulating<br />

weighted distances. If the signed distance of a voxel <strong>for</strong> depth image i is Di, <strong>and</strong> the<br />

corresponding weight (resp. confidence) is Wi, then the averaged distance value is<br />

�<br />

i WiDi<br />

�<br />

i Wi<br />

Because the weights do not sum to one, a weighted scheme requires tracking the total sum<br />

�<br />

i Wi of the weights in addition to the parameters described above. This can be achieved<br />

either by writing to a fifth channel, which requires the recent multi-render-target graphics<br />

extension, or alternatively two of the other parameters can be merged. Depending on the<br />

object to be reconstructed culled voxels can be counted as empty or occluded without<br />

decreasing the accuracy of the final model. For free-st<strong>and</strong>ing objects like statues it is<br />

reasonable to declare culled voxels as empty, since the object in interest is typically visible<br />

in all images. In other cases occluded <strong>and</strong> culled voxels can be treated equivalently.<br />

8.4 Isosurface Determination <strong>and</strong> Extraction<br />

After all available depth images are processed, the target buffer holds the coarse statistic<br />

<strong>for</strong> all voxels of the current slice. The classification pass to determine the final status<br />

of every voxel is essentially a voting procedure. This step assigns the depth distance to<br />

the final surface to every voxel, such that the isosurface at level 0 corresponds with the<br />

merged 3D model. For efficiency the voting procedure uses only the statistics acquired<br />

<strong>for</strong> the current voxel, but does not inspect neighboring voxels. Algorithm 7 presents the<br />

.


8.4. Isosurface Determination <strong>and</strong> Extraction 125<br />

utilized averaging procedure to assign the signed distance to the final surface. There is one<br />

parameter, which must be specified by the user: #RequiredDefinite denotes the minimum<br />

number of near surface entries accumulated in the voxel statistic. This means, that<br />

at least #RequiredDefinite depth maps must agree, that the current voxel is close to the<br />

estimated surface. The choice of this parameter depends on the redundancy in the images<br />

an on the quality of the provided depth maps. A larger choice <strong>for</strong> #RequiredDefinite<br />

reduces the clutter induced by outliers in the input depth maps, but may lead to holes in<br />

the final surface, if parts of the surface are visible in too few views.<br />

Algorithm 7 Procedure to calculate the final surface distance <strong>for</strong> a voxel<br />

Procedure result = AverageDistance<br />

Input: User specified constant: #RequiredDefinite<br />

Input: Voxel statistics: � Di, #Empty, #Occluded, #Unfilled<br />

#Definite ← #Images − #Occluded − #Unfilled<br />

if #Definite < #RequiredDefinite then<br />

result ← UnknownLabel(e.g. NaN)<br />

else<br />

#NearSurface ← #Images − #Empty − #Unfilled<br />

if #NearSurface ≥ #Empty then<br />

result ← � Di/#NearSurface<br />

else<br />

result ← +∞<br />

end if<br />

end if<br />

Return result<br />

Up to now the discussed steps in the volumetric range image integration pipeline, depth<br />

map conversion <strong>and</strong> fusion, run entirely on graphics hardware. After the GPU-based computation<br />

<strong>for</strong> one slice of the voxel space is finished, the isovalues of the current slice are<br />

trans<strong>for</strong>med into a triangular mesh on the CPU [Lorenson <strong>and</strong> Cline, 1987] <strong>and</strong> added to<br />

the final surface representation. This mesh can be directly visualized <strong>and</strong> is ready <strong>for</strong><br />

additional processing like texture map generation. Instead of generating a surface representation<br />

from the individual slices a 3D texture can be accumulated alternatively, which<br />

is suitable <strong>for</strong> volume rendering techniques. The main portion of this approach is per<strong>for</strong>med<br />

again entirely on the GPU <strong>and</strong> does not involve substantial CPU computations.<br />

In contrast to a slice-based incremental isosurface extraction method, this direct approach<br />

requires the space <strong>for</strong> a complete 3D texture in graphics memory. Since modern 3D graphics<br />

hardware is equipped with large amounts of video memory, the 16MB required by a<br />

256 3 voxel space are af<strong>for</strong>dable. Rendering an isosurface directly from the volumetric data<br />

requires additional calculation of surface normals, which are directly derived from the<br />

gradients at every voxel. By using a deferred rendering approach, computation of the gradient<br />

can be limited to the actual surface voxels <strong>and</strong> the additional memory consumption<br />

is minimal.


126 Chapter 8. Volumetric 3D Model Generation<br />

8.5 Implementation Remarks<br />

Tracking the statistic <strong>for</strong> each voxel in the current slice requires a four channel buffer<br />

with floating point precision to accumulate the distance values <strong>for</strong> near-surface voxels. By<br />

normalizing the distance of these voxels from [−T, T ] to [−1, 1] a half precision buffer (16<br />

bit floating point <strong>for</strong>mat) is usually sufficient. Furthermore, the final voxel values can<br />

be trans<strong>for</strong>med to the range [0, 1] <strong>and</strong> a traditional 8 bit fix-point buffer offers adequate<br />

precision. Using low-precision buffers decreases the volume integration time by about 30%.<br />

8.6 Results<br />

This section provides visual <strong>and</strong> timing results <strong>for</strong> some real datasets. The timings are<br />

given <strong>for</strong> a PC hardware consisting of a Pentium4 3GHz processor <strong>and</strong> an NVidia Ge<strong>for</strong>ce<br />

6800 graphics card. All source views are resized to 512 × 512 pixels be<strong>for</strong>eh<strong>and</strong>, <strong>and</strong><br />

the obtained depth images have the same resolutions (unless noted otherwise). Partially<br />

available <strong>for</strong>eground segmentation data is not used in these experiments.<br />

The first dataset depicted in Figure 8.2(a) shows one source image (out of 47) displaying<br />

a small statue. The images are taken in a roughly circular sequence around the statue.<br />

The camera is precalibrated <strong>and</strong> the relative poses of the images are determined from point<br />

correspondences found in adjacent views. From the correspondences <strong>and</strong> the camera parameters<br />

a sparse reconstruction can be triangulated, which is used by a human operator to<br />

determine a 3D box enclosing the voxel space of interest. The extension of this box is used<br />

to determine the depth range employed in the subsequent plane-sweep step, which took 53s<br />

to generate 45 depth images in total (Figure 8.2(b)). In this depth estimation procedure<br />

(recall Chapter 4), 200 evenly distributed depth hypotheses are tested using the SAD <strong>for</strong><br />

a 5 × 5 window. In order to compensate illumination changes in several view triplets, the<br />

source images were normalized by subtracting its local mean image. Black pixels indicate<br />

unreliable matches, which are labeled as unfilled be<strong>for</strong>e the depth integration procedure.<br />

These depth maps are integrated in just over 4 seconds to obtain a 256 3 volume dataset<br />

as illustrated in Figure 8.2(c). The isosurface displayed in Figure 8.2(d) can be directly<br />

extracted using a ray-casting approach on the GPU [Stegmaier et al., 2005]. Almost all<br />

of the clutter <strong>and</strong> artefacts outside the proper statue are eliminated by requiring at least<br />

7 definite values <strong>for</strong> the statistic of a voxel.<br />

The result <strong>for</strong> another dataset consisting of 43 images is shown in Figure 8.3(b), <strong>for</strong><br />

which one source image is depicted in Figure 8.3(a). The same procedure as <strong>for</strong> the previous<br />

dataset is applied, from which a set of 41 depth images is obtained in the first instance.<br />

Plane-sweep depth estimation using the ZNCC correlation with 200 depth hypotheses requires<br />

97.7s in all to generate the depth maps. The subsequent depth image fusion step<br />

requires 4s to yield the volumetric data illustrated in Figure 8.3(b).<br />

Note that these timing reflect the creation time <strong>for</strong> rather high-resolution models. If<br />

all resolutions are halved (256 × 256 × 100 depth images <strong>and</strong> 128 3 volume resolution),


8.7. Discussion 127<br />

(a) One source image (b) One depth image (c) Direct volume<br />

rendering<br />

(d) Shaded isosurface<br />

Figure 8.2: Visual results <strong>for</strong> a small statue dataset generated from a sequence of 47<br />

images. The total time to generate the depth maps <strong>and</strong> the final volumetric representation<br />

is less than 1 min. The left image (a) shows one source view, <strong>and</strong> the corresponding depth<br />

map generated by a plane sweep approach is illustrated in (b). The 3D volume obtained<br />

by depth image integration is displayed using direct volume rendering in (c). The outline<br />

of the isosurface corresponding to the integrated model is clearly visible. Additionally, the<br />

region of near-surface voxels is indicated by the blur next to the surface. The right image<br />

shows the isosurface extracted from the volume data using GPU-based raycasting. Both<br />

images are generated by the volume raycasting software made available by S. Stegmaier<br />

et al. [Stegmaier et al., 2005].<br />

the total depth estimation time is 13s <strong>and</strong> the volumetric integration time is less than 1s<br />

<strong>for</strong> this dataset. We believe that these timing results allow our method to qualify as an<br />

interactive modeling approach.<br />

The visual result <strong>for</strong> another dataset consisting of 16 source views is shown in Figure<br />

8.3(c) <strong>and</strong> (d). Depth estimation <strong>for</strong> 14 views took 34.2s using a 5x5 ZNCC with a<br />

best-half-sequence occlusion strategy (200 tentative depth values). Without an implicit occlusion<br />

h<strong>and</strong>ling approach parts of the sword are missing. Volumetric integration requires<br />

another 1.8s to generate the isosurface shown in Figure 8.3(d).<br />

8.7 Discussion<br />

In this work we demonstrated, that generating proper 3D models from a set of depth<br />

images can be achieved at interactive rates using the processing power of modern GPUs.<br />

The quality of the obtained 3D models depends on the grade of the source depth maps


128 Chapter 8. Volumetric 3D Model Generation<br />

(a) One source image (of 43) (b) Shaded isosurface (102s)<br />

(c) One source image (of 16) (d) Shaded isosurface<br />

(36s)<br />

Figure 8.3: Source views <strong>and</strong> isosurfaces <strong>for</strong> two real-world datasets.<br />

<strong>and</strong> on the redundancy within the provided data, but the voting scheme is robust in case<br />

of outliers usually generated by pure local depth estimation procedures.<br />

Although the proposed method is efficient <strong>and</strong> often provides 3D geometry suitable<br />

<strong>for</strong> visualization <strong>and</strong> further processing, the results are inferior in many cases with low


8.7. Discussion 129<br />

redundancy in the source depth maps. In these settings, the pure local averaging <strong>and</strong><br />

voting approach to combine the depth maps is not sufficient. Global surface reconstruction<br />

methods resulting in smoother <strong>and</strong> often watertight 3D geometry were recently<br />

proposed. Volumetric graph-cut approaches [Vogiatzis et al., 2005, Tran <strong>and</strong> Davis, 2006,<br />

Hornung <strong>and</strong> Kobbelt, 2006b, Hornung <strong>and</strong> Kobbelt, 2006a] appear highly successful to<br />

create smooth models, but they are computationally expensive <strong>and</strong> provide only limited<br />

choices <strong>for</strong> regularization terms. Moreover, graph-cut methods in general do not benefit<br />

much from GPU or SIMD accelerated implementation.<br />

Consequently, future work will likely focus on variational reconstruction<br />

approaches. Since determining the surface of an imaged object from<br />

multiple depth maps can be seen as segmentation problem (separation of<br />

empty space <strong>and</strong> interior volume), variational image segmentation methods<br />

(e.g. [Caselles et al., 1997, Westin et al., 2000, Appleton <strong>and</strong> Talbot, 2006]) could be<br />

adapted <strong>for</strong> multiple-view surface reconstruction tasks. The nature of the underlying<br />

implementations enables substantial per<strong>for</strong>mance gains by employing graphics processing<br />

units <strong>for</strong> these methods.


Chapter 9<br />

Results<br />

Contents<br />

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

9.2 Synthetic Sphere Dataset . . . . . . . . . . . . . . . . . . . . . . . 131<br />

9.3 Synthetic House Dataset . . . . . . . . . . . . . . . . . . . . . . . 134<br />

9.4 Middlebury Multi-View Stereo Temple Dataset . . . . . . . . . 137<br />

9.5 Statue of Emperor Charles VI . . . . . . . . . . . . . . . . . . . . 138<br />

9.6 Bodhisattva Figure . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

9.1 Introduction<br />

This chapter provides results illustrating the complete GPU-based work-flow on several<br />

datasets. At first, two synthetic datasets are discussed, which allow a comparison of<br />

the purely image-based reconstruction with the known ground truth. Thereafter, several<br />

real-world datasets from various domains <strong>and</strong> the generated respective 3D models are<br />

presented. The focus of the discussion of these datasets lies on the comparison between<br />

medium resolution <strong>and</strong> high resolution results. Consequently, the potential gain of more<br />

expensive computations at higher resolution is visually illustrated.<br />

The depth maps <strong>for</strong> the real-world datasets are generated using the plane-sweep (Chapter<br />

4) <strong>and</strong> scanline optimization approaches (Chapter 7), since these methods are less vulnerable<br />

to illumination changes in the images <strong>and</strong> do not require a suitable initialization<br />

as the iterative methods (Chapter 3 <strong>and</strong> 6) depend on.<br />

9.2 Synthetic Sphere Dataset<br />

The first presented dataset is a synthetically rendered perfect sphere with radius 1 (see<br />

Figure 9.1). The surface is textured using a procedurally generated stone texture. 36<br />

131


132 Chapter 9. Results<br />

views at 512 × 512 resolution are created using the Persistence of <strong>Vision</strong> raytracer. ∗ The<br />

cameras are placed in even intervals around the sphere center looking towards the center.<br />

(a) (b) (c)<br />

Figure 9.1: Three source views of the synthetic sphere dataset.<br />

Choosing a sphere as the ground truth geometry has the advantage, that the comparison<br />

of the reconstructed model with the ground truth is extremely simple: the offset of an<br />

arbitrary 3D point to the sphere surface is just the difference between the sphere radius<br />

<strong>and</strong> the distance of the point to the center. This allows an easy evaluation of the reconstructed<br />

meshes, <strong>and</strong> the regular structure of the target model allows the identification of<br />

systematic errors <strong>and</strong> biases <strong>and</strong> the reconstruction methods.<br />

We compare three depth estimation methods in this section:<br />

1. a plane-sweep approach using a winner-takes-all depth extraction described in Chapter<br />

4 (denoted by WTA)<br />

2. the GPU-based scanline optimization procedure presented in Chapter 7<br />

3. a GPU accelerated variational approach to depth estimation as described in Chapter<br />

6 (indicated by PDE)<br />

All methods have a triplet of images as input with the central view designated as key<br />

image. The image dissimilarity function is the SAD aggregated in a 5 × 5 window <strong>for</strong><br />

the first two methods, <strong>and</strong> the single pixel SSD <strong>for</strong> the variational approach. The planesweep<br />

<strong>and</strong> the scanline optimization procedures evaluate 400 potential depth values <strong>for</strong><br />

every pixel of the key image. Figure 9.2 displays the result of the three depth estimation<br />

methods <strong>for</strong> one particular key view. The discrete set of depth values can be clearly seen<br />

in Figure 9.2(a) <strong>and</strong> (b).<br />

The obtained three sets comprising 36 depth maps are merged into a final 3D model<br />

using the procedure described in Chapter 8. We set the main parameters Tsurf <strong>and</strong><br />

∗ www.povray.org


9.2. Synthetic Sphere Dataset 133<br />

(a) WTA (b) Scanline opt. (c) PDE<br />

Figure 9.2: Depth estimation results <strong>for</strong> a view triplet of the sphere dataset<br />

#RequiredDefinite to 0.03 <strong>and</strong> 7 respectively. This step requires about 5.5s in order to<br />

combine the 36 depth maps. The final meshes <strong>for</strong> the three depth estimation methods are<br />

depicted in Figure 9.3. The visual appearance is quite similar; the staircasing artefacts of<br />

the WTA <strong>and</strong> the scanline optimization approach are removed by the depth integration<br />

step. The polar regions of the sphere are not visible in the source views, hence those parts<br />

are not reconstructed.<br />

(a) WTA (b) Scanline opt. (c) PDE<br />

Figure 9.3: Fused 3D models <strong>for</strong> the sphere dataset wrt. the depth estimation method<br />

In order to provide a quantitative evaluation, the final meshes are compared with the<br />

ground truth sphere. In Table 9.1 the total depth estimation runtime <strong>for</strong> 36 views is given<br />

in the second column. The third column reports the average sphere radius as induced<br />

by the generated final mesh (with respect to the true sphere center). The final column<br />

specifies the percentage of vertices from the final meshes, which lie within 0.5% of the<br />

sphere radius.


134 Chapter 9. Results<br />

Depth est. method Total runtime Reported radius Points within 0.5%<br />

Winner-takes-all 83s 1.0012 97.7%<br />

Scanline opt. 350s 0.9992 97.4%<br />

PDE 125s 0.9987 95.5%<br />

Table 9.1: Quantitative evaluation of the reconstructed spheres<br />

Of course, the figures in Table 9.1 indicate the achievable accuracy under best circumstances.<br />

9.3 Synthetic House Dataset<br />

Another synthetic dataset depicting a simple textured house model is illustrated in Figure<br />

9.4. 36 views of the VRML model were generated, <strong>and</strong> the source images were resized<br />

to 512 × 512 pixels. Since the model house is rotated during the virtual capturing process,<br />

but the (virtual) lights remain in constant position, this dataset simulates a turntable<br />

sequence with a moving object <strong>and</strong> fixed light sources. Consequently, purely intensity<br />

based image dissimilarity measures fail in this case. There<strong>for</strong>e we excluded the variational<br />

approach in the evaluation.<br />

(a) (b) (c)<br />

Figure 9.4: Three source views of the synthetic house dataset.<br />

In order to obtain a 3D model, 36 triplets of views were used to create depth images<br />

using the plane-sweep approaches with either winner-takes-all or scanline optimization <strong>for</strong><br />

depth extraction. A 5 × 5 ZNCC image similarity score was employed in the experiments.<br />

The purely local approach is further divided into two variant: a plain method taking the<br />

depth maps as is (denoted by WTA (1)) <strong>and</strong> a conservative method marking unreliable<br />

pixels in the depth map with a low matching score as invalid (WTA (2)). Since the<br />

difference between these two variants lies only in a depth map post-processing step, the<br />

runtimes are equivalent.<br />

The depth maps were again combined using the volumetric integration approach, which


9.3. Synthetic House Dataset 135<br />

took 5.2s. The reconstruction volume encloses the house model <strong>and</strong> its proximity, but does<br />

not include the complete ground plane.<br />

(a) WTA (1) (b) WTA (2)<br />

(c) Scanline opt.<br />

Figure 9.5: Fused 3D models <strong>for</strong> the synthetic house dataset wrt. the depth estimation<br />

method<br />

The pure local methods encounter problems in homogeneous regions as expected (Figure<br />

9.5(a) <strong>and</strong> (b)). Surprisingly, employing scanline optimization to fill the depth images<br />

in textureless areas does not yield to to expected high-quality result. An explanation can<br />

be given, if the depth maps displayed in Figure 9.6 are examined: the depth maps generated<br />

by the local methods contain mismatches resp. unreliable depth values in textureless<br />

region (Figure 9.6(a) <strong>and</strong> (b), <strong>and</strong> recall Figure 9.4(c)).<br />

Scanline optimization (Figure 9.6(c)) fills homogeneous regions with reasonable depth<br />

values, but because of the linear discontinuity cost model there is an ambiguity in perfect<br />

homogeneous regions: in such cases, the smoothness cost � |d(x) − d(x + 1)| is minimized


136 Chapter 9. Results<br />

(a) WTA (1) (b) WTA (2) (c) SO<br />

Figure 9.6: Three generated depth maps of the synthetic house dataset. The results<br />

of the local approaches show incorrect depth estimations in textureless regions. Scanline<br />

optimization with a linear discontinuity cost fills the pixel in the depth image suboptimally<br />

due to the ambiguity of the optimal path.<br />

<strong>for</strong> a set of pixels not providing discriminative matching costs. The minimum is not unique<br />

<strong>and</strong> the method may report any of these optima. Our implementation reports piecewise<br />

constant depth maps (as illustrated e.g. in the right section of Figure 9.6(c)) in contrast<br />

to the expected piecewise planar ones.<br />

This surprising behavior is caused by the 1-dimensional depth optimization in combination<br />

with the linear discontinuity cost model. If a quadratic smoothness cost model<br />

is utilized, the minimum even in textureless regions is uniquely yielding a planar map.<br />

Per<strong>for</strong>ming full 2-dimensional depth optimization (e.g. by graph-cut methods) gives again<br />

a unique optimum <strong>and</strong> is not vulnerable to this ambiguity.<br />

In order to evaluate the obtained final 3D models wrt. the ground truth, two measures<br />

are employed: the model accuracy specifies the ratio of model surface, which are close<br />

to the ground truth model within a given distance threshold. The model completeness<br />

depicts the portion the ground truth model, which is covered by the reconstructed mesh<br />

(i.e. where the reconstructed surface is close to the ground truth wrt. a given threshold).<br />

For the completeness calculation the wide-stretched ground plane is omitted from the<br />

reference model, since it is only reconstructed in the proximity of the house. Measuring the<br />

completeness of a model accurately is difficult, since small holes may not have any influence<br />

<strong>and</strong> larger holes shrink depending on the tolerated distance. Consequently, we set the<br />

distance threshold <strong>for</strong> completeness evaluation in the order of the reported average distance<br />

of inliers as reported by the accuracy evaluation (which is about 0.2% of the diameter of<br />

the reconstructed box). The obtained values are still only approximately accurate, but<br />

they match the visual appearance of the models. For instance, the conservative winnertakes-all<br />

approach has the highest accuracy (since only reliable depth values are retained),<br />

but the lowest completeness result (unreliable regions remain unfilled).<br />

The surface-to-surface distance computations are approximated by converting the tri-


9.4. Middlebury Multi-View Stereo Temple Dataset 137<br />

angular mesh models into point sets by uni<strong>for</strong>mly sampling the meshes <strong>and</strong> calculating the<br />

closest point-pairs <strong>for</strong> these sets. Table 9.2 presents the results of this evaluation. Beside<br />

the total runtime, the model accuracy <strong>and</strong> the completeness are given <strong>for</strong> two distance<br />

thresholds. These thresholds are indicated as fractions of the diameter of the reconstructed<br />

volume.<br />

Depth est. method Runtime Accuracy 1% 0.5% Completeness 0.4% 0.2%<br />

Winner-takes-all (1) 120s 92.54% 83.7% 95.65% 75.99%<br />

Winner-takes-all (2) 120s 99.07% 93.47% 90.78% 63.51%<br />

Scanline opt. 170s 96.27% 90.51% 95.91% 82.30%<br />

Table 9.2: Quantitative evaluation of the reconstructed synthetic house<br />

9.4 Middlebury Multi-View Stereo Temple Dataset<br />

This dataset is one of the currently two proposed datasets with known ground-truth geometry<br />

[Seitz et al., 2006] † . The images show a replications of an ancient temple (see<br />

Figure 9.7). The ground-truth geometry was obtained by laser-scanning the miniature<br />

model. There are three variants of the dataset: at first, a large set of images is provided,<br />

which contains more than 300 source views acquired using a spherical gantry <strong>and</strong> a moving<br />

camera. Additionally, two smaller subsets are supplied: a dense ring set of images<br />

consisting of 47 views, <strong>and</strong> a sparse ring with 16 images. All images have 640 × 480 pixels<br />

resolution. We used the medium sized dense ring dataset to generate the results presented<br />

below.<br />

We provide two final results <strong>for</strong> this dataset: the first mesh displayed in Figures 9.8(a)<br />

<strong>and</strong> (b) is created using the camera matrices <strong>and</strong> orientations supplied by the originators.<br />

Since the authors of this dataset do not claim high accuracy of their camera parameters,<br />

we additionally calculated the relatives poses between the views from scratch using<br />

our multi-view reconstruction pipeline. Two views of the resulting mesh are shown in<br />

Figure 9.9(a) <strong>and</strong> (b). In both cases the same parameters <strong>for</strong> depth estimation <strong>and</strong> volumetric<br />

integration are used. The initial depth maps are computed employing a 3 × 3 SAD<br />

matching score <strong>and</strong> scanline optimization <strong>for</strong> depth extraction. 255 potential depth values<br />

are evaluated <strong>for</strong> every pixel. This procedure takes 3m7s to finish. Subsequent fusion of<br />

all depth maps into a volumetric model with 288 3 voxels resolution requires another 12s<br />

to complete.<br />

The surface mesh created with our own calculated camera matrices appears smoother<br />

<strong>and</strong> less noisy than the one based on the supplied camera poses. The drawback of camera<br />

poses computed from scratch is, that the obtained 3D model is calculated with respect to<br />

a local camera coordinate system <strong>and</strong> cannot be compared with the laser-scanned model<br />

directly.<br />

† http://vision.middlebury.edu/mview/


138 Chapter 9. Results<br />

(a) (b) (c)<br />

Figure 9.7: Three (out of 47) source images of the temple model dataset. The images are<br />

taken approximately evenly spaced on a circular sequence around the model.<br />

9.5 Statue of Emperor Charles VI<br />

Figure 9.10 displays two source views (out of 42) showing a statue of the Austrian Emperor<br />

Charles VI. inside the state hall of the Austrian National Library. The source images have<br />

a significant variation in brightness conditions due to the back light induced by the large<br />

windows of the hall.<br />

A set of 40 depth maps is generated <strong>for</strong> every triplet of source images, which are subsequently<br />

fused using our volumetric depth image integration approach. We calculated the<br />

final model <strong>for</strong> two different resolutions: at first, a medium resolution model is generated<br />

<strong>for</strong> depth images with 336 × 512 pixels <strong>and</strong> 256 × 256 × 384 voxels used <strong>for</strong> volumetric<br />

integration. Further, a high resolution result at 676 × 1016 pixels <strong>and</strong> 384 × 384 × 512<br />

voxels is created to evaluate the benefit of increased resolution. Table 9.3 depicts the<br />

required run-times to generate the 40 depth maps using 250 depth hypotheses at the specified<br />

image resolution. Volumetric fusion takes 8.5s at medium resolution <strong>and</strong> 27s at high<br />

resolution, respectively.<br />

Resolution Depth est. method Runtime<br />

336 × 512 Winner-takes-all 1m40s<br />

Scanline opt. 2m10s<br />

676 × 1016 Winner-takes-all 5m30s<br />

Scanline opt. 7m40s<br />

Table 9.3: Timing results <strong>for</strong> the Emperor Charles dataset. These figures represent the<br />

time needed to generate 40 depth maps at the specified resolution. 250 depth hypo<strong>thesis</strong><br />

are evaluated <strong>for</strong> every pixel.


9.5. Statue of Emperor Charles VI 139<br />

(a) Front view (b) Back view<br />

Figure 9.8: Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />

the original camera matrices (1095000 triangles).<br />

The meshes obtained at medium resolution using a winner-takes-all <strong>and</strong> a scanline<br />

optimization depth extraction method are illustrated in Figure 9.11(a)–(d). The surface<br />

mesh generated using the simple winner-takes-all approach is essentially as good as the<br />

scanline optimization based result.<br />

Figures 9.12(a)–(f) depict the meshes obtained at the higher resolution. Again, a<br />

winner-takes-all <strong>and</strong> a scanline optimization approach are used <strong>for</strong> depth extraction. At<br />

this resolution the WTA result has more noise as illustrated in the close-up view of the<br />

cloak in Figure 9.12(c) <strong>and</strong> (f). The corresponding depth maps generated by the WTA<br />

<strong>and</strong> SO approach can be seen is Figure 9.13. Volumetric fusion evidently removes the<br />

mismatches occurring on the WTA-based depth image only partially, which yields to holes<br />

in the final mesh.<br />

If one compares the outcomes of the two resolutions directly, e.g. Figure 9.11(c) <strong>and</strong><br />

Figure 9.12(d), then the increased geometric details of the high resolution result are clearly<br />

visible. Nevertheless, the high resolution mesh containing approximately 1 000 000 triangles<br />

is too complex <strong>for</strong> real-time display <strong>and</strong> requires geometric simplification <strong>and</strong> other<br />

enhancements to be suitable <strong>for</strong> further visualization.


140 Chapter 9. Results<br />

(a) Front view (b) Back view<br />

Figure 9.9: Front <strong>and</strong> back view of the fused 3D model of the temple dataset based on<br />

new calculated camera matrices (857000 triangles).<br />

9.6 Bodhisattva Figure<br />

The final dataset is a set of images displaying a wooden Bodhisattva statue inside a<br />

Buddhist stupa building (Figure 9.14). These images were taken with a digital singlelens<br />

reflex camera under difficult lighting conditions. Additionally, the views are partially<br />

widely separated due to the narrow interior of the stupa. This dataset focuses directly on<br />

the digital preservation of cultural heritage, since the wooden statue weathers slowly due<br />

to atmospheric conditions. Furthermore, this <strong>and</strong> similar religious artefacts are highly in<br />

dem<strong>and</strong> of gatherers <strong>and</strong> consequently susceptible to theft.<br />

The complete set of images contains 13 views of the statue. Two sequences of depth<br />

images (using scanline optimization) are generated: a medium resolution set at 512 ×<br />

768 pixels <strong>and</strong> a high resolution one at 1000 × 1504 pixels, <strong>for</strong> which a few depth maps<br />

are depicted in Figure 9.15. In both cases the number of depth hypotheses is set to<br />

250. The medium resolution result utilized a ZNCC correlation using a 5 × 5 support<br />

window. The generation of 11 depth images using triplets of source views needed 1m12s.<br />

Volumetric fusion was applied in a 256×512×512 voxel space yielding the mesh displayed<br />

in Figure 9.16(a). In the high resolution case a 7 × 7 aggregation window was applied


9.6. Bodhisattva Figure 141<br />

(a) Front view (b) Back view<br />

Figure 9.10: Two views of the statue showing Emperor Charles VI inside the state hall of<br />

the Austrian National Library.<br />

<strong>for</strong> matching costs computation, <strong>and</strong> the volumetric fusion is based on a 384 × 768 × 768<br />

voxel space. Depth map generation took 5m to complete. The finally extracted mesh is<br />

illustrated in Figure 9.16(b).<br />

For this dataset the lower resolution mesh appears smoother <strong>and</strong> less noisy in comparison<br />

with the high resolution outcome. There are two reasons <strong>for</strong> this behavior: at<br />

first, several depth maps contain a substantial amount of noise <strong>and</strong> mismatches due to<br />

the widely separated views <strong>for</strong> some triplets (e.g. Figure 9.15(d)). During volumetric fusion<br />

this noise is largely suppressed at the medium resolution. Additionally, the lack of a<br />

global smoothing term in the “greedy” depth map fusion procedure does not inhibit high<br />

variations (i.e. local noise) in the extracted surface mesh. Future work needs to address<br />

an efficient depth map integration approach, which incorporates some discontinuity cost<br />

to prevent unnecessary noise in the final outcome. In any case, a feature preserving mesh<br />

simplification procedure is required to enable further processing <strong>and</strong> visualization.


142 Chapter 9. Results<br />

(a) WTA, front view (b) WTA, back view<br />

(c) SO, front view (d) SO, back view<br />

Figure 9.11: Medium resolution mesh <strong>for</strong> the Charles VI dataset. Figures (a) <strong>and</strong> (b)<br />

show the surface mesh obtained from a winner-takes-all plane-sweep approach to depth<br />

map generation. Figures (c) <strong>and</strong> (d) illustrate the results using scanline optimization.


9.6. Bodhisattva Figure 143<br />

(a) Front view (b) Front view (c) Front view<br />

(d) Front view (e) Front view (f) Front view<br />

Figure 9.12: High resolution mesh <strong>for</strong> the Charles VI dataset. Figures (a) <strong>and</strong> (b) show<br />

the surface mesh obtained from a winner-takes-all plane-sweep approach to depth map<br />

generation. (c) displays a close-up view of the cloak revealing substantial noise in the<br />

mesh. Figures (d)–(f) illustrate the results using scanline optimization. The cloak in<br />

Figure (f) is much smoother in this setting.


144 Chapter 9. Results<br />

(a) WTA (b) SO<br />

Figure 9.13: Two depth maps <strong>for</strong> the same reference view of the Charles dataset generated<br />

by the winner-takes-all <strong>and</strong> the scanline optimization approach, respectively.<br />

(a) (b) (c) (d) (e) (f) (g)<br />

Figure 9.14: Every other of the 13 source images of the Bodhisattva statue dataset.


9.6. Bodhisattva Figure 145<br />

(a) (b) (c) (d)<br />

Figure 9.15: Several Depth images <strong>for</strong> the Bodhisattva statue<br />

(a) Medium resolution (512 × 768, ≈ 1Mio<br />

triangles)<br />

(b) High resolution (1000 × 1504, ≈ 2.7Mio<br />

triangles)<br />

Figure 9.16: Medium <strong>and</strong> high resolution results <strong>for</strong> the Bodhisattva statue images. The<br />

depth images <strong>for</strong> the left model are computed at 512 × 768 pixels resolution, <strong>and</strong> the<br />

subsequent volumetric depth map integration is per<strong>for</strong>med at 256 × 512 × 512 voxels. The<br />

depth map <strong>and</strong> the voxel resolution <strong>for</strong> the right model are 1000×1504 <strong>and</strong> 384×768×768,<br />

respectively. For this dataset the inherent smoothing induced by the lower resolution yields<br />

to slightly more appealing results.


Chapter 10<br />

Concluding Remarks<br />

This <strong>thesis</strong> outlines high-per<strong>for</strong>mance approaches to several stages in the reconstruction<br />

pipeline regarding dense depth <strong>and</strong> mesh generation using modern GPUs. Several approaches<br />

<strong>for</strong> multi-view reconstruction benefit substantially from the data-parallel computing<br />

model <strong>and</strong> the processing power of modern GPUs. The provided accuracy of<br />

arithmetic operations on the GPU is sufficient <strong>for</strong> most image processing <strong>and</strong> computer<br />

vision methods not relying on high-precision computations.<br />

The range of described methods starts with GPU-based correlation calculation followed<br />

by a simple winner-takes-all depth extraction procedure <strong>and</strong> reaches semi-global methods<br />

using dynamic programming <strong>and</strong> volumetric methods to merge a set of depth images into<br />

a final 3D model. So far, several important global methods <strong>for</strong> depth estimation can<br />

only partially benefit from GPUs: graph cut approaches are currently too sophisticated<br />

<strong>for</strong> substantial GPU acceleration, <strong>and</strong> loopy belief propagation methods have too high<br />

memory requirements to be useful <strong>for</strong> high-resolution reconstructions. Hence, we believe<br />

that the methods proposed in this <strong>thesis</strong> are good c<strong>and</strong>idates <strong>for</strong> GPU utilization to<br />

generate high-resolution models from multiple views.<br />

It is evident to ask whether other steps in the pipeline can be accelerated by graphics<br />

hardware as well. Several processing steps in the early pipeline like distortion correction,<br />

basic corner extraction <strong>and</strong> similar low level image processing tasks can easily exploit the<br />

processing power of modern GPUs (e.g. [Sugita et al., 2003] <strong>and</strong> [Colantoni et al., 2003]).<br />

Other important procedures mostly related to pose estimation like tracking <strong>and</strong> matching<br />

of correspondences <strong>and</strong> RANSAC based relative pose estimation require too sophisticated<br />

control flow mechanisms to be rewarding targets <strong>for</strong> SIMD processing model offered by<br />

current GPUs. There might the possibility of hybrid approaches <strong>for</strong> these tasks incorporating<br />

CPU <strong>and</strong> GPU processing power at equal parts. In particular, the estimation of<br />

sparse correspondences is still a relatively slow procedure within our current reconstruction<br />

pipeline. Accelerating this stage of the pipeline seems to be the most worthwhile goal <strong>for</strong><br />

the near future. Sinha et al. [Sinha et al., 2006] recently addressed KLT tracking <strong>for</strong> video<br />

streams <strong>and</strong> SIFT key extraction using the GPU <strong>and</strong> reported substantial per<strong>for</strong>mance<br />

gains. Incorporating <strong>and</strong> extending these techniques is part of future investigations.<br />

147


148 Chapter 10. Concluding Remarks<br />

With the emergence of more general programming models <strong>for</strong> graphics hardware, more<br />

sophisticated depth estimation <strong>and</strong> other computer vision methods may become relevant<br />

targets <strong>for</strong> a GPU-based implementation. According to current technical proposals, nextgeneration<br />

graphics hardware will provide a more flexible <strong>and</strong> dynamic programming<br />

approach, which potentially allows to assign more control flow <strong>and</strong> more dynamic behavior<br />

to the GPU. Additionally, the strict locality found in our algorithms induced by<br />

the current GPU programming model might be softened, <strong>and</strong> more global knowledge of<br />

the views <strong>and</strong> the depth hypo<strong>thesis</strong> could be incorporated into future procedures. In<br />

particular, the introduction of geometry shaders as an additional step in the rendering<br />

pipeline [Blythe, 2006] adds extended dynamical behavior by allowing vertices to be created<br />

<strong>and</strong> removed by shader programs executed by the GPU. Sophisticated use of this <strong>and</strong><br />

other currently emerging features may yield to interesting efficient approaches to computer<br />

vision problems.<br />

Every long-term prognosis about future graphics hardware <strong>and</strong> its non-graphical applications<br />

is highly speculative. Similar objections apply to the future of CPUs. Nevertheless<br />

we outline two recent developments, which may provide some insights on future graphics<br />

<strong>and</strong> parallel processing technology in general: at first, we mention the highly innovative<br />

(<strong>and</strong> unconventional) design of the Cell microprocessor [Kahle et al., 2005], which essentially<br />

consists of a traditional CPU core tightly coupled with eight SIMD co-processors<br />

providing the computing power e.g. <strong>for</strong> multimedia tasks. The most prominent use of the<br />

Cell architecture will be a video gaming console still equipped with a dedicated graphics<br />

processing unit, but the main goal of substantial enhancing the SIMD capabilities<br />

of general purpose processors is obvious. One important application of this design is<br />

physically correct simulation of objects in computer games. Another <strong>for</strong>thcoming development<br />

in SIMD processing hardware is the unification of previously distinct vertex <strong>and</strong><br />

fragment shaders on GPUs. This means, that the shader pipelines on the GPU can execute<br />

either vertex programs or fragment programs as requested by the application or the<br />

graphics driver software. Consequently, the shader pipelines closely resemble the SIMD coprocessors<br />

of the Cell architecture. This evolution of CPUs <strong>and</strong> GPUs is partially driven<br />

by the need of efficient physic simulation engines used in modern computer games. Hence,<br />

one can expect arrays of versatile SIMD co-processors in future computer hardware, which<br />

are located close to the CPU (as in the Cell model) or close to the GPU (in the unified<br />

shader case).<br />

These developments will substantially change the programming model to implement<br />

multimedia tasks <strong>and</strong> related high-per<strong>for</strong>mance applications. The current technological<br />

trends indicate, that main CPUs mainly augmented with data-parallel co-processors will<br />

be the most dominant future computing device. Several techniques developed to utilize<br />

the GPU <strong>for</strong> computer vision tasks can be transferred to this new architecture, whereas<br />

other per<strong>for</strong>mance optimizations specifically targeted <strong>for</strong> GPUs (e.g. using the z-buffer <strong>for</strong><br />

conditional evaluation) have no general SIMD counterpart. Since every new generation of<br />

computer hardware, <strong>and</strong> graphics hardware in particular, provides a set of new features,


the required frequent adaption of GPU-based implementations will likely enable a smooth<br />

transition to future computer architectures.<br />

Currently, the programming interface <strong>for</strong> GPU application is a graphics library (mainly<br />

OpenGL <strong>and</strong> Direct3D). At least it is counter-intuitive <strong>and</strong> error-prone to use graphics<br />

comm<strong>and</strong>s to implement non-graphical methods <strong>and</strong> computations. Consequently, there<br />

are <strong>for</strong>th-coming proposals to interact with the GPU as a non-graphical device: Accelerator<br />

[Tarditi et al., 2005] provides a high-level SIMD programming model <strong>and</strong> translates<br />

the library calls into suitable fragment shaders <strong>and</strong> graphical comm<strong>and</strong>s of the underlying<br />

graphics library. Peercy et al [Peercy et al., 2006] present a library, which exposes the<br />

data-parallel capabilities of the GPU directly without invocation of the systems graphics<br />

library. These trends illustrate the transition of hardware <strong>and</strong> software vendors from<br />

h<strong>and</strong>ling the GPU exclusively as graphics device to a more general parallel computing<br />

device.<br />

Nevertheless, the main focus of future work is not the sole acceleration of computer<br />

vision methods using off-the-shelf parallel computing devices (most notably the GPU),<br />

but the enhancement of the underlying computer vision algorithms. As an example,<br />

semantic segmentation of the input images into relevant regions (facades, static objects)<br />

<strong>and</strong> irrelevant ones (sky, vegetation, moving objects) allows the exclusion of undesirable<br />

values in the depth map. Consequently, the fusion of the depth images is more robust,<br />

<strong>and</strong> the final model omits unnecessary clutter induces by negligible objects.<br />

The presented volumetric approach to 3D model generation from several<br />

depth maps is very efficient, but yields to water-tight models only in ideal cases.<br />

Additionally, the extracted meshes have poor overall smoothness due to the lack<br />

of an appropriate neighborhood h<strong>and</strong>ling. Recently, volumetric mesh extraction<br />

approaches based on graph cuts incorporating global smoothness were proposed<br />

(e.g. [Vogiatzis et al., 2005, Hornung <strong>and</strong> Kobbelt, 2006c]), but these methods have<br />

their own difficulties beside the increased computational complexity. For instance, some<br />

volumetric graph-cut procedures work best only if a suitable visual hull is available.<br />

Furthermore, graph cut solutions prefer minimal surfaces, hence an ad-hoc ballooning<br />

term needs to be added to the cost functional. The limitations of current methods imply,<br />

that there is still room <strong>for</strong> further research in range image integration.<br />

Finally, there is often the requirement of human interaction in the reconstruction<br />

pipeline. In particular, post-processing steps like model trimming <strong>and</strong> the integration<br />

of independently reconstructed objects into one common model commonly depend on a<br />

human operator. The topic of providing user interfaces <strong>for</strong> efficient execution of such tasks<br />

is not directly suited <strong>for</strong> future research. More promising is the integration of efficient<br />

model computation methods with manual interaction schemes in order to intervene in the<br />

depth map or 3D model generation procedure: <strong>for</strong> instance, manual labeling of unmodeled<br />

surface properties like specular highlights combined with a real-time update of the final<br />

3D model may yield to highly effective modeling applications.<br />

149


Appendix A<br />

Selected Publications<br />

A.1 Publications Related to this Thesis<br />

The original approach to mesh-based stereo reconstruction on the GPU as described in<br />

Chapter 3 can be found in [Zach et al., 2003a]. The per<strong>for</strong>mance of the proposed method<br />

was substantially increased using the techniques presented in [Zach et al., 2003b].<br />

Material from Chapter 4 (plane-sweep depth estimation on the GPU) <strong>and</strong> Chapter 8<br />

(fast volumetric integration of depth maps) appeared in [Zach et al., 2006a].<br />

The scanline optimization implementation on the GPU (Chapter 7) is published<br />

as [Zach et al., 2006b].<br />

A.2 Other Selected Scientific Contributions<br />

Most work in the first half of my time as <strong>PhD</strong> student addressed rendering of large 3D environments,<br />

which were typically generated by remote sensing methods (e.g. satellite laser<br />

scans) <strong>and</strong> photogrammetric methods. Hence, early papers covered the task of interactive<br />

visualization of such dataset using view-dependent multi-resolution geometry.<br />

In [Zach <strong>and</strong> Karner, 2003a] an efficient algorithm <strong>for</strong> selective refinement of viewdependent<br />

meshes is presented. View-dependent refinement of meshes typically requires a<br />

top-down traversal of a tree-like structure, which affects the obtained frame rate significantly.<br />

The proposed method is an event-drived approach to the dynamic mesh refinement<br />

procedure, which exploits temporal coherence explicitly <strong>and</strong> achieves significantly reduced<br />

refinement times.<br />

Mapping textures on multiresolution meshes is straigh<strong>for</strong>ward, if texture coordinates<br />

can be interpolated across all levels of detail (e.g. when only one texture is applied to<br />

the geometry). If the geometry is texture mapped with several images, the displayed<br />

level of detail is constrained or artifacts occur, if no additional processing is per<strong>for</strong>med.<br />

[Zach <strong>and</strong> Bauer, 2002] <strong>and</strong> [Sormann et al., 2003] generalize clipmap like approaches <strong>for</strong><br />

texturing multiresolution heightfields to more general 3D models by generating a texture<br />

151


152 Chapter A. Selected Publications<br />

hierarchy in correspondence with the vertex hierarchy used <strong>for</strong> view-dependent rendering<br />

of multiresolution meshes.<br />

Efficient external encoding of multiresolution meshes suitable <strong>for</strong> view-dependent access<br />

of relevant fractions of the complete 3D model was mainly addressed by M. Grabner<br />

[Grabner, 2003]. In [Zach et al., 2004a] we replace the originally proposed topology<br />

encoding method <strong>for</strong> multiresolution meshes with a different encoding scheme. Our new<br />

encoding method is superior in worst case examples <strong>and</strong> in real-world data sets. We prove<br />

that two vertices of a triangle can be encoded with 1 bit on average, whereas the third<br />

vertex requires O(log n) bits in the worst case.<br />

[Zach <strong>and</strong> Karner, 2003b] addresses again compression of model data suitable <strong>for</strong> efficient<br />

transmission over a network. This time, the compressed encoding of precomputed<br />

visibility in<strong>for</strong>mation <strong>for</strong> walk-through applications is described. It is assumed that the<br />

user can navigate in an urban scenario with the virtual camera fixed at a predefined<br />

eye height. For every node in the view-dependent mesh hierarchy a conservative estimation<br />

of visibility is precomputed using software provided by P. Wonka <strong>and</strong> M. Wimmer<br />

[Wonka et al., 2000]. The result of this calculation is a set of visible nodes <strong>for</strong> each cell<br />

in the maneuverable space. This data comprise essentially a large binary matrix, which is<br />

appropriately encoded to be used in remote visualization applications.<br />

Rendering large view-dependent multiresolution models in combination with many<br />

view-independent multiresolution objects was addressed in [Zach et al., 2002]. In particular,<br />

the real-time rendering of a large digital elevation model augmented with a huge number<br />

of trees is discussed. In order to achieve real-time per<strong>for</strong>mance, a new level of detail<br />

selection procedure is proposed, which is fast enough to assign suitable resolutions to more<br />

than 1 million objects. The digital elevation model is represented as coarse view-dependent<br />

hierachical level of detail, <strong>and</strong> the tree models are rendered using point-based graphics<br />

primitives. An extended version of this paper is recently published [Zach et al., 2004b].


Bibliography<br />

[Akbarzadeh et al., 2006] Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels,<br />

C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q.-X.,<br />

Stewénius, H., Yang, R., Welch, G., Towles, H., Nistér, D., <strong>and</strong> Pollefeys, M. (2006).<br />

Towards urban 3d reconstruction from video. In International Symposium on 3D Data<br />

Processing, Visualization <strong>and</strong> Transmission (3DPVT).<br />

[Appleton <strong>and</strong> Talbot, 2006] Appleton, B. <strong>and</strong> Talbot, H. (2006). Globally minimal surfaces<br />

by continuous maximal flows. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence, 28(1):106–118.<br />

[Baker <strong>and</strong> Bin<strong>for</strong>d, 1981] Baker, H. H. <strong>and</strong> Bin<strong>for</strong>d, T. (1981). Depth from edge <strong>and</strong><br />

intensity based stereo. In Proc. 7th Intl Joint Conf. Artificial Intelligence, pages 631–<br />

636.<br />

[Birchfield <strong>and</strong> Tomasi, 1998] Birchfield, S. <strong>and</strong> Tomasi, C. (1998). A pixel dissimilarity<br />

measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis<br />

<strong>and</strong> Machine Intelligence, 20(4):401–406.<br />

[Blythe, 2006] Blythe, D. (2006). The Direct3D 10 system. In Proceedings of SIGGRAPH<br />

2006, pages 724–734.<br />

[Bolz et al., 2003] Bolz, J., Farmer, I., Grinspun, E., <strong>and</strong> Schröder, P. (2003). Sparse<br />

matrix solvers on the GPU: Conjugate gradients <strong>and</strong> multigrid. In Proceedings of SIG-<br />

GRAPH 2003, pages 917–924.<br />

[Bornik et al., 2001] Bornik, A., Karner, K., Bauer, J., Leberl, F., <strong>and</strong> Mayer, H. (2001).<br />

High-quality texture reconstruction from multiple views. Journal of Visualization <strong>and</strong><br />

<strong>Computer</strong> Animation, 12(5):263–276.<br />

[Boykov et al., 2001] Boykov, Y., Veksler, O., <strong>and</strong> Zabih, R. (2001). Fast approximate energy<br />

minimization via graph cuts. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence (PAMI), 23(11):1222–1239.<br />

[Brown et al., 2003] Brown, M. Z., Burschka, D., <strong>and</strong> Hager, G. D. (2003). Advances in<br />

computational stereo. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence,<br />

25(8):993–1008.<br />

153


154<br />

[Brox et al., 2004] Brox, T., Bruhn, A., Papenberg, N., <strong>and</strong> Weickert, J. (2004). High<br />

accuracy optical flow estimation based on a theory <strong>for</strong> warping. In European Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 25–36.<br />

[Brunton <strong>and</strong> Shu, 2006] Brunton, A. <strong>and</strong> Shu, C. (2006). Belief propagation <strong>for</strong><br />

panorama generation. In International Symposium on 3D Data Processing, Visualization<br />

<strong>and</strong> Transmission (3DPVT).<br />

[Buck et al., 2004] Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston,<br />

M., <strong>and</strong> Hanrahan, P. (2004). Brook <strong>for</strong> GPUs: Stream computing on graphics hardware.<br />

In Proceedings of SIGGRAPH 2004, pages 777–786.<br />

[Canny, 1986] Canny, J. (1986). A computational approach to edge detection. IEEE<br />

Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence, 8(6):679-698., 8(6):679–<br />

698.<br />

[Caselles et al., 1997] Caselles, V., Kimmel, R., <strong>and</strong> Sapiro, G. (1997). Geodesic active<br />

contours. Int. Journal on <strong>Computer</strong> <strong>Vision</strong>, 22(1):61–79.<br />

[Chan <strong>and</strong> Vese, 2002] Chan, T. F. <strong>and</strong> Vese, L. A. (2002). A multiphase levelset framework<br />

<strong>for</strong> image segmentation using the Mum<strong>for</strong>d <strong>and</strong> Shah model. Int. Journal of<br />

<strong>Computer</strong> <strong>Vision</strong>, 50(3):271–293.<br />

[Chefd’Hotel et al., 2001] Chefd’Hotel, C., Hermosillo, G., <strong>and</strong> Faugeras, O. (2001). A<br />

variational approach to multi-modal image matching. In IEEE Workshop on Variational<br />

<strong>and</strong> Level Set Methods in <strong>Computer</strong> <strong>Vision</strong>, pages 21–28.<br />

[Colantoni et al., 2003] Colantoni, P., Boukala, N., <strong>and</strong> Rugna, J. D. (2003). Fast <strong>and</strong><br />

accurate color image processing using 3D graphics cards. In Proc. of <strong>Vision</strong>, Modeling<br />

<strong>and</strong> Visualization 2002.<br />

[Cornelis <strong>and</strong> Van Gool, 2005] Cornelis, N. <strong>and</strong> Van Gool, L. (2005). Real-time connectivity<br />

constrained depth map computation using programmable graphics hardware. In<br />

IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 1099–<br />

1104.<br />

[Criminisi et al., 2005] Criminisi, A., Shotton, J., Blake, A., Rother, C., <strong>and</strong> Torr, P.<br />

(2005). Efficient dense-stereo with occlusions <strong>and</strong> new view syn<strong>thesis</strong> by four state dp<br />

<strong>for</strong> gaze correction. Technical report, Microsoft Research Cambridge.<br />

[Crow, 1984] Crow, F. C. (1984). Summed-area tables <strong>for</strong> texture mapping. In Proceedings<br />

of SIGGRAPH 84, pages 207–212.<br />

[Culbertson et al., 1999] Culbertson, W. B., Malzbender, T., <strong>and</strong> Slabaugh, G. (1999).<br />

Generalized voxel coloring. In Proc. ICCV Workshop, <strong>Vision</strong> Algorithms Theory <strong>and</strong><br />

Practice, pages 100–115.


BIBLIOGRAPHY 155<br />

[Curless <strong>and</strong> Levoy, 1996] Curless, B. <strong>and</strong> Levoy, M. (1996). A volumetric method <strong>for</strong><br />

building complex models from range images. In Proceedings of SIGGRAPH ’96, pages<br />

303–312.<br />

[Dally et al., 2003] Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labont, F., Ahn,<br />

J.-H., Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., <strong>and</strong> Buck, I. (2003). Merrimac:<br />

Supercomputing with streams. In Proceedings of SC2003.<br />

[Darabiha et al., 2003] Darabiha, A., Rose, J., <strong>and</strong> MacLean, W. J. (2003). Video-rate<br />

stereo depth measurement on programmable hardware. In IEEE Conference on <strong>Computer</strong><br />

<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 203–210.<br />

[Davis et al., 2002] Davis, J., Marschner, S., Garr, M., <strong>and</strong> Levoy, M. (2002). Filling holes<br />

in complex surfaces using volumetric diffusion. In First International Symposium on<br />

3D Data Processing, Visualization, <strong>and</strong> Transmission.<br />

[Devernay <strong>and</strong> Faugeras, 2001] Devernay, F. <strong>and</strong> Faugeras, O. (2001). Straight lines have<br />

to be straight. Machine <strong>Vision</strong> <strong>and</strong> Applications, 13(1):14–24.<br />

[Dixit et al., 2005] Dixit, N., Keriven, R., <strong>and</strong> Paragios, N. (2005). GPU-cuts <strong>and</strong> adaptive<br />

object extraction. Technical Report 05-07, CERTIS.<br />

[Dominé et al., 2002] Dominé, S., Rege, A., <strong>and</strong> Cebenoyan, C. (2002). Real-time hatching.<br />

Game Developers Conference.<br />

[Dubois <strong>and</strong> Rodrigue, 1977] Dubois, P. <strong>and</strong> Rodrigue, G. H. (1977). An analysis of the<br />

recursive doubling algorithm. High Speed <strong>Computer</strong> <strong>and</strong> Algorithm Organization, pages<br />

299–307.<br />

[Eisert et al., 1999] Eisert, P., Steinbach, E., <strong>and</strong> Girod, B. (1999). Multi-hypo<strong>thesis</strong>,<br />

volumetric reconstruction of 3-D objects from multiple calibrated camera views. In<br />

Proc. of International Conference on Acoustics, Speech <strong>and</strong> Signal Processing, pages<br />

3509–3512.<br />

[Engel <strong>and</strong> Ertl, 2002] Engel, K. <strong>and</strong> Ertl, T. (2002). Interactive high-quality volume<br />

rendering with flexible consumer graphics hardware. In STAR – State of the Art Report.<br />

Eurographics ’02.<br />

[Engel et al., 2001] Engel, K., Kraus, M., <strong>and</strong> Ertl, T. (2001). High-quality pre-integrated<br />

volume rendering using hardware-accelerated pixel shading. In Eurographics / SIG-<br />

GRAPH Workshop on <strong>Graphics</strong> Hardware ’01, pages 9–16.<br />

[Faugeras et al., 1996] Faugeras, O., Hotz, B., Mathieu, H., Viéville, T., Zhang, Z., Fua,<br />

P., Théron, E., Moll, L., Berry, G., Vuillemin, J., Bertin, P., <strong>and</strong> Proy, C. (1996).<br />

Real time correlation based stereo: algorithm implementations <strong>and</strong> applications. The<br />

International Journal of <strong>Computer</strong> <strong>Vision</strong>.


156<br />

[Faugeras <strong>and</strong> Keriven, 1998] Faugeras, O. <strong>and</strong> Keriven, R. (1998). Variational principles,<br />

surface evolution, PDEs, level set methods, <strong>and</strong> the stereo problem. IEEE Transactions<br />

on Image Processing, 7(3):336–344.<br />

[Faugeras et al., 2002] Faugeras, O., Malik, J., <strong>and</strong> Ikeuchi, K., editors (2002). Special<br />

Issue on Stereo <strong>and</strong> Multi-Baseline <strong>Vision</strong>. International Journal of <strong>Computer</strong> <strong>Vision</strong>.<br />

[Felzenszwalb <strong>and</strong> Huttenlocher, 2004] Felzenszwalb, P. F. <strong>and</strong> Huttenlocher, D. P.<br />

(2004). Efficient belief propagation <strong>for</strong> early vision. In IEEE <strong>Computer</strong> Society Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 261–268.<br />

[Forstmann et al., 2004] Forstmann, S., Ohya, J., Kanou, Y., Schmitt, A., <strong>and</strong> Thuering,<br />

S. (2004). Real-time stereo by using dynamic programming. In CVPR 2004 Workshop<br />

on real-time 3D sensors <strong>and</strong> their use.<br />

[Förstner <strong>and</strong> Gülch, 1987] Förstner, W. <strong>and</strong> Gülch, E. (1987). A fast operator <strong>for</strong> detection<br />

<strong>and</strong> precise location of distinct points, corners <strong>and</strong> centres of circular features.<br />

Proc. of the ISPRS Intercommission Workshop on Fast Processing of Photogrammetric<br />

Data, Interlaken, pages 285–301.<br />

[Fua, 1993] Fua, P. (1993). A parallel stereo algorithm that produces dense depth maps<br />

<strong>and</strong> preserves image features. Machine <strong>Vision</strong> <strong>and</strong> Applications, 6:35–49.<br />

[Garl<strong>and</strong> <strong>and</strong> Heckbert, 1997] Garl<strong>and</strong>, M. <strong>and</strong> Heckbert, P. S. (1997). Surface simplification<br />

using quadric error metrics. In Proceedings of SIGGRAPH ’97, pages 209–216.<br />

[Geiger et al., 1995] Geiger, D., Ladendorf, B., <strong>and</strong> Yuille, A. (1995). Occlusions <strong>and</strong><br />

binocular stereo. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 14:211–226.<br />

[Goesele et al., 2006] Goesele, M., Curless, B., <strong>and</strong> Seitz, S. (2006). Multi-view stereo<br />

revisited. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />

Recognition (CVPR), pages 2402–2409.<br />

[Gong <strong>and</strong> Yang, 2005a] Gong, M. <strong>and</strong> Yang, R. (2005a). Image-gradient-guided real-time<br />

stereo on graphics hardware. In Fifth International Conference on 3-D Digital Imaging<br />

<strong>and</strong> Modeling, pages 548–555.<br />

[Gong <strong>and</strong> Yang, 2005b] Gong, M. <strong>and</strong> Yang, Y.-H. (2005b). Near real-time reliable stereo<br />

matching using programmable graphics hardware. In IEEE Conference on <strong>Computer</strong><br />

<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 924–931.<br />

[Goodnight et al., 2003] Goodnight, N., Woolley, C., Lewin, G., Luebke, D., <strong>and</strong><br />

Humphreys, G. (2003). A multigrid solver <strong>for</strong> boundary value problems using programmable<br />

graphics hardware. In Eurographics/SIGGRAPH Workshop on <strong>Graphics</strong><br />

Hardware 2003.


BIBLIOGRAPHY 157<br />

[Grabner, 2003] Grabner, M. (2003). Compressed Adaptive Multiresolution Encoding. <strong>PhD</strong><br />

<strong>thesis</strong>, Technical University <strong>Graz</strong>.<br />

[Hadwiger et al., 2001] Hadwiger, M., Theußl, T., Hauser, H., <strong>and</strong> Gröller, M. E. (2001).<br />

Hardware-accelerated high-quality filtering on PC hardware. In Proc. of <strong>Vision</strong>, Modeling<br />

<strong>and</strong> Visualization 2001, pages 105–112.<br />

[Harris <strong>and</strong> Stephens, 1988] Harris, C. <strong>and</strong> Stephens, M. (1988). A combined corner <strong>and</strong><br />

edge detector. Proceedings 4th Alvey Visual Conference, pages 189–192.<br />

[Harris <strong>and</strong> Luebke, 2005] Harris, M. <strong>and</strong> Luebke, D. (2005). SIGGRAPH 2005 GPGPU<br />

course notes.<br />

[Harris et al., 2002] Harris, M. J., Coombe, G., Scheuermann, T., <strong>and</strong> Lastra, A. (2002).<br />

Physically-based visual simulation on graphics hardware. In Eurographics/SIGGRAPH<br />

Workshop on <strong>Graphics</strong> Hardware, pages 109–118.<br />

[Hart <strong>and</strong> Mitchell, 2002] Hart, E. <strong>and</strong> Mitchell, J. L. (2002). Hardware shading with<br />

EXT vertex shader <strong>and</strong> ATI fragment shader. ATI Technologies.<br />

[Heckbert, 1986] Heckbert, P. S. (1986). Filtering by repeated integration. In Proceedings<br />

of SIGGRAPH 86, pages 315–321.<br />

[Heikkilä, 2000] Heikkilä, J. (2000). Geometric camera calibration using circular control<br />

points. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />

22(10):1066–1077.<br />

[Hensley et al., 2005] Hensley, J., Scheuermann, T., Coombe, G., Singh, M., <strong>and</strong> Lastra,<br />

A. (2005). Fast summed-area table generation <strong>and</strong> its applications. In Proceedings of<br />

Eurographics 2005, pages 547–555.<br />

[Hermosillo et al., 2001] Hermosillo, G., Chefd’Hotel, C., <strong>and</strong> Faugeras, O. (2001). A variational<br />

approach to multi-modal image matching. Technical Report RR 4117, INRIA.<br />

[Hilton et al., 1996] Hilton, A., Stoddart, A. J., Illingworth, J., <strong>and</strong> Windeatt, T. (1996).<br />

Reliable surface reconstruction from multiple range images. In European Conference on<br />

<strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 117–126.<br />

[Hirschmüller, 2005] Hirschmüller, H. (2005). Accurate <strong>and</strong> efficient stereo processing by<br />

semi-global matching <strong>and</strong> mutual in<strong>for</strong>mation. In IEEE <strong>Computer</strong> Society Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 807–814.<br />

[Hirschmüller, 2006] Hirschmüller, H. (2006). Stereo vision in structured environments by<br />

consistent semi-global matching. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong><br />

<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 2386–2393.


158<br />

[Hoff III et al., 1999] Hoff III, K. E., Keyser, J., Lin, M., Manocha, D., <strong>and</strong> Culver, T.<br />

(1999). Fast computation of generalized Voronoi diagrams using graphics hardware. In<br />

Proceedings of SIGGRAPH ’99, pages 277–286.<br />

[Hopf <strong>and</strong> Ertl, 1999a] Hopf, M. <strong>and</strong> Ertl, T. (1999a). Accelerating 3D convolution using<br />

graphics hardware. In Visualization 1999, pages 471–474.<br />

[Hopf <strong>and</strong> Ertl, 1999b] Hopf, M. <strong>and</strong> Ertl, T. (1999b). Hardware-based wavelet trans<strong>for</strong>mations.<br />

In Workshop of <strong>Vision</strong>, Modelling, <strong>and</strong> Visualization (VMV ’99), pages<br />

317–328.<br />

[Hornung <strong>and</strong> Kobbelt, 2006a] Hornung, A. <strong>and</strong> Kobbelt, L. (2006a). Hierarchical volumetric<br />

multi-view stereo reconstruction of manifold surfaces based on dual graph embedding.<br />

In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />

(CVPR), pages 503–510.<br />

[Hornung <strong>and</strong> Kobbelt, 2006b] Hornung, A. <strong>and</strong> Kobbelt, L. (2006b). Robust <strong>and</strong> efficient<br />

photo-consistency estimation <strong>for</strong> volumetric 3d reconstruction. In European Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 179–190.<br />

[Hornung <strong>and</strong> Kobbelt, 2006c] Hornung, A. <strong>and</strong> Kobbelt, L. (2006c). Robust reconstruction<br />

of watertight 3D models from non-uni<strong>for</strong>mly sampled point clouds without normal<br />

in<strong>for</strong>mation. In Eurographics Symposium on Geometry Processing, pages 41–50.<br />

[Jia et al., 2003] Jia, Y., Xu, Y., Liu, W., Yang, C., Zhu, Y., Zhang, X., <strong>and</strong> An, L. (2003).<br />

A miniature stereo vision machine <strong>for</strong> real-time dense depth mapping. In Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> Systems (ICVS 2003), pages 268–277.<br />

[Jung et al., 2006] Jung, Y. M., Kang, S. H., <strong>and</strong> Shen, J. (2006). Multiphase image<br />

segmentation via Modica-Mortola phase transition. Technical report, Department of<br />

Mathematics, University of Kentucky.<br />

[Kahle et al., 2005] Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer,<br />

T. R., <strong>and</strong> Shippy, D. (2005). Introduction to the Cell multiprocessor. IBM Journal of<br />

Research <strong>and</strong> Development, 49(4/5):589–604.<br />

[Kanade et al., 1996] Kanade, T., Yoshida, A., Oda, K., Kano, H., <strong>and</strong> Tanaka, M. (1996).<br />

A stereo engine <strong>for</strong> video-rate dense depth mapping <strong>and</strong> its new applications. In IEEE<br />

Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 196–202.<br />

[Kautz <strong>and</strong> Seidel, 2001] Kautz, J. <strong>and</strong> Seidel, H.-P. (2001). Hardware accelerated displacement<br />

mapping <strong>for</strong> image based rendering. In <strong>Graphics</strong> Interface 2001, pages 61–70.<br />

[Kim <strong>and</strong> Lin, 2003] Kim, T. <strong>and</strong> Lin, M. (2003). Visual simulation of ice crystal growth.<br />

In Proc. ACM SIGGRAPH / Eurographics Symposium on <strong>Computer</strong> Animation.


BIBLIOGRAPHY 159<br />

[Klaus et al., 2002] Klaus, A., Bauer, J., Karner, K., <strong>and</strong> Schindler, K. (2002). MetropoGIS:<br />

A semi-automatic city documentation system. In Photogrammetric <strong>Computer</strong><br />

<strong>Vision</strong> 2002 (PCV’02).<br />

[Kolmogorov <strong>and</strong> Zabih, 2001] Kolmogorov, V. <strong>and</strong> Zabih, R. (2001). Computing visual<br />

correspondence with occlusions using graph cuts. In IEEE International Conference on<br />

<strong>Computer</strong> <strong>Vision</strong> (ICCV), pages 508–515.<br />

[Kolmogorov <strong>and</strong> Zabih, 2002] Kolmogorov, V. <strong>and</strong> Zabih, R. (2002). Multi-camera scene<br />

reconstruction via graph cuts. In European Conference on <strong>Computer</strong> <strong>Vision</strong> (ECCV),<br />

pages 82–96.<br />

[Kolmogorov <strong>and</strong> Zabih, 2004] Kolmogorov, V. <strong>and</strong> Zabih, R. (2004). What energy functions<br />

can be minimized via graph cuts? IEEE Transactions on Pattern Analysis <strong>and</strong><br />

Machine Intelligence (PAMI), 26(2):147–159.<br />

[Kolmogorov et al., 2003] Kolmogorov, V., Zabih, R., <strong>and</strong> Gortler, S. (2003). Generalized<br />

multi-camera scene reconstruction using graph cuts. In Fourth International Workshop<br />

on Energy Minimization Methods in <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />

(EMMCVPR).<br />

[Konolige, 1997] Konolige, K. (1997). Small vision systems: Hardware <strong>and</strong> implementation.<br />

In Proceedings of 8th International Symposium on Robotic Research, pages 203–<br />

212.<br />

[Krishnan et al., 2002] Krishnan, S., Mustafa, N., <strong>and</strong> Venkatasubramanian, S. (2002).<br />

Hardware-assisted computation of depth contours. In 13th ACM-SIAM Symposium on<br />

Discrete Algorithms.<br />

[Krüger <strong>and</strong> Westermann, 2003] Krüger, J. <strong>and</strong> Westermann, R. (2003). Linear algebra<br />

operators <strong>for</strong> GPU implementation of numerical algorithms. In Proceedings of SIG-<br />

GRAPH 2003, pages 908–916.<br />

[Kutulakos <strong>and</strong> Seitz, 2000] Kutulakos, K. <strong>and</strong> Seitz, S. (2000). A theory of shape by<br />

space carving. Int. Journal of <strong>Computer</strong> <strong>Vision</strong>, 38(3):198–216.<br />

[Labatut et al., 2006] Labatut, P., Keriven, R., <strong>and</strong> Pons, J.-P. (2006). A GPU implementation<br />

of level set multiview stereo. In International Symposium on 3D Data Processing,<br />

Visualization <strong>and</strong> Transmission (3DPVT).<br />

[Lanczos, 1986] Lanczos, C. (1986). The Variational Principles of Mechanics. Dover<br />

Publications, fourth edition.<br />

[Laurentini, 1995] Laurentini, A. (1995). How far 3d shapes can be understood from 2d<br />

silhouettes. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />

17(2).


160<br />

[Lefohn et al., 2003] Lefohn, A., Kniss, J. M., Hansen, C. D., <strong>and</strong> Whitaker, R. T. (2003).<br />

Interactive de<strong>for</strong>mation <strong>and</strong> visualization of level set surfaces using graphics hardware.<br />

In Proceedings of IEEE Visualization 2003, pages 75–82.<br />

[Lei et al., 2006] Lei, C., Selzer, J., <strong>and</strong> Yang, Y. (2006). Region-tree based stereo using<br />

dynamic programming optimization. In IEEE <strong>Computer</strong> Society Conference on<br />

<strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 2378–2385.<br />

[Lévy et al., 2002] Lévy, B., Petitjean, S., Ray, N., <strong>and</strong> Maillot, J. (2002). Least squares<br />

con<strong>for</strong>mal maps <strong>for</strong> automatic texture atlas generation. In Proceedings of SIGGRAPH<br />

2002, pages 362–371.<br />

[Li et al., 2003] Li, M., Magnor, M., <strong>and</strong> Seidel, H.-P. (2003). Hardware-accelerated visual<br />

hull reconstruction <strong>and</strong> rendering. In Proceedings of <strong>Graphics</strong> Interface 2003.<br />

[Li et al., 2004] Li, M., Magnor, M., <strong>and</strong> Seidel, H.-P. (2004). Hardware-accelerated rendering<br />

of photo hulls. In Proceedings of Eurographics 2004, pages 635–642.<br />

[Li et al., 2002] Li, M., Schirmacher, H., Magnor, M., <strong>and</strong> Seidel, H.-P. (2002). Combining<br />

stereo <strong>and</strong> visual hull in<strong>for</strong>mation <strong>for</strong> on-line reconstruction <strong>and</strong> rendering of dynamic<br />

scenes. In Proceedings of IEEE 2002 Workshop on Multimedia <strong>and</strong> Signal Processing,<br />

pages 9–12.<br />

[Lindholm et al., 2001] Lindholm, E., Kilgard, M. J., <strong>and</strong> Moreton, H. (2001). A userprogrammable<br />

vertex engine. In Proceedings of SIGGRAPH 2001, pages 149–158.<br />

[Lok, 2001] Lok, B. (2001). Online model reconstruction <strong>for</strong> interactive virtual environments.<br />

In Symposium on Interactive 3D <strong>Graphics</strong>, pages 69–72.<br />

[Lorenson <strong>and</strong> Cline, 1987] Lorenson, W. <strong>and</strong> Cline, H. (1987). Marching Cubes: A high<br />

resolution 3d surface construction algorithm. In Proceedings of SIGGRAPH ’87, pages<br />

163–170.<br />

[Lourakis <strong>and</strong> Argyros, 2004] Lourakis, M. <strong>and</strong> Argyros, A. (2004). The design <strong>and</strong> implementation<br />

of a generic sparse bundle adjustment software package based on the<br />

levenberg-marquardt algorithm. Technical Report 340, <strong>Institute</strong> of <strong>Computer</strong> Science -<br />

FORTH. Available from http://www.ics.<strong>for</strong>th.gr/~lourakis/sba.<br />

[Lowe, 1999] Lowe, D. (1999). Object recognition from local scale-invariant features. Proc.<br />

of the International Conference on <strong>Computer</strong> <strong>Vision</strong> ICCV, pages 1150–1157.<br />

[Lu et al., 2002] Lu, A., Taylor, J., Hartner, M., Ebert, D., <strong>and</strong> Hansen, C. (2002). Hardware<br />

accelerated interactive stipple drawing of polygonal objects. In Proc. of <strong>Vision</strong>,<br />

Modeling <strong>and</strong> Visualization 2002, pages 61–68.


BIBLIOGRAPHY 161<br />

[Mairal <strong>and</strong> Keriven, 2006] Mairal, J. <strong>and</strong> Keriven, R. (2006). A GPU implementation of<br />

variational stereo. In International Symposium on 3D Data Processing, Visualization<br />

<strong>and</strong> Transmission (3DPVT).<br />

[Mark et al., 2003] Mark, W., Glanville, R., Akeley, K., <strong>and</strong> Kilgard, M. (2003). Cg: A<br />

system <strong>for</strong> programming graphics hardware in a C-like language. In Proceedings of<br />

SIGGRAPH 2003, pages 896–907.<br />

[Matas et al., 2002] Matas, J., Chum, O., Urban, M., <strong>and</strong> Pajdla, T. (2002). Robust<br />

wide baseline stereo from maximally stable extremal regions. In Proceedings of the 13th<br />

British Machine <strong>Vision</strong> Conference, pages 384–393.<br />

[Matusik et al., 2001] Matusik, W., Buehler, C., <strong>and</strong> McMillan, L. (2001). Polyhedral<br />

visual hulls <strong>for</strong> real-time rendering. In Proceedings of 12th Eurographics Workshop on<br />

Rendering, pages 115–125.<br />

[Mayer et al., 2001] Mayer, H., Bornik, A., Bauer, J., Karner, K., <strong>and</strong> Leberl, F. (2001).<br />

Multiresolution texture <strong>for</strong> photorealistic rendering. In Proceedings of the Spring Conference<br />

on <strong>Computer</strong> <strong>Graphics</strong> SCCG.<br />

[Mendonça <strong>and</strong> Cipolla, 1999] Mendonça, P. R. S. <strong>and</strong> Cipolla, R. (1999). A simple technique<br />

<strong>for</strong> self-calibration. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition (CVPR), pages 1500–1506.<br />

[Mikolajczyk <strong>and</strong> Schmid, 2004] Mikolajczyk, K. <strong>and</strong> Schmid, C. (2004). Scale <strong>and</strong> affine<br />

invariant interest point detectors. Int. Journal of <strong>Computer</strong> <strong>Vision</strong>, 60(1):63–86.<br />

[Mitchell, 2002] Mitchell, J. L. (2002). Hardware shading on the Radeon 9700. ATI<br />

Technologies.<br />

[Mitchell et al., 2002] Mitchell, J. L., Brennan, C., <strong>and</strong> Card, D. (2002). Real-time image<br />

space outlining <strong>for</strong> non-photorealistic rendering. In SIGGRAPH 2002. Technical Sketch.<br />

[Morel<strong>and</strong> <strong>and</strong> Angel, 2003] Morel<strong>and</strong>, K. <strong>and</strong> Angel, E. (2003). The FFT on a GPU. In<br />

Eurographics/SIGGRAPH Workshop on <strong>Graphics</strong> Hardware 2003, pages 112–119.<br />

[Mühlmann et al., 2002] Mühlmann, K., Maier, D., Hesser, J., <strong>and</strong> Männer, R. (2002).<br />

Calculating dense disparity maps from color stereo images, an efficient implementation.<br />

IJCV, 47:79–88.<br />

[Mulligan et al., 2002] Mulligan, J., Isler, V., <strong>and</strong> Daniilidis, K. (2002). Trinocular stereo:<br />

a new algorithm <strong>and</strong> its evaluation. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 47:51–<br />

61.<br />

[Nagel <strong>and</strong> Enkelmann, 1986] Nagel, H.-H. <strong>and</strong> Enkelmann, W. (1986). An investigation<br />

of smoothness constraints <strong>for</strong> the estimation of displacement vector fields from image


162<br />

sequences. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />

8:565–593.<br />

[Nistér, 2001] Nistér, D. (2001). Calibration with robust use of cheirality by quasi-affine<br />

reconstruction of the set of camera projection centres. In Int. Conference on <strong>Computer</strong><br />

<strong>Vision</strong> (ICCV), pages 116–123.<br />

[Nistér, 2004a] Nistér, D. (2004a). An efficient solution to the five-point relative pose<br />

problem. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence (PAMI),<br />

26(6):756–770.<br />

[Nistér, 2004b] Nistér, D. (2004b). Untwisting a projective reconstruction. Int. Journal<br />

on <strong>Computer</strong> <strong>Vision</strong>, 60(2):165–183.<br />

[Nistér et al., 2004] Nistér, D., Naroditsky, O., <strong>and</strong> Bergen, J. (2004). Visual odometry.<br />

In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />

(CVPR), pages 652–659.<br />

[NVidia Corporation, 2002a] NVidia Corporation (2002a). Cg language specification.<br />

[NVidia Corporation, 2002b] NVidia Corporation (2002b). Developer relations.<br />

http://developer.nvidia.com.<br />

[Ohta <strong>and</strong> Kanade, 1985] Ohta, Y. <strong>and</strong> Kanade, T. (1985). Stereo by intra- <strong>and</strong> interscanline<br />

search using dynamic programming. IEEE Transactions on Pattern Analysis<br />

<strong>and</strong> Machine Intelligence, 7:139–154.<br />

[Papenberg et al., 2005] Papenberg, N., Bruhn, A., Brox, T., Didas, S., <strong>and</strong> Weickert, J.<br />

(2005). Highly accurate optic flow computation with theoretically justified warping.<br />

Technical report, Department of Mathematics, Saarl<strong>and</strong> University.<br />

[Peercy et al., 2006] Peercy, M., Segal, M., <strong>and</strong> Gerstmann, D. (2006). A per<strong>for</strong>manceoriented<br />

data parallel virtual machine <strong>for</strong> gpus. In ACM SIGGRAPH sketches.<br />

[Peercy et al., 2000] Peercy, M. S., Olano, M., Airey, J., <strong>and</strong> Ungar, P. J. (2000). Interactive<br />

multi-pass programmable shading. In Proceedings of SIGGRAPH 2000, pages<br />

425–432.<br />

[Perona <strong>and</strong> Malik, 1990] Perona, P. <strong>and</strong> Malik, J. (1990). Scale-space <strong>and</strong> edge detection<br />

using anisotropic diffusion. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence (PAMI), 12(7):629–639.<br />

[Point Grey Research Inc., 2005] Point Grey Research Inc. (2005).<br />

http://www.ptgrey.com.


BIBLIOGRAPHY 163<br />

[Pollefeys et al., 1999] Pollefeys, M., Koch, R., <strong>and</strong> Gool, L. V. (1999). Self-calibration<br />

<strong>and</strong> metric reconstruction in spite of varying <strong>and</strong> unknown internal camera parameters.<br />

Int. Journal on <strong>Computer</strong> <strong>Vision</strong>, 32(1):7–25.<br />

[Pons et al., 2005] Pons, J.-P., Keriven, R., <strong>and</strong> Faugeras, O. (2005). Modelling dynamic<br />

scenes by registering multi-view image sequences. In IEEE <strong>Computer</strong> Society Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 822–827.<br />

[Prock <strong>and</strong> Dyer, 1998] Prock, A. <strong>and</strong> Dyer, C. (1998). Towards real-time voxel coloring.<br />

In Proc. Image Underst<strong>and</strong>ing Workshop, pages 315–321.<br />

[Proudfoot et al., 2001] Proudfoot, K., Mark, W., Tzvetkov, S., <strong>and</strong> Hanrahan, P. (2001).<br />

A real-time procedural shading system <strong>for</strong> programmable graphics hardware. In Proceedings<br />

of SIGGRAPH 2001, pages 159–170.<br />

[Rodrigues <strong>and</strong> Ramires Fern<strong>and</strong>es, 2004] Rodrigues, R. <strong>and</strong> Ramires Fern<strong>and</strong>es, A.<br />

(2004). Accelerated epipolar geometry computation <strong>for</strong> 3d reconstruction using projective<br />

texturing. In Proceedings of Spring Conference on <strong>Computer</strong> <strong>Graphics</strong> 2004,<br />

pages 208–214.<br />

[Rudin et al., 1992] Rudin, L. I., Osher, S., <strong>and</strong> Fatemi, E. (1992). Nonlinear total variation<br />

based noise removal algorithms. Physica D, 60:259–268.<br />

[Sainz et al., 2002] Sainz, M., Bagherzadeh, N., <strong>and</strong> Susin, A. (2002). Hardware accelerated<br />

voxel carving. In 1st Ibero-American Symposium in <strong>Computer</strong> <strong>Graphics</strong> (SIACG<br />

2002), pages 289–297.<br />

[Scharstein <strong>and</strong> Szeliski, 2002] Scharstein, D. <strong>and</strong> Szeliski, R. (2002). A taxonomy <strong>and</strong><br />

evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. <strong>Vision</strong>,<br />

47(1-3):7–42.<br />

[Schmidegg, 2005] Schmidegg, H. (2005). Texturing 3D models from historical images.<br />

Master’s <strong>thesis</strong>, <strong>Graz</strong> University of Technology.<br />

[Seitz et al., 2006] Seitz, S., Curless, B., Diebel, J., Scharstein, D., <strong>and</strong> Szeliski, R. (2006).<br />

A comparison <strong>and</strong> evaluation of multi-view stereo reconstruction algorithms. In IEEE<br />

<strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR).<br />

[Seitz <strong>and</strong> Dyer, 1997] Seitz, S. <strong>and</strong> Dyer, C. (1997). Photorealistic scene reconstruction<br />

by voxel coloring. In IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition<br />

(CVPR), pages 1067–1073.<br />

[Seitz <strong>and</strong> Dyer, 1999] Seitz, S. <strong>and</strong> Dyer, C. (1999). Photorealistic scene reconstruction<br />

by voxel coloring. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 35(2):151–173.


164<br />

[Seitz <strong>and</strong> Kutulakos, 2002] Seitz, S. <strong>and</strong> Kutulakos, K. (2002). Plenoptic image editing.<br />

Int. Journal of <strong>Computer</strong> <strong>Vision</strong>, 48(2):115–129.<br />

[Shen, 2006] Shen, J. (2006). A stochastic-variational model <strong>for</strong> soft Mum<strong>for</strong>d-Shah segmentation.<br />

International Journal on Biomedical Imaging, 2006:1–14.<br />

[Sinha et al., 2006] Sinha, S. N., Frahm, J.-M., Pollefeys, M., <strong>and</strong> Genc, Y. (2006). Gpubased<br />

video feature tracking <strong>and</strong> matching. Technical Report 06-012, Department of<br />

<strong>Computer</strong> Science, UNC Chapel Hill.<br />

[Slabaugh et al., 2001] Slabaugh, G., Culbertson, W. B., <strong>and</strong> Malzbender, T. (2001). A<br />

survey of methods <strong>for</strong> volumetric scene reconstruction from photographs. In Int. Workshop<br />

on Volume <strong>Graphics</strong>, pages 81–100.<br />

[Slabaugh et al., 2002] Slabaugh, G., Schafer, R., <strong>and</strong> Hans, M. (2002). Image-based<br />

photo hulls. In The 1st International Symposium on 3D Processing, Visualization, <strong>and</strong><br />

Transmission (3DPVT).<br />

[Slesareva et al., 2005] Slesareva, N., Bruhn, A., <strong>and</strong> Weickert, J. (2005). Optic flow goes<br />

stereo: A variational method <strong>for</strong> estimating discontinuity-preserving dense disparity<br />

maps. In Proc. 27th DAGM Symposium, pages 33–40.<br />

[Sormann et al., 2005] Sormann, M., Zach, C., Bauer, J., Karner, K., <strong>and</strong> Bischof, H.<br />

(2005). Automatic <strong>for</strong>eground propagation in image sequences <strong>for</strong> 3d reconstruction. In<br />

Proc. 27th DAGM Symposium, pages 93–100.<br />

[Sormann et al., 2003] Sormann, M., Zach, C., <strong>and</strong> Karner, K. (2003). Texture mapping<br />

<strong>for</strong> view-dependent rendering. In Proceedings of Spring Conference on <strong>Computer</strong><br />

<strong>Graphics</strong> 2003, pages 146–155.<br />

[Sormann et al., 2006] Sormann, M., Zach, C., <strong>and</strong> Karner, K. (2006). Graph cut based<br />

multiple view segmentation <strong>for</strong> 3d reconstruction. In International Symposium on 3D<br />

Data Processing, Visualization <strong>and</strong> Transmission (3DPVT).<br />

[Stegmaier et al., 2005] Stegmaier, S., Strengert, M., Klein, T., <strong>and</strong> Ertl, T. (2005). A<br />

simple <strong>and</strong> flexible volume rendering framework <strong>for</strong> graphics-hardware-based raycasting.<br />

In Proceedings of Volume <strong>Graphics</strong>, pages 187–195.<br />

[Stevens et al., 2002] Stevens, M. R., Culbertson, W. B., <strong>and</strong> Malzbender, T. (2002). A<br />

histogram-based color consistency test <strong>for</strong> voxel coloring. In Intl. Conference on Pattern<br />

Recognition, pages 118–121.<br />

[Strecha et al., 2003] Strecha, C., Tuytelaars, T., <strong>and</strong> Van Gool, L. (2003). Dense matching<br />

of multiple wide-baseline views. In Int. Conference on <strong>Computer</strong> <strong>Vision</strong> (ICCV),<br />

pages 1194–1201.


BIBLIOGRAPHY 165<br />

[Strecha <strong>and</strong> Van Gool, 2002] Strecha, C. <strong>and</strong> Van Gool, L. (2002). PDE-based multi-view<br />

depth estimation. In 1st International Symposium od 3D Data Processing Visualization<br />

<strong>and</strong> Transmission, pages 416–425.<br />

[Sugita et al., 2003] Sugita, K., Naemura, T., <strong>and</strong> Harashima, H. (2003). Per<strong>for</strong>mance<br />

evaluation of programmable graphics hardware <strong>for</strong> image filtering <strong>and</strong> stereo matching.<br />

In Proceedings of ACM Symposium on Virtual Reality Software <strong>and</strong> Technology 2003.<br />

[Sun et al., 2005] Sun, J., Li, Y., Kang, S., <strong>and</strong> Shum, H.-Y. (2005). Symmetric stereo<br />

matching <strong>for</strong> occlusion h<strong>and</strong>ling. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong><br />

<strong>Vision</strong> <strong>and</strong> Pattern Recognition (CVPR), pages 399–406.<br />

[Sun et al., 2003] Sun, J., Shum, H. Y., <strong>and</strong> Zheng, N. N. (2003). Stereo matching using<br />

belief propagation. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence<br />

(PAMI), 25(7):787–800.<br />

[Tappen <strong>and</strong> Freeman, 2003] Tappen, M. F. <strong>and</strong> Freeman, W. T. (2003). Comparison of<br />

graph cuts with belief propagation <strong>for</strong> stereo, using identical mrf parameters. In Int.<br />

Conference on <strong>Computer</strong> <strong>Vision</strong> (ICCV), pages 900–907.<br />

[Tarditi et al., 2005] Tarditi, D., Puri, S., <strong>and</strong> Oglesby, J. (2005). Accelerator: simplified<br />

programming of graphics processing units <strong>for</strong> general-purpose uses via data-parallelism.<br />

Technical Report MSR-TR-2005-184, Microsoft Research.<br />

[Tell <strong>and</strong> Carlsson, 2000] Tell, D. <strong>and</strong> Carlsson, S. (2000). Wide baseline point matching<br />

using affine invariants computed from intensity profiles. In European Conference on<br />

<strong>Computer</strong> <strong>Vision</strong> (ECCV), pages 814–828.<br />

[Thompson et al., 2002] Thompson, C. J., Hahn, S., <strong>and</strong> Oskin, M. (2002). Using modern<br />

graphics architectures <strong>for</strong> general-purpose computing: A framework <strong>and</strong> analysis. In<br />

35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35).<br />

[Tran <strong>and</strong> Davis, 2006] Tran, S. <strong>and</strong> Davis, L. (2006). 3d surface reconstruction using<br />

graph cuts with surface constraints. In European Conference on <strong>Computer</strong> <strong>Vision</strong><br />

(ECCV), pages 219–231.<br />

[Tsai <strong>and</strong> Lin, 2003] Tsai, D.-M. <strong>and</strong> Lin, C.-T. (2003). Fast normalized cross correlation<br />

<strong>for</strong> defect detection. Pattern Recognition Letters, 24(15):2625–2631.<br />

[Turk <strong>and</strong> Levoy, 1994] Turk, G. <strong>and</strong> Levoy, M. (1994). Zippered polygon meshes from<br />

range images. In Proceedings of SIGGRAPH ’94, pages 311–318.<br />

[Veksler, 2003] Veksler, O. (2003). Fast variable window <strong>for</strong> stereo correspondence using<br />

integral images. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />

Recognition (CVPR), pages 556–561.


166<br />

[Vogiatzis et al., 2005] Vogiatzis, G., Torr, P., <strong>and</strong> Cipolla, R. (2005). Multi-view stereo<br />

via volumetric graph-cuts. In IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong> <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition (CVPR), pages II: 391–398.<br />

[Wang et al., 2006] Wang, L., Liao, M., Gong, M., Yang, R., <strong>and</strong> Nistér, D. (2006). High<br />

quality real-time stereo using adaptive cost aggregation <strong>and</strong> dynamic programming.<br />

In International Symposium on 3D Data Processing, Visualization <strong>and</strong> Transmission<br />

(3DPVT).<br />

[Weickert <strong>and</strong> Brox, 2002] Weickert, J. <strong>and</strong> Brox, T. (2002). Diffusion <strong>and</strong> regularization<br />

of vector- <strong>and</strong> matrix-valued images. Inverse Problems, Image Analysis <strong>and</strong> Medical<br />

Imaging. Contemporary Mathematics, 313:251–268.<br />

[Weickert et al., 2004] Weickert, J., Bruhn, A., <strong>and</strong> ans T. Brox, N. P. (2004). Variational<br />

optic flow computation: From continuous models to algorithms. In International<br />

Workshop on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Image Analysis, pages 1–6.<br />

[Weiskopf et al., 2002] Weiskopf, D., Erlebacher, G., Hopf, M., <strong>and</strong> Ertl, T. (2002).<br />

Hardware-accelerated langrangian-eulerian texture advection <strong>for</strong> 2d flow. In Proc. of<br />

<strong>Vision</strong>, Modeling <strong>and</strong> Visualization 2002, pages 77–84.<br />

[Weiss <strong>and</strong> Freeman, 2001] Weiss, Y. <strong>and</strong> Freeman, W. T. (2001). On the optimality of<br />

solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE<br />

Transactions on In<strong>for</strong>mation Theory, 47(2):723–735.<br />

[Westin et al., 2000] Westin, C.-F., Lorigo, L. M., Faugeras, O. D., Grimson, W. E. L.,<br />

Dawson, S., Norbash, A., <strong>and</strong> Kikinis, R. (2000). Segmentation by adaptive geodesic<br />

active contours. In Proceedings of MICCAI 2000, Third International Conference on<br />

Medical Image Computing <strong>and</strong> <strong>Computer</strong>-Assisted Intervention, pages 266–275.<br />

[Wheeler et al., 1998] Wheeler, M., Sato, Y., <strong>and</strong> Ikeuchi, K. (1998). Consensus surfaces<br />

<strong>for</strong> modeling 3d objects from multiple range images. In Proceedings of ICCV ’98, pages<br />

917 – 924.<br />

[Woetzel <strong>and</strong> Koch, 2004] Woetzel, J. <strong>and</strong> Koch, R. (2004). Real-time multi-stereo depth<br />

estimation on GPU with approximative discontinuity h<strong>and</strong>ling. In 1st European Conference<br />

on Visual Media Production (CVMP 2004), pages 245–254.<br />

[Wonka et al., 2000] Wonka, P., Wimmer, M., <strong>and</strong> Schmalstieg, D. (2000). Visibility preprocessing<br />

with occluder fusion <strong>for</strong> urban walkthroughs. In Rendering Techniques 2000<br />

(Proceedings of the Eurographics Workshop 2000), pages 71–82.<br />

[Woodfill <strong>and</strong> Herzen, 1997] Woodfill, J. <strong>and</strong> Herzen, B. V. (1997). Real-time stereo vision<br />

on the parts reconfigurable computer. In IEEE Symposium on FPGAs <strong>for</strong> Custom<br />

Computing Machines.


BIBLIOGRAPHY 167<br />

[Yang et al., 2006] Yang, Q., Wang, L., <strong>and</strong> Yang, R. (2006). Real-time global stereo<br />

matching using hierarchical belief propagation. In Proceedings of the 17th British Machine<br />

<strong>Vision</strong> Conference.<br />

[Yang <strong>and</strong> Pollefeys, 2003] Yang, R. <strong>and</strong> Pollefeys, M. (2003). Multi-resolution real-time<br />

stereo on commodity graphics hardware. In Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />

Recognition (CVPR).<br />

[Yang et al., 2004] Yang, R., Pollefeys, M., <strong>and</strong> Li, S. (2004). Improved real-time stereo<br />

on commodity graphics hardware. In CVPR 2004 Workshop on Real-Time 3D Sensors<br />

<strong>and</strong> Their Use.<br />

[Yang et al., 2003] Yang, R., Pollefeys, M., <strong>and</strong> Welch, G. (2003). Dealing with textureless<br />

regions <strong>and</strong> specular highlights – a progressive space carving scheme using a novel photoconsistency<br />

measure. In Int. Conference on <strong>Computer</strong> <strong>Vision</strong> (ICCV), pages 576–584.<br />

[Yang et al., 2002] Yang, R., Welch, G., <strong>and</strong> Bishop, G. (2002). Real-time consensus based<br />

scene reconstruction using commodity graphics hardware. In Proceedings of Pacific<br />

<strong>Graphics</strong>, pages 225–234.<br />

[Yezzi <strong>and</strong> Soatto, 2003] Yezzi, A. <strong>and</strong> Soatto, S. (2003). Stereoscopic segmentation. Intl.<br />

J. of <strong>Computer</strong> <strong>Vision</strong>, 53(1):31–43.<br />

[Zach <strong>and</strong> Bauer, 2002] Zach, C. <strong>and</strong> Bauer, J. (2002). Automatic texture hierarchy generation<br />

from orthographic facade textures. In 26th Workshop of the Austrian Association<br />

<strong>for</strong> Pattern Recognition (AAPR) 2002.<br />

[Zach et al., 2004a] Zach, C., Grabner, M., <strong>and</strong> Karner, K. (2004a). Improved compression<br />

of topology <strong>for</strong> view-dependent rendering. In Proceedings of Spring Conference on<br />

<strong>Computer</strong> <strong>Graphics</strong> 2004, pages 174–182.<br />

[Zach <strong>and</strong> Karner, 2003a] Zach, C. <strong>and</strong> Karner, K. (2003a). Fast event-driven refinement<br />

of dynamic levels of detail. In Proceedings of Spring Conference on <strong>Computer</strong> <strong>Graphics</strong><br />

2003, pages 65–72.<br />

[Zach <strong>and</strong> Karner, 2003b] Zach, C. <strong>and</strong> Karner, K. (2003b). Progressive compression of<br />

visibility data <strong>for</strong> view-dependent multiresolution meshes. Journal of WSCG, 11(3):546–<br />

553.<br />

[Zach et al., 2003a] Zach, C., Klaus, A., Hadwiger, M., <strong>and</strong> Karner, K. (2003a). Accurate<br />

dense stereo reconstruction using graphics hardware. In Proc. Eurographics 2003, Short<br />

Presentations.<br />

[Zach et al., 2003b] Zach, C., Klaus, A., Reitinger, B., <strong>and</strong> Karner, K. (2003b). Optimized<br />

stereo reconstruction using 3D graphics hardware. In Workshop of <strong>Vision</strong>, Modelling,<br />

<strong>and</strong> Visualization (VMV 2003), pages 119–126.


168<br />

[Zach et al., 2002] Zach, C., Mantler, S., <strong>and</strong> Karner, K. (2002). Time-critical rendering<br />

of discrete <strong>and</strong> continuous levels of detail. In Proceedings of ACM Symposium on Virtual<br />

Reality Software <strong>and</strong> Technology 2002, pages 1–8.<br />

[Zach et al., 2004b] Zach, C., Mantler, S., <strong>and</strong> Karner, K. (2004b). Time-critical rendering<br />

of huge ecosystems using discrete <strong>and</strong> continuous levels of detail. Presence: Teleoperators<br />

<strong>and</strong> Virtual Environment.<br />

[Zach et al., 2006a] Zach, C., Sormann, M., <strong>and</strong> Karner, K. (2006a). High-per<strong>for</strong>mance<br />

multi-view reconstruction. In International Symposium on 3D Data Processing, Visualization<br />

<strong>and</strong> Transmission (3DPVT).<br />

[Zach et al., 2006b] Zach, C., Sormann, M., <strong>and</strong> Karner, K. (2006b). Scanline optimization<br />

<strong>for</strong> stereo on graphics hardware. In International Symposium on 3D Data Processing,<br />

Visualization <strong>and</strong> Transmission (3DPVT).<br />

[Zebedin, 2005] Zebedin, L. (2005). Texturing complex 3D models. Master’s <strong>thesis</strong>, Technical<br />

University <strong>Graz</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!