PhD thesis - Institute for Computer Graphics and Vision - Graz ...

Graz University of Technology 

Institute for Computer Graphics and Vision 

Dissertation 

High-Performance Modeling From 

Multiple Views Using Graphics 

Hardware 

Christopher Zach 

Graz, Austria, February 2007 

Thesis supervisors 

Prof. Dr. Franz Leberl, Graz University of Technology 

Prof. Dr. Horst Bischof, Graz University of Technology

Abstract 

Generating 3-dimensional virtual representations of real world environments is still a challenging 

scientific and technological objective. Photogrammetric computer vision methods 

enable the creation of virtual copies from a set of acquired images. These methods are 

usually based on either off-the-shelf digital cameras or large-scale sensors. High quality 

image-based models with minimal human assistance are achieved by ensuring sufficient 

redundancy in the image content. As a consequence, a large amount of image data needs 

to be captured and subsequently processed. Recent advances in the computational performance 

of graphics processing units (GPUs) and in the provided programmable features 

make these devices a natural platform for generic high-performance parallel processing. 

In particular, several fundamental computer vision methods can be successfully accelerated 

by graphics hardware due to their intrinsic parallelism and due to the highly efficient 

filtered pixel access. 

The contribution of this thesis is the development of several new 3D vision algorithms 

intended for efficient execution on current generation GPUs. All proposed methods address 

the fully automated creation of dense 2.5D and 3D geometry of objects and environments 

captured on a sequence of images. The range of depicted methods starts with simple and 

purely local approaches with very efficient respective implementations. Furthermore, a 

novel formulation of a semi-global depth estimation approach suitable for fast execution 

on the GPU is presented. In addition it is shown, that variational methods for depth 

estimation can benefit significantly from GPU acceleration as well. Finally, highly efficient 

methods are presented, which generate 3D models from the input image set, either 

directly from the images or indirectly via intermediate 2.5D geometry. The performance 

of the developed methods and their respective implementations is evaluated on artificial 

datasets to obtain quantitative results, and demonstrated in real world applications as 

well. The proposed methods are incorporated into a complete 3D vision pipeline, which 

was successfully applied in several research projects. 

Keywords. multiple view reconstruction, depth estimation, dynamic programming, 

variational depth map evolution, space carving, volumetric range image integration, 

general purpose programming on graphics processing units (GPGPU), GPU acceleration 

iii

Acknowledgments 

Writing a PhD thesis is a large scale project. Everybody with a PhD degree knows 

this simple fact from his (or her) own experience. Although oneself has the primary 

responsibility to make progress with the thesis, the support from many other people is 

very substantial for a successful completion. This section is the place to mention and to 

thank those people helping me directly or indirectly in preparing this thesis. 

At first I need to thank my thesis supervisors, Prof. Franz Leberl and Prof. Horst 

Bischof from the Institute for Computer Graphics and Vision for their advice during my 

time as PhD student. In those times, when Prof. Leberl was engaged with highly ambitious 

projects, Prof. Bischof provided significant guidance for my scientific work. 

During my PhD time I was a researcher at the VRVis Research Center for Virtual 

Reality and Visualization, and this thesis was largely funded by this research company. I 

would like to thank my current and former colleagues from VRVis Graz and Vienna for 

the opportunity of this position and their collaboration. 

In particular, the full reconstruction pipeline creating virtual copies from a set of 

images contains many more steps than those developed by me during this thesis. Several 

stages in the pipeline is work done by my colleagues in the “Virtual Habitat” group at 

VRVis. At first I would like to thank Mario, who acquired many of the source images and 

is mainly responsible for the first steps in the modeling pipeline. The textures for the final 

3D models displayed in this thesis were generated by Lukas as part of his master thesis. 

I would like to thank Dr. Ivana Kolingerova and her PhD students from Plzen, who 

invited me to work for several weeks in this really nice town. I spent almost two months 

there (including the annual WSCG conference). 

During my time as PhD student I advised three master students: Mario, Lukas and 

Manni, who all did valuable work for their respective projects. Mario and Lukas started 

working at VRVis after finishing their master thesis. Manni began working at the associated 

computer vision institute, hence I guess I didn’t discourage those students too 

much. 

Having the office located directly at the institute for computer graphics and vision 

proved highly beneficial. Several new ideas were developed during personal talks with the 

institute members. In particular, I would like to thank the current and former attendees 

of the espresso club, namely Bernhard, Horst, Martina, Mike, Tom (2x), Pierre and last 

v

vi 

but not least Roli, whose legendary parties will be remembered for a long, long time. 

Additionally, I had fruitful and interesting discussions with Peter, Matthias, Suri, Markus, 

Alex and especially with Martin, who shared the office with me now for so many years. 

Finishing this thesis was not possible without some additional activities freeing the 

mind and relaxing the body. At first I would like to thank all Aikido teachers and fellows 

on the tatami from Graz, who worked hard for the last seven years to make my body less 

stiff. 

Furthermore, I would like to thank Vera for persuading me to start dancing lessons with 

her. She is not only a clever and ambitious person, but she was additionally discovered as 

a gifted partner in the dance hall. 

Graz, January 2007 

Christopher Zach 

The problem is not that people will steal your ideas. On the contrary, 

your job as an academic is to ensure that they do. 

Tom’s advice, according to Frank Dellaert

Contents 

1 Introduction 1 

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Using Graphics Processing Units for Computer Vision . . . . . . . . . . . . 2 

1.3 3D Models from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.4 Overview of this Thesis and Contributions . . . . . . . . . . . . . . . . . . . 10 

2 Related Work 15 

2.1 Dense Depth and Model Estimation . . . . . . . . . . . . . . . . . . . . . . 15 

2.1.1 Computational Stereo on Rectified Images . . . . . . . . . . . . . . . 15 

2.1.2 Multi-View Depth Estimation . . . . . . . . . . . . . . . . . . . . . . 17 

2.1.3 Direct 3D Model Reconstruction . . . . . . . . . . . . . . . . . . . . 18 

2.2 GPU-based 3D Model Computation . . . . . . . . . . . . . . . . . . . . . . 19 

2.2.1 General Purpose Computations on the GPU . . . . . . . . . . . . . 19 

2.2.2 Real-time and GPU-Accelerated Dense Reconstruction from Multiple 

Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

3 Mesh-based Stereo Reconstruction Using Graphics Hardware 27 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

3.2 Overview of Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

3.2.1 Image Warping and Difference Image Computation . . . . . . . . . . 29 

3.2.2 Local Error Summation . . . . . . . . . . . . . . . . . . . . . . . . . 30 

3.2.3 Determining the Best Local Modification . . . . . . . . . . . . . . . 31 

3.2.4 Hierarchical Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3.3.1 Mesh Rendering and Image Warping . . . . . . . . . . . . . . . . . . 33 

3.3.2 Local Error Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 35 

3.3.3 Encoding of Integers in RGB Channels . . . . . . . . . . . . . . . . . 35 

3.4 Performance Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 

3.4.1 Amortized Difference Image Generation . . . . . . . . . . . . . . . . 36 

3.4.2 Parallel Image Transforms . . . . . . . . . . . . . . . . . . . . . . . . 36 

3.4.3 Minimum Determination Using the Depth Test . . . . . . . . . . . . 37 

vii

viii CONTENTS 

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

4 GPU-based Depth Map Estimation using Plane Sweeping 43 

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

4.2 Plane Sweep Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 43 

4.2.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

4.2.2 Image Correlation Functions . . . . . . . . . . . . . . . . . . . . . . 45 

4.2.2.1 Efficient Summation over Rectangular Regions . . . . . . . 46 

4.2.2.2 Normalized Correlation Coefficient . . . . . . . . . . . . . . 47 

4.2.3 Sum of Absolute Differences and Variants . . . . . . . . . . . . . . . 48 

4.2.4 Depth Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

4.3 Sparse Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

4.3.1 Sparse Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 51 

4.3.1.1 Sparse Data Cost Volume During Plane-Sweep . . . . . . . 51 

4.3.1.2 Sparse Data Cost Volume for Message Passing . . . . . . . 52 

4.3.2 Sparse Message Update . . . . . . . . . . . . . . . . . . . . . . . . . 52 

4.3.2.1 Sparse 1D Distance Transform . . . . . . . . . . . . . . . . 53 

4.4 Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

4.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

4.6 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

5 Space Carving on 3D Graphics Hardware 63 

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

5.2 Volumetric Scene Reconstruction and Space Carving . . . . . . . . . . . . . 64 

5.3 Single Sweep Voxel Coloring in 3D Hardware . . . . . . . . . . . . . . . . . 66 

5.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

5.3.2 Voxel Layer Generation . . . . . . . . . . . . . . . . . . . . . . . . . 67 

5.3.3 Updating the Depth Maps . . . . . . . . . . . . . . . . . . . . . . . . 69 

5.3.4 Immediate Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 70 

5.4 Extensions to Multi Sweep Space Carving . . . . . . . . . . . . . . . . . . . 70 

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

5.5.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

5.5.2 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

6 PDE-based Depth Estimation on the GPU 79 

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

6.2 Variational Techniques for Multi-View Depth Estimation . . . . . . . . . . . 80 

6.2.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

CONTENTS ix 

6.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

6.2.3 Extensions and Variations . . . . . . . . . . . . . . . . . . . . . . . . 83 

6.2.3.1 Back-Matching . . . . . . . . . . . . . . . . . . . . . . . . . 83 

6.2.3.2 Local Changes in Illumination . . . . . . . . . . . . . . . . 84 

6.2.3.3 Other Variations . . . . . . . . . . . . . . . . . . . . . . . . 84 

6.3 GPU-based Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 

6.3.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 

6.3.2 Regularization Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

6.3.3 Depth Update Equation . . . . . . . . . . . . . . . . . . . . . . . . . 87 

6.3.3.1 Jacobi Iterations . . . . . . . . . . . . . . . . . . . . . . . . 87 

6.3.3.2 Conjugate Gradient Solver . . . . . . . . . . . . . . . . . . 87 

6.3.4 Coarse-to-Fine Approach . . . . . . . . . . . . . . . . . . . . . . . . 88 

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

6.4.1 Facade Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

6.4.2 Small Statue Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

6.4.3 Mirabellstatue Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

7 Scanline Optimization for Stereo On Graphics Hardware 97 

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

7.2 Scanline Optimization on the GPU for 2-Frame Stereo . . . . . . . . . . . . 98 

7.2.1 Scanline Optimization and Min-Convolution . . . . . . . . . . . . . . 98 

7.2.2 Overall Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

7.2.3 GPU Implementation Enhancements . . . . . . . . . . . . . . . . . . 101 

7.2.3.1 Fewer Passes Through Bidirectional Approach . . . . . . . 101 

7.2.3.2 Disparity Tracking and Improved Parallelism . . . . . . . . 102 

7.2.3.3 Readback of Tracked Disparities . . . . . . . . . . . . . . . 103 

7.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

7.3 Cross-Correlation based Multiview Scanline Optimization on Graphics 

Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 

7.3.1 Input Data and General Setting . . . . . . . . . . . . . . . . . . . . 106 

7.3.2 Similarity Scores based on Incremental Summation . . . . . . . . . . 107 

7.3.3 Sensor Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

7.3.4 Slice Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

7.3.5 SAD Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

7.3.6 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 111 

7.3.7 Depth Extraction by Scanline Optimization . . . . . . . . . . . . . . 111 

7.3.8 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

7.3.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

x CONTENTS 

8 Volumetric 3D Model Generation 119 

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

8.2 Selecting the Volume of Interest . . . . . . . . . . . . . . . . . . . . . . . . . 120 

8.3 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

8.4 Isosurface Determination and Extraction . . . . . . . . . . . . . . . . . . . . 124 

8.5 Implementation Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

9 Results 131 

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

9.2 Synthetic Sphere Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

9.3 Synthetic House Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 

9.4 Middlebury Multi-View Stereo Temple Dataset . . . . . . . . . . . . . . . . 137 

9.5 Statue of Emperor Charles VI . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

9.6 Bodhisattva Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

10 Concluding Remarks 147 

A Selected Publications 151 

A.1 Publications Related to this Thesis . . . . . . . . . . . . . . . . . . . . . . . 151 

A.2 Other Selected Scientific Contributions . . . . . . . . . . . . . . . . . . . . . 151 

Bibliography 153

List of Figures 

1.1 Several reconstructed statue models . . . . . . . . . . . . . . . . . . . . . . 3 

1.2 A possible pipeline to create virtual models from images . . . . . . . . . . . 5 

1.3 The reconstruction pipeline in an example . . . . . . . . . . . . . . . . . . . 13 

2.1 The stream computation model of a GPU . . . . . . . . . . . . . . . . . . . 20 

3.1 Mesh reconstruction from a pair of stereo images . . . . . . . . . . . . . . . 29 

3.2 The regular grid as seen from the key camera . . . . . . . . . . . . . . . . . 30 

3.3 The neighborhood of a currently evaluated vertex . . . . . . . . . . . . . . . 30 

3.4 The correspondence between vertex indices and grid positions. . . . . . . . 31 

3.5 The basic workflow of the matching procedure . . . . . . . . . . . . . . . . . 32 

3.6 The modified pipeline to minimize P-buffer switches . . . . . . . . . . . . . 38 

3.7 Fragment program to write the depth component . . . . . . . . . . . . . . . 39 

3.8 Results for the artificial earth dataset. . . . . . . . . . . . . . . . . . . . . . 39 

3.9 Results for a dataset showing the yard inside a historic building. . . . . . . 40 

3.10 Results for a dataset showing an apartment house . . . . . . . . . . . . . . 41 

3.11 Visual results for the Merton college dataset . . . . . . . . . . . . . . . . . . 42 

4.1 Plane sweeping principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

4.2 NCC images calculated on the CPU (left) and on the GPU (right) . . . . . 48 

4.3 Determining the lower envelope using a sparse 1D distance transform. . . . 53 

4.4 Sparse belief propagation timing results wrt. the number of heap entries K 57 

4.5 Depth images with and without belief propagation . . . . . . . . . . . . . . 60 

4.6 Point models with and without belief propagation . . . . . . . . . . . . . . 61 

4.7 Point models with and without belief propagation . . . . . . . . . . . . . . 61 

4.8 Depth images with and without belief propagation . . . . . . . . . . . . . . 62 

5.1 A possible configuration for plane sweeping through the voxel space . . . . 65 

5.2 Perspective texture mapping using visibility information . . . . . . . . . . . 67 

5.3 Evolution of depth maps for two views during the sweep process . . . . . . 69 

5.4 Plane sweep with partial knowledge from the preceding sweeps . . . . . . . 71 

5.5 Timing results for the Bowl dataset . . . . . . . . . . . . . . . . . . . . . . . 74 

xi

xii LIST OF FIGURES 

5.6 Space carving results for the synthetic Dino dataset . . . . . . . . . . . . . 75 

5.7 Space carving results for the synthetic Bowl dataset . . . . . . . . . . . . . 76 

5.8 Space carving results for a statue dataset . . . . . . . . . . . . . . . . . . . 77 

5.9 Voxel coloring results for a statue dataset . . . . . . . . . . . . . . . . . . . 78 

6.1 Sparse structure of the linear system obtained from the semi-implicit approach 88 

6.2 A reconstructed historical statue displayed as colored point set . . . . . . . 89 

6.3 The depth maps of the embedded statue reconstructed with the numerical 

schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 

6.4 The effect of bidirectional matching on the embedded statue scene. . . . . 91 

6.5 Two views on the colored point set showing the front facade of a church. . 92 

6.6 The three source images and the resulting unsuccessful reconstruction of 

the statue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

6.7 Two of the successfully reconstructed point sets using image segmentation 

to omit the background scenery. . . . . . . . . . . . . . . . . . . . . . . . . 95 

6.8 An enhanced depth map and 3D point set obtained using the truncated 

error model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 

6.9 The effect of image-driven anisotropic diffusion . . . . . . . . . . . . . . . . 96 

7.1 Graphical illustration of the forward pass using a recursive doubling approach.100 

7.2 Parallel processing of vertical scanlines using the bidirectional approach for 

optimal utilization of the four available color channels . . . . . . . . . . . . 103 

7.3 Disparity images for the Tsukuba dataset for several horizontal resolutions 

generated by the GPU-based scanline approach. . . . . . . . . . . . . . . . 105 

7.4 Disparity images for the Cones and Teddy image pairs from the Middlebury 

stereo evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 

7.5 Plane-sweep approach to multiple view matching . . . . . . . . . . . . . . . 108 

7.6 Plane sweep from left to right . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

7.7 Spatial aggregation for the correlation window using sliding sums . . . . . . 110 

7.8 The three input views of the synthetic dataset . . . . . . . . . . . . . . . . . 113 

7.9 The obtained depth maps and timing results for the synthetic dataset using 

multiview scanline optimization on the GPU . . . . . . . . . . . . . . . . . 114 

7.10 The three input views of a wooden Bodhisattva statue and the corresponding 

depth maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

8.1 Classification of the voxel according to the depth map and camera parameters122 

8.2 Visual results for a small statue dataset generated from a sequence of 47 

images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

8.3 Source views and isosurfaces for two real-world datasets. . . . . . . . . . . 128 

9.1 Three source views of the synthetic sphere dataset. . . . . . . . . . . . . . . 132 

9.2 Depth estimation results for a view triplet of the sphere dataset . . . . . . . 133

LIST OF FIGURES xiii 

9.3 Fused 3D models for the sphere dataset wrt. the depth estimation method . 133 

9.4 Three source views of the synthetic house dataset. . . . . . . . . . . . . . . 134 

9.5 Fused 3D models for the synthetic house dataset wrt. the depth estimation 

method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

9.6 Three generated depth maps of the synthetic house dataset . . . . . . . . . 136 

9.7 Three (out of 47) source images of the temple model dataset . . . . . . . . 138 

9.8 Front and back view of the fused 3D model of the temple dataset based on 

the original camera matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 

9.9 Front and back view of the fused 3D model of the temple dataset based on 

new calculated camera matrices . . . . . . . . . . . . . . . . . . . . . . . . . 140 

9.10 Two views of the statue showing Emperor Charles VI inside the state hall 

of the Austrian National Library. . . . . . . . . . . . . . . . . . . . . . . . . 141 

9.11 Medium resolution mesh for the Charles VI dataset . . . . . . . . . . . . . . 142 

9.12 High resolution mesh for the Charles VI dataset . . . . . . . . . . . . . . . . 143 

9.13 Two depth maps for the same reference view of the Charles dataset generated 

by the WTA and the SO approach . . . . . . . . . . . . . . . . . . . . 144 

9.14 Every other of the 13 source images of the Bodhisattva statue dataset . . . 144 

9.15 Several depth images for the Bodhisattva statue . . . . . . . . . . . . . . . 145 

9.16 Medium and high resolution results for the Bodhisattva statue images . . . 145

List of Tables 

3.1 Timing results for the sphere dataset on two different graphic cards. . . . . 40 

4.1 Timing results for the plane-sweeping approach on the GPU with winnertakes-all 

depth extraction at different parameter settings and image resolutions. 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

6.1 Regularization terms induced by diffusion processes . . . . . . . . . . . . . . 82 

7.1 Average timing result for various dataset sizes in seconds/frame. . . . . . . 104 

7.2 Runtimes of GPU-scanline optimization using a 9 × 9 NCC at different 

resolutions using three views . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

9.1 Quantitative evaluation of the reconstructed spheres . . . . . . . . . . . . . 134 

9.2 Quantitative evaluation of the reconstructed synthetic house . . . . . . . . . 137 

9.3 Timing results for the Emperor Charles dataset . . . . . . . . . . . . . . . . 138 

xv

Chapter 1 

Introduction 

Contents 

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Using Graphics Processing Units for Computer Vision . . . . . 2 

1.3 3D Models from Multiple Images . . . . . . . . . . . . . . . . . . 5 

1.4 Overview of this Thesis and Contributions . . . . . . . . . . . . 10 

1.1 Introduction 

Creating a 3D virtual representation of a real object or scenery from images or other sensory 

data has many important real-world applications – ranging from city planning tasks 

performed by surveying offices to virtual conservation of historic buildings and objects, 

to entertainment and gaming applications creating virtual models of real and well-known 

locations. Consequently, the need for automated and reliable 3D model generation workflow 

for data acquired by active and passive sensors is still an active research topic. In 

particular, creating 3D representations of real objects solely from multiple images is a 

challenging task, since the complete automated work-flow is based only on passive sensory 

data. 

The development of suitable algorithms and methods for a multi-view reconstruction 

pipeline depends substantially on the objects of interest and on the number and quality 

of the acquired images. In order to enable a fully automated work-flow the images must 

contain substantial redundancy, i.e. the same 3D features must appear in several images. 

Furthermore, static and rigid objects are assumed in this work to make the traditional 

multiple view approaches for image registration applicable. A further question addresses 

the intended accuracy of the obtained models. As it is explained later in more detail, the 

major objectives of the methods developed in this thesis are achieving high performance for 

immediate visual feedback to the user and attaining sufficient accuracy for photorealistic 

visualization of the virtual models. Dense meshes and depth maps generated from multiple 

1

2 Chapter 1. Introduction 

views are usually not directly suitable for accurate 3D measurements, since the achievable 

accuracy especially in low textured regions is limited. Nevertheless, further knowledge 

about the object under interest enables e.g. fitting geometric primitives into the dense 

mesh potentially yielding higher accuracy. 

The methods proposed in our work-flow are mainly designed for typical close-range 

imagery, but the methods are not strictly limited to these settings. In order to illustrate 

the kind of datasets to be reconstructed using our modeling pipeline, we give at first 

a few examples of obtained virtual models generated by employing the proposed workflow. 

Figure 1.1 displays three 3D models generated solely from multiple images using 

the methods proposed in this thesis in several stages. In particular, efficient dense depth 

estimation methods (Chapter 4 and 7) were applied to obtain 2 1 

2 

D height-fields, which 

were subsequently fused into a final 3D model using a volumetric approach (Chapter 8). 

All procedures in the 3D reconstruction pipeline to create 3D models solely from images 

are outlined shortly in Section 1.3. 

The models displayed in Figure 1.1 are partially used for a historical documentation 

system ∗ . The generated models are high-resolution 3D meshes, which are intended for 

visualization when combined with a photorealistic texture. 

1.2 Using Graphics Processing Units for Computer Vision 

In this thesis we propose employing the computing power of modern programmable graphics 

processing units (GPU) for several essential stages in the 3D reconstruction pipeline. 

One goal of this work is fast visual feedback to the human operator, who can immediately 

judge the quality of the results and may optionally adjust suitable parameters, if necessary. 

Further, it is inevitable to have a substantial amount of redundancy in the image 

content when applying the current methods for reconstruction from multiple views in 

order to achieve high quality models. This implies that full 3D modeling of even a single 

object typically requires at least tens of images to be processed. Fast processing of these 

image sets is desirable, since obtaining the final model after two or 20 minutes makes a 

substantial difference. † If special purpose hardware — mainly graphics processing units, 

but digital signal processors (DSP) and field programmable gate arrays (FPGA), too — is 

employed in computer vision methods, several types of application can be distinguished: 

1. The first scenario enforces real-time response within specified temporal limits, and 

special purpose hardware provides the required processing power. Much of the initial 

research on accelerating computer vision methods is driven by the real-time needs 

of the particular application. 

2. The main objective in the second setting is faster (but not necessary real-time) 

processing using special hardware intensely. Since the computational accuracy and 

∗ www.josefsplatz.info 

† especially if the outcome is unsatisfying.

1.2. Using Graphics Processing Units for Computer Vision 3 

(a) Small statue of 

St. Barbara 

(b) Emperor Joseph – Josephsplatz (c) Empower Karl – Josephsplatz 

Figure 1.1: Several reconstructed statue models generated by our high-performance modeling 

pipeline. In (a) the model of a small statue depicting depicting St. Barbara is shown. 

Figure (b) illustrates the model of an outdoor statue displaying Empire Joseph. Finally, 

(c) shows the virtual model of the Empire Karl statue inside the Austrian National Library. 

The displayed models are not post-processed (e.g. smoothed or geometrically simplified). 

In (a) and (c) some noise and clutter can be seen, which can be removed by incorporating 

silhouette data. 

the programming model of special purpose hardware is often limited, the quality of 

the result may be decreased if compared with the outcome of CPU implementations. 

Finding an appropriate trade-off between higher performance and limiting the quality 

degradation is the challenge in this setting. Most methods proposed in this thesis 

fall in this category. 

3. Finally, special purpose hardware can be used purely as auxiliary processing unit 

executing only fractions of the overall method. In this case there is typically no 

degradation in the quality of the result, but the achieved performance gain can be 

limited. Special purpose hardware usually performs its computation asynchronously 

to the main CPU, hence a load-balanced implementation employing both processing 

units concurrently gives the largest gain. Most computer vision methods must be 

redesigned in order to benefit from this combined processing power. 

With the general availability of programmable graphics processing units and their large 

processing power it is natural, that modern graphics hardware attracts many researchers


to accelerate their non-graphical applications as well. We focus on programmable graphics 

hardware as computing device for the following reasons: 

• Driven by the needs of the gaming industry, graphics hardware evolves currently 

much faster than traditional CPUs or other processing devices. Selected numerical 

operations perform almost 10 times faster on high-end graphics hardware than on 

high-end CPUs. 

• A reasonable fast graphics processing unit is nowadays equipped in many consumer 

personal computers. Hence, the necessary hardware equipment is available virtually 

for everyone. 

• Recently there exist standardized programming interfaces working for hardware of 

different vendors. This allows our procedures to execute on a wider range of hardware 

not limited to a specific vendor. Additionally, the development cycle is now eased 

due to multi-vendor programming interfaces and tools. 

• While performing non-graphical computations the GPU can be directly used to 

display intermediate and final results to the operator, since the necessary data is 

already stored in GPU memory. 

Due to these factors modern graphics hardware is currently the ideal target platform for 

high-performance parallel computing. 

Note, that the rapid development of new features built into every upcoming generation 

of graphics hardware requires a constant adaption of GPU-based methods to obtain maximal 

performance. Consequently, a continuous redesign of GPU-based implementations is 

still necessary, since new features may enable significant performance improvements, and 

various techniques to increase the speed on current hardware may become obsolete in next 

generation graphics hardware. Nevertheless, we assume a stabilizing feature set for GPUs 

in the medium term. 

Using the GPU as a major processing unit for non-graphical problems allows direct 

visualization of intermediate and final results without an additional performance penalty. 

We employ this feature in most of our proposed reconstruction methods to give the user a 

direct visual feedback showing the progress of the procedure. Whether immediate visual 

feedback (i.e. after a few seconds at most) is available depends on the reconstruction 

pipeline as well. Using relatively simple methods e.g. developed for small baseline sets of 

images yielding a depth map allows the sequential processing of the whole dataset, and the 

first depth images are available with little delay. In these cases the provided intermediate 

results have full resolution, but refer only to a fraction of the final model. Sophisticated 

multiple view methods incorporating all images simultaneously often do not have this 

fine granularity and generally provide no intermediate result (at full resolution) to the 

human operator. Typically, a coarse-to-fine scheme forms the basis of these methods and 

intermediate results at coarser resolutions can be shown to the operator.

1.3. 3D Models from Multiple Images 5 

In any case, when processing larger datasets with different characteristics and from 

different sources, the opportunity to evaluate the outcome of the whole modeling pipeline 

visually at early processing stages proves very useful. 

Although graphics processing units have a very high computing power, the programming 

model of graphics hardware is limited. Consequently, the set of computer vision 

methods suitable for full acceleration by GPUs is restricted. E.g. several highly sophisticated 

dense depth estimation methods are currently beyond the capabilities of programmable 

graphics hardware, or allow only acceleration of fractions of the whole procedure 

in the best case. Hence, only relatively simple (but still nontrivial) computer vision 

methods can fully benefit from graphics processing units so far. 

Nevertheless, in many cases the generated 3D models created by our high-performance 

work-flow have sufficient quality for further processing and photorealistic display of the 

virtual models. The main contribution of this thesis consists of the adaption of several 

multi-view reconstruction methods to enable an efficient implementation using graphics 

hardware in the first place. Further, the actual efficiency and the quality of the obtained 

3D models is demonstrated on multiple real-world datasets. 

1.3 3D Models from Multiple Images 

The creation of virtual 3D models of real objects from a set of digital images requires a 

pipeline of several stages. The set of procedures applied in this pipeline depends on the 

actual setup and on the intended use of the generated model. The steps performed to 

create many virtual models shown in this thesis is illustrated in Figure 1.2. 

Digital images 

Depth images 

Feature 

extraction 

Features 

POIs 

Correspondence 

estimation 

Sparse 

model 

Dense depth 

estimation 

Multiview Geometry 

Multiview 

depth integration 

Raw 3D 

geometry 

processing 

Refined 3D 

geometry 

texturing 

Textured 3D 

model 

Figure 1.2: A possible pipeline to create virtual models from images. 

The steps in this pipeline are suitable for reconstructing a 3D object from many small 

baseline images taken with a high-quality and already calibrated digital single lens reflex 

camera. If the images are recorded with a digital video camera or a cheap digital consumer 

camera, several (especially early) stages in the pipeline will be substantially different. 

We describe the individual processing steps in this pipeline briefly and outline necessary 

adaptions in case of different source material.


Camera Calibration and Self-Calibration The term camera calibration often refers 

to two related, but nevertheless distinct steps to obtain several parameters of the employed 

digital camera and its lens system: the first procedure determines lens distortion 

parameters to remove the deviations in the image induced by optical lenses. Knowledge of 

the lens distortion and subsequent resampling of the source images allows the application 

of the simple pinhole camera model in the successive processing stages. The second part 

of the camera calibration step addresses the determination of the main parameters of the 

now applicable idealized pinhole camera model. These parameters are typically comprised 

in a 3-by-3 upper triangular matrix 

K = 

⎛ 

⎜ 

⎝ 

f s x0 

0 a f x1 

0 0 1 

Knowledge of this matrix allows the obtained 3D reconstructions to reside in a metric 

space, i.e. the obtained angles and length ratios correspond to the ones of the true model. 

Without additional knowledge it is not possible to determine the overall scale (or object 

size) solely from images. 

The most important parameter in this matrix is the focal length f. If the focal length 

is incorrectly estimated, the resulting 3D model is severely distorted. The skew parameter 

s is determined by the x and y-axes of the sensor pixels and is very close to zero for 

all practical cameras. Many calibration and especially self-calibration techniques assume 

orthogonal sensor axes and consequently, s = 0. The aspect ratio parameter a is one for 

squared shaped sensor pixels, which is a very common assumption. The intersection of 

the optical axis with the image plane is called the principal point (x0, y0) and is usually 

close to the image center. Accurate estimation of the principal point is difficult (since 

moving the principal point can be largely compensated by world space translation), but 

the quality of the 3D model is only weakly affected by an incorrect principal point. 

Since we focus mainly on generating 3D models from images taken with precalibrated 

cameras, a standard camera calibration procedure [Heikkilä, 2000] using predefined targets 

is typically employed in our work-flow. Several images of a planar target with known 

circular control points are taken, and camera matrices and lens distortion parameters are 

determined using a nonlinear optimization approach. The advantage of using precalibrated 

cameras is the high accuracy of the estimated intrinsic parameters of the camera. Hence, 

the subsequently calculated relative orientation and the dense depth estimation are based 

on reliable camera parameters and yield high quality results. 

On the other hand, good calibration results are mainly available for high-quality cameras, 

and usually fixed lenses set to infinite focus are required. A work-flow based on 

target calibration is only partially applicable to cheap consumer cameras with zooming 

and automatic focusing, and it typically fails for video sequences. Self-calibration methods 

attempt to recover the intrinsic camera parameters solely from image information like 

⎞ 

⎟ 

⎠ .


correspondences between multiple views. Radial distortion parameters can be determined 

even from single images using extracted 2D lines [Devernay and Faugeras, 2001], but for 

real datasets some manual intervention is often necessary in order to connect short line 

segments belonging to the same object line [Schmidegg, 2005]. Of course, this approach 

requires, that e.g. a building with dominant feature lines or even a printed page with 

straight lines is captured by the camera. 

During self-calibration the parameters of the pinhole camera model are determined 

by utilizing certain analytic properties of the epipolar geometry. Several self-calibration 

method start with a projective reconstruction based on point correspondences and the induced 

fundamental matrices between the images. The inherent projective ambiguity can 

be resolved using algebraic invariants and reasonable assumptions on the camera model 

(like zero skew and square pixels) [Pollefeys et al., 1999, Nistér, 2001, Nistér, 2004b]. The 

main difficulty of these approaches is the creation of an initial accurate and outlier-free projective 

reconstruction, since the self-calibration procedures are very sensitive to incorrect 

input data. A simple self-calibration method not requiring a projective 3D reconstruction 

is proposed in [Mendonça and Cipolla, 1999]. This approach refines the intrinsic camera 

parameters to upgrade the supplied fundamental matrices to essential matrices, which 

have stronger algebraic properties. The essential matrix encodes the relative pose between 

two views and has fewer degrees of freedom than the fundamental matrix. In particular, 

the two non-zero singular values of an essential matrix are equal. This property is utilized 

in [Mendonça and Cipolla, 1999] to adjust initially provided camera intrinsic parameters, 

such that the non-zero singular values of the upgraded fundamental matrices are as close 

as possible. We employ this method optionally even in the calibrated case to refine the 

camera intrinsic parameters for highest accuracy. 

Feature Extraction Feature extraction selects image points or regions which give significant 

structural information to be identified in other images showing the same objects 

of interest. Commonly used point features are Harris corners [Harris and Stephens, 1988] 

and Förstner points [Förstner and Gülch, 1987]. Point features are well suited for sparse 

correspondence search, but extracting lines may be beneficial for images showing manmade 

structures. Instead of extracting isolated corner points a set of edge elements (edgel 

for short) are determined [Canny, 1986] and subsequently grouped to obtain geometric 

line segments. 

If the provided images are taken from rather different positions, more advanced 

features and local image descriptors are required. In particular, the projected size and 

shape of objects varies substantially in wide baseline setups, which is addressed by 

scale- and affine-invariant feature detectors and descriptors, including scale invariant 

feature transforms [Lowe, 1999], intensity profiles [Tell and Carlsson, 2000], maximally 

stable extremal regions [Matas et al., 2002] and scale and affine invariant Harris 

points [Mikolajczyk and Schmid, 2004]). 

In our current work-flow we utilize Harris corners as primary point features, which are


extended with either local image patches or intensity profiles as feature descriptors. 

Correspondence and Pose Estimation In order to relate a set of images geometrically 

it is necessary to find correspondences, i.e. the images of identical scene objects. 

For the task of calculating the relative orientation between images it is suitable to extract 

features with good point localization as provided by the feature extraction step. In a calibrated 

setting the relative orientation between two views can be calculated from five point 

correspondences. Hence a RANSAC-based approach is used for robust initial estimation 

of the relative pose between two adjacent views. In order to test many samples an efficient 

procedure for relative pose estimation is utilized [Nistér, 2004a]. With the knowledge of 

the relative poses between all consecutive views and corresponding point features visible 

in at least 3 images, the orientations of all views in the sequence can be upgraded to a 

common coordinate system. The camera poses and the sparse reconstruction consisting of 

3D points triangulated from point correspondences are refined using a simple but efficient 

implementation of sparse bundle adjustment [Lourakis and Argyros, 2004]. This step concludes 

the pipeline to establish the 3D relationship for a sequence of images. The essential 

data generated by this pipeline are distortion-free images and the camera matrices relating 

positions in 3D space with 2D image locations. 

In case of video sequences it is sufficient to track simple point features over time and 

to apply a RANSAC scheme to obtain the relative poses of the images, which can be 

optionally accomplished in real-time [Nistér et al., 2004]. In our setting targeted at offline 

reconstructions using high resolution images a real-time behavior to determine the 

geometrical relationship between the views is not necessary. Nevertheless, high processing 

performance of these early reconstruction stages are relevant due to the amount of taken 

images. Even reconstructing a small, isolated object like a statue easily results in 50 

images of that object, which must be integrated into a common coordinate system. 

Foreground Segmentation If the 3D reconstruction of individual or free-standing objects 

is desired, an image segmentation procedure separating foreground objects from 

the unwanted background is suitable. If the result of this segmentation step is accurately 

representing the silhouette of the object of interest, any shape from silhouette technique 

[Laurentini, 1995, Lok, 2001, Matusik et al., 2001, Li et al., 2003] can be applied 

to obtain a first coarse 3D model called the visual hull. When encountered with many 

small baseline images, manual segmentation of foreground pixels against a complex background 

is a tedious task. Hence, a automated or semi-automatic approach to generate 

the object silhouettes for these cases is reasonable. Specifying an initial object silhouette 

and propagating it through the image sequence is described in [Sormann et al., 2005, 

Sormann et al., 2006]. Silhouette information is partially used in the successive dense 

matching procedures to suppress unintended fragments in the final model.


Dense Depth Estimation With the knowledge of the camera parameters and the 

relative poses between the source views dense correspondences for all pixels of a particular 

key view can be estimated. Since the epipolar geometry is already known, this procedure 

is basically a one-dimensional search along the epipolar line for every pixel. Triangulation 

of these correspondences results in a dense 3D model, which reflects the true surface 

geometry of the captured object in ideal settings. 

In order to simplify the depth estimation task and to make it more robust, almost all 

dense depth estimation method assume opaque surfaces with diffuse reflection properties 

to be reconstructed. In some approaches the lighting conditions and the exposure settings 

of the camera may change between the captured views to some amount. The depth map 

for the particular key view is usually estimated from a set of nearby views having a large 

overlap in their image content. 

The major part of this thesis addresses the generation of dense depth maps, in particular 

Chapter 3, 4, 6 and Chapter 7. The main differences between dense depth estimation 

approaches in general are the utilized image dissimilarity function, which ranks potential 

correspondences on the epipolar line, and the handling of textureless regions, where the 

dissimilarity score is ambiguous and unreliable. Both factors influence the range of potential 

applications for the method and its performance in terms of time and 3D model 

quality. The main contribution of the chapters discussing dense depth estimation is the 

efficient generation of depth maps by utilizing the computational power and programming 

model of modern graphics hardware. The presented methods and implementations include 

several dissimilarity scores and different approaches to cope with regions containing 

indiscriminative surface texture. 

Multiview Depth Integration The set of depth images obtained from dense depth 

estimation needs to be combined in order to obtain a consistent final geometric model of 

the captured scene or object. If we assume a redundancy of depth information, potential 

outliers generated by the previous depth estimation procedure can be detected and removed 

at this point. A successful method for multiple depth map fusion is the volumetric 

range image integration approach [Curless and Levoy, 1996, Wheeler et al., 1998]. Chapter 

8 describes our fast depth integration procedure. Alternatively, proper 3D models can 

be directly generated using voxel coloring methods (see Chapter 5). 

Geometry Processing Depending on the actual depth image integration methods the 

obtained 3D mesh may contain holes and may appear still somewhat noisy. Furthermore, 

the generated mesh is almost always over-tessellated and is not directly appropriate for 

further processing or visualization. Consequently, a final geometry processing step may 

include mesh simplification techniques and other mesh refinement and cleaning procedures. 

In particular, we apply a mesh simplification tool [Garland and Heckbert, 1997] to reduce 

the geometric complexity of the model.


Photorealistic Texturing The simplified and enhanced geometry of the imaged object 

still lacks an appropriate texture for a photorealistic display within virtual scenes. 

Texture map generation for arbitrary 3D shapes requires cutting of the original polygonal 

representation into several disk-like patches. Each of these patches has its own texture coordinate 

mapping associated. In order to obtain few distortions and better visual quality, 

these patches should be preferably flat. Our implementation [Zebedin, 2005] combines the 

texture atlas generation procedure described in [Lévy et al., 2002] with robust multi-view 

texturing techniques in presence of occlusions [Mayer et al., 2001, Bornik et al., 2001]. If 

a surface element is visible in several images (which is usually the case), unmodeled occlusions 

can be detected and removed using a robust color averaging method. Additionally, 

the orientation of a surface patch with respect to the source images and its projected footprint 

provides reliability information, which can be used to weight the color contribution 

from the source images. 

An Illustrative Example We illustrate various stages of this pipeline with a statue 

example in Figure 1.3. In addition to two (out of 47) input images we show two dense 

depth estimation results based on a GPU-accelerated plane-sweep ((c) and (d)). These 

small-baseline reconstructions are still noisy and have outliers. Volumetric depth image 

integration uses all available depth images to remove the artifacts and creates a suitable 

geometry representing the statue (images (e) and (f)). Finally, the decimated and textured 

mesh is illustrated ((g) and (h)). 

After this coarse presentation of the modeling pipeline, we provide a more in-depth 

description of the various stages in the work-flow, which are not directly related with this 

thesis. 

1.4 Overview of this Thesis and Contributions 

Chapter 2 presents work and publications related to this thesis. It is divided into two major 

sections: Section 2.1 presents important approaches and work focusing on dense depth 

estimation and computational stereo in general. From the vast number of publications in 

this field only a few seminal ones are briefly presented. Some of these comprise the basis for 

our procedures and are described in more detail in the appropriate chapters. Section 2.2 

gives a general overview of GPU-accelerated approaches and algorithms that appeared 

in recent years. Further, several research lines for realtime and GPU-based methods to 

computational stereo and multi-view reconstruction approaches are presented. 

Our first computational stereo method accelerated by graphics hardware is described 

in Chapter 3. This dense stereo reconstruction procedure is essentially an iterative local 

mesh refinement method to generate a surface consistent with the given views. The main 

motivation for this approach is the fast projective texturing capability provided by graphics 

hardware since its beginnings. With the emergence of programmable GPUs, it is possible 

to calculate simple image dissimilarity functions by GPUs as well. CPU intervention

1.4. Overview of this Thesis and Contributions 11 

is necessary to update the current mesh hypothesis according to the determined best 

local modifications and to occasionally smooth the mesh. Since this approach works on 

meshes, this method is the only one presented in this thesis making extensive use of vertex 

programs. The obtained software performs reconstructions at interactive or near realtime 

rates. 

This chapter contains material from two publications ([Zach et al., 2003a] and 

[Zach et al., 2003b]). 

Note, that all other procedures presented in the following chapter are purely performed 

on the graphics hardware with the CPU only executing the flow control for GPU routines. 

By providing the source images and the camera parameters and poses the full reconstruction 

pipeline to the final 3D model visualization performs entirely on the graphics 

hardware and no expensive data transfer from GPU memory to main memory is necessary. 

Consequently, these methods are perfectly suited for fast visual feedback to the human 

operator. 

Plane-sweep methods to depth estimation are still the most suitable approaches for 

efficient implementations on the GPU. So far, most algorithms presented in the literature 

require images with exactly the same lighting conditions, since very simple correlation 

measures like the sum of absolute differences (SAD) or sum of squared differences (SSD) 

are utilized. In Chapter 4 we propose an approximated zero-mean normalized sum of absolute 

differences correlation function, which produces results similar to the widely used 

NCC function and can be more efficiently calculated on current generation graphics hardware. 

Using GPU-based summed area tables (aka. integral images) the computation time 

for this image correlation measure is independent of the template window size. Furthermore, 

a sparse belief propagation method is proposed to obtain depth maps incorporating 

smoothness constraints. Material from this chapter can be found in [Zach et al., 2006a]. 

Chapter 5 describes, how a voxel-coloring technique can be executed entirely on graphics 

hardware by combining plane-sweep approaches with correct visibility handling. Thus, 

3D volumetric models from many images can be obtained at interactive rates. Additionally, 

several voxel-coloring passes can be applied in orthogonal directions to obtain true 

3D models from a complete sequence around the object in interest. But this particular 

space carving technique on the GPU requires a 3D volume texture to be stored in video 

memory, thereby limiting the resolution of the voxel space. 

A very fast variational approach to depth estimation in presented in Chapter 6. On a 

first view it seems unlikely that graphics hardware can accelerate the numerical calculations 

required to solve the partial differential equations derived from variational formulations 

of depth estimation. However, it turns out that the current programming features 

of GPUs substantially decrease the run-time of iterative PDE solvers on regular grids. 

Variational depth estimation methods can provide very high quality models, but they are 

very sensitive to parameter settings and to the initial depth hypothesis in general, hence 

an immediate feedback is very useful to a human operator.


The most versatile method for dense depth estimation, which can be performed by 

the GPU entirely, is scanline optimization as described in Chapter 7. Conceptually, the 

technique described in this chapter extends the plane-sweep method from Chapter 4 with 

a semi-global depth extraction technique. The key innovation in this chapter is the formulation 

of a specific dynamic programming approach to depth estimation in a manner 

suitable for the programming model of GPUs. Although the time complexity after the 

transformation is O(N log N) instead of O(N), the observed timing results are promising. 

The core method from this chapter is presented in [Zach et al., 2006b]. 

The final algorithmic contribution of this thesis discussed in Chapter 8 is a volumetric 

approach to generate proper 3D models from multiple depth maps at interactive rates. 

The final 3D model is represented implicitly as isosurface in a scalar volume dataset and 

the corresponding mesh geometry can be extracted using marching cubes or tetraheda 

methods. Alternatively, the isosurface can be directly visualized from the volume data using 

recent methods of volume visualization. A condensed version of this chapter appeared 

in [Zach et al., 2006a]. 

Chapter 9 presents several multi-view datasets and the associated depth maps and 

models generated with the proposed methods. In few cases, where a ground truth is 

available, a quantitative accuracy evaluation is provided as well.

1.4. Overview of this Thesis and Contributions 13 

(a) (b) (c) (d) 

(e) (f) (g) (h) 

Figure 1.3: Several step in the reconstruction pipeline illustrated with a statue example. 

(a) and (b) are two source images out of 47 images in total. The result of GPU-based 

dense depth estimation for two views is shown in (c) and (d). Two views of the result mesh 

after volumetric depth image integration are given in (e) and (f). The finally simplified 

and textured 3D geometry of the statue is displayed in (g) and (h).

Chapter 2 

Related Work 

Contents 

2.1 Dense Depth and Model Estimation . . . . . . . . . . . . . . . . 15 

2.2 GPU-based 3D Model Computation . . . . . . . . . . . . . . . . 19 

2.1 Dense Depth and Model Estimation 

There is a huge bibliography on the generation of depth images and dense geometry from 

multiple views, hence we focus on seminal work in this field. We divide the approaches to 

computational stereo into three subtopics for a better structure: at first, important publications 

dealing with the classical stereo setup consisting of two images with vertically 

aligned epipolar geometry are discussed. Subsequently, major approaches to depth estimation 

from multiple, not necessarily rectified images are presented. Finally, true multi-view 

methods generating a 3D model (and not just depth images) directly are briefly sketched. 

Note, that computational stereo and depth estimation can be seen as a subtopic of the 

more general optical flow computation between images. The main difference between the 

former and optical flow is the reduced (one-dimensional) search space for stereo methods, 

since knowledge of the epipolar geometry is assumed. In order to obtain metric models 

the internal camera parameters are required to be known, too. 

2.1.1 Computational Stereo on Rectified Images 

The minimal requirement to obtain a depth map, or equivalently a 2.5D height field solely 

from images, is a pair of input images with a typically convergent view on the scene 

to be reconstructed. Many methods generating depth maps from such input data work 

on rectified images with aligned epipolar geometry mostly for efficiency reasons, since 

vertically aligned epipolar lines allow efficient image dissimilarity calculations and the 

reuse of already computed values. Recent surveys of computational stereo methods are 

15

16 Chapter 2. Related Work 

given in [Scharstein and Szeliski, 2002], [Faugeras et al., 2002] and [Brown et al., 2003]. 

Additionally, in [Scharstein and Szeliski, 2002] an evaluation framework is proposed, which 

is still widely used to compare stereo methods in terms of their ability to recover the true 

geometry. 

Many depth estimation methods perform typically the following four subsequent steps 

to constitute a depth map (after [Scharstein and Szeliski, 2002]): 

1. matching cost (i.e. image dissimilarity score) computation; 

2. an aggregation procedure to accumulate the matching costs within some region; 

3. depth map extraction; 

4. and an optional refinement of the depth map. 

Often, the first two steps cannot be separated, e.g. if the utilized matching score is 

already based on some measure involving pixel neighborhoods. The major difference 

between the various computational stereo approaches lies in the method of depth 

map extraction given the matching costs data structure. Purely local methods apply 

a very greedy winner-takes-all approach, which assigns the depth value with the 

lowest matching cost to a pixel. Global methods for depth map extraction apply an 

optimization procedure, which takes matching scores and spatial smoothness of the 

depth map into account. Smoothness is typically modeled by a regularization function, 

which has the depth values assigned to adjacent pixels as input and yields a (positive) 

penalty value for unequal depths. If smoothness of the depth map is enforced only on 

vertical scanlines (which coincide with the epipolar lines), a very efficient and elegant 

algorithms based on the dynamic programming principle can be devised. Earlier 

work includes [Baker and Binford, 1981, Ohta and Kanade, 1985, Geiger et al., 1995, 

Birchfield and Tomasi, 1998]. Although dynamic programming approaches to stereo 

are known for a long time, there is still ongoing research on this topic [Veksler, 2003, 

Criminisi et al., 2005, Hirschmüller, 2005, Hirschmüller, 2006, Lei et al., 2006]. A more 

detailed discussion of one employed dynamic programming approach to stereo and its 

GPU-based implementation is provided in Chapter 7. 

More recently, many proposed global methods for stereo focus on enforcing 

smoothness in both directions, not just within the same scanline. Since finding 

the true global optimum is not feasible, various approximation schemes have 

been presented in the literature. Largely, two lines of global optimization 

procedures have been applied successfully to stereo problems: maximum network 

flow methods (usually called graph-cut approaches in the computer vision literature 

[Boykov et al., 2001, Kolmogorov and Zabih, 2001, Kolmogorov and Zabih, 2002]), 

and Markov random field methods based on iterative belief updating (belief 

propagation [Sun et al., 2003, Felzenszwalb and Huttenlocher, 2004, Sun et al., 2005]). 

Although the depth maps obtained from these advanced procedures are generally

2.1. Dense Depth and Model Estimation 17 

better than those generated by dynamic programming methods, their time and 

space complexities are substantially higher than those for 1-dimensional optimization 

procedures. 

Graph-cut methods are iterative procedures to update the current labeling (i.e. depth 

values in the stereo case) of pixels to obtain a lower total energy value. The initial depth 

labeling can be computed e.g. by pure local stereo methods. In every iteration a greedy, but 

large ∗ relabeling of pixels is determined, which yields the lowest total energy. A suitable 

graph network is built in every iteration, and the maximum flow solution corresponds to 

an optimal greedy relabeling. These iterations are repeated until a (strong) local minimum 

is reached. 

While dynamic programming, belief propagation and graph cut approaches to computational 

stereo treat the underlying energy minimization problem as combinatorial problem 

with a discrete set of pixels and disparity labels, it is nevertheless possible to employ variational 

methods developed to solve problems on a continuous domain for stereo vision. 

Since many of the proposed variational approaches for multi-view reconstruction are typically 

formulated for a general multiple view setup, these methods are discussed below in 

Section 2.1.2. 

The depth maps returned by any of the above-mentioned methods may still contain 

wrong depth values for certain pixels, e.g. due to occlusions, specular reflections 

etc. These mismatches can be potentially detected by a very simple left-right consistency 

check [Fua, 1993] (also called bidirectional matching or back-matching). This technique 

reverses the role of the input images and generates two depth maps (one wrt. the first 

image and one wrt. the second image). Only depth values for pixels which agree in both 

depth maps (according to some metric) are retained. 

2.1.2 Multi-View Depth Estimation 

In this section we summarize work on dense depth estimation from multiple, but usually 

still small baseline views. In general, more than two views cannot be rectified in order to 

simplify and accelerate the depth estimation procedure. Since small baselines between the 

images are assumed, explicit or implicit occlusion detection and handling strategies are 

possible. Implicit occlusion handling approaches typically use truncated matching scores 

or multiple scores between pairs of images to reduce the influence of occluded pixels in the 

estimation procedure (e.g. [Woetzel and Koch, 2004] and Chapter 4). 

Several approaches developed for a multi-view setup utilize variational methods to 

search for a 3D surface or depth map color-consistent with the provided input images. A 

hypothetical surface or depth map (together with the known epipolar geometry between 

the views) induces a (nonlinear) 2D transfer between the images. If the correct depth map 

is found, all warped source images are very similar according to a provided image similarity 

metric. Additionally, surface smoothness is assumed if the image data is ambiguous (i.e. 

∗ Meaning, that the subset of pixels with a newly assigned label is as large as possible.


lacking sufficient texture). Variational approaches to multi-view stereo formulate the reconstruction 

problem as continuous energy optimization task and apply methods from the 

variational calculus (most notably the Euler-Lagrange equation) to determine a suitable 

gradient descent direction in function space. The current mesh (or depth map hypothesis) 

is updated according to this direction until convergence. All variational methods to stereo 

employ a coarse-to-fine strategy to avoid reaching a weak local minimum in early stages 

of the procedure. 

If a surface is evolved within a variational framework to obtain a final 

mesh consistent with the images, an implicit level-set representation of the 

current mesh hypothesis allows simple handling of topological changes of the 

mesh [Faugeras and Keriven, 1998, Yezzi and Soatto, 2003, Pons et al., 2005]. Generating 

depth images instead of meshes from multiple views within a continuous framework 

yields to a set of partial differential equations, which are numerically solved to obtain the 

final depth map [Strecha and Van Gool, 2002, Strecha et al., 2003, Slesareva et al., 2005]. 

Chapter 6 describes depth estimation using variational principles more precisely and 

presents an efficient GPU-based implementation of one particular approach. 

Combinatorial and graph optimization methods can be applied in the multi-view stereo 

case as well: Kolmogorov et al. [Kolmogorov and Zabih, 2002, Kolmogorov et al., 2003] 

employ graph-cut optimization to obtain a depth map from multiple views. In addition to 

image similarity and smoothness terms the energy function is augmented with an explicit 

visibility term derived from the current depth map. 

2.1.3 Direct 3D Model Reconstruction 

This section outlines several approaches for multi-view reconstruction targeted at 

using all available images from different viewpoints simultaneously. Early methods 

include space carving and its variants, which projects 3D voxels in the available images 

according to the current visibility and an image consistency score is calculated from the 

sampled pixels. If the voxel is declared as inconsistent, the voxel is classified as empty 

and the current model and visibility information is updated. The variants of the basic 

space carving principle mostly differ in their employed consistency function and the 

voxel traversal order [Seitz and Dyer, 1997, Prock and Dyer, 1998, Seitz and Dyer, 1999, 

Culbertson et al., 1999, Kutulakos and Seitz, 2000, Slabaugh et al., 2001, 

Sainz et al., 2002, Stevens et al., 2002] (see also Chapter 5). All space carving methods 

compute the so called photo hull (the set of image-consistent voxels), which typically 

contains the true geometry, but in practice the photo hull can be a substantial 

over-estimate of the true model. Textureless regions yield to poor photo hulls in 

particular because of the absence of a smoothing force. 

In order to address the shortcomings of pure space carving methods with 

its instant classification of voxels, volumetric graph cut extraction of surface 

voxels incorporating image consistency and smoothness constraints were recently

2.2. GPU-based 3D Model Computation 19 

proposed [Vogiatzis et al., 2005, Tran and Davis, 2006, Hornung and Kobbelt, 2006b, 

Hornung and Kobbelt, 2006a]. Since individual voxels essentially correspond to nodes 

in the network graph used to determine the maximum flow, these methods still rely 

on existing object silhouettes in order to consider only voxels close to the visual hull. 

Additionally, approximate visibility is inferred from the visual hull to determine occluded 

views for each voxel. 

Instead of a direct, one-pass reconstruction approach from multiple views, one can utilize 

a two-pass method, which generates at first a set of depth images from small baseline 

subsets of the provided source views, and subsequently creates a full 3D model by merging 

the depth maps. Goesele et al. [Goesele et al., 2006] employs a simple plane-sweep 

based depth estimation approach followed by a volumetric range image integration procedure 

[Curless and Levoy, 1996] to obtain the final 3D model. Only relatively confident 

depth values are retained in the depth maps, hence the final model may still contain holes 

e.g. in textureless regions. Additionally, the range image integration is based on weighted 

depth values with the weights induced from the corresponding matching score. This approach 

is very similar to our purely GPU-based reconstruction pipeline comprising the 

methods presented in Chapter 4 and Chapter 8 (see also [Zach et al., 2006a]). In contrast 

to volumetric graph cut methods, which generate watertight surfaces, the result of the 

pure locally working volumetric range image method may contain holes, which can be 

geometrically filled e.g. using volumetric diffusion processes [Davis et al., 2002]. 

2.2 GPU-based 3D Model Computation 

2.2.1 General Purpose Computations on the GPU 

Because of the rapid development and performance increase of current 3D graphics hardware, 

the goal of using graphics processing units for non-graphical purposes became appealing. 

The SIMD design of graphics hardware allows much higher peak performance in 

certain applications than it is achievable for a general purpose CPU. Whereas a traditional 

CPU like a 3 GHz Pentium 4 achieves a theoretical performance of 6 GFlops and a memory 

bandwidth of about 6 GByte/sec, a high-end graphics card such as a NVidia GeForce 

6800 achieves 53 GFlops at 34 GByte/sec [Harris and Luebke, 2005]. Furthermore, the 

annual increase of performance for graphics processing units is significantly higher than 

for CPUs. In contrast to the MIMD programming model for traditional processing units 

the computational model for GPUs is a stream processing approach applying the same 

instructions to multiple data items. Consequently, existing CPU-based algorithms must 

be mapped onto this computational model, and not every algorithm can benefit from the 

processing power of the GPU. 

Since the emergence of programmable graphics hardware in the year 2001, a huge 

number of research papers addresses the acceleration of known algorithms and numerical 

methods using the GPU as specialized, but fast coprocessor. In this section we only refer


to seminal work in this area. 

At first we give a brief overview of the computational model of GPU-based computations 

(Figure 2.1). The incoming vertex stream with several attributes per vertex (vertex 

position, color, texture coordinates) is processed by a vertex program and transformed into 

normalized screen space. A set of three vertices constitutes a triangle, which is prepared 

for the rasterization step. The rasterizer generates fragments and interpolates vertex attributes. 

An optional fragment program takes the incoming fragments and may perform 

additional calculation, thereby modifying the outgoing fragment color and depth. The 

blending stage performs optional alpha blending and combines several fragment samples 

into one pixel, if multi-sampling based antialiasing is enabled. Fragment programs and 

recently vertex programs as well can perform texture lookups to retrieve arbitrary image 

data. 

Vertex stream 

Vertex 

program 

Texture 

Transformed 

vertex stream 

Triangle 

assembly/clip 

Screen−space 

triangle stream 

Rasterization 

Unprocessed 

Fragment 

program 

Fragment 

Blending 

Pixel 

Framebuffer 

Image 

fragment stream stream stream 

Figure 2.1: The stream computation model of a GPU (adapted 

from [Harris and Luebke, 2005]). 

Most applications using the GPU as a general purpose SIMD processor employ the 

fragment shaders to perform computational tasks, since most of the processing power of 

modern graphics hardware is concentrated in the fragment units. Additionally, direct and 

dependent texture lookups provided by fragment shaders constitute a powerful instrument 

for data array access. Consequently, general purpose computing on the GPU focuses on the 

second row in pipeline depicted in Figure 2.1 (notably fragment programs and blending). 

Textures act as data array sources, on which the same set of instructions is applied. 

The resulting fragments represent the calculated outcome of these computation. Hence, 

in most applications a screen-aligned quadrilateral with appropriate texture coordinates 

is drawn and the requested computation is entirely performed in the fragment processing 

units. 

Vertex and fragment programs are specified in an assembly like language in the 

first instance. Several higher level specification languages for vertex and fragment 

programs were developed to ease the development of GPU programs. A commonly


used language for visual effects and general purpose programming on the GPU is 

Cg [NVidia Corporation, 2002a, Mark et al., 2003], which provides a C-like specification 

language for GPU programs and a compiler for translation to the native instruction set 

of graphics hardware. Brook is a language designed specifically for parallel numerical 

algorithms [Dally et al., 2003], and now an implementation is available for current 

programmable graphics hardware [Buck et al., 2004]. The two main concepts of Brook 

(and parallel numerical approaches in general) are kernels and reductions. A kernel is a 

procedure applied to a large set of data items and represents the more powerful version 

of a SIMD instruction. Since the computation of a kernel only depends on the incoming 

data and a kernel has no additional side-effects, a kernel can be executed for many data 

values in parallel. Application of a kernel is similar to the higher order map function 

found in most functional programming languages. A reduction operation combines the 

elements in a data array to generate a single result. In functional programming this 

operation corresponds to the (again higher-order) fold function. On graphics hardware 

kernels correspond mainly to fragment programs and can be applied in a straightforward 

manner. Reductions require a rather expensive multipass procedure based on recursive 

doubling with a logarithmic number of passes. 

Because of the close relationship between the computational model of modern GPUs 

and general stream processing concepts, similar benefits and limitations for algorithm implementations 

can be found in both models. Nevertheless, there are significant differences 

between general stream processors and graphics hardware: In contrast to general parallel 

programming and stream computation models, a GPU only provides a very limited support 

for scatter operations (i.e. indexed array updates) and other general purpose operations 

(e.g. bit-wise integer manipulation). On the other hand, linearly filtered data access is 

performed very efficiently by the GPU, since this is an intrinsic feature of texture units. In 

spite of these (and many other) differences between stream processing models and modern 

GPUs, essentially the same set of algorithms can be accelerated by both architectures. 

Even before programmable graphics hardware was available, the fixed 

function pipeline of 3D graphics processors was utilized to accelerate several 

numerical [Hopf and Ertl, 1999a, Hopf and Ertl, 1999b] and geometric 

calculations [Hoff III et al., 1999, Krishnan et al., 2002] and even to emulate 

programmable shading not available at that time [Peercy et al., 2000]. The 

introduction of a quite general programming model for vertex and pixel processing 

[Lindholm et al., 2001, Proudfoot et al., 2001] opened a very active research 

area. The primary application for programmable vertex and fragment processing 

is the enhancement of photorealism and visual quality in interactive visualization 

systems (e.g. [Engel et al., 2001, Hadwiger et al., 2001]) and entertainment applications 

([Mitchell, 2002, NVidia Corporation, 2002b]). Additionally several non-photorealistic 

rendering techniques can be effectively implemented in modern graphics hardware 

[Lu et al., 2002, Mitchell et al., 2002, Weiskopf et al., 2002, Dominé et al., 2002]. 

Thompson et al. [Thompson et al., 2002] implemented several non-graphical


algorithms to run on programmable graphics hardware and profiled the execution times 

against CPU based implementations. They concluded that an efficient memory interface 

(especially when transferring data from graphics memory into main memory) is still an 

unsolved issue. For the same reason our implementations are designed to minimize the 

memory traffic between graphics hardware and main memory. 

Naturally the texture handling capability and especially the free bilinear and accelerated 

anisotropic texture fetch operation makes graphics hardware suitable for image 

processing tasks, e.g. filtering with linear kernels. Sugita et al. [Sugita et al., 2003] and 

Colantoni et al. [Colantoni et al., 2003] compared the performance of CPU-based and 

GPU-based implementations of several image filters and image transforms, and observed 

substantial performance gains using the GPU over optimized CPU implementations. 

Numerical methods and simulations became feasible on the GPU since the emergence 

of floating point texture capabilities, that enables specification and handling of floating 

point values for use on the GPU (instead of the 8 bit fixed point precision provided 

so far). Numerical solvers for sparse matrix equations were proposed by Bolz et 

al. [Bolz et al., 2003] and by Krüger and Westermann [Krüger and Westermann, 2003]. 

Note that the system matrices appearing in variational methods to optical flow and depth 

estimation are huge, but sparse matrices with usually 4 or 8 off-diagonal bands. Consequently, 

variational methods exploiting the computational power of modern GPUs are 

now feasible and outperform CPU based implementation substantially. Of course, the 

limited floating point precision of current GPUs (essentially an IEEE 32 bit float format) 

is an obstacle to high precision numerical computations. Actual numerical or physical 

simulations are described in [Harris et al., 2002, Kim and Lin, 2003, Lefohn et al., 2003, 

Goodnight et al., 2003, Moreland and Angel, 2003]. 

2.2.2 Real-time and GPU-Accelerated Dense Reconstruction from Multiple 

Images 

In this section we focus on multi-view reconstruction methods that are either aimed on 

realtime execution or use programmable 3D graphics hardware to accelerate the depth 

estimation procedure. 

Vision-based dense depth estimation methods performing at interactive rates 

or even in real-time were initially implemented using special hardware and digital 

signal processors [Faugeras et al., 1996, Kanade et al., 1996, Konolige, 1997, 

Woodfill and Herzen, 1997, Jia et al., 2003, Darabiha et al., 2003]. With the 

appearance of SIMD instructions sets like MMX and SSE primarily intended 

for multimedia applications on general purpose CPUs, several implementations 

targeted on the efficient use of these extensions for computational stereo applications 

[Mühlmann et al., 2002, Mulligan et al., 2002, Forstmann et al., 2004]. The basic 

ideas of high performance CPU depth estimation method include a cache friendly design 

of the algorithm to minimize CPU pipeline stalls, and exploiting the SIMD functionality


e.g. by rating four disparity values simultaneously. All these approaches work usually 

with very simple image similarity measure like the SSD or SAD. 

The Triclops vision system [Point Grey Research Inc., 2005] is a commercially available 

realtime stereo implementation. Typically the setup consists of two or three cameras and 

appropriate software for realtime stereo matching. Depending on the image resolution 

and the disparity range, the system is able to generate depth images at a rate of about 

30Hz for images of 320x240 pixels on current PC hardware. The software exploits the 

particular L-shape orientation of the cameras and MMX/SSE instructions available on 

current CPUs. 

Probably the first multi-view depth estimation approach executed on programmable 

graphics hardware was presented by Yang et al. [Yang et al., 2002], who developed a fast 

stereo reconstruction method performed in 3D hardware by utilizing a plane sweep approach 

to find correct depth values. The proposed method uses projective texturing capabilities 

of 3D graphics hardware to project the given image onto the reference plane. 

Further, single pixel error accumulation for all given views is performed on the GPU 

as well. The number of iterations is linear in the requested resolution of depth values, 

therefore this method is limited to rather coarse depth estimation in order to fulfill the 

realtime requirements of their video conferencing application. Further, their approach 

requires a true multi-camera setup to be robust, since the error function is only aggregated 

in single pixel windows. Since the application behind this method is a multi-camera 

teleconferencing system, accuracy is less important than realtime behavior. In later work 

the method was made more robust using trilinear texture access to accumulate error differences 

within a window [Yang and Pollefeys, 2003]. Their developed ideas were reused 

and improved to obtain a GPU-based dense matching procedure for a rectified stereo 

setup [Yang et al., 2004]. 

The basic GPU-based plane-sweep technique for depth estimation can be enhanced 

with implicit occlusion handling and smoothness constraints to obtain depth maps with 

higher quality. Woetzel and Koch [Woetzel and Koch, 2004] addressed occlusion occurring 

in the source images by a best n out of m and by a best half-sequence multi-view selection 

policy to limit the impact of occlusions on the resulting depth map. In order to obtain 

sharper depth discontinuities a shiftable correlation window approach was utilized. The 

employed image similarity measure is a truncated sum of squared differences, which is 

sensitive to changing lighting conditions. 

Cornelis and Van Gool [Cornelis and Van Gool, 2005] proposed several refinement 

steps performed after a plane-sweep procedure used to obtain an initial depth map using 

a single pixel truncated SSD correlation measure. Outliers in the initially obtained depth 

map are removed by a modified median filtering procedure, which may destroy fine 3D 

structures. These fine details are recovered by an subsequent depth refinement pass. 

Since this approach is based on single pixel similarity instead of a window based one, 

slanted surfaces and depth discontinuities are reconstructed more accurately compared 

with window-based approaches.


Typically, the correlation windows used in realtime dense matching have a fixed size, 

which causes inaccuracies close to depth discontinuities. Since large depth changes are 

often accompanied by color or intensity changes in the corresponding image, adapting 

the correlation window to extracted edges is a reasonable approach. Gong and 

Yang [Gong and Yang, 2005a] investigated in a GPU-based computational stereo procedure 

with an additional color segmentation step to increase the quality of the depth map 

near object borders. 

A GPU-based plane-sweeping technique suitable for sparse 3D reconstructions was 

presented by Rodrigues and Fernandes [Rodrigues and Ramires Fernandes, 2004]. They 

used projective texturing hardware to map rays going through interest points into the 

other views according to the epipolar geometry. In contrast to the dense depth planesweeping 

methods, a true multi-view configuration of the cameras can be used. The result 

of the procedure is a sparse 3D point cloud corresponding to 2D interest point seen in 

several input images. 

For several applications, e.g. video teleconferencing and mixed reality applications, 

it is sufficient to reconstruct the visual hull, which is the intersection of the generalized 

cones generated by the silhouette of the object and the optical center of a camera. 

Even with the non-programmable traditional graphics pipeline real-time generation and 

rendering of visual hulls can be accelerated by 3D graphics hardware. Lok [Lok, 2001], 

Matusik et al. [Matusik et al., 2001] and Li et al. [Li et al., 2003] present on-line visual 

hull reconstructions systems mostly aimed on video conferencing and mixed reality applications. 

In order to improve the visual quality of the reconstructed models, the visual 

hull can be upgraded with depth information generated by computational stereo algorithms 

[Slabaugh et al., 2002, Li et al., 2002]. 

Li et al [Li et al., 2004] present a method for GPU-based photo hull generation 

used for viewpoint interpolation, that is in some aspects similar to the material 

presented in Chapter 5. Essentially their work combine the plane-sweep approach 

proposed by Yang [Yang et al., 2002] with visibility handling used in the space carving 

framework [Seitz and Dyer, 1997, Kutulakos and Seitz, 2000]. In contrast to our 

approach only depth maps suitable for view interpolation are generated, whereas our 

approach creates proper 3D models as obtained by other voxel coloring and space carving 

techniques. 

Recently, Gong and Yang [Gong and Yang, 2005b] implemented a dynamic programming 

approach to computational stereo with a simple discontinuity cost model on the 

GPU and achieved at least interactive rates. In contrast to the other GPU based depth 

estimation methods this approach belongs to the category of global matching procedures 

(as opposed to the winner-takes-all local methods). Although their framework can be implemented 

entirely on the GPU, they report higher performance using a hybrid CPU/GPU 

approach, in which the dynamic programming step is performed on the CPU. Currently, 

GPU-based global methods for disparity assignment are slowly emerging in the literature. 

Dixit et al. [Dixit et al., 2005] present a GPU implementation of a graph cut opti-


mization method called GPU-cut used for image segmentation. Since graph cut based 

approaches to computational stereo are highly successful, further investigations of GPUcut 

for dense stereo are expected. 

Mairal and Keriven [Mairal and Keriven, 2006] propose a GPU-based variational stereo 

framework, which iteratively refines a 3D mesh hypothesis until convergence. The basic 

framework and goals are similar to our system presented in Chapter 3. A variational multiview 

approach for 3D reconstruction using graphics hardware is proposed by Labatut et 

al. [Labatut et al., 2006], which uses a level-set approach to deform an initial mesh to 

match the image similarity constraint. The authors reported a performance speedup by a 

factor of approximately four compared with their CPU implementation. The overall time 

required to obtain the final model using a 128 3 volumetric grid is about 5 to 7 minutes 

depending on the data-set. 

Loopy belief propagation with its basically parallel message update scheme is ostensibly 

an ideal candidate for GPU-based methods: Brunton and Shu [Brunton and Shu, 2006] 

and Yang et al. [Yang et al., 2006] describe implementations utilizing the GPU. The disadvantage 

of belief propagation is at first the huge memory consumption for large images 

and depth resolutions requiring either limited depth range [Brunton and Shu, 2006] and 

image resolution [Yang et al., 2006]. Additionally, the purely parallel (synchronous) message 

update feasible on the GPU converges slower than the sequential update available on 

the CPU [Tappen and Freeman, 2003].

Chapter 3 

Mesh-based Stereo Reconstruction 

Using Graphics Hardware 


This chapter describes a computational stereo method generating a 2.5D height-field represented 

as a triangular mesh from a pair of images with known relative pose. The key 

idea is a generate-and-test approach, which successively modifies a mesh hypothesis and 

evaluates an image correlation measure to rate the refined hypothesis. The current 3D 

mesh geometry and the relative pose between the images can be used to generate virtual 

views of the source images with respect to one particular view. The generated images of 

the virtual views should match closely if the correct 3D geometry is found. 

The procedure works iteratively: mesh modification resulting in better image correlation 

are kept, whereas mesh variations lowering the image similarity are discarded. These 

iterations are embedded in a coarse-to-fine framework to avoid convergence to purely local 

minima. This procedure can be seen as a simple and discrete formulation of a variational, 

mesh-based dense stereo approach. 

The virtual view generation and the subsequent image similarity calculation are performed 

by programmable graphics processing units. In contrast to several GPU-based 

3D reconstruction methods described in following chapters, the required feature set to be 

provided by the GPU for this method is very small. Consequently, the proposed stereo 

approach described in this chapter works on early generations of programmable graphics 

hardware. 

Unlike to the approaches proposed in later chapters this approach still uses a mixed 

computation model employing the GPU for many portions of the procedures, but nevertheless 

relies on CPU-based computations in some aspects. Essentially, only those parts of 

the method are accelerated by graphics hardware, which can be efficiently implemented on 

DirectX 8.1 class GPUs. ∗ The proposed approach in this chapter substantially exploits the 

∗ DirectX 8.1 type GPUs provide relatively powerful vertex shaders, but only very limited pixel shaders 

27

28 Chapter 3. Mesh-based Stereo Reconstruction Using Graphics Hardware 

main capabilities of graphics hardware by repeated rendering of multi-textured mesh geometry 

for virtual view generation. Virtual view creation induces a non-linear deformation 

of the source image, hence we refer to this operation as image warping procedure. 

3.2 Overview of Our Method 

The input for our procedure consists of two gray-scale images with known relative pose 

and camera calibration suitable for stereo reconstruction, and a coarse initial mesh to 

start with. This mesh can be based on a sparse reconstruction obtained by the relative 

orientation procedure (e.g. a mesh generated from a sparse set of corresponding points by 

some triangulation). In our experiments we use a planar mesh as the starting point for 

dense reconstruction. One image of the stereo pair is referred as the key image, whereas 

the other one is denoted as the sensor image. † Consequently the cameras (resp. their 

positions) are designated as the key camera and the sensor camera. 

The overall idea of the dense stereo procedure is that if the current mesh hypothesis 

corresponds to the true model, the appropriately warped sensor image virtually created 

for the key camera position resembles the original key image. This similarity is quantified 

by some suitable error metric on images, which is the sum of absolute difference values in 

our current implementation. Modifying the current mesh results in different warped sensor 

images with potentially higher similarity to the key image (see Figure 3.1). The current 

mesh hypothesis is iteratively refined to generate and evaluate improved hypotheses. The 

huge space of possible mesh hypotheses can be explored efficiently, since local mesh refinements 

have only local impacts on the warped image, therefore many local modifications 

can be applied and evaluated in parallel. 

The matching procedure consists of three nested loops: 

1. The outermost loop determines the mesh and image resolutions. In every iteration 

the mesh and image resolutions are doubled. The refined mesh is obtained by linear 

(and optionally median) filtering of the coarser one. This loop adds the hierarchical 

strategy to our method. 

2. The inner loop chooses the set of vertices to be modified and updates the depth 

values of these vertices after performing the innermost loop. 

3. The innermost loop evaluates depth variations for candidate vertices selected in the 

enclosing loop. The best depth value is determined by repeated image warping 

and error calculation wrt. the tested depth hypothesis. The body of this loop runs 

entirely on 3D graphics hardware. 

with a small number of instructions are available. Additionally, floating point accuracy for textures and 

pixel shaders is not supported. 

† There is no unique fixed convention to denote the role of the two views. Sometimes the images are 

called master and slave views to indicate the key resp. the sensor view. In medical image processing the 

notion of template and moving image are very common.

3.2. Overview of Our Method 29 

Mesh to reconstruct 

Key camera 

camera ray 

Secondary camera 

mesh vertex 

tested displacement 

Figure 3.1: Mesh reconstruction from a pair of stereo images. Vertices of the current mesh 

hypothesis are translated along the back-projected ray of the key camera. The image 

obtained from the sensor camera is warped onto the mesh and the effect in the local 

neighborhood of the modified vertex is evaluated. 

To perform image warping the current mesh hypothesis is rendered like a regular 

height-field as illustrated in Figure 3.2. As it can be seen in Figure 3.3, a change of 

the depth value of one vertex has only influence on few adjacent triangles. Therefore 

one fourth of the vertices can be modified simultaneously without affecting each other. 

The optimization procedure to minimize the error between key image and warped image 

is a sequence of determining the best depth values for alternating fractions of the mesh 

vertices. Since vertices of the grid are numbered such that vertices, which are modified 

and evaluated in the same pass, comprise a connected block (Figure 3.4), we denote the 

fraction of vertices to change as a block. 

In every step the depth values of one forth of the vertices is modified and the local 

error between the key image and the warped image in the affected neighborhood of the 

vertex is evaluated. For every modified vertex the best depth value is determined and 

the mesh is updated accordingly. The procedure to calculate and update error values for 

modified vertices is outlined in Figure 3.5. 

3.2.1 Image Warping and Difference Image Computation 

Since the vertices of the mesh are moved along the back-projected rays of the key camera, 

the mesh as seen from the first camera is always a regular grid and mesh modifications 

do not distort the key image. The appearance of the sensor image as seen from the key 

camera depends on the mesh geometry.


Figure 3.2: The regular grid as seen from the key camera. This grid structure allows 

fast rendering of the mesh using triangle strips with only one call. The marked vertices 

comprise one block. These vertices are shifted on the back-projected ray and evaluated 

simultaneously in every iteration. 

Modified vertex 

Affected triangles 

Accumulated neighborhood 

Figure 3.3: The neighborhood of a currently evaluated vertex. Moving this vertex on the 

back-projected ray will only effect the 6 shaded triangles. The actual error for this vertex 

is calculated for the enclosing rectangle, that is still disjoint with the neighborhoods of all 

other tested vertices. 

From the 3D positions of the current mesh vertices and the known relative orientation 

between the cameras, it is easy to use automatic texture coordinate generation with appropriate 

coefficients to perform the image warping step. To minimize updates of mesh 

geometry we use our own vertex program to calculate texture coordinates for the sensor 

image. This vertex shader is described in more detail in Section 3.3.1. 

3.2.2 Local Error Summation 

After the difference between the key image and the warped image is computed and stored 

in a pixel buffer, we need to accumulate the error in the neighborhoods of modified ver-

3.2. Overview of Our Method 31 

(0,0) (2, 0) (1,0) (3, 0) (0,1) (2,1) (3,1) (1,1) 

Block 0 Block 1 Block 2 

Block 3 

Figure 3.4: The correspondence between vertex indices and grid positions. 

tices. In order to sum the values within a rectangular window, we employ a variant of 

a recursive doubling scheme. The required modification of the recursive approach refers 

to the encoding and accumulation of larger integer values, if only traditional 8 bit color 

channels are available (see Section 3.3.3). Essentially, we perform a repeated downsampling 

procedure, which sums up four adjacent pixels into one resulting pixel. The target 

pixel buffer has half the resolution in every dimension of the source buffers. If one vertex 

is located every four pixels, the downsampling is performed three times to sum the error 

in an 8 by 8 pixel window. 

We need to mention that only 2 n × 2 n error values are computed for a mesh with 

(2 n + 1) × (2 n + 1) vertices. Vertices at the right and lower edge of the grid do not have 

an associated error value. For these vertices we propagate the depth values from the left 

resp. upper neighbors. 

3.2.3 Determining the Best Local Modification 

If δ denotes the largest allowed depth change, then the tested depth variations are sampled 

regularly from the interval [−δ, δ]. To minimize the amount of data that needs to be copied 

from graphics memory to main memory, we do not directly read back the local errors to 

determine the best local modification in software. We store the currently best local error 

and the corresponding index in a texture and update these values within an additional 

pass. These values are read back after all depth variations for one block of vertices are 

evaluated. 

3.2.4 Hierarchical Matching 

In order to avoid local optima during dense matching we utilize a hierarchical approach. 

The coarsest level consists of a mesh with 9 by 9 vertices and an image resolution of 32 

by 32 pixels. The initial model comprise a planar mesh with the approximate correct 

depth values known from the points of interest generated by the relative pose estimation 

procedure. After a fixed number of iterations the mesh calculated in the coarser level 

is upsampled (using a bilinear filter) and used as input to the next level. A median 

filter is optionally applied to the mesh to remove potential outliers especially found in 

homogeneous image regions. 

The largest allowed displacement for mesh vertices is decreased for higher levels to


Key image 

Sensor image 

Absolute difference 

Sum of abs. 

differences 

Range image 

Update mesh hypothesis 

New minimal error 

and optimal depth 

Minimum calculation 

Old minimal error 

Figure 3.5: The basic workflow of the matching procedure. For the current mesh hypothesis 

a difference image between key image and warped sensor image is calculated in 

hardware. The error in the local neighborhood of the modified vertices are accumulated 

and compared with the previous minimal error value. The result of these calculations are 

minimal error values (stored in the red, green and blue channel) and the index of the best 

modification of vertices so far (stored in the alpha channel). All these steps are executed 

in graphics hardware and do not require transfer of large datasets between main memory 

and video memory. 

enable higher precision. It is assumed that the model generated at the previous level is 

already a sufficiently accurate approximation of the true model, and only local refinements 

to the mesh are required at the next level. In the current implementation we halve the 

largest evaluated depth variation when entering the next hierarchy level. The coarsest 

level starts with a maximum depth variation roughly equal to the distance of the object 

to the key camera.

3.3. Implementation 33 

3.3 Implementation 

In this section we describe in more detail some aspects of our approach. Our 

implementation is based on OpenGL extensions available for the ATI Radeon 

9700Pro, namely VERTEX_OBJECT_ATI, ELEMENT_ARRAY_ATI, VERTEX_SHADER_EXT and 

FRAGMENT_SHADER_ATI [Hart and Mitchell, 2002]. These extensions are available on 

the Radeon 8500 and 9000 as well, therefore our method can be applied with these 

older (and cheaper) cards, too. For better reading we sketch the vertex program in Cg 

notation [NVidia Corporation, 2002a]. 

The major design criterion is to minimize the amount of data transferred between the 

CPU memory and GPU memory. In particular, reading back data from the graphics card 

is very slow, therefore only absolutely necessary information is copied from video memory. 

3.3.1 Mesh Rendering and Image Warping 

For maximum performance we employ the VERTEX_OBJECT_ATI and ELEMENT_ARRAY_ATI 

OpenGL extension to store mesh vertices and connectivity information directly in graphics 

memory. In every iteration one fourth of the vertices needs to be updated to test 

mesh modifications. In order to reduce memory traffic we update the mesh only after all 

modifications are evaluated and the best one is determined. The current tested offset is a 

parameter to a vertex program, that moves vertices along the camera ray as indicated by 

the given offset. 

Additionally the mesh vertices are ordered such that vertices that are modified in the 

same pass comprise a single connected block, therefore only one fourth of the vertex array 

object stored in video memory needs to be updated. 

We sketch the vertex program that calculates the appropriate texture coordinates for 

the sensor image in Algorithm 1. The vertex attributes consists of the position and the 

block mask encoded in the primary color attribute. Program parameters common for all 

vertices are 

1. the currently tested depth displacement for the active block, 

2. a matrix M1 transforming pixel positions into back-projected rays of the key camera, 

3. and a matrix M2 representing the transformation from the key camera into image 

positions of the sensor camera. 

If a vertex belongs to block i, then the i-th component of the block mask attribute of 

this vertex is set to one. The other channels are set to zero. If all vertices of block j 

are currently evaluated, the displacement represented as a 4-component vector has the 

current offset value at position j and zeros otherwise. Therefore a four-component dot 

product between the mask and the given displacement is either the displacement or zero, 

depending whether the block numbers match.


Algorithm 1 The vertex program responsible for warping the sensor image. This vertex 

shader calculates appropriate texture coordinates for the second image based on the 

relative orientation of the cameras and the currently evaluated offset. 

Procedure Vertex program for sensor image warping 

Input: Constant parameters: Matrices M1 and M2, displacement (a 4-vector) 

Input: Vertex attributes: position (homogeneous 3D position), mask (a 4-vector, provided 

in the associated vertex color) 

depthold ← position.z 

{Inner product to determine actual depth displacement} 

delta ← displacement · mask 

depthnew ← depthold + delta 

{Back-project pixel to obtain the corresponding ray of the key camera} 

ray ← M1 · position 

positionnew ← depthnew · ray 

{Position on 2D screen, to be transformed by the modelview-projection matrix} 

windowP osition ← (position.x, position.y, 0, 1) 

{Project perturbed 3D position to obtain final texture coordinate to sample the sensor 

image} 

texcoord ← M2 · positionnew 

If K1 and K2 are the internal parameters of the key resp. the sensor camera (arranged 

in an upper-triangular matrix) and O is the relative orientation � �between 

the cameras 

(with O being a 4 × 4 matrix with the components O = 

are calculated as follows: 

R 

0 

t 

1 

), then M1 and M2 

⎛ 

1 

⎜ 

M1 = ⎜ 

⎝ 

1 

0 

⎞ 

⎟ × K−1 1 

1 ⎠ 

1 

and 

⎛ 

1/w 

⎜ 

M2 = ⎜ 

⎝ 

1/h 

1 

1 0 

⎞ 

⎟ 

⎠ × K2 × O, 

where w and h represent the image width and height in pixels. If M1 is applied to a vector 

(x, y, ·, 1), the result is the direction (∆x, ∆y, 1, 1) of the camera ray going through the 

pixel at (x, y). This direction is scaled by the target depth value to obtain the vertex in 

the key camera space. Consequently, the vertex data for mesh points consists of vectors 

(x, y, z, 1), where (x, y) are the pixel coordinates in the key image and z is the current 

depth value. The obtained texture coordinates (s, t, q, q) for the sensor image are subject

3.3. Implementation 35 

to perspective division prior to texture lookup. On current hardware perspective texture 

lookup is performed for every texel, hence the correct perspective projection (and warping) 

is achieved. 

Additionally we remark, that texture coordinate transformation from one image to 

another cannot be accomplished only by one transformation matrix: in this case the 

depth changes are applied in screen space, which maps world coordinates non-linearly due 

to perspective division. 

The described image warping transformation can result in texture coordinates lying 

outside the sensor image. It is possible to ignore mesh regions outside the sensor image 

explicitly, but according to our experience simple clamping of texture coordinates is 

sufficient in those cases. 

3.3.2 Local Error Aggregation 

Aggregating the intensity difference values between the key image and the warped sensor 

image is performed by a recursive doubling approach, which is basically a successive 

downsampling procedure. 

One iteration of the downsampling procedure is quite simple: the input texture is 

bound to four texture units and a quadrilateral covering the whole viewport is rendered. 

The texture coordinates for the 4 texturing units are jittered slightly, such that the correct 

adjacent pixels are accessed for each final fragment. The filtering mode for the source 

textures is set to GL_NEAREST. Since the aggregation window is fixed to a 8 × 8 rectangle, 

three iterations are applied. 

3.3.3 Encoding of Integers in RGB Channels 

Although the input images are grayscale images and one 8 bit gray channel is sufficient 

to represent the absolute difference image, summation of local errors is likely to generate 

overflows. Current generations of graphics cards supports float textures, but at the time 

of our first attempts to employ the GPU for computer vision applications no pixel buffer 

format allowed color channels with floating point precision. Therefore we decided to utilize 

a slightly more complex method to perform error summation with 8 bit RGB channels. 

In the proposed implementation floating point textures are not required. 

Our integer encoding assigns the least significant 6 bits of a larger integer value to the 

red channel, the middle 6 bits to the green channel and the remaining bits to the blue 

channel. The two most significant bits of the red and green channel are always zero. This 

encoding allows summation of four error values without loss of precision using a fragment 

program utilizing a dependent texture lookup. After (component-wise) summation of 4 

input values the most significant bits of the red and green component of the register storing 

the sum are possibly set, hence this register requires an additional conversion to obtain 

the final error value with the desired encoding. This conversion is performed using a 256 

by 256 texture map.


If more than four values are summed in one step, the number of spare bits needs to 

be adjusted, e.g. if 8 values are summed in one pass, the three most significant bits of the 

red and green channel must be reserved to avoid overflows. 

3.4 Performance Enhancements 

As it turns out, the implementation described above has still performance bottlenecks, 

that can be avoided by a careful design of the particular implementation. 

3.4.1 Amortized Difference Image Generation 

For larger image resolutions (e.g. 1024 × 1024) rendering of the corresponding mesh 

generated by the sampling points takes a considerable amount of time. In the 1-megapixel 

case the mesh consists of approximately 131 000 triangles, which must be rendered for every 

depth value (several hundred times in total). Especially on mobile graphic boards, mesh 

processing implies a severe performance penalty: stereo matching of two 256 × 256 pixel 

images shows similar performances on the evaluated desktop GPU and on the employed 

mobile GPU of a laptop, but matching 1-megapixel images requires two times longer on 

the mobile GPU. 

In order to reduce the number of mesh drawings up to four depth values are evaluated 

in one pass. We use multitexturing facilities to generate four texture coordinates for 

different depth values within the vertex program. The fragment shader calculates the 

absolute differences for these deformations simultaneously and stores the results in the four 

color channels (red, green, blue and alpha). Note that the mesh hypothesis is updated 

infrequently and the actually evaluated mesh is generated within the vertex shader by 

deforming the incoming vertices according to the current displacement. 

The vertex program has now more work to perform, since four transformations (matrixvector 

multiplications) are executed to generate texture coordinates for the right image 

for each vertex. Nevertheless, the obtained timing results (see Section 3.5) indicate a 

significant performance improvement by utilizing this approach. Several operations are 

executed only once for up to 4 mesh hypotheses: transferring vertices and transforming 

them into window coordinates, triangle rasterization setup and texture access to the left 

image. 

3.4.2 Parallel Image Transforms 

In contrast to Yang and Pollefeys [Yang and Pollefeys, 2003] we calculate the error within a 

window explicitly using multiple passes. In every pass four adjacent pixels are accumulated 

and the result is written to a temporary off-screen frame buffer (usually called pixel buffer 

or P-buffer for short). It is possible to set pixel buffers as destination for rendering 

operations (write access) or to bind a pixel buffer as a texture (read access), but a combined 

read and write access is not available. In the default setting the window size is 8 × 8,

3.4. Performance Enhancements 37 

therefore 3 passes are required. Note that we use specific encoding of summed values to 

avoid overflow due to the limited accuracy of one color channel. 

Executing this multipass pipeline to obtain the sum of absolute differences within a 

window requires several P-buffer activations to select the correct target buffer for writing. 

These switches turned out to be relatively expensive (about 0.15ms per switch). In combination 

with the large number of switches the total time spent within these operations 

comprise a significant fraction of the overall matching time (about 50% for 256 × 256 

images). If the number of these operations can be optimized, one can expect substantial 

increase in performance of the matching procedure. 

Instead of directly executing the pipeline in the innermost loop (requiring 5 P-buffer 

switches) we reorganize the loops to accumulate several intermediate results in one larger 

buffer with temporary results arranged in tiles (see Figure 3.6). Therefore P-buffer switches 

are amortized over several iterations of the innermost loop. This flexibility in the control 

flow is completely transparent and needs not to be coded explicitly within the software. 

Those stages in the pipeline waiting for the input buffer to become ready are skipped 

automatically. 

3.4.3 Minimum Determination Using the Depth Test 

We have two procedures available to update the minimal error and optimal depth value: 

the first approach utilizes a separate pass employing a simple fragment program for the 

conditional update. This method works on a wider range of graphic cards (on some mobile 

GPUs in particular), but it is rather slow due to necessary P-buffer activations (since the 

minimum computation cannot be done in-place). The alternative implementation employs 

Z-buffer tests for the conditional updates of the frame buffer in-place, but the range of 

supported graphics hardware is more limited. In order to utilize this simpler (and faster) 

method, the GPU must support user-defined assignment of z-values within the fragment 

shader (e.g. by using the ARB_FRAGMENT_PROGRAM OpenGL extension). Older hardware 

always interpolates z-values from the given geometry (vertices). 

We use the rather simple fragment program shown in Figure 3.7 to obtain one scalar 

error value from the color coded error and move this value to the depth register used 

by graphics hardware to test the incoming depth against the z-buffer. Using the depth 

test provided by 3D graphics hardware, the given index of the currently evaluated depth 

variation and the corresponding sum of absolute differences is written into the destination 

buffer, if the incoming error is smaller than the minimum already stored in the buffer. 

Therefore a point-wise optimum for evaluated depth values for mesh vertices can be computed 

easily and efficiently.


Difference 

image 1 

Difference 

image 2 

Difference 

image 3 

n/2 x n/2 

n x n 

n x n 

Difference 

image 4 

pixel summation 

for every iteraion 

n x n 


every four iterations 


once per block 

Figure 3.6: The modified pipeline to minimize P-buffer switches. Several temporary results 

are accumulated in larger pixel buffers arranged like tiles. Later passes operate on all those 

intermediate results and are therefore executed less frequently. 

3.5 Results 

We tested our hardware based matching procedure on artificial and on real datasets. In all 

test cases the source images are grayscale images with a resolution of 1024 by 1024 pixels. 

For the real datasets the relative orientations between stereo images are determined using 

the method described by Klaus et al. [Klaus et al., 2002]. 

We run the timing experiments on a desktop PC with an Athlon XP 2700 and an ATI 

Radeon 9700 and on a laptop PC with a mobile Athlon XP 2200 and an ATI Radeon 9000 

Mobility. 

The artificial dataset comprise two images of a sphere mapped with an earth texture 

rendered by the Inventor scene viewer (Figure 3.8). The meshes obtained by our 

reconstruction method are displayed as point set for easier visual evaluation. Timing

3.5. Results 39 

PARAM depth_index = program.env[0]; 

PARAM coeffs = { 1/256, 1/16, 1, 0 }; 

TEMP error, col; 

TEX col, fragment.texcoord[0], 

texture[0], 2D; 

DP3 error, coeffs, col; 

MOV result.color, depth_index; 

MOV result.depth, error; 

Figure 3.7: Fragment program to transfer the incoming, color coded error value to the 

depth component of the fragment. The dot product (DP3) between the texture element 

and the coefficient vector restores the scalar error value encoded in the color channels. 

statistics for this dataset reconstructed at different resolutions are given in Table 3.1. The 

matching procedure performs 8 iterations with 7 tested depth variations for each hierarchy 

level. These values result in high quality reconstructions in reasonable time. Therefore 

the pipeline shown in Figure 3.5 is executed 56 times for each level. The number of levels 

varies from 4 to 6 depending on the given image resolution. The total number of evaluated 

mesh hypothesis is 224 (256x256), 280 (512x512) and 336 (1024x1024). In the highest 

resolution (1024x1024) each vertex is actually tested with 84 depth values out of a range 

of approximately 600 possible values. Because of limitations in graphics hardware we are 

currently restricted to images with power of two dimensions. 

(a) The key image (b) The second image (c) The reconstructed model 

Figure 3.8: Results for the artificial earth dataset. 

In addition to the timing experiments we applied the proposed procedure to several 

real-world datasets consisting of stereo image pairs showing various buildings. The source 

image of these datasets are grayscale images resampled to 1024 × 1024 pixels to meet the 

power-of-two graphics hardware requirement. The source images and the reconstructed 

models are visualized in Figure 3.9–3.11. In Figure 3.10 the homogeneously textured 

regions showing the sky yield to poor reconstructions in these areas in particular. The


Hardware Resolution Matching time 

Radeon 9700 Pro 256x256 0.106s 

512x512 0.198s 

1024x1024 0.501s 

Radeon 9000 Mobility 256x256 0.095s 

512x512 0.31s 

1024x1024 1.05s 

Table 3.1: Timing results for the sphere dataset on two different graphic cards. 

same holds for the repetitive pattern on the foreground lawn in Figure 3.11. 

Since the number of iterations is equal to the one chosen for the artificial dataset, the 

times required for dense reconstruction are similar. 


Figure 3.9: Results for a dataset showing the yard inside a historic building. 

3.6 Discussion 

This chapter presents a method to reconstruct dense meshes from stereo images with 

known relative pose, which is almost completely performed in programmable graphics 

hardware. Dense reconstructions can be generated for pairs of images with one megapixel 

resolution in less than one second on the evaluated hardware platforms. 

With the emergence of additional features provided by the GPU, the approach proposed 

in this chapter is extended and enhanced as described in the following chapters. 

The simple sum of absolute differences image similarity measure can be replaced by more 

robust correlation function to achieve better results for real-world datasets. Additionally, 

the presented method can be easily extended to a multi-view setup at the cost of 

higher execution times. A true variational multi-view dense depth estimation framework 

performed by the GPU is presented in Chapter 6.

3.6. Discussion 41 


Figure 3.10: Results for a dataset showing an apartment house. Unstructured regions 

showing the sky are poorly reconstructed due to the ambiguity in the local image similarity. 

Another straightforward extension of the method described in this chapter addresses 

the generation of an optical flow field between two views. If no epipolar geometry is known 

or the static scene assumption is violated, the one-dimensional search along back-projected 

pixels is replaced by a 2D disparity search space. Since a 3D reconstruction from a sole 

disparity field is not possible, we focused on the setting with known epipolar geometry 

allowing 3D models to be generated.


(a) Left image (b) Right image (c) The depth image 

(d) The reconstructed model as 3D point cloud 

Figure 3.11: Visual results for the Merton college dataset. The source images have a 

resolution of 1024 × 1024 pixels.

Chapter 4 

GPU-based Depth Map 

Estimation using Plane Sweeping 

Contents 

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

4.2 Plane Sweep Depth Estimation . . . . . . . . . . . . . . . . . . . 43 

4.3 Sparse Belief Propagation . . . . . . . . . . . . . . . . . . . . . . 50 

4.4 Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 54 

4.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

4.6 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 


This chapter describes the implementation of a multiview depth estimation method based 

on a plane-sweeping approach, which is accelerated by 3D graphics hardware. The goal 

of our approach is the generation of depth maps with suitable quality at interactive rates. 

The final depth extraction can be performed using a fast and simple winner-takes-all 

approach, or alternatively a time- and memory-efficient variant of belief propagation can 

be employed to obtain higher quality depth images. 

4.2 Plane Sweep Depth Estimation 

Plane sweep techniques in computer vision are simple and elegant approaches to image 

based reconstruction from multiple views, since a rectification procedure as required in 

many traditional computational stereo methods is not required. The 3D space is iteratively 

traversed by parallel planes, which are usually aligned with a particular key view 

43

44 Chapter 4. GPU-based Depth Map Estimation using Plane Sweeping 

(Figure 4.1). The plane at a certain depth from the key view induces homographies for 

all other views, thus the sensor images can be mapped onto this plane easily. 

Key view 

Sensor view 

Figure 4.1: Plane sweeping principle. For different depths the homography between the 

reference plane and the sensor view is varying. Consequently, the projected image of the 

sensor view changes with the depth according to the epipolar geometry. 

If the plane at a certain depth passes exactly through the surface of the object to 

be reconstructed, the color values from the key image and from the mapped sensors images 

should coincide at appropriate positions (assuming constant brightness conditions). 

Hence, it is reasonable to assign the best matching depth value (according to some image 

correlation measure) to the pixels of the key view. By sweeping the plane through the 

3D space (i.e. varying the plane depth with respect to the key view) a 3D volume can be 

filled with image correlation values similar to the disparity space image (DSI) in traditional 

stereo. Therefore the dense depth map can be extracted using global optimization 

methods, if depth continuity or any other constraint on the depth map is required. 

Note, that a plane sweep technique in a two frame rectified stereo setup coincides 

with traditional stereo methods for disparity estimation. In these cases the homography 

between the plane and the sensor view is solely a translation along the X-axis. 

There are several techniques to make dense reconstruction approaches more robust in 

case of occlusions in a multi-view setup. Typically, occlusions are only modeled implicitly 

in contrast to e.g. space carving methods, where the generated model so far directly influences 

visibility information. Here we discuss briefly two approaches to implicit occlusion 

handling:

4.2. Plane Sweep Depth Estimation 45 

• Truncated scores: The image correlation measure is calculated between the key view 

and the sensor view and the final score for the current depth hypothesis is the 

accumulated sum of the truncated individual similarities. The reasoning behind this 

approach is that the effect of occlusions between a pair of views on the total score 

should be limited to favor good depth hypotheses supported by other image pairs. 

• Best half-sequence selection: In many cases the set of images comprise a logical 

sequence of views, which can be totally ordered (e.g. if the camera positions are 

approximately on a line). Hence the set of images used to determine the score in 

terms of the key view can be split into two half-sequences, and the final score is the 

better score of these subsets. The motivation behind this approach is, that occlusion 

with respect to the key view appear either in the left or in the right half-sequence. 

Dense depth estimation using plane sweeping as described in this chapter is restricted to 

small baseline setups, since for larger baselines occlusions should be modeled explicitly. 

Additionally, the inherent fronto-parallel surface assumption of correlation windows yields 

inferior results in wide baseline cases. 

4.2.1 Image Warping 

In the first step, the sensor images are warped onto the current 3D key plane π = (n ⊤ , d) 

using the projective texturing capability of graphics hardware. If we assume the canonical 

coordinate frame for the key view, the sensor images are transformed by the appropriate 

homography H with 

H = K 

� 

R − t n ⊤ � 

/d K −1 . 

K denotes the intrinsic matrix of the camera and (R|t) is the relative pose of the sensor 

view. 

In order to utilize the vector processing capabilities of the fragment pipeline in an 

optimal manner, the (grayscale) sensor images are warped wrt. four plane offset values d 

simultaneously. All further processing works on a packed representation, where the four 

values in the color and alpha channels correspond to four depth hypotheses. 

4.2.2 Image Correlation Functions 

After a sensor image is projected onto the current plane hypothesis, a correlation score 

for the current sensor view is calculated, and the scores for all sensor views are integrated 

into a final correlation score of the current plane hypothesis. The accumulation of the 

single image correlation scores depend on the selected occlusion handling policy: simple 

additive blending operations are sufficient if no implicit occlusion handling is desired. If the 

best half-sequence policy is employed, additive blending is performed for each individual 

subsequence and a final minimum-selection blending operation is applied.


To our knowledge, all published GPU-based dense depth estimation methods use the 

simple sum of absolute differences (SAD) or squared differences (SSD) for image dissimilarity 

computation (usually for performance reasons). By contrast, we have a set 

of GPU-based image correlation functions available, including the SAD, the normalized 

cross correlation (NCC) and the zero-mean NCC (ZNCC) similarity functions. The 

NCC and ZNCC implementations optionally use sum tables for an efficient implementation 

[Tsai and Lin, 2003]. Small row and column sums can be generated directly by 

sampling multiple texture elements within the fragment shader. Summation over larger 

regions can be performed using a recursive doubling approach similar to the GPU-based 

generation of integral images [Hensley et al., 2005]. Full integral image generation is also 

possible, but precision loss is observed for the NCC and ZNCC similarity functions in this 

case (see Section 4.2.2.2). 

For longer image sequences one cannot presume constant brightness conditions across 

all images, hence an optional prenormalization step is performed, which subtracts the boxfiltered 

image from the original one to compensate changes in illumination conditions. If 

this prenormalization is applied, the depth maps obtained using the different correlation 

functions have similar quality. 

4.2.2.1 Efficient Summation over Rectangular Regions 

The image similarity functions described in the following section can be efficiently implemented 

by utilizing integral images (also known as summed-area tables in computer 

graphics). Integral images allow constant-time box filtering regardless of the window 

size [Crow, 1984]. Given the integral image of a source image any box filtering can be 

performed in constant time using four image accesses (resp. texture lookups). This efficient 

box filtering approach can be extended more complex higher-order filtering operations 

[Heckbert, 1986]. 

The single-pass procedure to calculate the integral image efficiently on a general purpose 

processor is slow when mapped on SIMD architectures. Consequently, a different 

approach using a logarithmic number of passes to generate the integral image on the GPU 

is much more efficient [Hensley et al., 2005]. Note, that the integral image requires a much 

higher precision of the color channels than the source image precision. Calculating and 

using integral images on the GPU is only feasible since the emergence of floating point 

support on current graphics hardware. 

Note that for very small window sizes the utilization of bilinear texture fetches available 

on current graphics hardware essentially for free is usually more efficient than the 

computation and application of integral images. Bilinear texturing allows the summation 

of four adjacent pixels by just one texture access, e.g. summing the values inside a 4x4 

windows can be done using 4 bilinear texture lookups (instead of 16 individual accesses). 

Consequently, in order to obtain highest performance suitably customized procedures are 

best for very small correlation windows.


4.2.2.2 Normalized Correlation Coefficient 

The widely used (zero-mean) normalized correlation coefficient for window-based local 

matching of two images X and Y is (where ¯ X and ¯ Y denote the means inside the rectan- 

gular region W) 

ZNCC = 

� 

i∈W (Xi − ¯ X) (Yi − ¯ Y ) 

�� i∈W (Xi − ¯ � 

X) 2 

i∈W (Yi − ¯ Y ) 2 

which is invariant under (affine linear) changes of luminance between images, but relatively 

costly to calculate. Using integral images the ZNCC can be calculated in constant time 

regardless of the correlation window size [Tsai and Lin, 2003], since 

ZNCC = 

� XiYi − ( � Xi) ( � Yi) /N 

�� X2 i − ( � Xi) 2 � �� /N Y 2 

i − ( � Yi) 2 � 

/N 

. 

From the above formula it can be seen that five integral images are requires to calculate 

the ZNCC: the integral image for � Xi, � Yi, � X 2 i , � Y 2 

i and finally � XiYi. The 

precision requirement for the higher order sums is 8 + 8 + log 2 512 + log 2 512 = 34 bit for 

512 × 512 source images. The 32 bit floating point format of current GPUs has a mantissa 

of 23 bit and artefacts due to precision loss may occur. Figure 4.2 illustrates the reduced 

precision by depicting a ZNCC error image generated in software on a CPU and another 

one computed on the GPU. An increasing loss of precision can be seen towards the lower 

right corner of the image. Since the integral image generations starts from the upper left 

corner, the lower right portion has the highest precision requirements within the integral 

image. 

Note that the precision requirements for the simple sums � Xi and � Yi are 26 bit for 

8 bit images with 512 × 512 pixels resolution. By subtracting the image mean in advance 

from the source image two additional precision bits can be saved: one by halving the 

magnitude of the source values and another one by exploiting the sign bit in the integral 

image. 

Instead of creating full integral images, which allows box filtering with arbitrary window 

sizes, it is usually sufficient to sum the values with a given specific window, since we do 

not vary the aggregation window size during similarity score computation. Accumulation 

of larger windows can be performed using a similar recursive doubling scheme as used for 

integral image generation. Consequently, the precision requirements on the target buffer 

storing the aggregated values depend on the window size, and these are substantially lower 

than the requirements for integral images. 

,


(a) CPU generated (b) GPU generated 

Figure 4.2: NCC images calculated on the CPU (left) and on the GPU (right) using 

integral images.. Darker pixels indicate smaller similarity values. The image computed on 

the GPU has significant deviations especially in the lower right regions. 

4.2.3 Sum of Absolute Differences and Variants 

The sum of absolute differences (SAD) is a widely used image similarity function because 

of its simple computation, the minimal precision requirements and its high performance: 

SAD = � 

|Xi − Yi|, 

i∈W 

where W denotes the aggregation window. It is not insensitive to illumination changes, 

which results in limited use of the SAD for real-world application. 

Lighting changes in the scene can be incorporated by subtracting the local mean from 

the original image values yielding a zero-mean sum of absolute differences (ZSAD): 

ZSAD = � 

|(Xi − ¯ X) − (Yi − ¯ Y )|. 

i∈W 

In contrast to the correlation coefficient the subtracted local means cannot be moved 

outside the absolute value bars. Hence a similar technique like the shifting theorem for 

the correlation coefficient is not applicable and the ZSAD is not suitable for efficient 

computation. In a first step we replace the true zero-mean intensity values Xi − ¯ X resp. 

Yi − ¯ Y by the differences Xi − X σ i , where Xσ is a smoothed version of the image X. X σ 

is typically generated by box-filtering the original image. The same applies to Y . The


net effect of this approximation is, that the normalization of the images can be performed 

once for the input images. 

Hence, the first step is to calculate images ˜ X and ˜ Y , which are difference images 

between the original image and the smoothed one (i.e. ˜ X = X −X σ and ˜ Y = Y −Y σ ). The 

the approximated zero-mean sum of absolute differences reads as simple SAD operating 

on the transformed images: 

ZSAD ≈ � 

| ˜ Xi − ˜ Yi|. 

i∈W 

The SAD (and the approximated ZSAD) can be normalized to the range [0, 1] by appropriate 

division: 

SAD = 1 � 

|Xi − Yi|, 

|W| 

i∈W 

if Xi ∈ [0, 1] and Yi ∈ [0, 1] is assumed. An alternative normalized variant of the SAD is 

known as the Bray Curtis (respectively Sorensen) distance: 

and 

NSAD = 

ZNSAD = 

� 

i∈W |Xi − Yi| 

� 

i∈W |Xi| + � 

i∈W 

� 

i∈W | ˜ Xi − ˜ Yi| 

|Yi| , 

� 

i∈W | ˜ Xi| + � 

i∈W | ˜ Yi| . 

These similarity scores are between 0 and 1, where 0 indicates perfect match between the 

two local windows. 

Computing the NSAD (and the ZNSAD) between two images requires three integral 

images II(·) to be generated for every depth value: 

• II(|Xi − Yi|) to calculate the numerator for the NSAD efficiently, 

• II(|Xi|) and II(|Yi|) to compute the denominator of the NSAD formula. 

For the ZNSAD, the integral images are computed for ˜ Xi and ˜ Yi. 

If the plane sweep is performed normal to an input view, II(|Xi|) must be calculated 

only once before the sweep. In case of a rectified stereo setup, the integral images (resp. the 

box filtered images) of the mean-normalized inputs can be entirely precomputed before 

the sweep. For every depth (resp. disparity) value the integral image for the absolute 

difference image |Xi − Yi| between the two views must be calculated. 

Of course, the required sums for rectangular regions can be achieved by direct summation 

as well, but such an approach is only suitable and efficient for small support window 

sizes.


4.2.4 Depth Extraction 

In order to achieve high performance for depth estimation, we employ primarily a simple 

winner-takes-all strategy to assign the final depth values. This approach can be easily and 

efficiently implemented on the GPU using the depth test for a conditional update of the 

current depth image hypothesis (see [Yang et al., 2002] and Section 3.4.3). 

Unreliable depth values can be masked by a subsequent thresholding pass removing 

pixels in the obtained depth map, which have a low image correlation. 

If the resulting depth map is converted to 3D geometry, staircasing artefacts are typically 

visible in the obtained model. In order to reduce these artefacts an optional selective, 

diffusion-based depth image smoothing step is performed, which preserves true depth discontinuities 

larger than the steps induced by the discrete set of depth hypotheses (see 

Section 4.4). 

4.3 Sparse Belief Propagation 

Belief propagation (e.g. [Weiss and Freeman, 2001]) is an approximation technique for 

global optimization on graphs, which is based on passing messages on the arcs of the 

underlying graph structure. The algorithm iteratively refines the estimated probabilities 

of the hypotheses within the graph structure by updating the probability weighting of 

neighboring nodes. These updates are referred as message passing between adjacent nodes. 

The belief propagation method maintains an array of probabilities called messages for 

every arc in the graph, hence this method requires substantial memory for larger graphs 

and hypothesis spaces. We denote the value of a message from node p going to node q 

for hypothesis d at time t with m (t) 

p→q(d). Here d ranges over the possible hypothesis at 

node q. After the belief propagation procedure converged to a stable solution, the final 

hypothesis assignment to every node is typically extracted by taking the hypothesis with 

the maximum estimated a posteriori probability. We refer to Section 4.3.2 for the details 

on message passing and hypothesis extraction. 

In image processing and computer vision applications this graph is usually induced 

by the rectangular image grid with nodes representing pixels and arcs connecting adjacent 

pixels. Depth estimation integrating smoothness weights and occlusion handling 

can be formulated as global optimization problem and solved with belief propagation 

methods [Sun et al., 2003]. Nevertheless, basic belief propagation methods are computationally 

demanding, but the special structure of the regularization function typically 

used in computer vision to enforce smooth depth maps can be exploited to obtain more 

efficient implementations [Felzenszwalb and Huttenlocher, 2004]. In particular, the Potts 

discontinuity cost function and the optionally truncated linear cost model allow an efficient 

linear-time message passing method. In the Potts model, equal depth values assigned to 

adjacent pixels imply no smoothness penalty, whereas any different adjacent depth values 

result in a constant regularization penalty. More formally, the smoothness cost V (dp, dq)

4.3. Sparse Belief Propagation 51 

is zero, if dp = dq, and a constant λ otherwise. In the linear smoothness model we have 

V (dp, dq) = λ|dp − dq|. 

Our implementation of belief propagation to extract the depth map from image correlation 

values is based on the work proposed in [Felzenszwalb and Huttenlocher, 2004]. In 

contrast to already proposed depth estimation techniques based on belief propagation we 

apply the message passing procedure only to a promising subset of depth/disparity values. 

Consequently, the consumed memory and time is a fraction of the original method. 

Consider the following concrete example: a depth map with 512×512 pixels resolution 

should be extracted from 200 potential depth values. Traditional (dense) belief propagation 

requires about 4 × 512 × 512 × 200 message components to be stored (the factor 4 

results from the utilized 4-neighborhood of pixels), which gives 800MB for 32 bit floating 

point components. But most of the 200 depth hypothesis per pixel can be rejected immediately 

because of low image similarities. If on average only 10 tentative depth hypothesis 

survive for every pixel, only 4 × 512 × 512 × 10 message components need to be stored, 

which results in 40MB of memory consumption. The actual memory footprint is somewhat 

larger, since additional data structures are required for sparse belief propagation. 

We can adopt two of the three ideas proposed in [Felzenszwalb and Huttenlocher, 2004] 

for sparse belief propagation: 

• The checker-board update pattern for messages can be used directly to halve the 

memory requirements. 

• The two pass method to compute the message updates in linear time for the Potts 

and the linear regularization can be modified to work for sparse representations as 

well (see Section 4.3.2). 

Additionally, a coarse-to-fine approach to belief propagation for vision to accelerate the 

convergence is proposed in [Felzenszwalb and Huttenlocher, 2004]. The basic idea is the 

hierarchical grouping of pixels in coarser levels and to perform message passing in the 

reduced graphs. The results from coarser levels are used as initialization values for the 

next finer level. Since the hypothesis space (i.e. the range of admissible depth values) for 

a group of pixels in a coarser level consists of the union of all depth hypothesis valid for 

individual pixels, the data structures become less sparse. In the example above starting 

with 10 tentative depth values for every pixel, the next coarser level is comprised of 2 × 2 

pixel blocks associated with up to 40 possible depth values. Hence, there is no direct 

improvement in the time complexity using a hierarchical approach for our proposed sparse 

belief propagation method. 

4.3.1 Sparse Data Structures 

4.3.1.1 Sparse Data Cost Volume During Plane-Sweep 

Since belief propagation is a global optimization framework, a data structure similar to 

the disparity space image must be maintained, which stores the correlation value for


every depth hypothesis and pixel. We propose a sparse representation to store tentative 

depth/correlation value pairs. One simple implementation would store exactly K 

depth/correlation pairs for every pixel, which is a appropriate approach in practice. In 

certain situations this uniform choice for the number if hypotheses to be stored for every 

pixel is not appropriate: In highly textured regions there are possibly very few tentative 

depth hypotheses, whereas in low textured areas the similarity measure is not discriminative 

and the choice of K may be to low to include all potential depth candidates. 

Consequently, we choose a more dynamic data structure, which stores at least K depth 

hypotheses (together with the corresponding correlation value) and additionally allocates 

a pool of a user defined size, which stores the globally next best depth hypotheses. 

For efficient update of this data structure after computing the image similarity for a 

certain depth plane, the K entries associated with every pixel comprise a heap sorted wrt. 

the correlation value. Maintaining the heaps for every pixel is relatively cheap, since every 

heap has exactly K elements. The dynamically assigned depth hypotheses are maintained 

in a heap structure as well. Updating this pool is more costly due to its relative large size. 

4.3.1.2 Sparse Data Cost Volume for Message Passing 

After finishing the plane-sweep procedure to generate the data costs associated with every 

pixel and every tentative depth value, the gathered sparse data cost volume is restructured 

for efficient access during message passing. Whereas during plane-sweep the image similarity 

value serves as primary key for efficient incremental updates, the sparse 1D distance 

transform performed during message updates requires a depth-sorted list of items. Consequently, 

the sparse data cost volume used in message passing stage consists of an array of 

depth value/similarity value pairs for every pixel. In order to avoid memory fragmentation 

a scheme similar to compressed row storage format for sparse matrix representations is 

employed. 

4.3.2 Sparse Message Update 

Belief propagation uses repeated communication between adjacent pixels to strengthen 

or weaken the support of depth hypotheses. The iterative procedure updates the value 

for a message going from pixel p to its neighbor q at iteration t, m (t) 

p→q, according to the 

following rule: 

m (t) 

p→q(dq) := min ⎝V (|dp − dq|) + D(dp) + � 

dp 

⎛ 

s∈N (p)\q 

⎞ 

m (t−1) 

s→p (dp) ⎠ , (4.1) 

where dp and dq are tentative depth values at pixel p and q respectively. V (·) is the 

regularization term and D(dp) is the image similarity value for the depth dp. The sum 

� 

s∈N (p)\q m(t−1) 

s→p (dp) denotes the incoming messages from the neighborhood of q excluding

4.3. Sparse Belief Propagation 53 

p. The values from the previous iteration are used to determine the incoming messages 

(as denoted by the superscript (t − 1)). 

We utilize a linear regularization model, i.e. 

or a truncated linear approach with 

V (d) = λ d, 

V (d) = min {Vmax, λ d} , 

with a regularization weight λ. 

After a user-specified number of iterations T for each pixel p the depth hypothesis with 

the highest support (belief) is chosen as the actual depth: 

d result 

p 

= arg min 

dp 

⎧ 

⎨ 

4.3.2.1 Sparse 1D Distance Transform 

⎩ D(dp) + � 

s∈N (p) 

m T s→p(dp) 

For the linear regularization model the quadratic time complexity of message updates 

can be reduced to linear complexity using a two-pass scheme to calculate the 

min-convolution [Felzenszwalb and Huttenlocher, 2004]. Computing the min-convolution 

can be easily extended for sparse belief propagation. The procedure to for the sparse 1D 

distance transform is illustrated in Figure 4.3 and outlined in Algorithm 2. 

q1 p1 p2 

q2 q3 p3 

q4 

Figure 4.3: Determining the lower envelope using a sparse 1D distance transform. Solid 

lines represent given values of h[pi] = D[pi] + � 

s�=q ms→p[pi] and dashed lines indicate 

inferred values h[qi] from the distance transform. 

The algorithm applies a forward and a backward pass to calculate the lower envelope 

in essentially the same manner as in the basic belief propagation framework. The main 

⎫ 

⎬ 

⎭


observation or the distance transforms in the sparse settings is, that only the potential 

depth hypotheses for the nodes forming the arc p → q in interest need to be considered. 

Consequently, the lower envelope is derived solely from the potential depth hypotheses 

associated with pixel p and q. In order to apply the forward and backward pass, these two 

sets of selected depth values need to be sorted into a common sequence. This is the first 

step in Algorithm 2. 

Subsequently, the procedure embeds the given samples stored in the array h to the 

corresponding position in the combined sequence f. The subsequent forward and backward 

passes propagate the distance values through the sequence. Focusing on the forward pass, 

the successive element f[i + 1] is updated to 

min(f[i + 1], f[i] + λ |mergeddepths[i + 1] − mergeddepths[i]|. 

The backward pass in analogous. 

Algorithm 2 Sparse variant of 1D distance transform 

Procedure Sparse DT-1D 

Input: h[], depthsp[], sizep, depthsq[], sizeq, result mp→q[] 

Do a merge-sort step to combine depthsp and depthsq to obtain mergeddepths with at 

most sizep + sizeq entries 

Simultaneously, fill a temporary array f such that 

f[j] := h[i], if mergeddepths[j] = depthsp[i] 

f[j] := ∞, otherwise 

Perform forward pass on f 

Perform backward pass on f 

Fill in result array mp→q: 

mp→q[i] = f[j], if mergeddepths[j] = depthsq[i] 

The merge sort step stated in Algorithm 2 can be avoided by precomputing suitable 

arrays, but this approach is only slightly faster than using the inlined merge sort step and 

requires additional memory. 

4.4 Depth Map Smoothing 

If the 3D models generated by the plane sweep procedure are visualized directly, staircase 

artefacts induced by the discrete set of depth hypothesis are often clearly visible. If several 

individual depth maps resp. the induced 3D meshes are combined into one final model (e.g. 

as described in Chapter 8), these artefacts are typically removed by suitable averaging of 

the single models and the smoothing procedure proposed in this section is not necessary. 

Otherwise, a depth smoothing approach selectively removing the staircase effects without 

filtering larger depth discontinuities as described in this section can be applied. 

In the following we assume that the tentative depth values of every pixels are evenly

4.5. Timing Results 55 

spaced in a user-specified interval, and successive depth values vary by a constant depth 

difference T . Hence, depth variations between neighboring pixel in the magnitude T (or 

a small multiple of T ) indicate potential regions for depth map smoothing. We perform 

this selective filtering approach by applying a diffusion procedure to minimize 

� 

min (d − d0) 

d 

2 + µ�W (p) · ∇d� 2 dp. 

p 

In this term d0(·) denotes the depth map (a function of the pixel position p) generated by 

the plane-sweeping method in the first place. d(·) is the final smoothed depth map, and 

W (·) is a weighting vector described below. µ is a user-specified weight to balance the 

data term (d − d0) 2 and the regularization term �W (p) · ∇d� 2 . 

In order to define the weight W (p) at pixel position p, the original depth map d0 is 

sampled at position p and its four neighbors comprising a vector N = (d E 0 , dW 0 , dN 0 , dS 0 ). 

If the depth difference |d0 − d (·) 

0 

| is smaller than T (or an other used-given threshold), the 

diffusion process is allowed in the corresponding direction and the appropriate component 

in W (p) is set to one. All other components are set to zero to inhibit the diffusion. 

In addition to the directional gradient (i.e. the finite differences) in the source depth 

map, confidence information can be incorporated into W as well. Depth values for pixels 

with low confidence (e.g. detected by low image similarity) result in directional diffusion 

from confident pixels to unconfident ones by the appropriate update of W . We build 

a confidence map by assigning one to pixels with confident depths and zero otherwise. 

This map is based on hard-thresholding of the employed image similarity measure for the 

extracted depth value. The corresponding component of W is multiplied by the confidence 

map entries sampled for neighboring pixels. 

This diffusion procedure can be again executed by graphics hardware to increase the 

performance. Since Chapter 6 is entirely dedicated to variational methods for multi-view 

vision, we postpone the detailed description of the GPU-based implementation of diffusion 

processes and variational approaches in general to that chapter. 

4.5 Timing Results 

In this section we provide more detailed timing results for GPU-based depth estimation 

using the plane-sweeping approach. The benchmarking platform is a P4 3GHz as CPU 

and a NVidia GeForce 6800GTO as GPU. Since the adjustable parameters for our implementation 

have many degrees of freedom (image similarity score, aggregation window 

dimensions, number of used source images etc.), a tabular representation given in Table 4.1 

of the obtained timing results is preferred over a graphical representation. The input for 

the depth estimation method are three grayscale source images at the resolution specified 

in the appropriate column (512 × 512 or 1024 × 1024). The use of power-of-two image 

dimensions is caused by the partial support of graphics hardware for non-power-of-two


textures. The timing results given in this table reflect essentially the performance of applying 

the homography on the sensor images and calculating the stated dissimilarity score, 

since the time used for actual depth extraction is negligible. Note, that these timings are 

mostly insensitive to the provided image content. 

Resolution #depth planes Aggr. window Dissimilarity score Time 

512 × 512 200 5 × 5 SAD 0.918s 

ZNSAD 1.573s 

NCC 1.647s 

ZNCC 2.344s 

9 × 9 SAD 1.362s 

ZNSAD 2.426s 

NCC 2.481s 

ZNCC 3.591s 

400 5 × 5 SAD 1.699s 

ZNSAD 3.058s 

NCC 3.188s 

ZNCC 4.611s 

9 × 9 SAD 2.579s 

ZNSAD 4.774s 

NCC 4.855s 

ZNCC 7.103s 

1024 × 1024 200 5 × 5 SAD 3.772s 

ZNSAD 7.096s 

NCC 7.402s 

ZNCC 10.861s 

9 × 9 SAD 6.059s 

ZNSAD 11.446s 

NCC 11.656s 

ZNCC 17.206s 

400 5 × 5 SAD 7.540s 

ZNSAD 14.146s 

NCC 14.842s 

ZNCC 21.684s 

9 × 9 SAD 11.973s 

ZNSAD 22.863s 

NCC 23.281s 

ZNCC 34.379s 

Table 4.1: Timing results for the plane-sweeping approach on the GPU with winner-takesall 

depth extraction at different parameter settings and image resolutions. 

At higher resolutions, the expected theoretical ratios between the run times between 

the various similarity score are attained. Every score uses one or several accumulation

4.5. Timing Results 57 

passes to calculate � 

i∈W op(Xi, Yi), which comprises the dominant fraction of the total 

run-time. The SAD requires only one accumulation pass ( � 

i∈W |Xi − Yi|), whereas the 

NSAD resp. the NCC needs two passes, and finally the ZNCC performs three invocations 

of the accumulation procedure. ∗ Hence, the observed ratios of approximately 1:2:2:3 for 

the run-times of the evaluated correlation scores can be explained. 

Sparse belief propagation for the final depth extraction is much more costly in terms 

of computation time, as it is illustrated in Figure 4.4. The solid graph displays the required 

total run-time against the number of maintained heap entries for sparse belief 

propagation. This graph shows essentially a linear behavior, since the linear-time message 

passing dominates the heap construction with O(K log K) time complexity. For comparison, 

the dashed line depicts the runtime of the pure winner-takes all approach. Sparse 

belief propagation with just one heap entry requires about 5.8s, whereas the equivalent 

winner-takes-all method needs approximately 3s for these settings. The corresponding 

depth images obtained for the utilized dataset are shown later in Section 4.6. 

time in sec 

55 

50 

45 

40 

35 

30 

25 

20 

15 

10 

5 

BP times 

WTA time 

0 

0 5 10 15 20 25 30 35 40 

number of sparse BP entries 

Figure 4.4: Sparse belief propagation timing results wrt. the number of heap entries K. 

The image and depth map resolution is 512 × 512 pixels and 200 depth hypotheses are 

evaluated using a 7 × 7 ZNCC image similarity score. 

∗ Recall Section 4.2.2. Additionally, the summations involving only key image can be precomputed.


4.6 Visual Results 

In this section we provide depth maps and 3D models for real datasets in order to demonstrate 

the performance of our GPU-based depth estimation procedure and to indicate the 

differences between the winner-takes-all (WTA) depth extraction approach and the sparse 

belief propagation method. All source images are resampled to a resolution of 512 × 512 

pixels, since images with power-of-two dimensions are still better supported on graphics 

hardware. 

The Landhaus dataset shown in Figures 4.5 and 4.6 represents a historical statue 

embedded into a building facade. Three grayscale images with small baselines are used for 

depth estimation. At first, Figure 4.5 shows depth images generated by the winner-takesall 

and by the sparse belief propagation approach at different numbers of maintained 

heap entries K. 200 potential depth values are examined in all cases. The reported 

timings correspond to the values displayed in Figure 4.4. Most notably, belief propagation 

enhances the depth maps in the textureless wall regions on either side of the statue itself. 

Additionally, Figure 4.6 shows two 3D models represented as colored point sets obtained by 

a WTA depth extraction step and a sparse belief propagation procedure using 20 surviving 

depth entries. Both models look relatively similar and only a closer inspection reveals the 

outliers. If the models are rendered as shaded triangular meshes as in Figure 4.7, the 

noisy structure of the WTA result is clearly manifested. Note, that many outliers found in 

the initial depth maps can be removed by the subsequent depth image fusion procedure, 

which generates a proper 3D model from a set of depth maps. 

Three source images of another statue dataset and the respective depth results are 

shown in Figure 4.8. 400 tentative depth planes are evaluated on three adjacent images 

with small baseline. Since the dark background scenery to the left and right of the statue 

is out of the plane-sweep range, the depth image has poor quality in these regions. Belief 

propagation significantly smooths the depth map especially near depth discontinuities. 


GPU-based plane-sweeping procedures allow the efficient generation of depth images from 

multiple small base-line images. Several image dissimilarity measures are available in 

our implementation, which are efficiently calculated on graphics hardware and give good 

results even for varying lighting conditions. 

In case of highly textured scenes a final winner-takes-all depth extraction method is 

sufficient and fast enough to calculate to allow almost interactive feedback to the user. 

Optionally, a sparse belief propagation method is proposed, which significantly enhances 

the depth map in ambiguous regions. 

Future work needs to address a qualitative and quantitative comparison of traditional 

belief propagation and our proposed sparse counterpart. The question, whether the early 

rejection of unpromising depth values can have a negative impact on the extracted depth


maps, is still unresolved. Additionally, even sparse belief propagation is 5 to 10 times 

slower than the (fully hardware accelerated) winner-takes-all strategy, which opens the 

question, if further performance enhancements are possible for sparse BP. 

In Chapter 7 a GPU-based one-dimensional energy minimization approach based on 

the dynamic programming principle is presented.


(a) Sensor image (b) Without BP (WTA); 3s 

(c) BP, K = 10; 16.5s (d) BP, K = 20; 29.5s 

(e) BP, K = 30; 40.3s (f) BP, K = 40; 50.1s 

Figure 4.5: Depth images with and without belief propagation for the Landhaus dataset. 

With more allowed heap entries K, the amount of noisy pixels in textureless regions is 

reduced, but the runtime increases accordingly.


(a) Without BP (WTA) (b) With BP (K = 20) 

Figure 4.6: Point models with and without belief propagation 

(a) Without BP (WTA) (b) With BP (K = 20) 

Figure 4.7: Point models with and without belief propagation


(a) Left image (b) Middle (sensor) image (c) Right image 

(d) Without BP (WTA), 6.7s (e) With BP, 37s 

Figure 4.8: Depth images with and without belief propagation

Chapter 5 

Space Carving on 3D Graphics 

Hardware 

Contents 

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

5.2 Volumetric Scene Reconstruction and Space Carving . . . . . . 64 

5.3 Single Sweep Voxel Coloring in 3D Hardware . . . . . . . . . . 66 

5.4 Extensions to Multi Sweep Space Carving . . . . . . . . . . . . 70 

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 72 

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 


This chapter presents a direct scene reconstruction approach fully accelerated by graphics 

hardware. It shares the plane-sweep principle to obtain a model from multiple images with 

the method discussed in the previous chapter. In contrast to the plane sweep based depth 

estimation approach, the voxel coloring and space carving implementations proposed in 

this chapter generate a true 3D model from a large set of input views directly. 

Voxel coloring [Seitz and Dyer, 1997] and its derivatives incorporate multiple, optionally 

wide-baseline views simultaneously, and produce directly volumetric 3D models. 

Methods derived from the voxel coloring approach test a large number of voxels for photoconsistency 

and are therefore rather slow. Reported calculation times for voxel coloring 

range from several seconds for low resolutions up to hours for high quality models. 

In this chapter we address efficient implementations for voxel coloring and space carving 

exploiting commodity 3D graphics cards. Our current implementation is based on OpenGL 

using fragment shader extension (ATI fragment shader in particular). The hardware requirements 

are rather modest; in particular any ATI Radeon 8500 or better is supported 

63

64 Chapter 5. Space Carving on 3D Graphics Hardware 

by our implementation. Medium resolution models are generated at interactive rates on 

present-day graphics hardware, whereas high resolution models are typically obtained after 

a few seconds. There are at least two application scenarios, which can benefit from a 

fast voxel coloring implementation: at first, our implementation provides a fast preview 

for more highly sophisticated algorithms. The second scenario addresses improved functionality 

of plenoptic image editing: modifications in one or several images can be used 

to update the 3D model instantly. After recalculating the new model, these changes are 

propagated to the remaining images as well. Thus, specular highlights on surfaces and 

similar flaws can be removed interactively to improve the quality of the generated 3D 

model. 

5.2 Volumetric Scene Reconstruction and Space Carving 

Voxel coloring [Seitz and Dyer, 1997] generates a volumetric model by analyzing the consistency 

of scene voxels. As the voxel space is traversed using a plane sweeping approach, 

the state of each voxel is determined. For scenes without translucent objects a voxel can 

be classified either as empty or opaque. During the voxel coloring procedure voxels are 

projected into the input images and the distribution of the corresponding pixel values 

is used to determine the state of each voxel. A so-called photo-consistency (or colorconsistency) 

measure decides, whether a voxel is on the surface of a scene object, i.e. the 

voxel is opaque. This method is conservative in the sense that only assured inconsistent 

voxels are labeled as empty. Therefore already processed voxels can be used to determine 

visibility of voxels with respect to the input views. 

In order to traverse the voxels in correct depth by a simple plane sweep, the placement 

of cameras is restricted by the so called ordinal visibility constraint. This constraint 

ensures, that voxels are visited prior to voxels they occlude. In [Seitz and Dyer, 1999] it is 

shown, that this visibility constraint is satisfied if the scene to be reconstructed is outside 

the convex hull of the camera centers. One typical camera configuration suitable for voxel 

coloring and possible slices used for reconstruction are shown in Figure 5.1. 

Several extensions of voxel coloring were proposed to allow more general 

camera placements. Space carving [Kutulakos and Seitz, 2000], generalized voxel 

coloring [Culbertson et al., 1999] and multi-hypothesis voxel coloring [Eisert et al., 1999] 

remove the limitations on camera positions. Space carving performs multiple iterations 

of voxel coloring for different sweep directions. Only a suitable subset of all input views 

is used for each sweep. 

A crucial question is how to measure color consistency: the original voxel coloring 

approach utilized the variance of colors from projected voxels to determine consistency. 

Stevens et al. [Stevens et al., 2002] propose a histogram-based consistency metric. In their 

approach the footprint of a voxel in an image contains several pixels, which are organized 

in a histogram. A voxel is consistent, if the histograms of the footprints are not pairwise 

disjoint. The consistency measure presented by Yang et al. [Yang et al., 2003] handles

5.2. Volumetric Scene Reconstruction and Space Carving 65 

1 2 3 8 

Depth index 

Figure 5.1: A possible configuration for plane sweeping through the voxel space. The 

camera positions are restricted, such that voxels in subsequent layers can only be occluded 

by already processed voxels. 

non-lambertian, specular surfaces explicitly. 

Voxel coloring is a computationally expensive procedure, which typically requires 

at least tens of seconds up to tens of minutes to compute the reconstruction. Several 

researchers proposed improved implementations for voxel coloring, e.g. Prock and 

Dyer [Prock and Dyer, 1998] primarily utilize a hierarchical oct-tree representation to 

speed up voxel coloring. Additionally, they use graphics hardware to speed up certain calculations. 

Their multi-resolution voxel coloring method needs about 15s to generate a reconstruction 

with 256 3 voxels. However, a hierarchical, multi-resolution approach to volumetric 

3D reconstruction can potentially miss scene details. Sainz et al. [Sainz et al., 2002] 

use texture mapping features of 3D graphics hardware to accelerate the computations. 

Nevertheless, a 256 3 voxel model requires several minutes to be computed even on recent 

hardware. 

Seitz and Kutulakos [Seitz and Kutulakos, 2002] present an image editing approach 

for multiple images of a 3D scene. Changes in one image are propagated to other images 

by using an initially generated voxel model of the scene. Therefore direct manipulation of 

surface textures and other image editing operations are possible. Image editing is limited to 

methods, which do not require a complete volumetric reconstruction step to propagate the 

modifications. With our efficient space carving implementation, it is possible to allow more 

general editing methods useful for a user-driven interactive refinement of voxel models, 

since the volumetric reconstruction can be generated almost instantly from altered input 

images.


5.3 Single Sweep Voxel Coloring in 3D Hardware 

In this section we describe the hardware based implementation of voxel coloring. This 

description applies to the case of a single sweep for camera configurations satisfying the 

ordinal visibility constraints [Seitz and Dyer, 1997]; we will discuss the extensions required 

for the multi sweep case in Section 5.4. 

The input for our method consists of N resampled color images and the corresponding 

projection matrices, and a bounding box denoting the space volume to be reconstructed. 

The bounding box of the volume to be reconstructed is organized as a stack of parallel 

planes. These planes are traversed in a front-to-back ordering during the reconstruction 

procedure. The algorithm maintains a depth map for every camera, which stores the depth 

with respect to the camera position of the reconstructed model so far. For each plane the 

algorithm executes the following steps: 

1. The images of the camera views are projected onto the current plane and a consistency 

measure is evaluated. 

2. Surface pixels (voxels) are determined by thresholding the consistency map. 

3. For each camera view the associated depth map is updated by rendering the currently 

reconstructed voxel layer according to the input views. 

At the end of each iteration a layer of voxels is obtained and can be used for further 

processing. 

Figure 5.2 illustrates the first step in the procedure to obtain the color of a voxel 

with respect to a particular input view. Perspective texture mapping is combined with 

a depth test against the so far available depth map to sieve unoccluded voxels. This 

procedure is repeated for every input view to accumulate the necessary information for 

color consistency calculation. 

The following sections describe the steps performed in our implementation in more 

detail. 

5.3.1 Initialization 

In addition to the currently calculated voxel layer the algorithm maintains a depth map 

for every input view to test the visibility of voxels. Since voxel layers are processed in a 

front-to-back ordering, it is sufficient to use bitmaps to represent the depth map (pixels 

with value 1 indicate empty space along the line-of-sight, whereas value 0 denotes rays 

with already processed opaque voxels). In this paper we use range images for the depth 

maps with gray levels indicating the depth of the voxel layer for better visual feedback. 

At the beginning of the sweep these depth maps are cleared with a value indicating 

empty voxels (i.e. 1). Additionally, we need to handle voxels that are outside the viewing 

volume of a camera as well (since other cameras can possibly see these voxels). We set

5.3. Single Sweep Voxel Coloring in 3D Hardware 67 

Figure 5.2: Perspective texture mapping using visibility information. The original input 

image (depicted on the leftmost quad) is filtered using the depth map (in the middle), and 

only unoccluded pixels are rendered on the current voxel layer. 

the texture coordinate wrapping mode to GL_CLAMP to handle voxels outside the frustum 

correctly. Whenever a depth outside the frustum is accessed, a minimal depth value (0) is 

returned. Note, that only voxels in front of the camera can be culled against the viewing 

frustum, therefore all camera positions must be entirely outside the reconstructed volume. 

5.3.2 Voxel Layer Generation 

With the knowledge of the depth maps generated for every view so far, an estimate for 

photo-consistency can be calculated. We accumulate the consistency value very similar 

to the method proposed by Yang et al. [Yang et al., 2002]. In order to obtain the color 

of a voxel as seen from a particular input view, projective texture mapping is applied to 

determine the color hypothesis for every voxel in the current layer. The color hypotheses 

for all visible views are accumulated to obtain a consistency score for each voxel. 

Using the color variance as the consistency function is suboptimal on graphics hardware. 

At first, a significant number of passes is needed to calculate the variance ∗ , and 

the squaring operation causes numerical problems due to the limited precision available 

on the GPU. 

A simple consistency measure is the length of the interval generated by the color 

hypotheses for a voxel, which can be easily computed on graphics hardware and turned 

out to result in reasonable reconstructions. More formally, the consistency value c of a 

∗ One sweep over all input views is required to count the number of visible views for every voxel; another 

sweep is required to calculate the mean and a third sweep is required to obtain the variance.


voxel projected to pixel with color ci = (ci.r, ci.g, ci.b) in input view i is assigned to 

c = max 

j∈{r,g,b} (max ci.j − min ci.j). 

i 

i 

If the color hypotheses have a significant disparity, then the interval is too large and the 

voxel is labeled as inconsistent. Calculation of the interval length can be done with two 

complete sweeps over the input views: the first sweep uses a blending equation set to 

GL_MIN and the second sweep sets the blending equation to GL_MAX. A final pass calculates 

the length of the interval, but this step can be integrated into the thresholding step to 

determine consistent voxels. 

The final result of this step is an opacity bitmap (stored in an off-screen pixel buffer) 

indicating consistent voxels of the currently processed layer. This binary image constitutes 

one slice of the final volumetric model and is used to update the visibility information 

(Section 5.3.3). In our implementation the opacity of a voxel is stored in the alpha channel 

and the mean color of the voxel is stored in the remaining channels. 

In order to achieve high performance we exploit several features of graphics hardware: 

Visibility Determination Only views that are actually able to see a voxel contribute to 

the consistency value and image pixels from occluded cameras should be ignored. We employ 

the alpha test functionality for visibility calculation. The depth index of the current 

voxel layer is compared with the value stored in the depth map for the appropriate view. 

Pixels that fail the alpha test are discarded and are therefore ignored during consistency 

calculation. 

Note that it is possible to count the number of visible cameras for a voxel efficiently 

using the stencil buffer. Using this count it is easily possible to extract only surface voxels 

of the model. 

Selection of Consistent Voxels Voxels of the current layer are labeled as opaque if 

they are photo-consistent and if they are not part of the background. In our implementation 

dark pixels with an intensity value below some user-defined threshold are treated as 

background pixels and the state of the voxels is set to empty. 

Additional Processing At this stage of the procedure, additional processing the voxel 

bit-plane can be applied. In particular, prior knowledge from previous sweeps (see Section 

5.4) can be used to refine the generated slice. Furthermore, the generated voxel slice 

can be copied into a 3D texture used for direct visualization of the obtained volumetric 

model.

5.3. Single Sweep Voxel Coloring in 3D Hardware 69 

5.3.3 Updating the Depth Maps 

After determining filled voxels in the current layer, the depth maps must be updated to 

reflect occlusions of the additional solid voxels. For each input view the depth map is 

selected as rendering target and the corresponding camera matrix is used for projection. 

The blending mode is set to GL_MIN to achieve a conditional update of depth values. We 

apply a small fragment program to filter empty voxels by assigning a maximum depth 

value to these pixels. Consequently, transparent voxel do not affect the depth map. 

Figure 5.3 shows the successive update of depth maps for two input views. Snapshots 

of the depth map were taken after 25%, 50% and 100% of the reconstruction process. 

(a) view 1: 25% (b) 50% (c) 100% 

(d) view 2: 25% (e) 50% (f) 100% 

Figure 5.3: Evolution of depth maps for two views during the sweep process. Darker 

regions are closer to the camera. The images show depth maps obtained after processing 

25%, 50% and 100% of the reconstructed volume.


5.3.4 Immediate Visualization 

Immediate visual feedback is necessary to evaluate the quality of the reconstructed model 

rapidly. Reading back the voxel model from graphics memory into main memory to 

generate a surface representation is expensive and time-consuming, therefore direct volume 

rendering methods [Engel and Ertl, 2002] are more appropriate. The individual slices 

obtained by voxel coloring can be copied into a 3D texture and visualized immediately. 

Alternatively, the depth images generated for the input views can be displayed as displacement 

map [Kautz and Seidel, 2001], which allows the height-field stored in a texture 

to be rendered from novel views for visual inspection. 

5.4 Extensions to Multi Sweep Space Carving 

The procedure described in Section 5.3 is limited to cameras fulfilling the ordinal visibility 

constraints. In order to obtain reconstructions for more general camera setups, the plane 

sweep procedure is repeated several times for different sweep directions. Only a compatible 

set of cameras is used in each iteration. The difference to the single sweep approach lies 

in the amount of knowledge from the prior sweeps used in the current sweep. We have 

tested three alternatives: 

Independent Sweeps All sweeps are performed independently and no prior information 

is used in the current sweep. The reconstructed volumetric model is the intersection of 

the models generated by the independent sweeps. The intersection of the obtained voxel 

models is performed by the main CPU. This approach has no restriction on the resolution 

of the voxel space, but the frequent transfer of voxel data from graphics memory imposes a 

severe performance penalty. In our experiments we observed significantly longer running 

times, when voxel data is read back into main memory. Copying image data from the 

frame buffer or texture memory into main memory is a rather slow operation (in contrast 

to the reverse direction). This performance penalty depends on the resolution, and results 

in more than doubled execution time e.g. at 256 3 scene resolution. 

Complete Prior Knowledge The opacity value of the voxels generated in the previous 

sweep is stored in a 3D texture, which is used in the subsequent sweep to determine already 

carved voxels. The need for a 3D texture residing on graphics memory limits the maximum 

resolution of the voxel space. On consumer level graphics hardware the resolution of the 

voxel space is typically bounded by 256 3 . Two 3D textures are required simultaneously; 

one texture represents the previous model and the other one serves as destination for the 

model generated in the current sweep. Additionally, the continuous access of a 3D texture 

lowers the runtime performance of the implementation. A significant advantage of this 

approach is the opportunity to visualize the generated model immediately using direct 

volume rendering methods.

5.4. Extensions to Multi Sweep Space Carving 71 

Partial Prior Knowledge In order to avoid the expensive 3D texture representing 

complete prior knowledge, a height field can be used as a trade-off between the former two 

alternatives. In the following we assume orthogonal sweep directions along the major axis 

of the voxel space. In addition to the depth maps for the input views, the preceding sweep 

maintains a depth map in the sweep direction. This height-field is used to inhibit already 

carved voxels from being classified as opaque in the current sweep. This can be achieved 

by comparing the appropriate component of the voxel position with the value stored in 

the height field (see Figure 5.4). 

Carved voxels 

Current sweep direction 

1 2 3 8 

Depth index 

Previous sweep direction 

Figure 5.4: Plane sweep with partial knowledge from the processing sweeps. Carved voxels 

remain unfilled by using a depth image. The shaded region is known to be empty from 

the previous sweep, therefore filling voxels inside this region is prohibited. 

The final model is again the intersection of the volumetric models generated by the 

sweeps, since the incoming knowledge for each sweep is only a partial model. In order to 

avoid the expensive transfer of data from graphics memory to perform this intersection 

in software, we display the result of the final sweep to the user. Additionally, we use the 

height-fields of all prior sweeps to approximate the volumetric model. 

In this approach the available graphics memory does not limit the voxel space resolution, 

but the depth of the color channel is a restricting factor, if high precision depth 

buffers are not available.


5.5 Experimental Results 

5.5.1 Performance Results 

We have implemented voxel coloring and space carving as described in Sections 5.3 

and 5.4. Our implementation is based on fragment shader features as exposed by the 

ATI fragment shader OpenGL extension. Hence it is possible to perform hardware 

accelerated voxel coloring and space carving on low-end or mobile graphics hardware as 

well. 

At first we give performance results obtained by our implementation. The benchmarking 

system is equipped with an AMD Athlon XP2000 as CPU and an ATI Radeon 9700 

Pro as graphics hardware. The performance plots are created for the synthetic “Bowl” 

dataset (see Figure 5.7). 36 views of the model were captured using a virtual turntable 

software. Each sweep uses 9 views for reconstruction. Figure 5.5(a) presents timing results 

for the voxel coloring implementation at different resolutions. The required time for 

voxel coloring is approximately linear in the depth resolution (i.e. the number of generated 

slices). Surprisingly, the time needed for resolutions from 32 × 32 × d up to 128 × 128 × d 

are close to the time required for 256 × 256 × d. The runtime for lower resolutions is dominated 

by the expensive pixel buffer switches (which is linear in the number of slices, but 

independent of the resolution). At higher resolutions the fill rate of the graphics hardware 

becomes more dominant. For 256 × 256 × d scene resolutions our implementation of the 

voxel coloring approach generates 3D models at interactive rates. 

Figure 5.5(b) compares the observed timings for the proposed space carving methods. 

The final 3D model was generated using four sweeps in order to utilize all 36 captured 

views. The timings for single sweep voxel coloring are displayed for comparison, too. For 

resolutions up to 128 3 space carving is slightly more expensive than performing four voxel 

coloring sweeps, since some time is required to merge the individual sweeps. At 256 3 

resolution, space carving maintaining the full voxel model in graphics memory runs out of 

memory and requires substantially more time. 

5.5.2 Visual Results 

In this section we illustrate the visual quality of the obtained reconstructions. At first we 

demonstrate our implementation on a synthetic dataset obtained by off-screen rendering 

and capturing a 3D dinosaur model. The resolution of the input images is 256 × 256. 

Several input images are shown in Figure 5.6(a)–(c). The volumetric texture directly 

obtained by the space carving procedure is shown in Figure 5.6(d). In order to reduce the 

size of the 3D texture, only luminance values instead of colors are stored in the texture. 

Figure 5.6(e) and (f) are snapshots showing the 3D model as a point cloud within a VRML 

viewer. 

Another synthetic dataset, the “Bowl” dataset, is shown in Figure 5.7. The images 

were obtained under the same conditions as the Dino dataset. In Figure 5.7(d) complete


prior knowledge stored in a 3D texture is used, whereas in Figure 5.7(e) the already carved 

model is approximated by height-fields. The latter model contains more outliers and noise, 

but the memory requirement is substantially reduced. 

The real dataset consists of images showing a historic statue (Figure 5.8(a)–(c)). In 

Figure 5.8(d) the surface voxels of the reconstructed model generated from 7 input views 

is shown as point cloud. The number of voxels is 1024 × 1024 × 250 and the pure voxel 

coloring took about 4.8s. Reading the voxels back into the main memory and generating 

the VRML file requires additional 40s. A lower resolution version (256 3 ) of the same 

dataset generated in 0.77s is shown in Figure 5.9. 


This chapter described a hardware accelerated approach for voxel coloring and space carving 

scene reconstruction methods. Voxel coloring can be performed at interactive rates 

for medium scene resolutions, and volumetric models can be obtained with space carving 

very quickly (in the order of seconds). Despite the simple consistency measure used in 

our implementation, the obtained 3D models are suitable for visual feedback to the user 

to estimate the parameters used for the final high-quality, software-based reconstruction. 

With new features provided by modern graphics processors, more sophisticated consistency 

measures can be implemented. In particular, a histogram-based consistency measure 

[Stevens et al., 2002] is a potential candidate for efficient implementation in graphics 

hardware. 

At low resolution the performance of our implementation is dominated by the multipass 

rendering overhead. Consequently, reducing the number of passes especially at coarse 

resolutions may yield to a near real-time generation of volumetric models. Such improvements 

need further investigations.


time in millisecs 

time in millisecs 

8000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

6000 

5000 

4000 

3000 

2000 

1000 

256x256xd 

512x512xd 

1024x1024xd 

0 

50 100 150 

depth resolution 

200 250 

partial knowledge 

independent sweeps 

complete knowledge 

voxel coloring 

(a) 

0 

32x32x32 64x64x64 128x128x128 256x256x256 

resolution 

(b) 

Figure 5.5: Timing results for the Bowl dataset. Each sweep used 9 views to calculate 

the consistency of voxels. (a) shows timing results for voxel coloring using a single plane 

sweep at different resolutions. (b) illustrates timing results for space carving using multiple 

sweeps at various voxel space resolution. With the exception of voxel coloring, which 

is depicted for comparison, four sweeps are performed to obtain the final model. Space 

carving with complete prior knowledge requires almost 33s at 256 3 resolution; this behavior 

is caused by shortage of graphics memory.


(a) (b) (c) 

(d) (e) (f) 

Figure 5.6: (a)–(c) Three input views (of 36) from the synthetic Dino dataset. (d) The 

obtained volumetric model visualized with a 3D texture. We use only luminance and alpha 

channels for the texture to reduce the memory footprint of the 3D texture. (e) and (f) 

show the 3D model rendered as point cloud. In our current implementation, colors for 

surface voxels are assigned in the final sweep, hence surface voxels not seen in the final 

sweep have a default color.


(a) (b) (c) 

(d) (e) 

Figure 5.7: (a)–(c) Three input views (of 36) from the synthetic Bowl dataset. (d) The 

obtained volumetric model visualized with a 3D texture. The model was generated in 

1.4s. (e) is generated by approximating the result of previous sweeps with height-fields 

instead of a full 3D texture.


(a) (b) (c) 

(d) (e) 

Figure 5.8: (a)–(c) Three input views from an image sequence showing a statue. (d) 

shows a high resolution reconstruction generated by carving 250 Mio. initial voxels. Pure 

voxel coloring done in graphics hardware required less than 5s. Only surface voxels are 

shown as a point cloud.


(a) (b) 

Figure 5.9: (a) A 3D reconstruction generated by single sweep voxel coloring using a 

space of 256 × 256 × 250 voxels. 7 input views are used for the reconstruction. Voxel 

coloring and VRML generation required about 3s. The displayed geometry consists of 

surface voxels rendered as points, hence several holes are apparent. (b) A depth image for 

the same dataset generated in 0.77s.

Chapter 6 

PDE-based Depth Estimation on 

the GPU 

Contents 

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

6.2 Variational Techniques for Multi-View Depth Estimation . . . 80 

6.3 GPU-based Implementation . . . . . . . . . . . . . . . . . . . . . 85 

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 


This chapter describes a variational approach to multi-view depth estimation, which is 

accelerated by 3D graphics hardware. Variational methods to multi-view depth estimation 

are techniques with its foundations in variational calculus and numerical analysis. 

The result of these procedures is a depth image which minimizes an energy functional 

incorporating image similarity and smoothness regularization terms. In contrast to many 

window-based dense matching approaches favoring fronto-parallel surfaces, the utilized 

variational depth estimation method is based on per-pixel image similarities and works 

well for slanted surfaces. Depth values communicate with the surrounding depth hypotheses 

through the regularization term. 

Energy-based approaches to dense correspondence estimation incorporate image similarity 

and smoothness constraints into the objective function and search for an appropriate 

minimum. Consequently, these methods allow the propagation of depth values into textureless 

regions, where no robust correspondences are available. Variational techniques 

express the discrete energy function in continuous terms and solve the corresponding 

Euler-Lagrange partial differential equation numerically. 

79

80 Chapter 6. PDE-based Depth Estimation on the GPU 

In contrast to energy-based methods for image restoration and segmentation, variational 

techniques for multi-view depth require successive deformation (warping) of the 

sensor images according to the current depth map hypothesis. In particular, this step 

can be significantly accelerated by the texture units of graphics hardware, which offer the 

necessary image interpolation virtually for free. Furthermore, the numerical procedures 

to solve variational problems are typically algorithms with high parallelism and can be 

transferred to current generation graphics hardware for optimal performance. 

This chapter outlines our implementation of the hardware-accelerated approach to 

variational depth estimation and presents some positive results. We demonstrate that 

a substantial performance gain is obtained by our approach. Additionally, difficult settings 

for variational stereo methods resulting in incorrect 3D models are discussed and 

possible solutions proposed. Notice, that very fast numerical solvers allow the convenient 

investigation of potentially more complex and robust image similarity measures and other 

extensions to the basic model of variational depth. 

6.2 Variational Techniques for Multi-View Depth Estimation 

6.2.1 Basic Model 

This section describes a variational approach to depth estimation following 

mostly [Strecha and Van Gool, 2002, Strecha et al., 2003]. In order to allow a 

one-dimensional search for a depth value at every pixel, the camera calibration matrices 

and the external orientations are assumed to be known. In order to utilize a true 

multi-view setup, pixels in one image are transferred by the epipolar geometry (as 

described below), and an image rectification procedure is not required. In the set of 

employed images one image Ii represents the key image, for which the depth map is 

generated. The other images, Ij, j �= i, are sensor images. The camera imaging Ii is 

assumed to be in canonical position (Pi = Ki [I|0]) and the external orientation for Ij 

is [Rj|tj] and the camera calibration matrix is Kj. The depth map is calculated with 

respect to Ii and depth values assigned for pixels in Ii transfer to the other images as 

follows: The corresponding pixel qij for a pixel pi in Ii with associated depth di is given 

by 

qij(pi) = Hij pi + Tj/di, 

where Hij = KjR t j K−1 

i 

and Tj = Kj tj. Note, that pi and qij refer to homogeneous pixel 

positions and qij must be normalized by its third component. 

The primary goal of depth estimation is the assignment of depth values to every pixel 

of Ii, such that a cost function incorporating image similarity terms and smoothness terms 

is minimized. In particular, the following objective function is often used in variational 

stereo methods:

6.2. Variational Techniques for Multi-View Depth Estimation 81 

E(di) = � 

⎛ 

⎝ � 

pi 

j 

(Ij(qij(di(pi))) − Ii(pi)) 2 + λ�∇di(pi)� 2 

⎞ 

⎠ → min (6.1) 

Since the depth map di is defined on a grid, ∇di refers to a suitable finite 

difference scheme to calculate the gradient. We omit the explicit dependence of di 

on the pixel pi and abbreviate Ij(qij(di(pi))) as Ij(di). Minimizing Eq. 6.1 using 

discrete (non-continuous) methods can be achieved using e.g. graph cut methods 

[Boykov et al., 2001, Kolmogorov and Zabih, 2001, Kolmogorov and Zabih, 2002, 

Kolmogorov and Zabih, 2004]. Alternatively, Eq. 6.1 can be seen as discrete 

approximation to a continuous minimization problem and techniques from variational 

calculus can be applied. The continuous formulation of Eq. 6.1 is 

� 

S(di) = 

⎛ 

⎝ � 

(Ij(di) − Ii) 2 + λ�∇di� 2 

⎞ 

⎠ dp → min (6.2) 

p 

j 

The Euler-Lagrange equation states a necessary condition for the function di to be a 

stationary value with respect to S [Lanczos, 1986]: 

δS 

δdi 

= � ∂Ij 

(Ij(di) − Ii) − λ∇ 

∂di 

2 ! 

di = 0 (6.3) 

j 

Note, that this equation holds for every pixel p in Ii. The spatial derivative ∂Ij 

∂dj 

is the 

intensity change along the epipolar line in image Ij. By discretizing Eq. 6.3 one can solve 

the associated partial differential equation using a numerical scheme on the grid of pixels. 

We describe a particular approach, which is very suitable for GPU-based implementation. 

At first, the image intensities Ij(di) are locally linearized around d0 i using the first 

order Taylor expansion: 

Ij(di) = Ij(d 0 i + ∆di) ≈ Ij(d 0 i ) + ∂Ij(d 0 i ) 

Applying this expansion on the Euler-Lagrange equation yields 

� 

� 

∂Ij 

j 

∂di 

Ij(d 0 i ) + ∂Ij(d0 i ) 

∆di − Ii 

∂di 

∂di 

∆di 

� 

− λ∇ 2 di = 0. (6.4) 

In combination with a (linear) finite differencing scheme for ∇ 2 di the equation above 

results in a huge, but sparse linear system to solve for di. This scheme iteratively refines 

the estimates for the depth map di given its previous estimate. 

In order to prevent the scheme to converge to a suboptimal local minimum, a coarseto-fine 

approach is mandatory.


Diffusion type Term in S Derivative 

Homogeneous diffusion ∇ t d∇d = �∇d� 2 ∇ 2 d 

Image-driven isotropic diffusion ∇ t d g(�∇I� 2 )∇d div(g(�∇I� 2 )∇d) 

Image-driven anisotropic diffusion ∇ t d D(∇I)∇d div(D(∇I)∇d) 

Flow-driven isotropic diffusion ∇ t d g(�∇d� 2 )∇d div(g(�∇I� 2 )∇d) 

Flow-driven anisotropic diffusion ∇ t d D(∇d)∇d div(D(∇d)∇d) 

Table 6.1: Regularization terms induced by diffusion processes 

6.2.2 Regularization 

Taking the Laplacian of the depth map, ∇ 2 di, to guide the regularization gives usually too 

smooth results and the obtained depth maps lack sharp depth discontinuities. Table 6.2.2 

lists several regularization functions based on diffusion processes mostly in accordance 

with the taxonomy of Weickert et al. [Weickert et al., 2004]. In this table the function 

g(s 2 ) is a decreasing scalar function solely based on the magnitude of the gradient, e.g. 

g(s 2 ) = exp(−Ks 2 ) (for a user specified K). D(∇c) denotes the diffusion tensor 

D(∇c) = 

1 

�∇c�2 + 2ν2 �� 

∂c 

∂y 

− ∂c 

� � 

∂c 

∂y 

∂x − ∂c 

�t + ν 

∂x 

2 � 

I . 

ν is a small constant to prevent singularities in perfectly homogeneous regions. Setting 

ν to 0.001 is a common choice. Note that D(∇c) is very similar to the structural tensor 

used to detect image corners. If for example | ∂c ∂c 

∂x | ≫ | ∂y | (a vertical edge in the image), 

the diffusion is inhibited in the x-direction. 

Isotropic diffusion inhibits diffusion at discontinuities regardless of the direction of the 

gradient, whereas anisotropic regularization allows diffusion parallel to edge discontinuities. 

Image-driven regularization is based solely on the gradients calculated in the source 

data (images) and the numerical scheme results in linear expressions. Hence, imagedriven 

diffusion is also called linear diffusion [Weickert and Brox, 2002]. In flow-based 

regularization the diffusion stops at discontinuities of the current flow resp. depth map. 

Consequently, the obtained equation system derived from finite differencing is a nonlinear 

system and requires e.g. fix-point iterations to be solved. 

Note that the terminology in not uniform in the literature: flow-driven isotropic diffusion 

is often referred as nonlinear anisotropic diffusion [Perona and Malik, 1990]. In 

addition to homogeneous diffusion we employ an image-driven (linear) anisotropic regularization 

approach [Nagel and Enkelmann, 1986] for the following reasons: 

• The anisotropy of this regularization adapts very well to homogeneous image region 

boundaries and allows smoothing along image edges. 

• The linear nature of the numerical scheme allows efficient sparse matrix solvers to 

be utilized.

6.2. Variational Techniques for Multi-View Depth Estimation 83 

Pure image-driven diffusion employed for image smoothing and denoising will fail in highly 

textured regions, but in this case the discriminative image data will result in correct 

determination of the final depth map. 

6.2.3 Extensions and Variations 

In the literature several extensions and enhancements are proposed to increase the quality 

and reliability of variational approaches to depth estimation. We summarize a few 

important concepts in this section. 

6.2.3.1 Back-Matching 

In order to increase the robustness of the variational depth estimation method and to 

detect mismatches, a back-matching scheme can be utilized to assign confidence values 

to the depth values. Confident depth estimates should have a higher influence in the 

regularization term for adjacent pixels with lower confidence. 

In a back-matching setting, every image Ii takes the role of a key image and a dense 

depth map is computed with several Ij, j �= i, as sensor images. If we denote the depth 

map computed for Ii with di and qij(p, di) represents the transfer of a pixel p in image Ii 

with the associated depth into Ij, then the forward-backward error is 

eij = �p − qji(qij(p, di), dj)�. 

The confidence cij is now a function of eij, e.g. 

or 

cij = 1/(1 + k eij) 

� 

cij = exp − c2 

� 

. 

k 

If cij is close to 1, then the depth value is highly confident. Values of cij close to zero 

indicate unreliable depth values. In [Strecha et al., 2003] the following energy functional 

is proposed: 

� 

S(di) = 

p 

⎛ 

⎝ � 

j 

cij(Ij(d) − Ii) 2 + λ∇ t diD(∇Ci)∇di 

⎞ 

⎠ dp → min, 

where Ci = maxj(cij) and D(∇Ci) is a anisotropic diffusion operator. The corresponding 

Euler-Lagrange equation reads 

δS 

δdi 

= � 

j 

∂Ij 

! 

cij (Ij(di) − Ii) − λdiv(D(∇Ci)∇di = 0. 

∂di


6.2.3.2 Local Changes in Illumination 

If the scene to be reconstructed contains not only purely Lambertian surfaces with diffuse 

reflection behavior, illumination changes appear between the images. These local lighting 

changes can be modeled by an additional intensity scaling function κij, which scales the 

intensity values of Ii to match the intensities in Ij. The extended energy function is 

� 

S(di, κij) = 

p 

⎛ 

⎝ � 

j 

(Ij(d) − κij Ii) 2 + λ�∇di� 2 + λ2�∇κij� 2 

⎞ 

⎠ dp → min, 

since both di and κij are assumed to change smoothly over the image domain. The 

corresponding Euler-Lagrange equations for di and κij are now: 

δS 

δdi 

δS 

δκij 

= � ∂Ij 

(Ij(d) − κij Ii) − λ∇ 

∂di 

2 d 

j 

= Ii(Ij(di) − κij Ii) − λ2∇ 2 κij. 

Of course, confidence evaluation using back-matching and the estimation of local lighting 

changes can be combined into one framework. 

In case of local illumination changes the intensity scaling and the depth map will 

be affected. It is impossible to correctly estimate the depth from the available local 

information only, since both depth and intensity scaling processes will adapt to match the 

pixel intensity values. 

6.2.3.3 Other Variations 

The energy functional presented in Eq. 6.2 and used in the previous sections can be 

modified in various ways. At first, the L 2 data term (Ij(di) − Ii) 2 can be replaced by a 

suitable function Ψ on the intensity differences, e.g. 

Ψ(Ij(di) − Ii) = 

� 

(Ij(di) − Ii) 2 + ε 2 

for small ε [Brox et al., 2004, Slesareva et al., 2005]. This choice of Ψ is a smooth, differentiable 

L 1 norm. Additionally, the data term may incorporate intensity gradient and 

other higher order information as well [Papenberg et al., 2005]. 

It the L 1 image data term is utilized, it is common to employ a total variation regularization 

[Rudin et al., 1992], �∇d�, instead of the quadratic one. In general, the choice 

of the regularization significantly affects the results especially close to discontinuities.

6.3. GPU-based Implementation 85 

6.3 GPU-based Implementation 

This section describes our implementation of the variational depth estimation technique 

on a GPU. Depth estimation in our application is performed on a set of three images 

(one key image plus two sensor images). In general, three passes are performed in every 

iteration of depth refinements: 

1. In the first pass the sensor images Ij are warped according to the current depth map 

hypothesis and the spatial derivatives ∂Ij/∂di are calculated. 

2. Expressions used in the regularization term are precomputed, e.g. the Laplacian or 

the anisotropic flow used in the subsequent semi-implicit solvers. 

3. Finally, the depth estimates are updated using some semi-implicit strategy derived 

from Eq. 6.4. 

The next sections describe each pass in more detail. These iterations are embedded in a 

coarse-to-fine framework using a Gaussian image pyramid to avoid immediate convergence 

to a local minimum. The depth map acquired after convergence at the coarser level is used 

as initial depth map at the next finer level. 

6.3.1 Image Warping 

The first pass of the GPU-based depth estimation implementation consists of warping the 

sensor images, Ij, according to the depth map di. The lookup in image Ij is performed 

using the epipolar parametrization 

qij = (x, y, 1) t = Hij pi + Tij/di, 

Consequently, the warped image according to the current depth hypothesis can be obtained 

by dependent texture lookups. The required spatial derivative ∂Ij(di)/∂di can be 

efficiently calculated by the chain rule: 

∂Ij(di) 

∂di 

= ∂Ij(qij) ∂qij 

qij ∂di 

� �t � � 

∂Ij/∂x ∂x/∂di 

= 

. 

∂Ij/∂y ∂y/∂di


If we define X = (X (1) , X (2) , X (3) ) t = Hij pi + Tij/di and Tij = (T (1) 

ij 

have 

∂x/∂di = 

∂y/∂di = 

(1) 

T ij X(3) − T (3) 

ij X(1) 

(X (3) ) 2 d2 i 

(2) 

T ij X(3) − T (3) 

ij X(2) 

(X (3) ) 2 d2 i 

(2) (3) 

, T ij , T ij )t , then we 

The advantage of this scheme is, that with precomputed gradient images ∇Ij the spatial 

derivative along the epipolar line, ∂Ij(di)/∂di, can be easily calculated and the computation 

of X = Hij pi + Tij/di can be shared, if Ij(di) and its derivative is calculated in the 

same fragment program. In our implementation, a texture representing Ij holds the intensity 

value, the horizontal and the vertical gradient in its three channels. Image warping 

assigns Ij(di) and its derivative to the two channels of the target buffer. 

Note, that Hij pi needs not to be calculated for every pixel, but can be linearly interpolated 

by the GPU rasterizer like any other texture coordinate. On our hardware 

the performance gain was rather minimal, since the matrix-vector multiplication in the 

fragment program is mostly hidden by the required texture fetches. 

The GPU version of this step performs approximately 100 times faster than a straightforward, 

but otherwise completely equivalent software implementation. 

6.3.2 Regularization Pass 

If Laplacian regularization is employed, a simple fragment program is sufficient to calculate 

∇ 2 di. The more interesting case is the utilization of image-based or confidence-based 

anisotropic diffusion to control the depth map regularization. Both regularization approaches 

yield linear numerical schemes, since the diffusion weights remain constant for 

the current level in the image pyramid. 

Confidence images are created as follows: After determining the depth maps at the 

next-coarser resolution, a confidence map cij between view i and j is generated with 

cij = 1/(1 + k eij), where eij = �p − qji(qij(p, di), dj)� is the back-matching error. This 

confidence map remains constant for the current resolution level. The confidence values 

cij adjacent to a pixel are normalized, such that their sum is one. For every pixel this 

results in a weight vector W with four components. The regularization term is calculated 

as 

⎛ 

⎜ 

⎝ 

W [x−1] 

W [x+1] 

W [y−1] 

W [y+1] 

⎞t 

⎟ 

⎠ 

⎛ 

⎜ 

⎝ 

d [x−1] 

i 

d [x+1] 

i 

d [y−1] 

i 

d [y+1] 

i 

This is proportional to the standard Laplacian, if W is set to (1/4, 1/4, 1/4, 1/4) t . 

− di 

− di 

− di 

− di 

⎞ 

⎟ 

⎠ .

6.3. GPU-based Implementation 87 

6.3.3 Depth Update Equation 

The finite difference scheme of equation 6.4 (respectively one of its extensions) is a large 

system of equations in the unknowns ∆di for every pixel: 

� 

� 

∂Ij 

Ij(di) + 

∂di 

∂Ii(di) 

� 

∆di − I0 − λ∇ 

∂di 

2 (di + ∆di) = 0 (6.5) 

j 

Approximating the Laplacian (resp. the employed diffusion term) by a linear operator, the 

system becomes a sparse one, and the unknowns ∆di are coupled only for adjacent pixels 

through to the regularization term yielding a sparse system matrix. 

Using the standard 4-star scheme to calculate the Laplacian the matrix of the sparse 

linear system obtained from the above equation has a special structure containing 5 diagonal 

bands (Figure 6.1). Two iterative numerical schemes to solve sparse linear system are 

currently applicable for the GPU: the Jacobi method and the conjugate gradient method. 

6.3.3.1 Jacobi Iterations 

In order to solve a linear system Ax = b with diagonally dominant matrix A, the Jacobi 

method performs the following iteration: 

x (n+1) � 

−1 

= D (D − A)x (n) � 

+ b , 

where D is the diagonal part of A. Consequently, the new components of x (n+1) depend 

only on the old values of x (n) . The update procedure for every pixel according to Eq. 6.5 

is now 

∆d (n+1) 

i 

= 

� 

λ ∇2di + 1 � 

4 

p∈N ∆d(n) 

i 

λ + � 

j 

� 

− ∂Ij(di) 

∂di 

�2 � ∂Ij(di) 

∂di 

(Ij(di) − I0) 

, 

where p ∈ N runs over the four adjacent pixels to the current pixel. After several iterations 

= 

of this inner loop to obtain a converged ∆d final 

i 

d (k) 

i 

+ ∆dfinal 

i . 

6.3.3.2 Conjugate Gradient Solver 

, the depth map is updated as d (k+1) 

i 

In addition to the Jacobi method we implemented a conjugate gradient procedure on 

the GPU to solve the sparse linear system. This implementation is based on the ideas 

presented by Krüger and Westermann [Krüger and Westermann, 2003]. 

On the GPU the system matrix with five diagonal bands is stored in two textures: the 

off-diagonal bands are stored in a 4 component texture image, which remains constant. 

The main diagonal is represented as a single component render target, since it must be 

updated after every warping pass. Analogous to the Jacobi method the result of the 

conjugate gradient approach is a stabilized depth update ∆di.


Figure 6.1: The sparse structure of the linear system obtained from the semi-implicit 

approach. Dark pixels indicate non-zero entries. 

6.3.4 Coarse-to-Fine Approach 

In order to avoid reaching a local minimum immediately we utilize a coarse-to-fine scheme. 

We chose a usual image pyramid, which halves the image dimensions in every level. After 

downsampling the image of the next finer level the obtained image was additionally 

smoothed. When going to the next coarser level, the regularization weight λ should be 

halved as well, but in practice scaling λ by a factor of � 1/2 gave better results. 

6.4 Results 

This section presents several depth maps and 3D models to illustrate the benefits and 

possible shortcomings of the variational depth estimation method. 

6.4.1 Facade Datasets 

The first dataset depicts a historical statue embedded in a facade. The resolution of 

the grayscale source images and the resulting depth map is 512 × 512 pixels. Figure 6.2 

illustrates the obtained range map based on three small-baseline source images as colored 

3D point set. Figure 6.3 shows the corresponding depth images using the implemented 

numerical solvers and gives timing information. Six pyramidal levels are generated for 

the coarse-to-fine approach. The Jacobi and the CG solvers execute 50 iterations in the 

outer loop (image warping) and 3 iterations in the inner loop to calculate the actual depth 

update. The Jacobi solver runs fastest with 1.15s, whereas the conjugate gradient solver 

requires significantly more time. The obtained depth maps are almost identical with both 

approaches.


Figure 6.2: A reconstructed historical statue displayed as colored point set with a resolution 

of 512 × 512 points. Three small baseline images are used to generate the model. 

Figure 6.4 shows the consequences of back-matching. Without back-matching a severe 

mismatch appears near the feet of the statue (Figure 6.4(a)). Back-matching uses a larger 

sequence of images to mutually verify the depth maps as described in Section 6.2.3.1. 

Figure 6.4(b) shows the same close-up view of the feet with a significantly better geometry. 

Another result of the variational depth estimation approach is shown in Figure 6.5. 

The resolution of the depth map for this dataset is 1024 × 640. 

6.4.2 Small Statue Dataset 

This section addresses the reconstruction of another dataset, which requires additional 

methods to be applied to obtain a suitable model. The object to be reconstructed is a 

small statue, for which more than 40 images were taken in a circular path around the 

statue. 

Using the source images directly to generate the depth maps is not successful, which can 

be seen in Figure 6.6. Even including the back-matching approach does not improve the 

result. The reason for this failure is due to the very large depth discontinuities between 

the foreground statue and the background scenery. Consequently, the smoothness and 

ordering constraint is violated in these images (see Figure 6.6(a–c)). 

The first approach to obtain better reconstructions is to perform an image segmentation 

procedure to separate foreground and background regions. The initial manual 

segmentation for one image is propagated through the complete sequence, such that only


(a) Jacobi (n=3), 1.15s (b) CG (n=3), 3.15s 

Figure 6.3: The depth maps of the embedded statue reconstructed with the numerical 

schemes. Both numerical solver yields almost similar result, with the Jacobi solver being 

faster. 

little further manual interaction is necessary [Sormann et al., 2005]. Background pixels 

are set to a uniform color before applying the depth estimation procedure. Two of the 

obtained point sets are shown in Figure 6.7. 

Alternatively we introduced a more robust image intensity error term in order to handle 

the changing background and occlusions. The energy function to be optimized includes a 

truncated intensity difference: 

⎛ 

� 

S(di) = ⎝ � 

min � T, (Ij(di) − Ii) 2� + λ�∇di� 2 

⎞ 

⎠ dp → min, (6.6) 

p 

j 

with a thresholding parameter T . Instead of replacing the thresholding operator by a 

differentiable soft-min function, we chose a very different approach: Since we have two 

sensor images, Ij1 and Ij2 , zero, one or both data terms may be saturated and in the 

Euler-Lagrange equation the corresponding term is missing. Consequently, the new depth


(a) Without back-matching (b) With back-matching 

Figure 6.4: The effect of bidirectional matching on the embedded statue scene. 

is taken from this set of decoupled solutions: 

∆d (k+1) 

i 

∆d (k+1) 

i 

∆d (k+1) 

i 

= 

= 

= 

λ∇ 2 d (k) 

i 

∆d (k+1) 

i = ∇ 2 d (k) 

i 

− � 

j 

λ + � 

j 

∂Ij(d (k) � 

i ) 

∂di 

� ∂Ij(d (k) 

i ) 

∂di 

Ij(d (k) 

� 

i ) − I0 

� 2 

λ∇2d (k) 

i − ∂Ij (d 1 (k) 

i ) 

Ij1 ∂di 

(d(k) 

� 

∂Ij (d 1 λ + 

(k) 

�2 i ) 

∂di 

λ∇2d (k) 

i − ∂Ij (d 2 (k) 

i ) 

Ij2 ∂di 

(d(k) 

� 

∂Ij (d 2 λ + 

(k) 

�2 i ) 

∂di 

� 

� 

i ) − I0 

i ) − I0 

Note that the three lower equation are obtained by removing one or both image terms in 

the first equation. In case of truncation of the intensity error, the derivative of the constant 

threshold is zero. The depth value with the lowest actual error term is selected as the 

result for this iteration. In Figure 6.8 the resulting enhanced depth map and 3D model is 

illustrated. Although the depth image and the reconstructed model are far superior than 

the original model depicted in Figure 6.6, the obtained statue model has still some flaws 

and a more refined approach requires further investigation. 

� 

�


(a) (b) 

Figure 6.5: Two views on the colored point set showing the front facade of a church. 

6.4.3 Mirabellstatue Dataset 

The source images of this dataset display an outdoor statue (see Figure 6.9(a)). Depth map 

generation is restricted to the statue using silhouette masks to separate the foreground 

statue object from the background scenery. Three images with 512 × 512 pixels resolution 

are used to compute the depth maps illustrated in Figure 6.9(b)–(d). The differences 

between the displayed meshes come from the employed regularization approach. The first 

two meshes are acquired using a homogeneous regularization with different values for the 

weight λ. The third mesh is obtained utilizing image-driven anisotropic diffusion for a 

selective regularization in textureless image regions as discussed in Section 6.2.2. 

The mesh shown in Figure 6.9(b) uses a small value for λ, which results in noisy mesh 

geometry especially in textureless regions. The mesh displayed in Figure 6.9(c) is obtained 

by using a larger value for λ and appears clearly smoother, but sharper creases at depth 

discontinuities are missing. Image-driven anisotropic diffusion yields to a generally smooth 

mesh, but includes sharp edges at depth discontinuities. 


Variational approaches to depth estimation provide a mathematically sound tool for generating 

3D models from multiple images. These methods work best for images with constant 

lighting conditions and if only little occlusions and depth discontinuities are present in the 

imaged scene. Under these requirements high-quality depth maps can be generated at 

interactive rates.


Nevertheless, there are several issues that must be addressed: At first, scenes with large 

depth discontinuities and violated ordering constraints must be handled in a more robust 

manner. The approach presented in Section 6.4.2 is only a first step in this direction, since 

the results are still not completely satisfying. Incorporating segmentation information to 

detect piecewise connected objects can be based on color clustering, as it is partially 

employed in Section 6.4.2. Alternatively, combining a segmentation procedure based on 

initial and coarser depth hypotheses with the described variational approach appears to 

be promising. Variational multi-phase approaches (e.g. [Chan and Vese, 2002, Shen, 2006, 

Jung et al., 2006]) are potential candidates to generate the combined initial depth and 

segmentation hypothesis. 

Incorporating lighting changes into a variational framework to optical flow and 

depth estimation can be accomplished using techniques proposed by Hermosillo et 

al. [Hermosillo et al., 2001, Chefd’Hotel et al., 2001]. Whether such approaches are 

suitable for 3D modeling at interactive rates is an open question. 

Another item which needs to be addressed is the image smoothing used in the coarse-tofine 

hierarchy. In a multi-view setup the epipolar lines run arbitrarily through the source 

images and usual Gaussian smoothing possibly moves corresponding features away from 

the appropriate epipolar line. Consequently, the recovered geometry at a coarser scale 

is not a smoothed version of the true geometry, but only loosely coupled with the true 

underlying model. In a rectified stereo setup a pure horizontal blurring has the advantage, 

that features are smoothed along the epipolar lines, but not in their orthogonal direction. 

Extending this approach to a multi-view setting is a topic for future research.


(a) (b) (c) 

(d) (e) 

Figure 6.6: The three source images and the resulting unsuccessful reconstruction of the 

statue.


(a) (b) 

Figure 6.7: Two of the successfully reconstructed point sets using image segmentation to 

omit the background scenery. 

(a) (b) 

Figure 6.8: An enhanced depth map and 3D point set obtained using the truncated error 

model.


(a) One source view (b) Homogeneous, λ = 3 

(c) Homogeneous, λ = 10 (d) Image-driven anisotropic, 

λ = 10 

Figure 6.9: The effect of image-driven anisotropic diffusion. Two generated meshes using 

homogeneous regularization with different values of λ are shown in (a) and (b). The 

choice of λ = 3 in (a) yields a noisy result, wheres setting λ = 10 gives a significantly 

better geometry. Employing image-driven anisotropic diffusion yields to the visually most 

appealing mesh with sharp creases, but without noise in textureless regions (c).

Chapter 7 

Scanline Optimization for Stereo 

On Graphics Hardware 

Contents 

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

7.2 Scanline Optimization on the GPU for 2-Frame Stereo . . . . . 98 

7.3 Cross-Correlation based Multiview Scanline Optimization on 

Graphics Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 105 

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 


In this chapter we propose a GPU-based computational stereo approach using scanline 

optimization to achieve optimal intrascanline disparity maps. Since we employ a linear 

discontinuity cost model, the central part of the procedure is the calculation of the 

appropriate min-convolution, which is usually implemented as two pass method using destructive 

array updates. We replace this in-place updates by a recursive doubling scheme 

better suited for stream programming models. Consequently, the entire dense estimation 

pipeline from matching cost computation to global optimization to obtain the disparity 

resp. depth map is performed by the GPU and only the control flow is maintained by the 

CPU. 

Since the material of this chapter is rather technical, it is divided into two parts: the 

first section (Section 7.2) focuses on the details of a GPU-based scanline optimization 

procedure for the rectified stereo setup employing very simple image matching scores. 

The second section (Section 7.3) addresses the incorporation of the GPU-based scanline 

optimization implementation in a multiview setup. The focus of this section lies on the 

efficient utilization of ‘sliding’ sums to calculate the zero mean normalized cross correlation 

score in particular. 

97

98 Chapter 7. Scanline Optimization for Stereo On Graphics Hardware 

7.2 Scanline Optimization on the GPU for 2-Frame Stereo 

This section describes the core of the GPU implementation of scanline optimization. The 

main idea is the transformation of the main dynamic programming step (which has linear 

time complexity on sequential processors) to an equivalent procedure suitable for parallel 

computing (with O(N log N) time complexity). Additionally, several techniques to employ 

the parallelism within the fragment processor to full extent are presented. Not all of these 

methods are applicable for high-resolution depth maps (see Section 7.3.7 for one approach 

to overcome this limitation). 

7.2.1 Scanline Optimization and Min-Convolution 

Scanline optimization [Scharstein and Szeliski, 2002] searches for a globally optimal assignment 

of disparity values to pixels in the current (horizontal) scanline, i.e. it finds 

arg min 

dx 

W� 

(D(x, dx) + λV (dx, dx−1)) , 

x=1 

where D(x, d) is the image dissimilarity cost and V (d, d ′ ) is the regularization cost. As in 

all dynamic programming approaches to stereo, different scanlines are treated independent 

from the neighboring ones (which may result in vertical streaks visible in the disparity 

image). 

The optimal assignment can be efficiently found using a dynamic programming approach 

to maintain the minimal accumulated costs ¯ C(x, d) up to the current position 

x: 

¯C(x + 1, d) = D(x + 1, d) + 

� 

min ¯C(x, d1) + V (d, d1) 

d1 

� . 

In a linear discontinuity cost model we have V (d, d1) = λ|d − d1| and the calculation of 

� 

min ¯C(x, d1) + λ|d − d1| 

d1 

� 

for every d can be performed in linear time using a forward and a backward pass to compute 

the lower envelope [Felzenszwalb and Huttenlocher, 2004]. The linear-time procedure to 

calculate the min-convolution is given in Algorithm 3. 

This procedure is not directly suitable for GPU implementation, since it relies at first 

on in-place array updates and secondly, a linear number of passes is required to update 

the entire array h. ∗ 

∗ Using the depth test with the same depth buffer as texture source and target buffer would allow 

the direct implementation, but this approach results in undefined behavior according to the specifications. 

Such an approach would have additional disadvantages, mainly the reduced ability to utilize the parallelism 

of the GPU.

7.2. Scanline Optimization on the GPU for 2-Frame Stereo 99 

Algorithm 3 Procedure to calculate the lower envelope efficiently 

Procedure Min-Convolution 

Input: Output h[] 

for d = 1 . . . k do 

h[d] ← ¯ C(x, d) 

end for 

{Forward pass} 


h[d] ← min(h[d], h[d − 1] + λ) 


{Backward pass} 

for d = k − 1 . . . 1 do 

h[d] ← min(h[d], h[d + 1] + λ) 


The basic idea to enable a GPU implementation of min-convolution is utilizing 

a recursive doubling approach, which is outlined in Algorithm 4. Recursive 

doubling [Dubois and Rodrigue, 1977] is a common technique in high-performance 

computing to enable parallelized implementations of sequential algorithms. This 

technique is frequently used in GPU-based applications to perform stream reduction 

operations like accumulating all values of a texture image [Hensley et al., 2005]. 

If we focus on the forward pass in Algorithm 4, the procedure calculates the result of 

[d] contains 

the forward pass for subsequently longer sequences ending in d. Initially, h + 0 

the min-convolution of the single element sequence [d, d]. In every outer iteration with 

index L the handled sequence is extended to [d − 2L , d] and its length is doubled. Note, 

that h + [d] is defined to be ∞ (i.e. a large constant), if d is outside the valid range [1 . . . k]. 

After all iterations, h + [d] contains the correct result of the forward pass, which can be 

easily shown by induction. The same argument applies to the backward pass, hence this 

procedure yields to the desired result. In addition to the lower envelope h the disparity 

values for which the minimum is attained are tracked in the array disp[]. 

Note that the updates in the loops over d are independent and can be performed 

as parallel loop. In GPGPU terminology, the bodies of these loops are computational 

kernels [Buck et al., 2004]. Additionally, the scanlines of the images are treated independently, 

therefore the min-convolution can be performed for all scanlines in parallel. 

Figure 7.1 gives an illustration of the first few iterations in the forward pass of Algorithm 

4. Since the next iteration of the outer loops in the min-convolution algorithm 

refers only to values generated in the previous iteration, only two arrays must be maintained 

(instead of a logarithmic number of arrays). The role of this two arrays is swapped 

after every iteration; the destination array becomes the new source and vice versa. In 

GPU terminology, these arrays correspond to render-to-texture targets, and alternating 

the roles of these textures is referred as ping-pong rendering.


Algorithm 4 Procedure to calculate the lower envelope using recursive doubling 

Procedure Min-Convolution using Recursive Doubling 

{Forward pass} 


h + 0 [d] ← ¯ C(x, d) 

disp[d] ← d 


for L = 0 . . . ⌈log 2(k − 1))⌉ do 


d1 ← d − 2 L 

h + 

L 

[d] ← min(h+ 

L−1 

disp[d] ← arg mind(h + 



{Backward pass} 


[d], h+ 

L−1 [d1] + λ 2 L ) 

L−1 

h − 0 [d] ← h+ 

L [d] 


for L = 0 . . . ⌈log2(k − 1))⌉ do 


d1 ← d + 2L h − 

L 

[d] ← min(h− 

L−1 

disp[d] ← arg mind(h − 



Return h − 

log2 (k−1) and disp 

[d], h+ 

L−1 [d1] + λ 2 L ) 

[d], h− 

L−1 [d1] + λ 2 L ) 

L−1 

[d], h+ 

L−1 [d1] + λ 2 L ) 

The full linear discontinuity cost model is often not appropriate and a truncated linear 

cost model with V (d, d1) = λ min(T, |d − d1|) is preferable. If T is chosen to be a 

power of two, the truncated cost model can be incorporated without an additional performance 

penalty into Algorithm 4 by replacing the λ 2 L smoothness cost term in the 

A 

A’ 

A’’ 

A 

A’ 

B 

B’ 

min(A+1,B) 

C 

C’ 

min(B+1,C) 

B’’ B’ C’’ min(A’+2,C’) D’’ min(B’+2,D’) 

D 

D’ 

min(C+1,D) 

E 

E’ 

E’’ 

min(D+1,E) 

min(C’+2,E’) 

Figure 7.1: Graphical illustration of the forward pass using a recursive doubling approach.


min-convolution algorithm by λ min(T, 2 L ). For other values of T an additional pass over 

the ¯ C(x, ·) array is required [Felzenszwalb and Huttenlocher, 2004]. For optimal performance 

we restrict our implementation to the pure linear model resp. to the truncated 

model with power-of-two thresholds. 

7.2.2 Overall Procedure 

This section describes the basic procedure for scanline optimization on the GPU, which 

consists of several steps. The outline of the overall procedure is presented in Algorithm 5. 

The input consists of two rectified images with resolution W × H. The range of potential 

disparity values is [dmin, dmax] with k elements. 

The procedure traverses vertical scanlines positioned at x from left to right. At first 

the dissimilarity of the current scanline at x in the left image with the set of vertical 

scanlines [x + dmin, x + dmax] is calculated, resulting in a texture image with dimensions 

H and k. The dissimilarity is either a sum of absolute differences aggregated 

in a rectangular window or the sampling insensitive pixel dissimilarity score proposed 

in [Birchfield and Tomasi, 1998]. 

If the first scanline is processed, the texture storing ¯ C is initialized with the dissimilarity 

score. For all subsequent scanlines the lower envelope of ¯ C is computed using 

� 

Algorithm 4 to obtain mind1 

¯C(x − 1, d1) + λ|d − d1| � for every row y and disparity value 

d. The computation of the lower envelope keeps track of the disparity value, where the 

minimum is attained (we refer to Section 7.2.3.2 for a detailed description of the efficient 

disparity tracking). These tracked disparities are read back into main memory for the 

subsequent optimal disparity map extraction. Afterwards, the ¯ C array is incremented by 

the dissimilarity score of the current vertical scanline. 

If the final scanline is reached, the total accumulated ¯ C is read back in order to 

determine the optimal disparities for the last column given by arg mind ¯ C(W, d). With the 

knowledge of the disparities for the final column, the disparities for previous columns can 

be assigned by a backtracking procedure. 

7.2.3 GPU Implementation Enhancements 

The basic method outlined in the last section does not utilize the free parallelism of fragment 

program operations, which work on four component vectors simultaneously. Consequently, 

the performance of the method can be substantially improved if this inherent 

parallelism is taken into account. 

7.2.3.1 Fewer Passes Through Bidirectional Approach 

Essentially, W passes of the min-convolution procedure are required to obtain the final 

¯C values and the corresponding disparity map. This number can be effectively halved, 

if scanline optimization is applied on two opposing horizontal positions simultaneously


Algorithm 5 Outline of the scanline optimization procedure on the GPU 

Procedure Scanline optimization on the GPU 

for x = 1 . . . W do 

Compute the image dissimilarity for the vertical scanline at x and all possible 

disparities, resulting in scoreTex 

if x = 1 then 

sumCostTex := scoreTex 

else 

Calculate the lower envelope h of sumCostTex resulting in lowerEnvTex. 

Read back tracked disparities from lowerEnvTex. 

sumCostTex := lowerEnvTex + scoreTex 

end if 

if x = W then 

Read back the accumulated cost for the final column from sumCostTex. 

end if 


Extract final disparity map by backtracking 

finally meeting in the central position. More formally, let ¯ Cfw(x, d) be the accumulated 

cost starting from x = 1 and ¯ Cbw the cost beginning at x = W , which are computed 

simultaneously using parallel fragment operations. If we assume even W , in every iteration 

the values for ¯ Cfw(x, d) and ¯ Cbw(W − x + 1, d) are determined. The iterations stop at 

x 1/2 := W/2 + 1 and the total cost for optimal paths with disparity d at position x 1/2 is 

¯Cfw(x 1/2, d) + ¯ Cbw(x 1/2, d) − D(x 1/2, d). 

Hence the initial disparity assigned to x 1/2 is the disparity attaining the minimum of this 

sum, and the complete disparity map can be extracted by the backtracking procedure 

as already outlined. This approach better utilizes the essentially free vector processing 

capabilities, and this modification reduces the total runtime by approximately 45% for 

384 × 288 images. 

7.2.3.2 Disparity Tracking and Improved Parallelism 

Using a bidirectional approach does not only reduce the number of passes, but the parallelism 

of the fragment processor is employed to some extent – two ¯ C values are handled 

in parallel ( ¯ Cfw and ¯ Cbw). Since GPUs are designed to operate on vector values with four 

components, an additional performance gain can be expected if four ¯ C values are stored 

in the color channels for every pixel. 

Note that the calculation of the lower envelope for ¯ C is not enough, since the appropriate 

disparity values attaining the minimum must be stored as well in order to enable an 

efficient backtracking phase. If one assumes integral disparity values, image dissimilarity


scores and an integral smoothness weight λ, then ¯ C and h are integer numbers as well. 

Hence, the associated disparity can be encoded in the fractional part of h. Furthermore, 

no additional operations are needed to track the disparities attaining the minimal accumulated 

costs. Of course, in case of ties in the min-convolution procedure, disparities with 

smaller encoded fractions are preferred (which is as good as any other strategy). 

Encoding the disparity value in the fractional part of floating point numbers limits the 

image resolution in order to avoid precision loss. If the dissimilarity score is an integer 

from the interval [0, T ], then the total accumulated cost is at most (W/2 + 1) × T , where 

W is the source image width. If the dissimilarity score is discretized into the range [0, 255], 

16 bit of the mantissa are required to encode ¯ C for half PAL resolution (W = 384), which 

leaves enough accuracy to encode the disparities in the fractional part. The sign bit of 

the floating point representation can be additionally incorporated by centering the range 

of dissimilarity scores around 0. 

Utilizing this compact representation for accumulated cost/disparity pairs allows us 

to handle two horizontal scanlines in parallel, thereby reducing the effective image height 

to the half for the min-convolution. Figure 7.2 illustrates the parallel processing of two 

vertical scanlines in the bidirectional approach, and the assignment of the RGBA channels 

to pixel positions. 

R 

B 

R R 

G G 

B 

B A 

Figure 7.2: Parallel processing of vertical scanlines using the bidirectional approach for 

optimal utilization of the four available color channels. The arrows indicate the progression 

of the processed scanlines in consecutive passes. 

7.2.3.3 Readback of Tracked Disparities 

After the lower envelope is computed, the encoded tracked disparities are read back into 

main memory to be available for the final back tracking procedure. The tracked disparity 

A 

G 

A


values encoded in the fractional part of the lower envelope are extracted directly on the 

GPU into an 8-bit framebuffer (which is efficient, since fragment programs on NVidia 

hardware support native instructions to get the fractional part of a floating point number). 

The tracked disparities are now read back as byte channels. We discovered, that this 

approach is the fastest, since the usually expensive conversion from floating point numbers 

to integers is performed on the GPU without a performance penalty and the amount of 

data to be read back is substantially reduced. 

7.2.4 Results 

At first we give timing results for CPU and GPU implementation of scanline optimization 

software. The CPU version is a straightforward C++ implementation using the minconvolution 

as described in Algorithm 3. The disparity map is determined for successive 

scanlines. Code optimization is left to the compiler. The GPU implementation is based 

on OpenGL using the frame buffer extension and the Cg language. 

The timing tests are performed on two hardware platforms: the first platform is a 

PC with a 3 GHz Pentium 4 CPU (CPUA) and an NVidia Geforce 6800 graphics board 

(GPUA) running Linux. The C++ source is compiled with gcc 3.4.3 and -O2 optimization. 

The second system is a PC with an AMD Athlon64 X2 4400+ CPU (CPUB) and a 

Gefore 7800GT graphics hardware (GPUB). The employed compiler is gcc 4.0.1 again 

with -O2 optimization. 

Table 7.1 displayed the obtained timing results. Tsukuba 1x denotes the original wellknown 

dataset with 384 × 288 image resolution and 15 possible disparity values. Tsukuba 

2x and 4x denote the same dataset, which is resized to 768 × 288 resp. 1536 × 288 pixels. 

The possible disparity range consists of 30 and 60 values, respectively. We select horizontal 

stretching of the image to simulate sub-pixel disparity estimation. 

The Pentagon dataset is another common stereo dataset with 512×512 pixels resolution 

and 16 potential disparity values (Pentagon 1x). Resizing the images to 1024 × 1024 

resolution yields the Pentagon 2x dataset (32 disparities). The image similarity function 

in all datasets is the SAD using a 3×1 window calculated on grayscale images. In order to 

avoid the memory consuming 3D disparity image space the image dissimilarity is calculated 

on demand for the current vertical scanline. 

CPUA GPUA CPUB GPUB 

Tsukuba 1x 0.0462 0.1180 0.0373 0.0678 

Tsukuba 2x 0.1891 0.2911 0.1387 0.1565 

Tsukuba 4x 0.7257 1.0082 0.5655 0.4566 

Pentagon 1x 0.1261 0.1877 0.0953 0.1165 

Pentagon 2x 0.9458 1.0381 0.7065 0.4930 

Table 7.1: Average timing result for various dataset sizes in seconds/frame.

7.3. Cross-Correlation based Multiview Scanline Optimization on Graphics Hardware 105 

The results in Table 7.1 clearly indicate that the multi-pass GPU method is significantly 

slower than the CPU version for small image resolutions. For higher resolutions the 

speed is roughly equal resp. the GPU version shows better performance depending on the 

hardware. Note that most time is actually spent in the scanline optimization procedure 

itself; only about 15-20% of the frame time is spent to calculate this particularly simple 

image dissimilarity. Additionally, we observed that the CPU-based backtracking part to 

extract the optimal disparities has a negligible impact on the total runtime. 

The required time grows almost linearly on the CPU with increasing resolution, which 

is in contrast to the GPU curve. In theory, the 4 times stretched Tsukuba dataset should 

require the 16-fold runtime (fourfold number of disparities and horizontal pixels). The 

CPU version matches this expectation largely (15.1 and 15.7-fold runtime), whereas the 

GPU shows a sublinear behavior (8.5 resp. 6.7-fold runtime). At low resolutions the setup 

times for frame buffers etc. become a more dominant fraction of the total runtime. 

In order to provide a visual proof for the correctness of the proposed GPU implementation, 

the disparity maps for several standard stereo datasets are shown in Figure 7.3 and 

Figure 7.4. Additionally, the obtained depth maps using subpixel disparity estimation for 

the Tsukuba images are displayed in Figure 7.3(b) and (c). 

(a) 1x (b) 2x (c) 4x 

Figure 7.3: Disparity images for the Tsukuba dataset for several horizontal resolutions 

generated by the GPU-based scanline approach. 

7.3 Cross-Correlation based Multiview Scanline Optimization 

on Graphics Hardware 

This section extends and modifies the approach presented in Section 7.2 on depth estimation 

using scanline optimization on the GPU. The value of the formerly presented method 

is increased by enabling multiple views to be handled. Additionally, the SAD matching 

cost function can be replaced by the usually more robust cross correlation similarity score.


(a) Cones (b) Teddy 

Figure 7.4: Disparity images for the Cones and Teddy image pairs from the Middlebury 

stereo evaluation datasets. These disparity images illustrate only the correctness of the 

GPU implementation, but the images are not intended to indicate superior matching 

performance. 

7.3.1 Input Data and General Setting 

The input data for this method consists of n ≥ 2 grayscale source images of dimension 

w × h with already removed lens distortion. Additionally, the camera intrinsic parameters 

and the relative poses between the views are known. One source image plays the particular 

role of a key view, for which the depth map is calculated. The other views are used to 

evaluate the depth hypotheses and are called sensor images. The depth image assigns one 

depth value in the range from [znear, zfar] with D possible values from that range. In our 

implementation the potential depth values are taken equally spaced from this interval. 

The viewing frustum induced by the key view limited to the depth range [znear, zfar] 

comprises a 3D volume, which encloses the feasible surface to be reconstructed. Planesweep 

methods and our approach traverse this volume using a sequence of 3D planes and 

warp the sensor images onto this plane (resp. the corresponding quadrilateral formed by 

intersection with the view frustum). Plane-sweep methods typically use 3D planes parallel 

to the key image plane, whereas our method uses planes induced by vertical scanlines in 

the key image. 

In the later sections we describe the implementation of several image dissimilarity 

functions, which are calculated for a user-specified aggregation (support) window of W ×H 

pixels. The sum of absolute differences (SAD) between two rectangular sets of pixels is 

defined as 

SAD = � 

|Xi − Yi|, 

i∈W


where i ∈ W denotes the set of pixels in the rectangular support window W. The zeromean 

normalized cross correlations is defined as follows: 

� 

i∈W 

NCC = 

(Xi − ¯ X) (Yi − ¯ Y ) 

�� i∈W (Xi − ¯ � 

X) 2 

= 

By the shifting property one gets: 

NCC = 

i∈W (Yi − ¯ Y ) 2 

� 

i∈W (Xi − ¯ X) (Yi − ¯ Y ) 

� 

σ2 X σ2 Y 

� 

XiYi − 1 

N (� Xi) ( � Yi) 

� 

σ2 X σ2 , (7.1) 

Y 

with σ 2 X = � X 2 i − (� Xi) 2 /N and σ 2 Y = � Y 2 

i − (� Yi) 2 /N. Hence, it is possible 

to compute the cross correlation solely from several sums aggregated within the support 

window. 

If multiple sensor images are provided, the total matching cost for a depth hypothesis 

is the sum of individual (optionally truncated) matching costs between the key view and 

each sensor image. Using 8- or 16-bit resolution for the correlation values, this sum can 

be obtained by utilizing the blending (i.e. in-place accumulation) stage of recent graphics 

hardware. 

7.3.2 Similarity Scores based on Incremental Summation 

If one employs a plane sweep approach combined with a purely local winner-takes-all depth 

extraction method (see Figure 7.5), spatial aggregation within the support window is easily 

performed. Warping the sensor images on the current depth plane and spatial aggregation 

can be substantially accelerated by graphics hardware due to its specific projective texture 

sampling capabilities (see Chapter 4 and [Yang et al., 2002, Yang and Pollefeys, 2003, 

Cornelis and Van Gool, 2005]). 

On the other hand, if a global depth extraction method is utilized, the matching cost 

values conceptually comprise a disparity space image (DSI), which stores the matching 

score for every pixel in the key view and every candidate depth value. Hence, the DSI 

is a 3D data array with w × h × D elements. When using scanline optimization to find 

the optimal depth assignments for horizontal scanlines in the key view, the matching 

costs for every pixel and depth value are only accessed once. Consequently, the matching 

scores can be calculated on demand for vertical lines in the key view as the algorithm 

successively updates the ¯ C array from left to right. Due to this simple observation the 

memory-consuming construction of the DSI can be avoided. In the following paragraphs 

we describe this on-the-fly matching cost computation for multiple view configurations in 

more detail.


Key view 

Sensor view 

Figure 7.5: Plane-sweep approach to multiple view matching 

In contrast to plane-sweep approaches, which warp the sensor images onto a plane 

parallel to the key image plane positioned at a certain depth, we project the sensor images 

on a plane induced by a vertical scanline x = const in the key image (Figure 7.6). This 

plane is formed by all rays K −1 

0 

y 

(x, y, 1) for a fixed x value.. 

key view 

znear 

x 

z 

zfar 

Figure 7.6: Plane sweep from left to right 

If the aggregation (correlation) window size is W × H, then (at least conceptually) 

W slices around the current x-value must be stored. For image dissimilarity functions, 

which can be computed by appropriate box filters, like the sum of absolute differences 

(SAD), sum of squared differences (SSD) and the normalized cross correlation (NCC),


maintaining the aggregated sums can be done in an incremental manner by providing the 

new incoming slice and the outgoing slice to the updating procedures. 

7.3.3 Sensor Image Warping 

We assume, that the key view has a canonical position, i.e. P0 = K0 (I|0) with the known 

camera intrinsic matrix K0. The sensor view i has the projection matrix Pi = (Mi|mi) = 

Ki (Ri, ti). Then the 2D point (x, y) wrt. the key view combined with a depth z maps into 

the sensor images in the following manner: 

qi ∼ z Ai (x, y, 1) t + mi, 

with Ai = Mi K −1 

0 . qi is a homogeneous quantity (a 3-vector). Using projective texture 

mapping, the correct intensity values from the sensor images can be sampled. 

Warping the sensor images onto the planar slices as indicated in Figure 7.6 can be 

performed by rendering an aligned quadrilateral into a buffer of dimensions h × D. In 

world space the quad is determined by constant x value and varying y ∈ [1, h] and z ∈ 

[znear, zfar]. Rasterization of this quadrilateral yields to sampling the pixels from the 

sensor images using projective texture mapping. Consequently, the sensor image intensity 

values wrt. all depth hypotheses for the current vertical scanline can be easily retrieved. 

Note, that during rendering of this slice additional operations can be performed for 

higher efficiency. For instance, the corresponding key view pixels (comprising a vertical line 

at the current x position) can be sampled as well, and a binary operation can be applied 

on the sampled key image pixel and the sensor image pixel. This feature is utilized as 

described in the next sections. 

Sensor Image Sampling In a plane-sweep approach the rendered quadrilateral corresponding 

to a depth plane matches the assumed fronto-parallel surface geometry. Consequently, 

higher quality sensor image sampling using mipmapped trilinear or anisotropic 

filtering is immediately available. Since our rendered slices do not match the assumed 

(fronto-parallel) object surface, the texture space to screen space derivatives interpolated 

by the rasterization hardware from the provided quadrilateral geometry are incorrect. The 

simplest solution is to revert to basic linear filtering without using derivative information 

at all. Another solution is providing derivatives computed in the fragment program to the 

texture lookup functions, which is possible on newer graphics hardware. If qi = (q x i 

, qy 

i , qz i ) 

is the homogeneous position in the sensor image for a given key image pixel (x, y) and 

depth z (as described above), then the texture coordinates are (s, t) = (q x i /qz i 

Additionally, we have for the texture space derivatives 

∂s 

∂x = z(A11X3 − A31X1) 

(X3) 2 , 

, qy 

i /qz i ).


with X = (X1, X2, X3) t = z Ai (x, y, 1) t + mi and Akl are the elements of Ai. The other 

derivatives ∂s/∂y, ∂t/∂x and ∂t/∂y are calculated in an analogue manner. Using these 

derivatives the texture footprint of a fronto-parallel surface can be simulated. The projective 

texture lookup to sample the sensor images is then replaced by a 2D lookup with 

supplied texture space derivatives. 

In our evaluated datasets the results using linear resp. anisotropic texture 

sampling are effectively indistinguishable due to the small baseline multiview 

geometry. If several surface orientation are evaluated to obtain more accurate 

reconstructions [Akbarzadeh et al., 2006], higher quality sensor image sampling could be 

beneficial. Enabling fourfold anisotropic texture filtering increased the total runtime by 

about 5–10% in our experiments. 

7.3.4 Slice Management 

The scanline optimization procedure stores the epipolar volume slices around the current 

x position, i.e. the slices corresponding to X ∈ {x − W/2, x + W/2}. When the matching 

cost computation and the update of ¯ C for the current x position are finished, the new 

slice corresponding to x + W/2 + 1 is rendered into a temporary buffer. The matching 

cost update routines are invoked with the now obsolete slice at x − W/2 and the newly 

generated slice at x + W/2 + 1 provided. This allows the cost update functions to perform 

an incremental update of its stored values. Afterwards, the buffer holding the obsolete 

slice can be reused as target slice at x + W/2 + 2 in the next iteration. 

Figure 7.7 illustrates the incremental update of the accumulated values. Note, that several 

different accumulation results may be required depending on the employed matching 

cost function. 

previous sum incoming slice outgoing slice 

Figure 7.7: Spatial aggregation for the correlation window. At first, the pixels are aggregated 

in the x-direction by incremental summation of multiple slices. The final aggregated 

value is obtained by vertical summation of these intermediate pixels. 

7.3.5 SAD Calculation 

If the SAD is chosen as image dissimilarity cost, the incremental update is very simple: 

when rendering the 3D quadrilateral to sample the sensor images the absolute differences 

between the sensor image and the key image pixels is calculated on the fly. The procedure 

to calculate the SAD matching cost maintains only the horizontal sums of absolute differences 

for j ∈ {x−W/2, x+W/2}. This can be easily achieved, since the update procedure 

Σ


takes the obsolete and the newly generated slice as input. Computing the actual matching 

score is performed by vertical aggregation of H pixels. 

7.3.6 Normalized Cross Correlation 

The basic method to maintain the sums for NCC calculation are essentially similar to 

the SAD version. In this case, three horizontal sums need to be maintained: � 

i Y (i, y), 

� 

i Y (i, y)2 , and � 

i X(i, y)Y (i, y), where X(·) denote key image pixels and Y (·) refers to 

sampled sensor image pixels. Epipolar volume slice extraction calculates Y (i, y) and the 

product X(i, y)Y (i, y) and stores these values in two of the color channels. 

The standard deviation σX wrt. the aggregation window for every pixel in the key 

image and the box filtering result � 

i∈W Xi can be precomputed and are immediately 

available during the iterations at no additional cost. 

The calculation of the final correlation score involves vertical aggregation of � 

i Y (i, y) 

and � 

� 

i X(i, y)Y (i, y) to obtain the sum for the rectangular window W, i∈W Yi resp. 

� 

i∈W XiYi. The squared sum � 

i∈W Y 2 

i can be generated simultaneously while aggregating 

� 

i∈W Yi. A final fragment program calculates the NCC using Equation 7.1 from these 

intermediate values. 

Note, that this approach requires additional buffers to store the appropriate horizontal 

sums for each sensor image. 

In practice we use the square root of the NCC as employed matching cost for the 

following reasons: at first, discretizing the NCC directly into e.g. 255 different values 

induces inaccuracies especially for small matching costs. On the contrary, the graph of 

√ 

NCC has a more linear shape, hence a uniform discretization is feasible. Secondly, the 

NCC behaves qualitatively like a squared difference between normalized intensities, since 

� 

� 

Xi − ¯ X 

i∈W 

σX 

− Yi − ¯ Y 

σY 

� 2 

= 2 − 2 NCC(X, Y ). 

Hence we consider it reasonable to adapt the matching cost to the linear regularization 

cost model by taking the square root. 

7.3.7 Depth Extraction by Scanline Optimization 

The matching costs for the current active vertical scanline are used to update the accumulated 

cost array ¯ C. In order to have a pure GPU implementation this step is performed 

by graphics hardware as well as described in Section 7.2. Alternatively, readback of the 

matching scores and CPU-based depth extraction by dynamic programming is possible as 

well [Wang et al., 2006]. 

In Section 7.2.3 the vector processing capability of the fragment processor (operating on 

4-component vectors simultaneously without additional costs) is utilized by a bidirectional 

approach: the accumulated costs ¯ C are calculated in parallel starting from x = 1 in the


forward direction and x = w backwards, meeting in the central position. Backtracking the 

optimal depth values is subsequently performed to the left and right border starting from 

the central pixel. This approach reduces the number of iteration in the multipass methods 

to the half and doubles the employed parallelism in the fragment programs. Additionally, 

two vertically adjacent pixels are treated within the same fragment requiring a compact 

encoding of ¯ C and the corresponding depth value in one floating point number. We apply 

the first, bidirectional scanning technique to improve the parallelism in this work as well. 

This implies, that matching costs are computed simultaneously for the vertical scanline 

at x1 = x and x2 = w − x simultaneously. The intermediate values and correlation score 

for x1 and x2 are stored in the red and green channel resp. the blue and alpha channel. 

We do not utilize the second method, since it limits the image and depth resolution to 

ensure accurate results. Nevertheless, we substantially improved the performance of the 

GPU-based scanline optimization method using the following approach: We restrict the 

precision of ¯ C stored in GPU memory to 16 bit float values (fp16), which allow accurate 

representation of integer values in the range [−2047, 2047]. Using fp16 values instead of 

the full IEEE precision floating point range halves the memory bandwidth required by 

the GPU-based scanline optimization method. Since this procedure is bandwidth limited 

(recall Algorithm 4), the performance of this step is approximately doubled. 

In order to maintain the accuracy of the generated depth maps we assume, that the 

matching cost is an integral value from the range [0, 255] and λ is integral as well. Hence 

¯C is an integral quantity, too. In order to avoid overflows of ¯ C, we perform frequent 

renormalization of ¯ C using the following update: 

¯C(x, d) ← ¯ C(x, d) − min 

d1 

¯C(x, d1) − 2047. 

We subtract 2047 to exploit the sign bit of the fp16 representation, too. Using ¯ C(x + 

n, d) − ¯ C(x, d) ≤ 255n and ¯ C(x, d) − mind1 ¯ C(x, d1) ≤ λD, we can calculate the frequency 

of updates from 

¯C(x + n, d) − min 

d1 

¯C(x, d1) ≤ λD + 255n. 

For the fp16 representation we require that the right hand side is at most 4094 (i.e. 

2 × 2047), hence 

n ≤ (4094 − λD)/255. 

This means, that n vertical scanlines can be updated without renormalization. For D = 

200 and λ = 2 we get n = 14. For the experiments we fixed n = 16 without visible 

degradation of the obtained depth map. 

7.3.8 Memory Requirements 

The parallel computing pattern of our approach treating vertical scanlines at once requires 

saving the full data needed for the final backtracking procedure. After updating ¯ C this


data is read back from the GPU memory into main memory. If the depth range contains 

less that 256 entries, the required memory is w × h × D bytes, which is e.g. less than 190 

MB for datasets with 768 × 1024 × 250 resolution. 

7.3.9 Results 

The reported timing results in this section are obtained on a Linux PC equipped with a 

Pentium IV 3GHz main processor and an NVidia Geforce 6800 graphics card with 12 pixel 

pipelines. 

The first dataset depicted in Figure 7.8 consist of a virtual turntable sequence displaying 

a simple building model. The synthetically rendered images are resized to 512 × 512 

pixels. This is the resolution of the obtained depth image as well. Since a turntable is 

emulated, the scene objects are rotated, but the light sources remain constant. Hence 

the surface shading changes between the views substantially. Consequently, the resulting 

depth maps calculated with the SAD matching cost function shown in Figure 7.9(a) and 

(b) have many significant defects. All these depth maps are computed for a depth range 

containing 200 equally spaced values. Figure 7.9(c) displays the depth image obtained 

by a plane-sweep approach using a winner-takes-all depth extraction method (Chapter 4). 

There are still mismatches in textureless regions visible. Finally, Figure 7.9(d) is the result 

of the proposed NCC + scanline optimization implementation. The scanline optimization 

procedure is performed on the GPU as well. In all cases the correlation window is set to 

9 × 9 pixels. Alternatively to the pure GPU method, we implemented a mixed CPU/GPU 

approach: while the GPU calculates the matching cost for the next vertical scanline, the 

CPU updates ¯ C for the current vertical scanline in parallel (using a straightforward C++ 

implementation). The runtime of this mixed approach is almost identical to the GPU 

method for this dataset. 

(a) left view (b) center view (c) right view 

Figure 7.8: The three input views of the synthetic dataset 

Table 7.2 displays the runtimes of our implementation at different resolutions. We 

evaluated pure GPU approaches (GPU-fp32 and GPU-fp16) and mixed implementation


(a) WTA, SAD: 

0.82s 

(b) SO, SAD: 5.1s (c) WTA, NCC: 

2.86s 

(d) SO, NCC: 6.21s 

Figure 7.9: The obtained depth maps and timing results for the synthetic dataset. WTA 

denotes a GPU plane-sweep approach with a winner-takes-all depth extraction (Chapter 

4). SO designates the scanline optimization implementation proposed in this work. 

utilizing the CPU for the scanline optimization part. GPU-fp32 denotes the pure GPU implementation 

without successive renormalization every 16 scanlines. Hence, 32 bit floating 

point values are used to store the accumulated costs ¯ C. GPU-fp16 indicates the pure GPU 

algorithm using 16 bit values for ¯ C utilizing frequent renormalization. We give timing results 

for two mixed CPU/GPU approaches as well: the first approach is a synchronous 

approach, where the matching costs calculation on the GPU and dynamic programming 

on the CPU are performed in a sequential manner (4th column). These timings allow 

direct comparison of the scanline optimization part with the corresponding runtimes on 

the GPU. The asynchronous version of the mixed approach calculates the matching cost 

for the next vertical scanline on the GPU while ¯ C is updated by the CPU (5th column). 

The runtime of this parallel approach is the fastest of all dynamic programming implementations, 

since the total runtime is dominated solely by the NCC computation (and the 

update of ¯ C is basically free). Finally, WTA denotes the local plane sweep approach from 

Chapter 4. 

Resolution GPU-fp32 GPU-fp16 Mixed sync. Mixed async. WTA 

256 × 256 × 100 0.79s 0.69s 0.66s 0.55s 0.34s 

512 × 512 × 200 6.2s 5.1s 5.0s 3.9s 2.7s 

512 × 768 × 200 9.2s 7.7s 7.7s 6.0s 4.1s 

768 × 1024 × 250 27.1s 21.4s 20.6s 16.5s 10.9s 

768 × 1024 × 250 10.1s 9.4s 9.6s 6.1s 5.0s 

Table 7.2: Runtimes of scanline optimization using a 9 × 9 NCC at different resolutions 

using three views. The last row displays the runtimes on a PC equipped with an Athlon64 

X2 4400+ and a GeForce 7800GT. 

The comparison of the last two columns (asynchronous CPU/GPU and winner-takesall 

depth extraction) reveals the performance penalty induced by the different sweep directions. 

The main reason for the higher performance of the WTA approach is, that


this method utilized all 4 components in the fragment processor, whereas the proposed 

implementation calculates only two matching score per pixel. 

The sole scanline optimization time for GPU-fp32 is approximately twice the time 

needed by GPU-fp16, as predicted. To see this, the NCC calculation time given in the next 

to last columns must be subtracted from the total time given in the respective columns. 

Finally, CPU scanline optimization using integer arithmetic is still slightly faster than our 

GPU-fp16 implementation (columns 3 and 4). 

The last row of Table 7.2 depicts the runtimes observed on more recent PC hardware 

equipped with an Athlon64 X2 4400+ and a GeForce 7800GT. The performance difference 

between the local approach and the fastest scanline optimization method is smaller than 

the gap observed on our main PC. Additionally, the performance gain of GPU-fp16 over 

GPU-fp32 is less eminent. These partially unexpected, but still preliminary results on 

current 3D hardware need further analysis. 

Figure 7.10 provides visual results for a dataset consisting of three images showing a 

wooden Bodhisattva statue. The source images and the depth maps have a resolution of 

512×768 pixels, and the depth range contains 200 values. The lighting conditions changes 

slightly between the input views (Figure 7.10(a)–(c)). The depth image obtained by a pure 

winner-takes-all approach using a 9 × 9 NCC is shown in Figure 7.10(d). The result of our 

multiview scanline optimization method is displayed as depth map (Figure 7.10(e)) and 

as the triangulated surface mesh (Figure 7.10(f)). The computation times for the local 

method and our proposed one are 4.1s and 6s, respectively. 


In this chapter we propose a scanline optimization procedure for disparity estimation suitable 

for stream architectures like modern programmable graphics processing units. Although 

the direct implementation of scanline optimization using destructive (i.e. in-place) 

value updates must be replaced by a more expensive recursive approach, the huge computational 

power of current GPUs turns out to be beneficial for larger image resolutions 

and disparity ranges. Consequently, the entire disparity estimation pipeline comprising 

of matching score computation and semi-global disparity extraction can be performed on 

graphics hardware, thereby avoiding the relatively costly data transfer between the GPU 

and the CPU and leaving the CPU idle for other tasks. 

Additionally, the basic GPU friendly approach to scanline optimization for a rectified 

stereo pair is extended to the multiple view case utilizing the more robust cross correlation 

matching score. The matching costs are generated on demand as required by the main 

dynamic programming procedure. When using more complex dissimilarity scores it turns 

out to be most efficient to employ the GPU and the CPU in parallel: while the GPU 

calculates the next set of matching scores, the CPU updates the accumulated costs for the 

current vertical scanline.


From the timing results presented in the Section 7.2.4 it can be concluded, that a 

GPU-based scanline optimization procedure is mostly suitable for larger images and disparity 

ranges, but not truly appropriate for realtime applications in particular. For small 

image resolutions the overhead of multipass rendering is still too significant to take advantage 

of the processing power of modern GPUs. Additionally, a scanline optimization 

procedure using a linear smoothness cost model is better dedicated for larger disparity 

ranges, where a (potentially truncated) linear model is preferable over the Potts model. If 

the disparity range contains only a few values, enforcing smooth disparity maps is futile, 

since consecutive values in the disparity range typically correspond to substantial depth 

discontinuities. Hence, a linear model is not effective in case of few potential disparities 

and a different approach like the near-realtime reliable dynamic programming (RDP) 

approach [Gong and Yang, 2005b] is better suited. On the contrary, we believe that the 

Potts model used in the RDP approach is not appropriate for high-quality reconstruction 

applications. 

If object silhouettes are available (e.g. by background segmentation), the quality of 

the depth map can be improved due to the knowledge of the visual hull. Datasets comprising 

turntable sequences with a known background (e.g. the reference multiview stereo 

datasets presented in [Seitz et al., 2006]) allow a simple background segmentation in particular. 

Additionally, the depth estimation performance can be increased by using the 

z-buffer test to avoid matching cost calculation for background pixels. Incorporating these 

improvements in such cases is ongoing work. 

In order to obtain better depth maps and to reduce the influence of the actual setting 

of the smoothness weight, the benefit of an adaptive smoothness weight based e.g. on the 

source image gradients [Fua, 1993, Scharstein and Szeliski, 2002] needs to be investigated.


(a) left view (b) center view (c) right view 

(d) depth map (WTA) (e) depth map (SO) (f) mesh view 

Figure 7.10: The three input views of a wooden Bodhisattva statue and the corresponding 

depth maps (using a local depth extraction approach indicated by WTA and the proposed 

scanline optimization method) and a view on the triangulated mesh.

Chapter 8 

Volumetric 3D Model Generation 

Contents 

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

8.2 Selecting the Volume of Interest . . . . . . . . . . . . . . . . . . 120 

8.3 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . . . 121 

8.4 Isosurface Determination and Extraction . . . . . . . . . . . . . 124 

8.5 Implementation Remarks . . . . . . . . . . . . . . . . . . . . . . . 126 

8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 


With the exception of our voxel coloring approach, all methods presented so far generate 

a set of depth images resp. 2.5D height fields. In order to create true 3D models 

this set of depth maps must be combined into a common representation. The proposed 

method in this chapter to create proper 3D models is based on an implicit volumetric 

representation, from which the final surface can be extracted by any implicit surface polygonization 

technique. The principles of robust fusion of several depth maps in the context 

of laser-scanned data was developed by Hilton et al. [Hilton et al., 1996] and Curless and 

Levoy [Curless and Levoy, 1996]. We apply essentially the same technique on depth maps 

obtained by dense depth estimation procedures, but the basic approach needs to be modified 

to be more robust against outliers occurring in the input depth maps. The basic idea 

of volumetric depth image integration is the conversion of depth maps to corresponding 

3D distance fields and the subsequent robust averaging of these distance fields. The resolution 

and the accuracy of the final model are determined by the quality of the source 

depth images and the resolution of the target volume. 

Instead of using an implicit representation of the surfaces induced by the depth images, 

one can merge a set of polygonal models directly [Turk and Levoy, 1994]. Such an 

119

120 Chapter 8. Volumetric 3D Model Generation 

approach is sensitive to outliers and mismatches occurring in the depth images. A volumetric 

approach can combine several surface hypotheses and perform a robust voting in 

order to extract a more reliable surface. On the other hand, a volumetric range image 

fusion approach limits the size of 3D features found in the final model dependent on the 

voxel size. 

Our implementation of the purely software based (i.e. unaccelerated) approach, which 

is based on [Curless and Levoy, 1996], uses compressed volumetric representations of the 

3D distance fields and can handle high resolution voxel spaces. Merging (averaging) of 

many distance fields induced by the corresponding depth maps is possible, since it is 

sufficient to traverse the compressed distance fields on a single voxel basis. Nevertheless 

our original implementation has substantial space requirements on external memory and 

consumes significant time to generate the final surface (usually in the order of several 

minutes). Hence this approach is not suitable for immediate visual feedback to the user. 

At least for fast and direct inspection of the 3D model it is reasonable to develop a very 

efficient volumetric range image integration approach again accelerated by the computing 

power of modern graphics hardware. Many steps in the range image integration pipeline 

are very suitable for processing on graphics hardware and significant speedup can be 

expected. 

The overall procedure traverses the voxel space defined by the user slice by slice and 

generates a section of the final implicit 3D mesh representation in every iteration. Consequently, 

the memory requirements are very low, but immediate postprocessing (e.g. 

filtering) of the generated slices is limited. Although the general idea is very close 

to [Curless and Levoy, 1996], several modifications are required to allow an efficient GPU 

implementations in the first instance. More importantly, the sensitivity to gross outliers 

frequently occurring in input depth maps is reduced by a robust voting approach. The 

details of our implementation are given in the next sections. 

8.2 Selecting the Volume of Interest 

The first step of proposed volumetric depth image integration pipeline is the specification 

of the 3D domain, for which the volumetric representation of the final model is built. 

Generally, it is not possible to determine this volume of interest automatically. In case of 

small objects entirely visible in each of the source images, the intersection of the viewing 

frustra can serve as indicator for the volume to be reconstructed. Larger objects only 

partially visible in the source images (e.g. large buildings) require human interaction to 

select the reconstruction volume. Consequently, there exists a user interface for manual 

selection of the reconstructed volume. This application displays a set of e.g. 3D feature 

points generated by the image orientation procedure or 3D point clouds generated from 

dense depth maps. The user can select and adjust the 3-dimensional bounding box of 

the region of interest. Additionally, the user specifies the intended resolution of the voxel 

space, which is set to 256 3 voxels in our experiments.

8.3. Depth Map Conversion 121 

8.3 Depth Map Conversion 

With the knowledge of the volume of interest and its orientation, the voxel space is traversed 

slice by slice and the values of the depth images are sampled according to the 

projective transformation induced by the camera parameters and the position of the slice. 

Since the sampled depth values denote the perpendicular distance of the surface to the 

camera plane, the distance of a voxel to the surface can be estimated easily as the difference 

between the depth value and the distance of the voxel to the image plane (see also 

Figure 8.1). This difference is an estimated signed distance to the surface; positive values 

indicate voxels in front of the surface and negative values correspond to voxels hidden by 

the surface. Of course, the accuracy of this approximation depends on the angle between 

the principal direction of the camera and the normal vector of the surface. Nevertheless, 

this efficiently computed approximation to the true distance transform gives very good 

results in practice. Additionally, we incorporated the angle between the surface normal 

and the viewing direction to scale this distance, but this modification had no apparent 

effect on the resulting models. 

The source depth maps contain two additional special values: one value (in our 

implementation chosen as -1) indicates absent depth values, which may occur due to 

some depth postprocessing procedure eliminating unreliable matches from the depth 

map. Another value (0 in our implementation) corresponds to pixels outside some 

foreground region of interest, which is based on an optional silhouette mask in our 

workflow [Sormann et al., 2005]. 

Consequently, the processed voxels fall into one of the following categories: 

1. Voxels that are outside the camera frustum are labeled as culled. 

2. Voxels with an estimated distance D to the surface smaller than a user-specified 

threshold Tsurf are labeled as near-surface voxels (|D| ≤ Tsurf ). 

3. Voxels with a signed distance greater than this threshold are considered as definitely 

empty (D > Tsurf ). 

4. The fourth category includes occluded voxels, which have a negative distance with a 

magnitude larger than the threshold (D < −Tsurf ). 

5. If the depth value of the back-projected voxel indicates an absent value, the voxel is 

labeled as unfilled. 

6. Voxels back-projecting into pixels outside the foreground regions are considered as 

empty. 

These categories are illustrated in Figure 8.1. The threshold Tsurf specifies essentially the 

amount of noise that is expected in the depth images.


Camera 

center 

Image plane 

depth distance 

Empty region 

Culled region 

Outside silh. 

Culled region 

Occluded region 

Unfilled region 

Occluded region 

Outside silh. 

Figure 8.1: Classification of the voxel according to the depth map and camera parameters. 

Voxels outside the camera frustum are initially labeled as culled. Voxels close to the surface 

induced by the depth map are near-surface voxels (on both sides of the surface, indicated 

by shaded regions). Voxels with a distance larger than a threshold are either empty or 

occluded, depending on the sign of the distance. 

In many reconstruction setups it is possible to classify culled voxel depth values immediately. 

If the object of interest is visible in all images, culled voxels are outside the region 

to be reconstructed and can be classified as empty instantly. Declaring culled voxels as 

unfilled may generate unwanted clutter due to outliers in the depth maps. If the object 

to be reconstructed is only partially visible in the images, voxels outside the viewing frustum 

of a particular depth map do not contribute information and are therefore labeled as 

unfilled. The choice between these two policies for handling culled data is specified by the 

user. Consequently, the 6 branches described above correspond to four voxel categories. 

A fragment program determines the status of voxels and updates an accumulated slice 

buffer for every given depth image. This buffer consists of four channels in accordance to 

the categories described above: 

1. The first channel accumulates the signed distances, if the voxel is a near-surface 

voxel. 

2. The second channel counts the number of depth images, for which the voxel is empty.

8.3. Depth Map Conversion 123 

3. The third channel tracks the number of depth images, for which the voxel is occluded. 

4. The fourth channel counts the number of depth images, for which the status of the 

voxel is unfilled. 

Thus, a simple but sufficient statistic for every voxel is accumulated, which is the basis for 

the final isosurface determination. Algorithm 6 outlines the incremental accumulation of 

the statistic for a voxel, which is executed for every provided depth image. The accumulated 

statistic for a voxel is a quadruple comprising the components as described above. 

In addition to the user-specified parameter Tsurf , another threshold Tocc can be specified, 

which determines the border between occluded voxels and again unfilled voxels located 

behind the surface. This threshold is set to 10 · Tsurf in our experiments. 

Algorithm 6 Procedure to accumulate the statistic for a voxel 

Procedure stat = AccumulateVoxelStatistic 

Input: Camera image plane imageP lane, near-surface threshold Tsurf , Tocc > Tsurf , #Images 

Input: depth image D, projective texture coordinate stq, 3D voxel position pos 

Input: Voxel statistics: stat = ( � Di, #Empty, #Occluded, #Unfilled) (a quadruple) 

st ← stq.xy/stq.z {Perspective division} 

if st is inside [0, 1] × [0, 1] then 

depth ← tex2D(D, st) {Gather depth from range image} 

if depth > 0 then 

dist ← depth − imageP lane · pos {Calculate signed distance to the surface} 

if dist > Tsurf then 

increment #Empty {Too far in front of surface} 

else if dist < −Tocc then 

increment #Unfilled {Very far behind the surface} 

else if dist < −Tsurf then 

increment #Occluded {Too far behind the surface} 

else 

� 

Di ← � Di + dist {Near-surface voxel} 

end if 

else 

if depth = 0 then 

stat ← (0, #Images + 1, 0, 0) {Declare voxel definitely as empty} 

else 

increment #Unfilled 

end if 

end if 

else 

{Execute one of the following lines, depending on the handling of culled voxels:} 

increment #Empty, or {Handle culled voxel as empty} 

increment #Unfilled {Alternatively, handle culled voxel as unfilled} 

end if 

Return stat 

This algorithm is very close to the range image integration approach proposed 

in [Curless and Levoy, 1996]. The main user-given parameter is the threshold Tsurf ,


which determines the set of near-surface voxels. This parameter is related to the accuracy 

of the depth maps and should be set to half of the uncertainty interval in theory. Since 

the uncertainty of depth images generated by dense estimation approaches depends on 

many parameters like the view geometry, scene content and surface properties, this 

threshold is determined empirically. 

Algorithm 6 has the following differences to the method proposed in 

[Curless and Levoy, 1996]: 

• Culled voxels (i.e. outside the viewing frustum) can be immediately carved away 

depending on the user specified policy. 

• Voxel very far behind the estimated surface are considered unreliable and are labeled 

as unfilled instead of being classified as occluded. A user specified threshold Tocc 

is introduced to distinguish between occluded (solid) voxels and unfilled ones. The 

choice of this parameter does not critically affect the obtained model. We use a 

default value of Tocc = 10 Tsurf in our experiments. 

Weighted Accumulation for Near-Surface Voxels 

It is possible to compute a weighted average for the near-surface voxels by accumulating 

weighted distances. If the signed distance of a voxel for depth image i is Di, and the 

corresponding weight (resp. confidence) is Wi, then the averaged distance value is 

� 

i WiDi 

� 

i Wi 

Because the weights do not sum to one, a weighted scheme requires tracking the total sum 

� 

i Wi of the weights in addition to the parameters described above. This can be achieved 

either by writing to a fifth channel, which requires the recent multi-render-target graphics 

extension, or alternatively two of the other parameters can be merged. Depending on the 

object to be reconstructed culled voxels can be counted as empty or occluded without 

decreasing the accuracy of the final model. For free-standing objects like statues it is 

reasonable to declare culled voxels as empty, since the object in interest is typically visible 

in all images. In other cases occluded and culled voxels can be treated equivalently. 

8.4 Isosurface Determination and Extraction 

After all available depth images are processed, the target buffer holds the coarse statistic 

for all voxels of the current slice. The classification pass to determine the final status 

of every voxel is essentially a voting procedure. This step assigns the depth distance to 

the final surface to every voxel, such that the isosurface at level 0 corresponds with the 

merged 3D model. For efficiency the voting procedure uses only the statistics acquired 

for the current voxel, but does not inspect neighboring voxels. Algorithm 7 presents the 

.

8.4. Isosurface Determination and Extraction 125 

utilized averaging procedure to assign the signed distance to the final surface. There is one 

parameter, which must be specified by the user: #RequiredDefinite denotes the minimum 

number of near surface entries accumulated in the voxel statistic. This means, that 

at least #RequiredDefinite depth maps must agree, that the current voxel is close to the 

estimated surface. The choice of this parameter depends on the redundancy in the images 

an on the quality of the provided depth maps. A larger choice for #RequiredDefinite 

reduces the clutter induced by outliers in the input depth maps, but may lead to holes in 

the final surface, if parts of the surface are visible in too few views. 

Algorithm 7 Procedure to calculate the final surface distance for a voxel 

Procedure result = AverageDistance 

Input: User specified constant: #RequiredDefinite 

Input: Voxel statistics: � Di, #Empty, #Occluded, #Unfilled 

#Definite ← #Images − #Occluded − #Unfilled 

if #Definite < #RequiredDefinite then 

result ← UnknownLabel(e.g. NaN) 

else 

#NearSurface ← #Images − #Empty − #Unfilled 

if #NearSurface ≥ #Empty then 

result ← � Di/#NearSurface 

else 

result ← +∞ 

end if 

end if 

Return result 

Up to now the discussed steps in the volumetric range image integration pipeline, depth 

map conversion and fusion, run entirely on graphics hardware. After the GPU-based computation 

for one slice of the voxel space is finished, the isovalues of the current slice are 

transformed into a triangular mesh on the CPU [Lorenson and Cline, 1987] and added to 

the final surface representation. This mesh can be directly visualized and is ready for 

additional processing like texture map generation. Instead of generating a surface representation 

from the individual slices a 3D texture can be accumulated alternatively, which 

is suitable for volume rendering techniques. The main portion of this approach is performed 

again entirely on the GPU and does not involve substantial CPU computations. 

In contrast to a slice-based incremental isosurface extraction method, this direct approach 

requires the space for a complete 3D texture in graphics memory. Since modern 3D graphics 

hardware is equipped with large amounts of video memory, the 16MB required by a 

256 3 voxel space are affordable. Rendering an isosurface directly from the volumetric data 

requires additional calculation of surface normals, which are directly derived from the 

gradients at every voxel. By using a deferred rendering approach, computation of the gradient 

can be limited to the actual surface voxels and the additional memory consumption 

is minimal.


8.5 Implementation Remarks 

Tracking the statistic for each voxel in the current slice requires a four channel buffer 

with floating point precision to accumulate the distance values for near-surface voxels. By 

normalizing the distance of these voxels from [−T, T ] to [−1, 1] a half precision buffer (16 

bit floating point format) is usually sufficient. Furthermore, the final voxel values can 

be transformed to the range [0, 1] and a traditional 8 bit fix-point buffer offers adequate 

precision. Using low-precision buffers decreases the volume integration time by about 30%. 

8.6 Results 

This section provides visual and timing results for some real datasets. The timings are 

given for a PC hardware consisting of a Pentium4 3GHz processor and an NVidia Geforce 

6800 graphics card. All source views are resized to 512 × 512 pixels beforehand, and 

the obtained depth images have the same resolutions (unless noted otherwise). Partially 

available foreground segmentation data is not used in these experiments. 

The first dataset depicted in Figure 8.2(a) shows one source image (out of 47) displaying 

a small statue. The images are taken in a roughly circular sequence around the statue. 

The camera is precalibrated and the relative poses of the images are determined from point 

correspondences found in adjacent views. From the correspondences and the camera parameters 

a sparse reconstruction can be triangulated, which is used by a human operator to 

determine a 3D box enclosing the voxel space of interest. The extension of this box is used 

to determine the depth range employed in the subsequent plane-sweep step, which took 53s 

to generate 45 depth images in total (Figure 8.2(b)). In this depth estimation procedure 

(recall Chapter 4), 200 evenly distributed depth hypotheses are tested using the SAD for 

a 5 × 5 window. In order to compensate illumination changes in several view triplets, the 

source images were normalized by subtracting its local mean image. Black pixels indicate 

unreliable matches, which are labeled as unfilled before the depth integration procedure. 

These depth maps are integrated in just over 4 seconds to obtain a 256 3 volume dataset 

as illustrated in Figure 8.2(c). The isosurface displayed in Figure 8.2(d) can be directly 

extracted using a ray-casting approach on the GPU [Stegmaier et al., 2005]. Almost all 

of the clutter and artefacts outside the proper statue are eliminated by requiring at least 

7 definite values for the statistic of a voxel. 

The result for another dataset consisting of 43 images is shown in Figure 8.3(b), for 

which one source image is depicted in Figure 8.3(a). The same procedure as for the previous 

dataset is applied, from which a set of 41 depth images is obtained in the first instance. 

Plane-sweep depth estimation using the ZNCC correlation with 200 depth hypotheses requires 

97.7s in all to generate the depth maps. The subsequent depth image fusion step 

requires 4s to yield the volumetric data illustrated in Figure 8.3(b). 

Note that these timing reflect the creation time for rather high-resolution models. If 

all resolutions are halved (256 × 256 × 100 depth images and 128 3 volume resolution),


(a) One source image (b) One depth image (c) Direct volume 

rendering 

(d) Shaded isosurface 

Figure 8.2: Visual results for a small statue dataset generated from a sequence of 47 

images. The total time to generate the depth maps and the final volumetric representation 

is less than 1 min. The left image (a) shows one source view, and the corresponding depth 

map generated by a plane sweep approach is illustrated in (b). The 3D volume obtained 

by depth image integration is displayed using direct volume rendering in (c). The outline 

of the isosurface corresponding to the integrated model is clearly visible. Additionally, the 

region of near-surface voxels is indicated by the blur next to the surface. The right image 

shows the isosurface extracted from the volume data using GPU-based raycasting. Both 

images are generated by the volume raycasting software made available by S. Stegmaier 

et al. [Stegmaier et al., 2005]. 

the total depth estimation time is 13s and the volumetric integration time is less than 1s 

for this dataset. We believe that these timing results allow our method to qualify as an 

interactive modeling approach. 

The visual result for another dataset consisting of 16 source views is shown in Figure 

8.3(c) and (d). Depth estimation for 14 views took 34.2s using a 5x5 ZNCC with a 

best-half-sequence occlusion strategy (200 tentative depth values). Without an implicit occlusion 

handling approach parts of the sword are missing. Volumetric integration requires 

another 1.8s to generate the isosurface shown in Figure 8.3(d). 


In this work we demonstrated, that generating proper 3D models from a set of depth 

images can be achieved at interactive rates using the processing power of modern GPUs. 

The quality of the obtained 3D models depends on the grade of the source depth maps


(a) One source image (of 43) (b) Shaded isosurface (102s) 

(c) One source image (of 16) (d) Shaded isosurface 

(36s) 

Figure 8.3: Source views and isosurfaces for two real-world datasets. 

and on the redundancy within the provided data, but the voting scheme is robust in case 

of outliers usually generated by pure local depth estimation procedures. 

Although the proposed method is efficient and often provides 3D geometry suitable 

for visualization and further processing, the results are inferior in many cases with low


redundancy in the source depth maps. In these settings, the pure local averaging and 

voting approach to combine the depth maps is not sufficient. Global surface reconstruction 

methods resulting in smoother and often watertight 3D geometry were recently 

proposed. Volumetric graph-cut approaches [Vogiatzis et al., 2005, Tran and Davis, 2006, 

Hornung and Kobbelt, 2006b, Hornung and Kobbelt, 2006a] appear highly successful to 

create smooth models, but they are computationally expensive and provide only limited 

choices for regularization terms. Moreover, graph-cut methods in general do not benefit 

much from GPU or SIMD accelerated implementation. 

Consequently, future work will likely focus on variational reconstruction 

approaches. Since determining the surface of an imaged object from 

multiple depth maps can be seen as segmentation problem (separation of 

empty space and interior volume), variational image segmentation methods 

(e.g. [Caselles et al., 1997, Westin et al., 2000, Appleton and Talbot, 2006]) could be 

adapted for multiple-view surface reconstruction tasks. The nature of the underlying 

implementations enables substantial performance gains by employing graphics processing 

units for these methods.

Chapter 9 

Results 

Contents 

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

9.2 Synthetic Sphere Dataset . . . . . . . . . . . . . . . . . . . . . . . 131 

9.3 Synthetic House Dataset . . . . . . . . . . . . . . . . . . . . . . . 134 

9.4 Middlebury Multi-View Stereo Temple Dataset . . . . . . . . . 137 

9.5 Statue of Emperor Charles VI . . . . . . . . . . . . . . . . . . . . 138 

9.6 Bodhisattva Figure . . . . . . . . . . . . . . . . . . . . . . . . . . 140 


This chapter provides results illustrating the complete GPU-based work-flow on several 

datasets. At first, two synthetic datasets are discussed, which allow a comparison of 

the purely image-based reconstruction with the known ground truth. Thereafter, several 

real-world datasets from various domains and the generated respective 3D models are 

presented. The focus of the discussion of these datasets lies on the comparison between 

medium resolution and high resolution results. Consequently, the potential gain of more 

expensive computations at higher resolution is visually illustrated. 

The depth maps for the real-world datasets are generated using the plane-sweep (Chapter 

4) and scanline optimization approaches (Chapter 7), since these methods are less vulnerable 

to illumination changes in the images and do not require a suitable initialization 

as the iterative methods (Chapter 3 and 6) depend on. 

9.2 Synthetic Sphere Dataset 

The first presented dataset is a synthetically rendered perfect sphere with radius 1 (see 

Figure 9.1). The surface is textured using a procedurally generated stone texture. 36 

131

132 Chapter 9. Results 

views at 512 × 512 resolution are created using the Persistence of Vision raytracer. ∗ The 

cameras are placed in even intervals around the sphere center looking towards the center. 

(a) (b) (c) 

Figure 9.1: Three source views of the synthetic sphere dataset. 

Choosing a sphere as the ground truth geometry has the advantage, that the comparison 

of the reconstructed model with the ground truth is extremely simple: the offset of an 

arbitrary 3D point to the sphere surface is just the difference between the sphere radius 

and the distance of the point to the center. This allows an easy evaluation of the reconstructed 

meshes, and the regular structure of the target model allows the identification of 

systematic errors and biases and the reconstruction methods. 

We compare three depth estimation methods in this section: 

1. a plane-sweep approach using a winner-takes-all depth extraction described in Chapter 

4 (denoted by WTA) 

2. the GPU-based scanline optimization procedure presented in Chapter 7 

3. a GPU accelerated variational approach to depth estimation as described in Chapter 

6 (indicated by PDE) 

All methods have a triplet of images as input with the central view designated as key 

image. The image dissimilarity function is the SAD aggregated in a 5 × 5 window for 

the first two methods, and the single pixel SSD for the variational approach. The planesweep 

and the scanline optimization procedures evaluate 400 potential depth values for 

every pixel of the key image. Figure 9.2 displays the result of the three depth estimation 

methods for one particular key view. The discrete set of depth values can be clearly seen 

in Figure 9.2(a) and (b). 

The obtained three sets comprising 36 depth maps are merged into a final 3D model 

using the procedure described in Chapter 8. We set the main parameters Tsurf and 

∗ www.povray.org

9.2. Synthetic Sphere Dataset 133 

(a) WTA (b) Scanline opt. (c) PDE 

Figure 9.2: Depth estimation results for a view triplet of the sphere dataset 

#RequiredDefinite to 0.03 and 7 respectively. This step requires about 5.5s in order to 

combine the 36 depth maps. The final meshes for the three depth estimation methods are 

depicted in Figure 9.3. The visual appearance is quite similar; the staircasing artefacts of 

the WTA and the scanline optimization approach are removed by the depth integration 

step. The polar regions of the sphere are not visible in the source views, hence those parts 

are not reconstructed. 

(a) WTA (b) Scanline opt. (c) PDE 

Figure 9.3: Fused 3D models for the sphere dataset wrt. the depth estimation method 

In order to provide a quantitative evaluation, the final meshes are compared with the 

ground truth sphere. In Table 9.1 the total depth estimation runtime for 36 views is given 

in the second column. The third column reports the average sphere radius as induced 

by the generated final mesh (with respect to the true sphere center). The final column 

specifies the percentage of vertices from the final meshes, which lie within 0.5% of the 

sphere radius.


Depth est. method Total runtime Reported radius Points within 0.5% 

Winner-takes-all 83s 1.0012 97.7% 

Scanline opt. 350s 0.9992 97.4% 

PDE 125s 0.9987 95.5% 

Table 9.1: Quantitative evaluation of the reconstructed spheres 

Of course, the figures in Table 9.1 indicate the achievable accuracy under best circumstances. 

9.3 Synthetic House Dataset 

Another synthetic dataset depicting a simple textured house model is illustrated in Figure 

9.4. 36 views of the VRML model were generated, and the source images were resized 

to 512 × 512 pixels. Since the model house is rotated during the virtual capturing process, 

but the (virtual) lights remain in constant position, this dataset simulates a turntable 

sequence with a moving object and fixed light sources. Consequently, purely intensity 

based image dissimilarity measures fail in this case. Therefore we excluded the variational 

approach in the evaluation. 

(a) (b) (c) 

Figure 9.4: Three source views of the synthetic house dataset. 

In order to obtain a 3D model, 36 triplets of views were used to create depth images 

using the plane-sweep approaches with either winner-takes-all or scanline optimization for 

depth extraction. A 5 × 5 ZNCC image similarity score was employed in the experiments. 

The purely local approach is further divided into two variant: a plain method taking the 

depth maps as is (denoted by WTA (1)) and a conservative method marking unreliable 

pixels in the depth map with a low matching score as invalid (WTA (2)). Since the 

difference between these two variants lies only in a depth map post-processing step, the 

runtimes are equivalent. 

The depth maps were again combined using the volumetric integration approach, which

9.3. Synthetic House Dataset 135 

took 5.2s. The reconstruction volume encloses the house model and its proximity, but does 

not include the complete ground plane. 

(a) WTA (1) (b) WTA (2) 

(c) Scanline opt. 

Figure 9.5: Fused 3D models for the synthetic house dataset wrt. the depth estimation 

method 

The pure local methods encounter problems in homogeneous regions as expected (Figure 

9.5(a) and (b)). Surprisingly, employing scanline optimization to fill the depth images 

in textureless areas does not yield to to expected high-quality result. An explanation can 

be given, if the depth maps displayed in Figure 9.6 are examined: the depth maps generated 

by the local methods contain mismatches resp. unreliable depth values in textureless 

region (Figure 9.6(a) and (b), and recall Figure 9.4(c)). 

Scanline optimization (Figure 9.6(c)) fills homogeneous regions with reasonable depth 

values, but because of the linear discontinuity cost model there is an ambiguity in perfect 

homogeneous regions: in such cases, the smoothness cost � |d(x) − d(x + 1)| is minimized


(a) WTA (1) (b) WTA (2) (c) SO 

Figure 9.6: Three generated depth maps of the synthetic house dataset. The results 

of the local approaches show incorrect depth estimations in textureless regions. Scanline 

optimization with a linear discontinuity cost fills the pixel in the depth image suboptimally 

due to the ambiguity of the optimal path. 

for a set of pixels not providing discriminative matching costs. The minimum is not unique 

and the method may report any of these optima. Our implementation reports piecewise 

constant depth maps (as illustrated e.g. in the right section of Figure 9.6(c)) in contrast 

to the expected piecewise planar ones. 

This surprising behavior is caused by the 1-dimensional depth optimization in combination 

with the linear discontinuity cost model. If a quadratic smoothness cost model 

is utilized, the minimum even in textureless regions is uniquely yielding a planar map. 

Performing full 2-dimensional depth optimization (e.g. by graph-cut methods) gives again 

a unique optimum and is not vulnerable to this ambiguity. 

In order to evaluate the obtained final 3D models wrt. the ground truth, two measures 

are employed: the model accuracy specifies the ratio of model surface, which are close 

to the ground truth model within a given distance threshold. The model completeness 

depicts the portion the ground truth model, which is covered by the reconstructed mesh 

(i.e. where the reconstructed surface is close to the ground truth wrt. a given threshold). 

For the completeness calculation the wide-stretched ground plane is omitted from the 

reference model, since it is only reconstructed in the proximity of the house. Measuring the 

completeness of a model accurately is difficult, since small holes may not have any influence 

and larger holes shrink depending on the tolerated distance. Consequently, we set the 

distance threshold for completeness evaluation in the order of the reported average distance 

of inliers as reported by the accuracy evaluation (which is about 0.2% of the diameter of 

the reconstructed box). The obtained values are still only approximately accurate, but 

they match the visual appearance of the models. For instance, the conservative winnertakes-all 

approach has the highest accuracy (since only reliable depth values are retained), 

but the lowest completeness result (unreliable regions remain unfilled). 

The surface-to-surface distance computations are approximated by converting the tri-

9.4. Middlebury Multi-View Stereo Temple Dataset 137 

angular mesh models into point sets by uniformly sampling the meshes and calculating the 

closest point-pairs for these sets. Table 9.2 presents the results of this evaluation. Beside 

the total runtime, the model accuracy and the completeness are given for two distance 

thresholds. These thresholds are indicated as fractions of the diameter of the reconstructed 

volume. 

Depth est. method Runtime Accuracy 1% 0.5% Completeness 0.4% 0.2% 

Winner-takes-all (1) 120s 92.54% 83.7% 95.65% 75.99% 

Winner-takes-all (2) 120s 99.07% 93.47% 90.78% 63.51% 

Scanline opt. 170s 96.27% 90.51% 95.91% 82.30% 

Table 9.2: Quantitative evaluation of the reconstructed synthetic house 

9.4 Middlebury Multi-View Stereo Temple Dataset 

This dataset is one of the currently two proposed datasets with known ground-truth geometry 

[Seitz et al., 2006] † . The images show a replications of an ancient temple (see 

Figure 9.7). The ground-truth geometry was obtained by laser-scanning the miniature 

model. There are three variants of the dataset: at first, a large set of images is provided, 

which contains more than 300 source views acquired using a spherical gantry and a moving 

camera. Additionally, two smaller subsets are supplied: a dense ring set of images 

consisting of 47 views, and a sparse ring with 16 images. All images have 640 × 480 pixels 

resolution. We used the medium sized dense ring dataset to generate the results presented 

below. 

We provide two final results for this dataset: the first mesh displayed in Figures 9.8(a) 

and (b) is created using the camera matrices and orientations supplied by the originators. 

Since the authors of this dataset do not claim high accuracy of their camera parameters, 

we additionally calculated the relatives poses between the views from scratch using 

our multi-view reconstruction pipeline. Two views of the resulting mesh are shown in 

Figure 9.9(a) and (b). In both cases the same parameters for depth estimation and volumetric 

integration are used. The initial depth maps are computed employing a 3 × 3 SAD 

matching score and scanline optimization for depth extraction. 255 potential depth values 

are evaluated for every pixel. This procedure takes 3m7s to finish. Subsequent fusion of 

all depth maps into a volumetric model with 288 3 voxels resolution requires another 12s 

to complete. 

The surface mesh created with our own calculated camera matrices appears smoother 

and less noisy than the one based on the supplied camera poses. The drawback of camera 

poses computed from scratch is, that the obtained 3D model is calculated with respect to 

a local camera coordinate system and cannot be compared with the laser-scanned model 

directly. 

† http://vision.middlebury.edu/mview/


(a) (b) (c) 

Figure 9.7: Three (out of 47) source images of the temple model dataset. The images are 

taken approximately evenly spaced on a circular sequence around the model. 

9.5 Statue of Emperor Charles VI 

Figure 9.10 displays two source views (out of 42) showing a statue of the Austrian Emperor 

Charles VI. inside the state hall of the Austrian National Library. The source images have 

a significant variation in brightness conditions due to the back light induced by the large 

windows of the hall. 

A set of 40 depth maps is generated for every triplet of source images, which are subsequently 

fused using our volumetric depth image integration approach. We calculated the 

final model for two different resolutions: at first, a medium resolution model is generated 

for depth images with 336 × 512 pixels and 256 × 256 × 384 voxels used for volumetric 

integration. Further, a high resolution result at 676 × 1016 pixels and 384 × 384 × 512 

voxels is created to evaluate the benefit of increased resolution. Table 9.3 depicts the 

required run-times to generate the 40 depth maps using 250 depth hypotheses at the specified 

image resolution. Volumetric fusion takes 8.5s at medium resolution and 27s at high 

resolution, respectively. 

Resolution Depth est. method Runtime 

336 × 512 Winner-takes-all 1m40s 

Scanline opt. 2m10s 

676 × 1016 Winner-takes-all 5m30s 

Scanline opt. 7m40s 

Table 9.3: Timing results for the Emperor Charles dataset. These figures represent the 

time needed to generate 40 depth maps at the specified resolution. 250 depth hypothesis 

are evaluated for every pixel.

9.5. Statue of Emperor Charles VI 139 

(a) Front view (b) Back view 

Figure 9.8: Front and back view of the fused 3D model of the temple dataset based on 

the original camera matrices (1095000 triangles). 

The meshes obtained at medium resolution using a winner-takes-all and a scanline 

optimization depth extraction method are illustrated in Figure 9.11(a)–(d). The surface 

mesh generated using the simple winner-takes-all approach is essentially as good as the 

scanline optimization based result. 

Figures 9.12(a)–(f) depict the meshes obtained at the higher resolution. Again, a 

winner-takes-all and a scanline optimization approach are used for depth extraction. At 

this resolution the WTA result has more noise as illustrated in the close-up view of the 

cloak in Figure 9.12(c) and (f). The corresponding depth maps generated by the WTA 

and SO approach can be seen is Figure 9.13. Volumetric fusion evidently removes the 

mismatches occurring on the WTA-based depth image only partially, which yields to holes 

in the final mesh. 

If one compares the outcomes of the two resolutions directly, e.g. Figure 9.11(c) and 

Figure 9.12(d), then the increased geometric details of the high resolution result are clearly 

visible. Nevertheless, the high resolution mesh containing approximately 1 000 000 triangles 

is too complex for real-time display and requires geometric simplification and other 

enhancements to be suitable for further visualization.



Figure 9.9: Front and back view of the fused 3D model of the temple dataset based on 

new calculated camera matrices (857000 triangles). 

9.6 Bodhisattva Figure 

The final dataset is a set of images displaying a wooden Bodhisattva statue inside a 

Buddhist stupa building (Figure 9.14). These images were taken with a digital singlelens 

reflex camera under difficult lighting conditions. Additionally, the views are partially 

widely separated due to the narrow interior of the stupa. This dataset focuses directly on 

the digital preservation of cultural heritage, since the wooden statue weathers slowly due 

to atmospheric conditions. Furthermore, this and similar religious artefacts are highly in 

demand of gatherers and consequently susceptible to theft. 

The complete set of images contains 13 views of the statue. Two sequences of depth 

images (using scanline optimization) are generated: a medium resolution set at 512 × 

768 pixels and a high resolution one at 1000 × 1504 pixels, for which a few depth maps 

are depicted in Figure 9.15. In both cases the number of depth hypotheses is set to 

250. The medium resolution result utilized a ZNCC correlation using a 5 × 5 support 

window. The generation of 11 depth images using triplets of source views needed 1m12s. 

Volumetric fusion was applied in a 256×512×512 voxel space yielding the mesh displayed 

in Figure 9.16(a). In the high resolution case a 7 × 7 aggregation window was applied

9.6. Bodhisattva Figure 141 


Figure 9.10: Two views of the statue showing Emperor Charles VI inside the state hall of 

the Austrian National Library. 

for matching costs computation, and the volumetric fusion is based on a 384 × 768 × 768 

voxel space. Depth map generation took 5m to complete. The finally extracted mesh is 

illustrated in Figure 9.16(b). 

For this dataset the lower resolution mesh appears smoother and less noisy in comparison 

with the high resolution outcome. There are two reasons for this behavior: at 

first, several depth maps contain a substantial amount of noise and mismatches due to 

the widely separated views for some triplets (e.g. Figure 9.15(d)). During volumetric fusion 

this noise is largely suppressed at the medium resolution. Additionally, the lack of a 

global smoothing term in the “greedy” depth map fusion procedure does not inhibit high 

variations (i.e. local noise) in the extracted surface mesh. Future work needs to address 

an efficient depth map integration approach, which incorporates some discontinuity cost 

to prevent unnecessary noise in the final outcome. In any case, a feature preserving mesh 

simplification procedure is required to enable further processing and visualization.


(a) WTA, front view (b) WTA, back view 

(c) SO, front view (d) SO, back view 

Figure 9.11: Medium resolution mesh for the Charles VI dataset. Figures (a) and (b) 

show the surface mesh obtained from a winner-takes-all plane-sweep approach to depth 

map generation. Figures (c) and (d) illustrate the results using scanline optimization.


(a) Front view (b) Front view (c) Front view 

(d) Front view (e) Front view (f) Front view 

Figure 9.12: High resolution mesh for the Charles VI dataset. Figures (a) and (b) show 

the surface mesh obtained from a winner-takes-all plane-sweep approach to depth map 

generation. (c) displays a close-up view of the cloak revealing substantial noise in the 

mesh. Figures (d)–(f) illustrate the results using scanline optimization. The cloak in 

Figure (f) is much smoother in this setting.


(a) WTA (b) SO 

Figure 9.13: Two depth maps for the same reference view of the Charles dataset generated 

by the winner-takes-all and the scanline optimization approach, respectively. 

(a) (b) (c) (d) (e) (f) (g) 

Figure 9.14: Every other of the 13 source images of the Bodhisattva statue dataset.


(a) (b) (c) (d) 

Figure 9.15: Several Depth images for the Bodhisattva statue 

(a) Medium resolution (512 × 768, ≈ 1Mio 

triangles) 

(b) High resolution (1000 × 1504, ≈ 2.7Mio 

triangles) 

Figure 9.16: Medium and high resolution results for the Bodhisattva statue images. The 

depth images for the left model are computed at 512 × 768 pixels resolution, and the 

subsequent volumetric depth map integration is performed at 256 × 512 × 512 voxels. The 

depth map and the voxel resolution for the right model are 1000×1504 and 384×768×768, 

respectively. For this dataset the inherent smoothing induced by the lower resolution yields 

to slightly more appealing results.

Chapter 10 

Concluding Remarks 

This thesis outlines high-performance approaches to several stages in the reconstruction 

pipeline regarding dense depth and mesh generation using modern GPUs. Several approaches 

for multi-view reconstruction benefit substantially from the data-parallel computing 

model and the processing power of modern GPUs. The provided accuracy of 

arithmetic operations on the GPU is sufficient for most image processing and computer 

vision methods not relying on high-precision computations. 

The range of described methods starts with GPU-based correlation calculation followed 

by a simple winner-takes-all depth extraction procedure and reaches semi-global methods 

using dynamic programming and volumetric methods to merge a set of depth images into 

a final 3D model. So far, several important global methods for depth estimation can 

only partially benefit from GPUs: graph cut approaches are currently too sophisticated 

for substantial GPU acceleration, and loopy belief propagation methods have too high 

memory requirements to be useful for high-resolution reconstructions. Hence, we believe 

that the methods proposed in this thesis are good candidates for GPU utilization to 

generate high-resolution models from multiple views. 

It is evident to ask whether other steps in the pipeline can be accelerated by graphics 

hardware as well. Several processing steps in the early pipeline like distortion correction, 

basic corner extraction and similar low level image processing tasks can easily exploit the 

processing power of modern GPUs (e.g. [Sugita et al., 2003] and [Colantoni et al., 2003]). 

Other important procedures mostly related to pose estimation like tracking and matching 

of correspondences and RANSAC based relative pose estimation require too sophisticated 

control flow mechanisms to be rewarding targets for SIMD processing model offered by 

current GPUs. There might the possibility of hybrid approaches for these tasks incorporating 

CPU and GPU processing power at equal parts. In particular, the estimation of 

sparse correspondences is still a relatively slow procedure within our current reconstruction 

pipeline. Accelerating this stage of the pipeline seems to be the most worthwhile goal for 

the near future. Sinha et al. [Sinha et al., 2006] recently addressed KLT tracking for video 

streams and SIFT key extraction using the GPU and reported substantial performance 

gains. Incorporating and extending these techniques is part of future investigations. 

147

148 Chapter 10. Concluding Remarks 

With the emergence of more general programming models for graphics hardware, more 

sophisticated depth estimation and other computer vision methods may become relevant 

targets for a GPU-based implementation. According to current technical proposals, nextgeneration 

graphics hardware will provide a more flexible and dynamic programming 

approach, which potentially allows to assign more control flow and more dynamic behavior 

to the GPU. Additionally, the strict locality found in our algorithms induced by 

the current GPU programming model might be softened, and more global knowledge of 

the views and the depth hypothesis could be incorporated into future procedures. In 

particular, the introduction of geometry shaders as an additional step in the rendering 

pipeline [Blythe, 2006] adds extended dynamical behavior by allowing vertices to be created 

and removed by shader programs executed by the GPU. Sophisticated use of this and 

other currently emerging features may yield to interesting efficient approaches to computer 

vision problems. 

Every long-term prognosis about future graphics hardware and its non-graphical applications 

is highly speculative. Similar objections apply to the future of CPUs. Nevertheless 

we outline two recent developments, which may provide some insights on future graphics 

and parallel processing technology in general: at first, we mention the highly innovative 

(and unconventional) design of the Cell microprocessor [Kahle et al., 2005], which essentially 

consists of a traditional CPU core tightly coupled with eight SIMD co-processors 

providing the computing power e.g. for multimedia tasks. The most prominent use of the 

Cell architecture will be a video gaming console still equipped with a dedicated graphics 

processing unit, but the main goal of substantial enhancing the SIMD capabilities 

of general purpose processors is obvious. One important application of this design is 

physically correct simulation of objects in computer games. Another forthcoming development 

in SIMD processing hardware is the unification of previously distinct vertex and 

fragment shaders on GPUs. This means, that the shader pipelines on the GPU can execute 

either vertex programs or fragment programs as requested by the application or the 

graphics driver software. Consequently, the shader pipelines closely resemble the SIMD coprocessors 

of the Cell architecture. This evolution of CPUs and GPUs is partially driven 

by the need of efficient physic simulation engines used in modern computer games. Hence, 

one can expect arrays of versatile SIMD co-processors in future computer hardware, which 

are located close to the CPU (as in the Cell model) or close to the GPU (in the unified 

shader case). 

These developments will substantially change the programming model to implement 

multimedia tasks and related high-performance applications. The current technological 

trends indicate, that main CPUs mainly augmented with data-parallel co-processors will 

be the most dominant future computing device. Several techniques developed to utilize 

the GPU for computer vision tasks can be transferred to this new architecture, whereas 

other performance optimizations specifically targeted for GPUs (e.g. using the z-buffer for 

conditional evaluation) have no general SIMD counterpart. Since every new generation of 

computer hardware, and graphics hardware in particular, provides a set of new features,

the required frequent adaption of GPU-based implementations will likely enable a smooth 

transition to future computer architectures. 

Currently, the programming interface for GPU application is a graphics library (mainly 

OpenGL and Direct3D). At least it is counter-intuitive and error-prone to use graphics 

commands to implement non-graphical methods and computations. Consequently, there 

are forth-coming proposals to interact with the GPU as a non-graphical device: Accelerator 

[Tarditi et al., 2005] provides a high-level SIMD programming model and translates 

the library calls into suitable fragment shaders and graphical commands of the underlying 

graphics library. Peercy et al [Peercy et al., 2006] present a library, which exposes the 

data-parallel capabilities of the GPU directly without invocation of the systems graphics 

library. These trends illustrate the transition of hardware and software vendors from 

handling the GPU exclusively as graphics device to a more general parallel computing 

device. 

Nevertheless, the main focus of future work is not the sole acceleration of computer 

vision methods using off-the-shelf parallel computing devices (most notably the GPU), 

but the enhancement of the underlying computer vision algorithms. As an example, 

semantic segmentation of the input images into relevant regions (facades, static objects) 

and irrelevant ones (sky, vegetation, moving objects) allows the exclusion of undesirable 

values in the depth map. Consequently, the fusion of the depth images is more robust, 

and the final model omits unnecessary clutter induces by negligible objects. 

The presented volumetric approach to 3D model generation from several 

depth maps is very efficient, but yields to water-tight models only in ideal cases. 

Additionally, the extracted meshes have poor overall smoothness due to the lack 

of an appropriate neighborhood handling. Recently, volumetric mesh extraction 

approaches based on graph cuts incorporating global smoothness were proposed 

(e.g. [Vogiatzis et al., 2005, Hornung and Kobbelt, 2006c]), but these methods have 

their own difficulties beside the increased computational complexity. For instance, some 

volumetric graph-cut procedures work best only if a suitable visual hull is available. 

Furthermore, graph cut solutions prefer minimal surfaces, hence an ad-hoc ballooning 

term needs to be added to the cost functional. The limitations of current methods imply, 

that there is still room for further research in range image integration. 

Finally, there is often the requirement of human interaction in the reconstruction 

pipeline. In particular, post-processing steps like model trimming and the integration 

of independently reconstructed objects into one common model commonly depend on a 

human operator. The topic of providing user interfaces for efficient execution of such tasks 

is not directly suited for future research. More promising is the integration of efficient 

model computation methods with manual interaction schemes in order to intervene in the 

depth map or 3D model generation procedure: for instance, manual labeling of unmodeled 

surface properties like specular highlights combined with a real-time update of the final 

3D model may yield to highly effective modeling applications. 

149

Appendix A 

Selected Publications 

A.1 Publications Related to this Thesis 

The original approach to mesh-based stereo reconstruction on the GPU as described in 

Chapter 3 can be found in [Zach et al., 2003a]. The performance of the proposed method 

was substantially increased using the techniques presented in [Zach et al., 2003b]. 

Material from Chapter 4 (plane-sweep depth estimation on the GPU) and Chapter 8 

(fast volumetric integration of depth maps) appeared in [Zach et al., 2006a]. 

The scanline optimization implementation on the GPU (Chapter 7) is published 

as [Zach et al., 2006b]. 

A.2 Other Selected Scientific Contributions 

Most work in the first half of my time as PhD student addressed rendering of large 3D environments, 

which were typically generated by remote sensing methods (e.g. satellite laser 

scans) and photogrammetric methods. Hence, early papers covered the task of interactive 

visualization of such dataset using view-dependent multi-resolution geometry. 

In [Zach and Karner, 2003a] an efficient algorithm for selective refinement of viewdependent 

meshes is presented. View-dependent refinement of meshes typically requires a 

top-down traversal of a tree-like structure, which affects the obtained frame rate significantly. 

The proposed method is an event-drived approach to the dynamic mesh refinement 

procedure, which exploits temporal coherence explicitly and achieves significantly reduced 

refinement times. 

Mapping textures on multiresolution meshes is straighforward, if texture coordinates 

can be interpolated across all levels of detail (e.g. when only one texture is applied to 

the geometry). If the geometry is texture mapped with several images, the displayed 

level of detail is constrained or artifacts occur, if no additional processing is performed. 

[Zach and Bauer, 2002] and [Sormann et al., 2003] generalize clipmap like approaches for 

texturing multiresolution heightfields to more general 3D models by generating a texture 

151

152 Chapter A. Selected Publications 

hierarchy in correspondence with the vertex hierarchy used for view-dependent rendering 

of multiresolution meshes. 

Efficient external encoding of multiresolution meshes suitable for view-dependent access 

of relevant fractions of the complete 3D model was mainly addressed by M. Grabner 

[Grabner, 2003]. In [Zach et al., 2004a] we replace the originally proposed topology 

encoding method for multiresolution meshes with a different encoding scheme. Our new 

encoding method is superior in worst case examples and in real-world data sets. We prove 

that two vertices of a triangle can be encoded with 1 bit on average, whereas the third 

vertex requires O(log n) bits in the worst case. 

[Zach and Karner, 2003b] addresses again compression of model data suitable for efficient 

transmission over a network. This time, the compressed encoding of precomputed 

visibility information for walk-through applications is described. It is assumed that the 

user can navigate in an urban scenario with the virtual camera fixed at a predefined 

eye height. For every node in the view-dependent mesh hierarchy a conservative estimation 

of visibility is precomputed using software provided by P. Wonka and M. Wimmer 

[Wonka et al., 2000]. The result of this calculation is a set of visible nodes for each cell 

in the maneuverable space. This data comprise essentially a large binary matrix, which is 

appropriately encoded to be used in remote visualization applications. 

Rendering large view-dependent multiresolution models in combination with many 

view-independent multiresolution objects was addressed in [Zach et al., 2002]. In particular, 

the real-time rendering of a large digital elevation model augmented with a huge number 

of trees is discussed. In order to achieve real-time performance, a new level of detail 

selection procedure is proposed, which is fast enough to assign suitable resolutions to more 

than 1 million objects. The digital elevation model is represented as coarse view-dependent 

hierachical level of detail, and the tree models are rendered using point-based graphics 

primitives. An extended version of this paper is recently published [Zach et al., 2004b].

Bibliography 

[Akbarzadeh et al., 2006] Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels, 

C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q.-X., 

Stewénius, H., Yang, R., Welch, G., Towles, H., Nistér, D., and Pollefeys, M. (2006). 

Towards urban 3d reconstruction from video. In International Symposium on 3D Data 

Processing, Visualization and Transmission (3DPVT). 

[Appleton and Talbot, 2006] Appleton, B. and Talbot, H. (2006). Globally minimal surfaces 

by continuous maximal flows. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 28(1):106–118. 

[Baker and Binford, 1981] Baker, H. H. and Binford, T. (1981). Depth from edge and 

intensity based stereo. In Proc. 7th Intl Joint Conf. Artificial Intelligence, pages 631– 

636. 

[Birchfield and Tomasi, 1998] Birchfield, S. and Tomasi, C. (1998). A pixel dissimilarity 

measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis 

and Machine Intelligence, 20(4):401–406. 

[Blythe, 2006] Blythe, D. (2006). The Direct3D 10 system. In Proceedings of SIGGRAPH 

2006, pages 724–734. 

[Bolz et al., 2003] Bolz, J., Farmer, I., Grinspun, E., and Schröder, P. (2003). Sparse 

matrix solvers on the GPU: Conjugate gradients and multigrid. In Proceedings of SIG- 

GRAPH 2003, pages 917–924. 

[Bornik et al., 2001] Bornik, A., Karner, K., Bauer, J., Leberl, F., and Mayer, H. (2001). 

High-quality texture reconstruction from multiple views. Journal of Visualization and 

Computer Animation, 12(5):263–276. 

[Boykov et al., 2001] Boykov, Y., Veksler, O., and Zabih, R. (2001). Fast approximate energy 

minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine 

Intelligence (PAMI), 23(11):1222–1239. 

[Brown et al., 2003] Brown, M. Z., Burschka, D., and Hager, G. D. (2003). Advances in 

computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 

25(8):993–1008. 

153

154 

[Brox et al., 2004] Brox, T., Bruhn, A., Papenberg, N., and Weickert, J. (2004). High 

accuracy optical flow estimation based on a theory for warping. In European Conference 

on Computer Vision (ECCV), pages 25–36. 

[Brunton and Shu, 2006] Brunton, A. and Shu, C. (2006). Belief propagation for 

panorama generation. In International Symposium on 3D Data Processing, Visualization 

and Transmission (3DPVT). 

[Buck et al., 2004] Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, 

M., and Hanrahan, P. (2004). Brook for GPUs: Stream computing on graphics hardware. 

In Proceedings of SIGGRAPH 2004, pages 777–786. 

[Canny, 1986] Canny, J. (1986). A computational approach to edge detection. IEEE 

Transactions on Pattern Analysis and Machine Intelligence, 8(6):679-698., 8(6):679– 

698. 

[Caselles et al., 1997] Caselles, V., Kimmel, R., and Sapiro, G. (1997). Geodesic active 

contours. Int. Journal on Computer Vision, 22(1):61–79. 

[Chan and Vese, 2002] Chan, T. F. and Vese, L. A. (2002). A multiphase levelset framework 

for image segmentation using the Mumford and Shah model. Int. Journal of 

Computer Vision, 50(3):271–293. 

[Chefd’Hotel et al., 2001] Chefd’Hotel, C., Hermosillo, G., and Faugeras, O. (2001). A 

variational approach to multi-modal image matching. In IEEE Workshop on Variational 

and Level Set Methods in Computer Vision, pages 21–28. 

[Colantoni et al., 2003] Colantoni, P., Boukala, N., and Rugna, J. D. (2003). Fast and 

accurate color image processing using 3D graphics cards. In Proc. of Vision, Modeling 

and Visualization 2002. 

[Cornelis and Van Gool, 2005] Cornelis, N. and Van Gool, L. (2005). Real-time connectivity 

constrained depth map computation using programmable graphics hardware. In 

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1099– 

1104. 

[Criminisi et al., 2005] Criminisi, A., Shotton, J., Blake, A., Rother, C., and Torr, P. 

(2005). Efficient dense-stereo with occlusions and new view synthesis by four state dp 

for gaze correction. Technical report, Microsoft Research Cambridge. 

[Crow, 1984] Crow, F. C. (1984). Summed-area tables for texture mapping. In Proceedings 

of SIGGRAPH 84, pages 207–212. 

[Culbertson et al., 1999] Culbertson, W. B., Malzbender, T., and Slabaugh, G. (1999). 

Generalized voxel coloring. In Proc. ICCV Workshop, Vision Algorithms Theory and 

Practice, pages 100–115.

BIBLIOGRAPHY 155 

[Curless and Levoy, 1996] Curless, B. and Levoy, M. (1996). A volumetric method for 

building complex models from range images. In Proceedings of SIGGRAPH ’96, pages 

303–312. 

[Dally et al., 2003] Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labont, F., Ahn, 

J.-H., Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., and Buck, I. (2003). Merrimac: 

Supercomputing with streams. In Proceedings of SC2003. 

[Darabiha et al., 2003] Darabiha, A., Rose, J., and MacLean, W. J. (2003). Video-rate 

stereo depth measurement on programmable hardware. In IEEE Conference on Computer 

Vision and Pattern Recognition (CVPR), pages 203–210. 

[Davis et al., 2002] Davis, J., Marschner, S., Garr, M., and Levoy, M. (2002). Filling holes 

in complex surfaces using volumetric diffusion. In First International Symposium on 

3D Data Processing, Visualization, and Transmission. 

[Devernay and Faugeras, 2001] Devernay, F. and Faugeras, O. (2001). Straight lines have 

to be straight. Machine Vision and Applications, 13(1):14–24. 

[Dixit et al., 2005] Dixit, N., Keriven, R., and Paragios, N. (2005). GPU-cuts and adaptive 

object extraction. Technical Report 05-07, CERTIS. 

[Dominé et al., 2002] Dominé, S., Rege, A., and Cebenoyan, C. (2002). Real-time hatching. 

Game Developers Conference. 

[Dubois and Rodrigue, 1977] Dubois, P. and Rodrigue, G. H. (1977). An analysis of the 

recursive doubling algorithm. High Speed Computer and Algorithm Organization, pages 

299–307. 

[Eisert et al., 1999] Eisert, P., Steinbach, E., and Girod, B. (1999). Multi-hypothesis, 

volumetric reconstruction of 3-D objects from multiple calibrated camera views. In 

Proc. of International Conference on Acoustics, Speech and Signal Processing, pages 

3509–3512. 

[Engel and Ertl, 2002] Engel, K. and Ertl, T. (2002). Interactive high-quality volume 

rendering with flexible consumer graphics hardware. In STAR – State of the Art Report. 

Eurographics ’02. 

[Engel et al., 2001] Engel, K., Kraus, M., and Ertl, T. (2001). High-quality pre-integrated 

volume rendering using hardware-accelerated pixel shading. In Eurographics / SIG- 

GRAPH Workshop on Graphics Hardware ’01, pages 9–16. 

[Faugeras et al., 1996] Faugeras, O., Hotz, B., Mathieu, H., Viéville, T., Zhang, Z., Fua, 

P., Théron, E., Moll, L., Berry, G., Vuillemin, J., Bertin, P., and Proy, C. (1996). 

Real time correlation based stereo: algorithm implementations and applications. The 

International Journal of Computer Vision.

156 

[Faugeras and Keriven, 1998] Faugeras, O. and Keriven, R. (1998). Variational principles, 

surface evolution, PDEs, level set methods, and the stereo problem. IEEE Transactions 

on Image Processing, 7(3):336–344. 

[Faugeras et al., 2002] Faugeras, O., Malik, J., and Ikeuchi, K., editors (2002). Special 

Issue on Stereo and Multi-Baseline Vision. International Journal of Computer Vision. 

[Felzenszwalb and Huttenlocher, 2004] Felzenszwalb, P. F. and Huttenlocher, D. P. 

(2004). Efficient belief propagation for early vision. In IEEE Computer Society Conference 

on Computer Vision and Pattern Recognition (CVPR), pages 261–268. 

[Forstmann et al., 2004] Forstmann, S., Ohya, J., Kanou, Y., Schmitt, A., and Thuering, 

S. (2004). Real-time stereo by using dynamic programming. In CVPR 2004 Workshop 

on real-time 3D sensors and their use. 

[Förstner and Gülch, 1987] Förstner, W. and Gülch, E. (1987). A fast operator for detection 

and precise location of distinct points, corners and centres of circular features. 

Proc. of the ISPRS Intercommission Workshop on Fast Processing of Photogrammetric 

Data, Interlaken, pages 285–301. 

[Fua, 1993] Fua, P. (1993). A parallel stereo algorithm that produces dense depth maps 

and preserves image features. Machine Vision and Applications, 6:35–49. 

[Garland and Heckbert, 1997] Garland, M. and Heckbert, P. S. (1997). Surface simplification 

using quadric error metrics. In Proceedings of SIGGRAPH ’97, pages 209–216. 

[Geiger et al., 1995] Geiger, D., Ladendorf, B., and Yuille, A. (1995). Occlusions and 

binocular stereo. International Journal of Computer Vision, 14:211–226. 

[Goesele et al., 2006] Goesele, M., Curless, B., and Seitz, S. (2006). Multi-view stereo 

revisited. In IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition (CVPR), pages 2402–2409. 

[Gong and Yang, 2005a] Gong, M. and Yang, R. (2005a). Image-gradient-guided real-time 

stereo on graphics hardware. In Fifth International Conference on 3-D Digital Imaging 

and Modeling, pages 548–555. 

[Gong and Yang, 2005b] Gong, M. and Yang, Y.-H. (2005b). Near real-time reliable stereo 

matching using programmable graphics hardware. In IEEE Conference on Computer 


[Goodnight et al., 2003] Goodnight, N., Woolley, C., Lewin, G., Luebke, D., and 

Humphreys, G. (2003). A multigrid solver for boundary value problems using programmable 

graphics hardware. In Eurographics/SIGGRAPH Workshop on Graphics 

Hardware 2003.


[Grabner, 2003] Grabner, M. (2003). Compressed Adaptive Multiresolution Encoding. PhD 

thesis, Technical University Graz. 

[Hadwiger et al., 2001] Hadwiger, M., Theußl, T., Hauser, H., and Gröller, M. E. (2001). 

Hardware-accelerated high-quality filtering on PC hardware. In Proc. of Vision, Modeling 

and Visualization 2001, pages 105–112. 

[Harris and Stephens, 1988] Harris, C. and Stephens, M. (1988). A combined corner and 

edge detector. Proceedings 4th Alvey Visual Conference, pages 189–192. 

[Harris and Luebke, 2005] Harris, M. and Luebke, D. (2005). SIGGRAPH 2005 GPGPU 

course notes. 

[Harris et al., 2002] Harris, M. J., Coombe, G., Scheuermann, T., and Lastra, A. (2002). 

Physically-based visual simulation on graphics hardware. In Eurographics/SIGGRAPH 

Workshop on Graphics Hardware, pages 109–118. 

[Hart and Mitchell, 2002] Hart, E. and Mitchell, J. L. (2002). Hardware shading with 

EXT vertex shader and ATI fragment shader. ATI Technologies. 

[Heckbert, 1986] Heckbert, P. S. (1986). Filtering by repeated integration. In Proceedings 


[Heikkilä, 2000] Heikkilä, J. (2000). Geometric camera calibration using circular control 

points. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 

22(10):1066–1077. 

[Hensley et al., 2005] Hensley, J., Scheuermann, T., Coombe, G., Singh, M., and Lastra, 

A. (2005). Fast summed-area table generation and its applications. In Proceedings of 

Eurographics 2005, pages 547–555. 

[Hermosillo et al., 2001] Hermosillo, G., Chefd’Hotel, C., and Faugeras, O. (2001). A variational 

approach to multi-modal image matching. Technical Report RR 4117, INRIA. 

[Hilton et al., 1996] Hilton, A., Stoddart, A. J., Illingworth, J., and Windeatt, T. (1996). 

Reliable surface reconstruction from multiple range images. In European Conference on 

Computer Vision (ECCV), pages 117–126. 

[Hirschmüller, 2005] Hirschmüller, H. (2005). Accurate and efficient stereo processing by 

semi-global matching and mutual information. In IEEE Computer Society Conference 


[Hirschmüller, 2006] Hirschmüller, H. (2006). Stereo vision in structured environments by 

consistent semi-global matching. In IEEE Computer Society Conference on Computer 

Vision and Pattern Recognition (CVPR), pages 2386–2393.

158 

[Hoff III et al., 1999] Hoff III, K. E., Keyser, J., Lin, M., Manocha, D., and Culver, T. 

(1999). Fast computation of generalized Voronoi diagrams using graphics hardware. In 

Proceedings of SIGGRAPH ’99, pages 277–286. 

[Hopf and Ertl, 1999a] Hopf, M. and Ertl, T. (1999a). Accelerating 3D convolution using 

graphics hardware. In Visualization 1999, pages 471–474. 

[Hopf and Ertl, 1999b] Hopf, M. and Ertl, T. (1999b). Hardware-based wavelet transformations. 

In Workshop of Vision, Modelling, and Visualization (VMV ’99), pages 

317–328. 

[Hornung and Kobbelt, 2006a] Hornung, A. and Kobbelt, L. (2006a). Hierarchical volumetric 

multi-view stereo reconstruction of manifold surfaces based on dual graph embedding. 

In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 

(CVPR), pages 503–510. 

[Hornung and Kobbelt, 2006b] Hornung, A. and Kobbelt, L. (2006b). Robust and efficient 

photo-consistency estimation for volumetric 3d reconstruction. In European Conference 

on Computer Vision (ECCV), pages 179–190. 

[Hornung and Kobbelt, 2006c] Hornung, A. and Kobbelt, L. (2006c). Robust reconstruction 

of watertight 3D models from non-uniformly sampled point clouds without normal 

information. In Eurographics Symposium on Geometry Processing, pages 41–50. 

[Jia et al., 2003] Jia, Y., Xu, Y., Liu, W., Yang, C., Zhu, Y., Zhang, X., and An, L. (2003). 

A miniature stereo vision machine for real-time dense depth mapping. In Conference 

on Computer Vision Systems (ICVS 2003), pages 268–277. 

[Jung et al., 2006] Jung, Y. M., Kang, S. H., and Shen, J. (2006). Multiphase image 

segmentation via Modica-Mortola phase transition. Technical report, Department of 

Mathematics, University of Kentucky. 

[Kahle et al., 2005] Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, 

T. R., and Shippy, D. (2005). Introduction to the Cell multiprocessor. IBM Journal of 

Research and Development, 49(4/5):589–604. 

[Kanade et al., 1996] Kanade, T., Yoshida, A., Oda, K., Kano, H., and Tanaka, M. (1996). 

A stereo engine for video-rate dense depth mapping and its new applications. In IEEE 

Conference on Computer Vision and Pattern Recognition (CVPR), pages 196–202. 

[Kautz and Seidel, 2001] Kautz, J. and Seidel, H.-P. (2001). Hardware accelerated displacement 

mapping for image based rendering. In Graphics Interface 2001, pages 61–70. 

[Kim and Lin, 2003] Kim, T. and Lin, M. (2003). Visual simulation of ice crystal growth. 

In Proc. ACM SIGGRAPH / Eurographics Symposium on Computer Animation.


[Klaus et al., 2002] Klaus, A., Bauer, J., Karner, K., and Schindler, K. (2002). MetropoGIS: 

A semi-automatic city documentation system. In Photogrammetric Computer 

Vision 2002 (PCV’02). 

[Kolmogorov and Zabih, 2001] Kolmogorov, V. and Zabih, R. (2001). Computing visual 

correspondence with occlusions using graph cuts. In IEEE International Conference on 

Computer Vision (ICCV), pages 508–515. 

[Kolmogorov and Zabih, 2002] Kolmogorov, V. and Zabih, R. (2002). Multi-camera scene 

reconstruction via graph cuts. In European Conference on Computer Vision (ECCV), 

pages 82–96. 

[Kolmogorov and Zabih, 2004] Kolmogorov, V. and Zabih, R. (2004). What energy functions 

can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and 

Machine Intelligence (PAMI), 26(2):147–159. 

[Kolmogorov et al., 2003] Kolmogorov, V., Zabih, R., and Gortler, S. (2003). Generalized 

multi-camera scene reconstruction using graph cuts. In Fourth International Workshop 

on Energy Minimization Methods in Computer Vision and Pattern Recognition 

(EMMCVPR). 

[Konolige, 1997] Konolige, K. (1997). Small vision systems: Hardware and implementation. 

In Proceedings of 8th International Symposium on Robotic Research, pages 203– 

212. 

[Krishnan et al., 2002] Krishnan, S., Mustafa, N., and Venkatasubramanian, S. (2002). 

Hardware-assisted computation of depth contours. In 13th ACM-SIAM Symposium on 

Discrete Algorithms. 

[Krüger and Westermann, 2003] Krüger, J. and Westermann, R. (2003). Linear algebra 

operators for GPU implementation of numerical algorithms. In Proceedings of SIG- 

GRAPH 2003, pages 908–916. 

[Kutulakos and Seitz, 2000] Kutulakos, K. and Seitz, S. (2000). A theory of shape by 

space carving. Int. Journal of Computer Vision, 38(3):198–216. 

[Labatut et al., 2006] Labatut, P., Keriven, R., and Pons, J.-P. (2006). A GPU implementation 

of level set multiview stereo. In International Symposium on 3D Data Processing, 

Visualization and Transmission (3DPVT). 

[Lanczos, 1986] Lanczos, C. (1986). The Variational Principles of Mechanics. Dover 

Publications, fourth edition. 

[Laurentini, 1995] Laurentini, A. (1995). How far 3d shapes can be understood from 2d 

silhouettes. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 

17(2).

160 

[Lefohn et al., 2003] Lefohn, A., Kniss, J. M., Hansen, C. D., and Whitaker, R. T. (2003). 

Interactive deformation and visualization of level set surfaces using graphics hardware. 

In Proceedings of IEEE Visualization 2003, pages 75–82. 

[Lei et al., 2006] Lei, C., Selzer, J., and Yang, Y. (2006). Region-tree based stereo using 

dynamic programming optimization. In IEEE Computer Society Conference on 

Computer Vision and Pattern Recognition (CVPR), pages 2378–2385. 

[Lévy et al., 2002] Lévy, B., Petitjean, S., Ray, N., and Maillot, J. (2002). Least squares 

conformal maps for automatic texture atlas generation. In Proceedings of SIGGRAPH 

2002, pages 362–371. 

[Li et al., 2003] Li, M., Magnor, M., and Seidel, H.-P. (2003). Hardware-accelerated visual 

hull reconstruction and rendering. In Proceedings of Graphics Interface 2003. 

[Li et al., 2004] Li, M., Magnor, M., and Seidel, H.-P. (2004). Hardware-accelerated rendering 

of photo hulls. In Proceedings of Eurographics 2004, pages 635–642. 

[Li et al., 2002] Li, M., Schirmacher, H., Magnor, M., and Seidel, H.-P. (2002). Combining 

stereo and visual hull information for on-line reconstruction and rendering of dynamic 

scenes. In Proceedings of IEEE 2002 Workshop on Multimedia and Signal Processing, 

pages 9–12. 

[Lindholm et al., 2001] Lindholm, E., Kilgard, M. J., and Moreton, H. (2001). A userprogrammable 

vertex engine. In Proceedings of SIGGRAPH 2001, pages 149–158. 

[Lok, 2001] Lok, B. (2001). Online model reconstruction for interactive virtual environments. 

In Symposium on Interactive 3D Graphics, pages 69–72. 

[Lorenson and Cline, 1987] Lorenson, W. and Cline, H. (1987). Marching Cubes: A high 

resolution 3d surface construction algorithm. In Proceedings of SIGGRAPH ’87, pages 

163–170. 

[Lourakis and Argyros, 2004] Lourakis, M. and Argyros, A. (2004). The design and implementation 

of a generic sparse bundle adjustment software package based on the 

levenberg-marquardt algorithm. Technical Report 340, Institute of Computer Science - 

FORTH. Available from http://www.ics.forth.gr/~lourakis/sba. 

[Lowe, 1999] Lowe, D. (1999). Object recognition from local scale-invariant features. Proc. 

of the International Conference on Computer Vision ICCV, pages 1150–1157. 

[Lu et al., 2002] Lu, A., Taylor, J., Hartner, M., Ebert, D., and Hansen, C. (2002). Hardware 

accelerated interactive stipple drawing of polygonal objects. In Proc. of Vision, 

Modeling and Visualization 2002, pages 61–68.


[Mairal and Keriven, 2006] Mairal, J. and Keriven, R. (2006). A GPU implementation of 

variational stereo. In International Symposium on 3D Data Processing, Visualization 


[Mark et al., 2003] Mark, W., Glanville, R., Akeley, K., and Kilgard, M. (2003). Cg: A 

system for programming graphics hardware in a C-like language. In Proceedings of 

SIGGRAPH 2003, pages 896–907. 

[Matas et al., 2002] Matas, J., Chum, O., Urban, M., and Pajdla, T. (2002). Robust 

wide baseline stereo from maximally stable extremal regions. In Proceedings of the 13th 

British Machine Vision Conference, pages 384–393. 

[Matusik et al., 2001] Matusik, W., Buehler, C., and McMillan, L. (2001). Polyhedral 

visual hulls for real-time rendering. In Proceedings of 12th Eurographics Workshop on 

Rendering, pages 115–125. 

[Mayer et al., 2001] Mayer, H., Bornik, A., Bauer, J., Karner, K., and Leberl, F. (2001). 

Multiresolution texture for photorealistic rendering. In Proceedings of the Spring Conference 

on Computer Graphics SCCG. 

[Mendonça and Cipolla, 1999] Mendonça, P. R. S. and Cipolla, R. (1999). A simple technique 

for self-calibration. In IEEE Computer Society Conference on Computer Vision 

and Pattern Recognition (CVPR), pages 1500–1506. 

[Mikolajczyk and Schmid, 2004] Mikolajczyk, K. and Schmid, C. (2004). Scale and affine 

invariant interest point detectors. Int. Journal of Computer Vision, 60(1):63–86. 

[Mitchell, 2002] Mitchell, J. L. (2002). Hardware shading on the Radeon 9700. ATI 

Technologies. 

[Mitchell et al., 2002] Mitchell, J. L., Brennan, C., and Card, D. (2002). Real-time image 

space outlining for non-photorealistic rendering. In SIGGRAPH 2002. Technical Sketch. 

[Moreland and Angel, 2003] Moreland, K. and Angel, E. (2003). The FFT on a GPU. In 

Eurographics/SIGGRAPH Workshop on Graphics Hardware 2003, pages 112–119. 

[Mühlmann et al., 2002] Mühlmann, K., Maier, D., Hesser, J., and Männer, R. (2002). 

Calculating dense disparity maps from color stereo images, an efficient implementation. 

IJCV, 47:79–88. 

[Mulligan et al., 2002] Mulligan, J., Isler, V., and Daniilidis, K. (2002). Trinocular stereo: 

a new algorithm and its evaluation. International Journal of Computer Vision, 47:51– 

61. 

[Nagel and Enkelmann, 1986] Nagel, H.-H. and Enkelmann, W. (1986). An investigation 

of smoothness constraints for the estimation of displacement vector fields from image

162 

sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 

8:565–593. 

[Nistér, 2001] Nistér, D. (2001). Calibration with robust use of cheirality by quasi-affine 

reconstruction of the set of camera projection centres. In Int. Conference on Computer 

Vision (ICCV), pages 116–123. 

[Nistér, 2004a] Nistér, D. (2004a). An efficient solution to the five-point relative pose 

problem. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 

26(6):756–770. 

[Nistér, 2004b] Nistér, D. (2004b). Untwisting a projective reconstruction. Int. Journal 

on Computer Vision, 60(2):165–183. 

[Nistér et al., 2004] Nistér, D., Naroditsky, O., and Bergen, J. (2004). Visual odometry. 

In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 

(CVPR), pages 652–659. 

[NVidia Corporation, 2002a] NVidia Corporation (2002a). Cg language specification. 

[NVidia Corporation, 2002b] NVidia Corporation (2002b). Developer relations. 

http://developer.nvidia.com. 

[Ohta and Kanade, 1985] Ohta, Y. and Kanade, T. (1985). Stereo by intra- and interscanline 

search using dynamic programming. IEEE Transactions on Pattern Analysis 

and Machine Intelligence, 7:139–154. 

[Papenberg et al., 2005] Papenberg, N., Bruhn, A., Brox, T., Didas, S., and Weickert, J. 

(2005). Highly accurate optic flow computation with theoretically justified warping. 

Technical report, Department of Mathematics, Saarland University. 

[Peercy et al., 2006] Peercy, M., Segal, M., and Gerstmann, D. (2006). A performanceoriented 

data parallel virtual machine for gpus. In ACM SIGGRAPH sketches. 

[Peercy et al., 2000] Peercy, M. S., Olano, M., Airey, J., and Ungar, P. J. (2000). Interactive 

multi-pass programmable shading. In Proceedings of SIGGRAPH 2000, pages 

425–432. 

[Perona and Malik, 1990] Perona, P. and Malik, J. (1990). Scale-space and edge detection 

using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine 

Intelligence (PAMI), 12(7):629–639. 

[Point Grey Research Inc., 2005] Point Grey Research Inc. (2005). 

http://www.ptgrey.com.


[Pollefeys et al., 1999] Pollefeys, M., Koch, R., and Gool, L. V. (1999). Self-calibration 

and metric reconstruction in spite of varying and unknown internal camera parameters. 

Int. Journal on Computer Vision, 32(1):7–25. 

[Pons et al., 2005] Pons, J.-P., Keriven, R., and Faugeras, O. (2005). Modelling dynamic 

scenes by registering multi-view image sequences. In IEEE Computer Society Conference 


[Prock and Dyer, 1998] Prock, A. and Dyer, C. (1998). Towards real-time voxel coloring. 

In Proc. Image Understanding Workshop, pages 315–321. 

[Proudfoot et al., 2001] Proudfoot, K., Mark, W., Tzvetkov, S., and Hanrahan, P. (2001). 

A real-time procedural shading system for programmable graphics hardware. In Proceedings 


[Rodrigues and Ramires Fernandes, 2004] Rodrigues, R. and Ramires Fernandes, A. 

(2004). Accelerated epipolar geometry computation for 3d reconstruction using projective 

texturing. In Proceedings of Spring Conference on Computer Graphics 2004, 

pages 208–214. 

[Rudin et al., 1992] Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation 

based noise removal algorithms. Physica D, 60:259–268. 

[Sainz et al., 2002] Sainz, M., Bagherzadeh, N., and Susin, A. (2002). Hardware accelerated 

voxel carving. In 1st Ibero-American Symposium in Computer Graphics (SIACG 

2002), pages 289–297. 

[Scharstein and Szeliski, 2002] Scharstein, D. and Szeliski, R. (2002). A taxonomy and 

evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision, 

47(1-3):7–42. 

[Schmidegg, 2005] Schmidegg, H. (2005). Texturing 3D models from historical images. 

Master’s thesis, Graz University of Technology. 

[Seitz et al., 2006] Seitz, S., Curless, B., Diebel, J., Scharstein, D., and Szeliski, R. (2006). 

A comparison and evaluation of multi-view stereo reconstruction algorithms. In IEEE 

Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 

[Seitz and Dyer, 1997] Seitz, S. and Dyer, C. (1997). Photorealistic scene reconstruction 

by voxel coloring. In IEEE Conference on Computer Vision and Pattern Recognition 

(CVPR), pages 1067–1073. 

[Seitz and Dyer, 1999] Seitz, S. and Dyer, C. (1999). Photorealistic scene reconstruction 

by voxel coloring. International Journal of Computer Vision, 35(2):151–173.

164 

[Seitz and Kutulakos, 2002] Seitz, S. and Kutulakos, K. (2002). Plenoptic image editing. 

Int. Journal of Computer Vision, 48(2):115–129. 

[Shen, 2006] Shen, J. (2006). A stochastic-variational model for soft Mumford-Shah segmentation. 

International Journal on Biomedical Imaging, 2006:1–14. 

[Sinha et al., 2006] Sinha, S. N., Frahm, J.-M., Pollefeys, M., and Genc, Y. (2006). Gpubased 

video feature tracking and matching. Technical Report 06-012, Department of 

Computer Science, UNC Chapel Hill. 

[Slabaugh et al., 2001] Slabaugh, G., Culbertson, W. B., and Malzbender, T. (2001). A 

survey of methods for volumetric scene reconstruction from photographs. In Int. Workshop 

on Volume Graphics, pages 81–100. 

[Slabaugh et al., 2002] Slabaugh, G., Schafer, R., and Hans, M. (2002). Image-based 

photo hulls. In The 1st International Symposium on 3D Processing, Visualization, and 

Transmission (3DPVT). 

[Slesareva et al., 2005] Slesareva, N., Bruhn, A., and Weickert, J. (2005). Optic flow goes 

stereo: A variational method for estimating discontinuity-preserving dense disparity 

maps. In Proc. 27th DAGM Symposium, pages 33–40. 

[Sormann et al., 2005] Sormann, M., Zach, C., Bauer, J., Karner, K., and Bischof, H. 

(2005). Automatic foreground propagation in image sequences for 3d reconstruction. In 

Proc. 27th DAGM Symposium, pages 93–100. 

[Sormann et al., 2003] Sormann, M., Zach, C., and Karner, K. (2003). Texture mapping 

for view-dependent rendering. In Proceedings of Spring Conference on Computer 

Graphics 2003, pages 146–155. 

[Sormann et al., 2006] Sormann, M., Zach, C., and Karner, K. (2006). Graph cut based 

multiple view segmentation for 3d reconstruction. In International Symposium on 3D 

Data Processing, Visualization and Transmission (3DPVT). 

[Stegmaier et al., 2005] Stegmaier, S., Strengert, M., Klein, T., and Ertl, T. (2005). A 

simple and flexible volume rendering framework for graphics-hardware-based raycasting. 

In Proceedings of Volume Graphics, pages 187–195. 

[Stevens et al., 2002] Stevens, M. R., Culbertson, W. B., and Malzbender, T. (2002). A 

histogram-based color consistency test for voxel coloring. In Intl. Conference on Pattern 

Recognition, pages 118–121. 

[Strecha et al., 2003] Strecha, C., Tuytelaars, T., and Van Gool, L. (2003). Dense matching 

of multiple wide-baseline views. In Int. Conference on Computer Vision (ICCV), 

pages 1194–1201.


[Strecha and Van Gool, 2002] Strecha, C. and Van Gool, L. (2002). PDE-based multi-view 

depth estimation. In 1st International Symposium od 3D Data Processing Visualization 

and Transmission, pages 416–425. 

[Sugita et al., 2003] Sugita, K., Naemura, T., and Harashima, H. (2003). Performance 

evaluation of programmable graphics hardware for image filtering and stereo matching. 

In Proceedings of ACM Symposium on Virtual Reality Software and Technology 2003. 

[Sun et al., 2005] Sun, J., Li, Y., Kang, S., and Shum, H.-Y. (2005). Symmetric stereo 

matching for occlusion handling. In IEEE Computer Society Conference on Computer 


[Sun et al., 2003] Sun, J., Shum, H. Y., and Zheng, N. N. (2003). Stereo matching using 

belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 

(PAMI), 25(7):787–800. 

[Tappen and Freeman, 2003] Tappen, M. F. and Freeman, W. T. (2003). Comparison of 

graph cuts with belief propagation for stereo, using identical mrf parameters. In Int. 

Conference on Computer Vision (ICCV), pages 900–907. 

[Tarditi et al., 2005] Tarditi, D., Puri, S., and Oglesby, J. (2005). Accelerator: simplified 

programming of graphics processing units for general-purpose uses via data-parallelism. 

Technical Report MSR-TR-2005-184, Microsoft Research. 

[Tell and Carlsson, 2000] Tell, D. and Carlsson, S. (2000). Wide baseline point matching 

using affine invariants computed from intensity profiles. In European Conference on 

Computer Vision (ECCV), pages 814–828. 

[Thompson et al., 2002] Thompson, C. J., Hahn, S., and Oskin, M. (2002). Using modern 

graphics architectures for general-purpose computing: A framework and analysis. In 

35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35). 

[Tran and Davis, 2006] Tran, S. and Davis, L. (2006). 3d surface reconstruction using 

graph cuts with surface constraints. In European Conference on Computer Vision 

(ECCV), pages 219–231. 

[Tsai and Lin, 2003] Tsai, D.-M. and Lin, C.-T. (2003). Fast normalized cross correlation 

for defect detection. Pattern Recognition Letters, 24(15):2625–2631. 

[Turk and Levoy, 1994] Turk, G. and Levoy, M. (1994). Zippered polygon meshes from 

range images. In Proceedings of SIGGRAPH ’94, pages 311–318. 

[Veksler, 2003] Veksler, O. (2003). Fast variable window for stereo correspondence using 

integral images. In IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition (CVPR), pages 556–561.

166 

[Vogiatzis et al., 2005] Vogiatzis, G., Torr, P., and Cipolla, R. (2005). Multi-view stereo 

via volumetric graph-cuts. In IEEE Computer Society Conference on Computer Vision 

and Pattern Recognition (CVPR), pages II: 391–398. 

[Wang et al., 2006] Wang, L., Liao, M., Gong, M., Yang, R., and Nistér, D. (2006). High 

quality real-time stereo using adaptive cost aggregation and dynamic programming. 

In International Symposium on 3D Data Processing, Visualization and Transmission 

(3DPVT). 

[Weickert and Brox, 2002] Weickert, J. and Brox, T. (2002). Diffusion and regularization 

of vector- and matrix-valued images. Inverse Problems, Image Analysis and Medical 

Imaging. Contemporary Mathematics, 313:251–268. 

[Weickert et al., 2004] Weickert, J., Bruhn, A., and ans T. Brox, N. P. (2004). Variational 

optic flow computation: From continuous models to algorithms. In International 

Workshop on Computer Vision and Image Analysis, pages 1–6. 

[Weiskopf et al., 2002] Weiskopf, D., Erlebacher, G., Hopf, M., and Ertl, T. (2002). 

Hardware-accelerated langrangian-eulerian texture advection for 2d flow. In Proc. of 

Vision, Modeling and Visualization 2002, pages 77–84. 

[Weiss and Freeman, 2001] Weiss, Y. and Freeman, W. T. (2001). On the optimality of 

solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE 

Transactions on Information Theory, 47(2):723–735. 

[Westin et al., 2000] Westin, C.-F., Lorigo, L. M., Faugeras, O. D., Grimson, W. E. L., 

Dawson, S., Norbash, A., and Kikinis, R. (2000). Segmentation by adaptive geodesic 

active contours. In Proceedings of MICCAI 2000, Third International Conference on 

Medical Image Computing and Computer-Assisted Intervention, pages 266–275. 

[Wheeler et al., 1998] Wheeler, M., Sato, Y., and Ikeuchi, K. (1998). Consensus surfaces 

for modeling 3d objects from multiple range images. In Proceedings of ICCV ’98, pages 

917 – 924. 

[Woetzel and Koch, 2004] Woetzel, J. and Koch, R. (2004). Real-time multi-stereo depth 

estimation on GPU with approximative discontinuity handling. In 1st European Conference 

on Visual Media Production (CVMP 2004), pages 245–254. 

[Wonka et al., 2000] Wonka, P., Wimmer, M., and Schmalstieg, D. (2000). Visibility preprocessing 

with occluder fusion for urban walkthroughs. In Rendering Techniques 2000 

(Proceedings of the Eurographics Workshop 2000), pages 71–82. 

[Woodfill and Herzen, 1997] Woodfill, J. and Herzen, B. V. (1997). Real-time stereo vision 

on the parts reconfigurable computer. In IEEE Symposium on FPGAs for Custom 

Computing Machines.


[Yang et al., 2006] Yang, Q., Wang, L., and Yang, R. (2006). Real-time global stereo 

matching using hierarchical belief propagation. In Proceedings of the 17th British Machine 

Vision Conference. 

[Yang and Pollefeys, 2003] Yang, R. and Pollefeys, M. (2003). Multi-resolution real-time 

stereo on commodity graphics hardware. In Conference on Computer Vision and Pattern 

Recognition (CVPR). 

[Yang et al., 2004] Yang, R., Pollefeys, M., and Li, S. (2004). Improved real-time stereo 

on commodity graphics hardware. In CVPR 2004 Workshop on Real-Time 3D Sensors 

and Their Use. 

[Yang et al., 2003] Yang, R., Pollefeys, M., and Welch, G. (2003). Dealing with textureless 

regions and specular highlights – a progressive space carving scheme using a novel photoconsistency 

measure. In Int. Conference on Computer Vision (ICCV), pages 576–584. 

[Yang et al., 2002] Yang, R., Welch, G., and Bishop, G. (2002). Real-time consensus based 

scene reconstruction using commodity graphics hardware. In Proceedings of Pacific 

Graphics, pages 225–234. 

[Yezzi and Soatto, 2003] Yezzi, A. and Soatto, S. (2003). Stereoscopic segmentation. Intl. 

J. of Computer Vision, 53(1):31–43. 

[Zach and Bauer, 2002] Zach, C. and Bauer, J. (2002). Automatic texture hierarchy generation 

from orthographic facade textures. In 26th Workshop of the Austrian Association 

for Pattern Recognition (AAPR) 2002. 

[Zach et al., 2004a] Zach, C., Grabner, M., and Karner, K. (2004a). Improved compression 

of topology for view-dependent rendering. In Proceedings of Spring Conference on 

Computer Graphics 2004, pages 174–182. 

[Zach and Karner, 2003a] Zach, C. and Karner, K. (2003a). Fast event-driven refinement 

of dynamic levels of detail. In Proceedings of Spring Conference on Computer Graphics 

2003, pages 65–72. 

[Zach and Karner, 2003b] Zach, C. and Karner, K. (2003b). Progressive compression of 

visibility data for view-dependent multiresolution meshes. Journal of WSCG, 11(3):546– 

553. 

[Zach et al., 2003a] Zach, C., Klaus, A., Hadwiger, M., and Karner, K. (2003a). Accurate 

dense stereo reconstruction using graphics hardware. In Proc. Eurographics 2003, Short 

Presentations. 

[Zach et al., 2003b] Zach, C., Klaus, A., Reitinger, B., and Karner, K. (2003b). Optimized 

stereo reconstruction using 3D graphics hardware. In Workshop of Vision, Modelling, 

and Visualization (VMV 2003), pages 119–126.

168 

[Zach et al., 2002] Zach, C., Mantler, S., and Karner, K. (2002). Time-critical rendering 

of discrete and continuous levels of detail. In Proceedings of ACM Symposium on Virtual 

Reality Software and Technology 2002, pages 1–8. 

[Zach et al., 2004b] Zach, C., Mantler, S., and Karner, K. (2004b). Time-critical rendering 

of huge ecosystems using discrete and continuous levels of detail. Presence: Teleoperators 

and Virtual Environment. 

[Zach et al., 2006a] Zach, C., Sormann, M., and Karner, K. (2006a). High-performance 

multi-view reconstruction. In International Symposium on 3D Data Processing, Visualization 


[Zach et al., 2006b] Zach, C., Sormann, M., and Karner, K. (2006b). Scanline optimization 

for stereo on graphics hardware. In International Symposium on 3D Data Processing, 

Visualization and Transmission (3DPVT). 

[Zebedin, 2005] Zebedin, L. (2005). Texturing complex 3D models. Master’s thesis, Technical 

University Graz.

PhD thesis - Institute for Computer Graphics and Vision - Graz ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?