Towards Wiki-based Dense City Modeling - Institute for Computer ...

Towards Wiki-based Dense City Modeling 

Arnold Irschara 

Institute for Computer Graphics and Vision, TU Graz 

irschara@icg.tugraz.at 

Horst Bischof 

Institute for Computer Graphics and Vision, TU Graz 

bischof@icg.tugraz.at 

Christopher Zach 

VRVis Research Center 

zach@vrvis.at 

Abstract 

This work reports on the advances and on the current 

status of a terrestrial city modeling approach, which uses 

images contributed by end-users as input. Hence, the Wiki 

principle well known from textual knowledge databases is 

transferred to the goal of incrementally building a virtual 

representation of the occupied habitat. In order to achieve 

this objective, many state-of-the-art computer vision methods 

must be applied and modified according to this task. We 

describe the utilized 3D vision methods and show initial results 

obtained from the current image database acquired by 

in-house participants. 

1. Introduction 

Recently, 3D vision methods demonstrate increased robustness 

and result in high quality models, hence multiview 

modeling appears more often in industry projects targeted 

at large scale modeling of environments. In particular, 

efforts like Google Earth and Virtual Earth aim on the systematic 

creation of virtual models from aerial images. In 

this work we focus on the uncoordinated generation of digital 

copies of urban habitats from community supplied images. 

Hence, more 3D vision methods will become known 

in a larger audience. Panorama creation tools like Autostitch 

[4] are already well-established in the public. Now, 

even end-user applications for the more sensitive structure 

from motion determination exist at least as technology previews. 

Currently, the Photo Tourism software [19] (and the related 

PhotoSynth application) is the most-well known application 

for automatic structure and motion computation from 

a large set of images. A collection of supplied images is 

analyzed and correspondences are established, from which 

a relevant subset of views and the respective 3D structure 

are determined. Photo Tourism does not explicitly incorporate 

calibrated cameras, but relies partially on the focal 

length specification found in the image meta-data to obtain 

the initial metric structure. Images with incorrect or missing 

meta-data can be registered by DLT pose estimation. 

In contrast to Photo Tourism we employ calibrated cameras, 

thereby substantially easing the structure and motion 

computation. Performing the calibration procedure proposed 

in our framework can be easily done by end-users. 

One additional advantage of using calibrated cameras is 

the higher accuracy of the computed poses, which enables 

the subsequent generation of dense geometry using multiview 

stereo techniques. In the first instance, we aim on 

textured dense models in quality similar to the results presented 

in [1]. Apparently, these raw models need to be postprocessed 

in subsequent steps to allow an efficient Internetbased 

visualization. 

The majority of 3D modeling approaches is intended 

for a decentralized use on personal computers. Vergauwen 

and Van Gool [23] present a Web-based interface to their 

3D modeling engine, again working with uncalibrated cameras 

[14]. Registered users can upload their images and subsequently 

receive the resulting depth maps. Their proposed 

system is targeted at reconstructing individual sites, but it is 

not aimed on building and maintaining a global image and 

3D model database. 

The goal of our work is the creation of a 3D virtual model 

representing an urban environment from captured digital 

images. Instead of using publicly available photo collections 

(as done in [19]), we rely on images submitted by 

interested users, since the participants need to provide the 

camera calibration data in addition. By adding more images, 

the virtual representations of an urban habitat can be 

incrementally maintained and refined gradually. Hence, we 

apply the famous and effective Wiki principle on the objective 

of creating a photorealistic 3D city model. The following 

sections describe the utilized methods and the results 

available so far in more detail. 

1

(a) 

Figure 1. 96 calibration markers arranged in a 4 by 4 layout on the 

floor. 

2. Camera Calibration 

In order to avoid to distinguish between scenes with and 

without dominant planar structures (e.g. [15]), we require 

the cameras to be calibrated. Additionally, our experience 

with self-calibration techniques indicates, that the accuracy 

of structure and motion computation is expected to be 

higher with a calibrated setup. In particular, accurate pose 

is essential to obtain a faithful reconstruction using dense 

depth techniques. 

To ease the calibration effort for the end-user, we employ 

a procedure aiming for the accuracy of target calibration 

techniques without the need for a precise calibration 

pattern. The approach is based on simple printed markers 

imaged in several views. Using specific markers enables to 

establish robust and correct correspondences between the 

views. Image-based city modeling requires generally an infinite 

focus and a wide angle setup, hence the calibration 

pattern needs to be sufficiently large. Thus, the marker patterns 

are printed on several sheets of paper and are typically 

arranged on the floor (see Figures 1(a) and (b)). These pages 

can be laid out arbitrarily, hence the well-known method of 

Zhang [27] is not applicable. It is not necessary to have all 

markers visible in the captured images, but for good calibration 

results most markers should be visible and well distributed 

in the images. 

The first step in the calibration procedure is the detection 

of the circular markers in the images and the extraction 

of the unique marker ID. The 2D feature point associated 

with a marker is either the center of the extracted ellipse, 

which only approximates the true marker center. Additionally, 

the 2D feature position can be refined using the central 

checkerboard pattern. In this case a non-linear search 

for the correct center is performed by aligning a synthetic 

checkerboard pattern with a suitable section of the marker 

image. 

Matching feature points across multiple views is trivial, 

since unique and easily extractable IDs are available. Of 

course, the uniqueness of extracted markers in every image 

needs to be checked to avoid incorrect detections in case of 

blurred or otherwise low-quality images. 

Since the marker images are laid out on a planar surface, 

(b) 

corresponding feature points are related by a homography. 

Hence, the first estimation of lens distortion parameters attempts 

to minimize the reprojection error between extracted 

feature points with free homography and lens distortion parameters 

[13]. More formally, if x k i denotes the position of 

marker k in the i-th image, the initial distortion estimation 

determines 

∑ 

arg min | ˜D(x k j , θ) − ˜D(H ij x k i , θ)| 2 , (1) 

H ij,θ 

i,j 

where H ij denotes the image homography from view i to 

j and ˜D(x, θ) is the inverse distortion function with coefficients 

θ. The distortion model is 

˜D(x, θ) = (x − (u 0 , v 0 ) T ) · (1 + k 1 r 2 + k 2 r 4 ), (2) 

with r = ‖x − (u 0 , v 0 ) T ‖. θ is the vector (u 0 , v 0 , k 1 , k 2 ) 

consisting of the distortion center (u 0 , v 0 ) and the coefficients 

k 1 and k 2 . 

The center of radial distortion (u 0 , v 0 ) is independent 

from the optical principal point, thus essentially removing 

the need for decentering distortion parameters [22]. The initial 

homographies are set to the gold standard results and the 

distortion parameters are initialized with the image center, 

and 0 for the coefficients k 1 and k 2 , respectively. The nonlinear 

minimization is performed with a (sparse) Levenberg- 

Marquardt method. Note, that the homographies are not 

independent: a consistent set of inter-image homographies 

should satisfy H ij = H lj H il for all l. This can be enforced 

in our implementation by using a minimal parametrization 

solely based on homographies between adjacent views, 

H i,i+1 , and representing H ij = ∏ j>l≥i H l,l+1. 

After determining the initial estimate for the lens distortion, 

the focal length of the camera is estimated from the set 

of homographies. Both [21] and [10] employ a non-linear 

minimization technique for intrinsic parameters estimation 

and an initial estimate is required. We utilize a much simpler 

search technique to quickly determine the camera intrinsics: 

At first, we assume that the principal point is close 

to the image center and that the aspect ratio and skew are 

one and zero, respectively. Hence, we search for a constant, 

but unknown focal length f determining the calibration matrix 

K. If the correct intrinsic matrix K is known, the 

image-based homographies H ij can be updated to homographies 

between metric image planes, ˜Hij = K −1 H ij K. 

For a particular view i assumed with canonical pose, ˜Hij 

can be decomposed as ˜H ij = R ij − t ij n T i /d i, where 

(R ij , t ij ) depicts the relative pose and n i and d i denote 

the plane normal and distance (according to the coordinate 

frame of view i), respectively. Note, that each ˜H ij provides 

its own estimate of n i = n i ( ˜H ij ). 

For the true calibration matrix K, the extracted normals 

n i ( ˜H ij ) should coincide into one common estimate of the

plane normal. Hence, a low variance of the set {n i } indicates 

approximately correct calibration parameters. A slight 

complication is induced by the fact, that decomposing ˜H ij 

results in two possible relative poses and plane parameters 

(denoted with n + i and n − i ). Let (n+ 0 , n− 0 ) be the most separated 

pair of normals from all pairs (n + i ( ˜H ij ), n − i ( ˜H ij )). 

We use n + 0 and n− 0 as the estimates for the mean of the set 

{n i }. Now, the score for K is the minimum of 

∑ 

min 

(∠(n + i ( ˜H ij ), n + 0 ), ∠(n− i ( ˜H 

) 

ij ), n + 0 ) (3) 

i,j 

i,j 

and 

∑ 

min 

(∠(n + i ( ˜H ij ), n − 0 ), ∠(n− i ( ˜H 

) 

ij ), n − 0 ) . (4) 

This score is evaluated for potential choices of f, e.g. 

f ∈ [0.3, 3] in terms of normalized pixel coordinates. The 

value of f with the lowest score is used as initial estimate 

for the focal length. This procedure is both simple and very 

fast, and yields to sufficiently accurate focal lengths in our 

experiments. 

With the (approximate) knowledge of the focal length, an 

initial metric reconstruction based on two appropriate views 

is generated. The remaining views are added by estimating 

their absolute poses. The final bundle adjustment procedure 

optimizes for the parameters of the forward distortion 

function, hence the inverse of the originally obtained distortion 

parameters is required. Since the employed polynomial 

distortion model is not closed under function inversion, 

the initial forward distortion parameters are determined by 

a least squares approach. A final bundle adjustment procedure 

is applied to refine the camera intrinsics and distortion 

parameters and to improve the only approximately planar 

3D structure and the camera poses. 

3. Structure and Motion Computation 

3.1. Preprocessing 

The first step after image uploading is resampling the 

image according to the obtained lens distortion. In order to 

avoid frequent recomputation or retrieval of the corresponding 

lookup table, this procedure is run as a daemon process 

caching several least recently used distortion lookup tables. 

Afterwards, feature points and their respective descriptors 

are extracted. The current implementation uses SIFT features 

[9] because of its success reported in the vision community. 

3.2. Image Matching 

Retrieving similar images for a given one is currently a 

very active research topic (e.g. [16, 12, 7, 17]. We employ 

a visual vocabulary tree approach similar to [12] to retrieve 

a set of potentially similar views from a large collection of 

images. Thus, other image related data to limit the candidate 

images for matching like GPS position is helpful, but 

not strictly necessary. The vocabulary tree enables us to efficiently 

match a single image against a database containing 

thousands and even millions of images. Images found in 

the database with a high score according to the vocabulary 

tree are investigated further by a more discriminant matching 

procedure. Since the score induced by the vocabulary 

tree may miss relevant images, the set of candidate images 

used for further matching is augmented with the neighbors 

of highly ranked images with respect to the already constructed 

view network. 

In our system the vocabulary tree is trained in an unsupervised 

manner with a subset of 2 × 10 6 SIFT feature vectors 

randomly taken from 2500 street-side images. The descriptor 

vectors are then hierarchically quantized into clusters 

using a k-means algorithm. We set the branch factor to 

10 and allow up to 7 tree levels. For each level the k-means 

algorithm is initialized with different seed clusters and the 

result producing the lowest Euclidean distance error is retained. 

Once the vocabulary tree is trained, searching the 

visual vocabulary is very efficient and new images can be 

inserted on-the-fly. Based on the scoring function a ranking 

of relevant database images is reported and used for postverification. 

In our current setting we rely on an entropy weighted 

scoring similar to the tf-idf “term frequency inverse document 

frequency” as described in [18]. Let D be an image in 

our database and t be the term in the vocabulary associated 

to feature f of the current query image Q, then our scoring 

function is, 

∑ 

( ) N 

log 

(5) 

n(t) 

t∈Q∩D 

where N is the total number of images in the collection and 

n(t) is the number of images that contain term t. In order 

to guarantee fairness between database images with different 

number of features, the query results are normalized by 

the self-scoring result. Therefore, if a database image is 

used for query, a score of 1 is returned. At the same time 

we get an absolute measure of image-to-image similarity, 

which enables us to set a global threshold for scoring. The 

k top ranked images reported by the vocabulary tree are then 

taken for post-verification. Typically, we set k to 1% of the 

current database size. 

Our verification procedure is based on exhaustive matching 

and RANSAC for geometric consistency check. First of 

all, correspondences are computed by mutual nearest neighbor 

matching of the 128 dimensional SIFT feature vectors. 

We adopt the idea of [2] and match features with the same 

contrast only, by taking advantage of the Laplacian sign. In 

addition, the epipolar geometry is verified using a RANSAC 

procedure. If sufficient reliable 3D structure for the image

(a) 

Figure 2. (a) Ten sample images from a collection of 105 facade images. (b) Side view of the camera poses and 9097 triangulated feature 

points. 

(b) 

measurements is available, we use pose estimation for verification. 

The calibrated settings enable us to use the Five- 

Point algorithm [11] for image-to-image matches and the 

Three-Point algorithm [6] for 3D to 2D correspondences 

as minimal hypothesis generators. Images are accepted as 

neighbors if a quality criterion is satisfied: we generate a 

number of N hypothesis in prior and then check if the number 

of outliers does not exceed the precomputed threshold. 

Since it has often been observed (e.g. in [20]) that the 

RANSAC stopping criterion is sometimes optimistic, we 

use a rather conservative confidence factor p of 1−10 −4 and 

require a minimal number of 20 inliers. If an image passes 

the post-verification procedure, the neighbors of this image 

are also matched with the query image. Note, that this strategy 

is highly efficient for our purpose, since frequently the 

first ranked image by the vocabulary tree is already a candidate 

match satisfying the geometry constraints. 

Unlike for the vocabulary tree scoring, where the whole 

set of SIFT descriptors is utilized, we limit the number of 

features for exhaustive matching. In general we take the 

1500 best features according to their DoG magnitude response 

for matching, only. Therefore, searching for correspondences 

is sufficiently fast (see Section 5 for detailed 

timings). 

3.3. Upgrading the View Network 

Each time a new image is added to the view network, 

its neighbors are computed and the reconstruction process 

is started. By reconstruction we mean structure and motion 

computation, namely the estimation of the camera orientations 

and the sparse 3D structure of the matched features. 

Since images may be taken from different locations, we 

do not expect to obtain a single coherent reconstruction, but 

a forest of multiple reconstructions. We require that a reconstruction 

consists of at least three images (view triple) 

and 20 common triangulated points. In general four different 

cases can occur if a new image is processed: 

1. The view can be robustly registered with exactly one 

already present reconstruction. 

2. The image can be aligned with multiple reconstructions. 

3. The current view cannot be (robustly) aligned with an 

existing reconstruction, but forms a good view triple 

with two other already present views. 

4. The geometric relation of this image with any of the 

known views cannot be established, and structure and 

motion determination is postponed until a new suitable 

view is inserted. 

In the first case the position of the current view can be computed 

immediately by robust absolute pose estimation, since 

2D to 3D points correspondences are known. Thereafter, 

the camera parameters are optimized by iterative refinement 

and new correspondences are triangulated to 3D points. 

In the second case, where the current image takes part of 

two or more different reconstructions, the reconstructions 

are merged. We determine robustly a 3D to 3D similarity 

transform for the registration of the two corresponding 3D 

points. This process is then followed by Euclidean bundle 

adjustment, where the sparse geometry and the camera parameters 

are optimized. 

After updating the reconstructions in this first two cases, 

the epipolar neighbors of the newly inserted view are traversed 

and checked, whether the updated 3D structure now 

allows determination of their poses. 

In the third scenario the neighbors of the current image 

are estimated as described previously and a new reconstruction 

is initialized from a well-conditioned view triple. The 

view triple should provide a good triangulation angle and at

Figure 3. Dense reconstruction from a collection of 105 facade images. 

the same time have many correspondences. Therefore, in 

the first step we identify the view pair which minimizes, 

1 

N 

N∑ 

i 

1 

sin 2 (α i ) 

where α i is the angle between the two camera rays for the 

3D point X i . Thereafter, a third view which minimizes the 

value in Eq. 6 with respect to this first configuration is estimated. 

Note, that 1/sin(α i ) approximates the uncertainty 

(deviation) of X i in the depth direction. In [3] the view 

pair with maximal mean roundness (essentially the same as 

1/N ∑ sin(α i )) is taken. Such an approach does not consider 

the number of correspondences between two views. 

In this work we assume, that the accuracy of further (leastsquares) 

computations depending on the initial structure 

scales with 1/ √ N. Consequently, Eq. 6 estimates the mean 

variance of the initial structure for a given view pair. The 

relative pose between the first two views is computed by 

the Five-Point algorithm and the third camera is inserted by 

the Three-Point algorithm with respect to the triangulated 

3D points. Thereafter, bundle adjustment is used to globally 

optimize the exterior camera orientations and the initial 

structure. A view triple is considered as a valid reconstruction 

if the sum of triangulation angles is above a threshold, 

we require at least twenty 3D points with measurements in 

all views and a triangulation angle greater than 2 ◦ . 

Whenever a number of M (15 in our case) views is added 

to a reconstruction, bundle adjustment is run by optimizing 

all cameras and triangulated 3D points. Our bundle adjustment 

implementation is similar to the one described in [8]. 

Thereafter, for each image measurement the reprojection error 

is computed, 3D points with an average reprojection error 

larger than 1.3 pixel and a triangulation angle less than 

2 ◦ are removed. Our experiments suggest that this strategy 

improves both, accuracy and robustness of the reconstruction 

algorithm. A sparse reconstruction result is shown in 

Figure 2. 

(6) 

4. Dense Reconstruction 

Obtaining the initial 3D structure and motion is an essential 

aspect of image-based modeling, but dense geometry 

is required for a faithful virtual representation of the 

captured scene. Since we face a large number of images 

with potentially associated depth maps, we focus on simple 

but fast dense depth estimation procedures. Additionally, 

processing of multiple sensor images with respect to 

a key view (in contrast to traditional two-frame stereo) is 

demanded. Plane-sweep approaches to dense stereo [24, 5] 

enable an efficient, GPU-accelerated procedure to create the 

depth maps. In order to obtain reasonable depth values in 

homogeneous regions, we employ GPU-based scanline optimization 

in our framework [26]. Note, that dense geometry 

generation is currently still performed in an offline fashion 

for individual large and connected view networks. 

In case of general view networks, a suitable selection of 

sensor views used for depth estimation is necessary. We 

use the following simple, but effective heuristic to select 

appropriate sensor images, which is based on an estimate 

for image overlap and viewing directions: for a particular 

key view and a potential sensor view, we determine the correspondences 

between these views and compute the convex 

hull of the respective 2D measurements in the key view. The 

area of this convex hull A H in relation to the total key image 

area A gives an estimate of the relevant image overlap. 

Note, that A does not necessarily denote the whole key image 

size, since insignificant image portions like sky regions 

can be excluded. Sensor view candidates with an overlap 

A H /A smaller than a given threshold (typically set to 0.3) 

are discarded. Image overlap is only a partial guideline for 

view selection. The angle between corresponding camera 

rays is another suitable criterion. Very small angles give rise 

to large depth uncertainties, whereas larger angles are susceptible 

to image distortion and occlusions. From a practical 

point of view, triangulation angles of about α 0 = 6 ◦ are 

favorable for dense stereo (meaning a distance to baseline

atio of about 10:1). Consequently, view pairs with a large 

overlap and appropriate triangulation angles are preferable. 

Hence, we rank the views with sufficient overlap according 

to the following score: 

A H 

A median(ψ(α i)), (7) 

where α i is the angle between corresponding rays and ψ(·) 

is a unimodal weighting function with its maximum at α 0 . 

We choose ψ as 

ψ(α) = α 2 e −2α/α0 . (8) 

The two views with highest scores are taken as sensor 

views. 

Dense depth estimation depends heavily on the quality 

of the provided epipolar geometry. In order to reduce the 

influence of inaccuracies in the estimated poses on the depth 

maps, and to increase the performance, multi-view stereo 

is applied on downsampled images (512 × 384 pixels for 

4 : 3 format digital images). Matching cost computation 

and scanline optimization for view triples (one key view and 

two sensor views) take about 3s on a GeForce 7800GS. 

The set of depth maps provides 2.5D geometry for each 

view. These depth maps are subsequently fused into a common 

3D model using a robust depth image integration approach 

[25]. This method results in globally optimal 3D 

models according to the employed energy functional, but it 

operates in batch mode. The incorporation of this step is 

the major reason for dense geometry generation to be an offline 

process. The result obtained by depth map fusion for 

the dataset depicted in Figure 2 (augmented with per-vertex 

coloring) is shown in Figure 3. 

5. Results 

We tested our system on different image collections, 

varying from a few hundred to thousands of pictures. All 

images are calibrated with the method described in Section 

2. The calibration precision (i.e. the final mean reprojection 

error) ranges from 1/20 to 1/7 pixel. Our largest 

test set is a database containing about 7000 street-side images 

taken with four compact digital cameras from different 

manufacturers. The images are captured at different days 

under varying illumination conditions. Furthermore, the 

images are only partially ordered. The size of the source 

images varies from two to seven Megapixel. In order to remove 

compression artefacts, the supplied images are resampled 

to the half resolution for further processing. Adding 

an image to the view network takes approximately half a 

minute, most of the time is spend on exhaustive matching 

and bundle adjustment. Average run-times required for each 

step are listed in Table 1. 

In all our tests we use a pre-trained generic vocabulary 

tree of about 7 × 10 5 leaf nodes for searching the image 

Operation 

time [s] 

Undistortion (2272 × 1704 pixel) 1.0 

Feature extraction (2272 × 1704 pixel) 1.6 

Vocabulary scoring 0.2 

Brute force NN-matching (1500 × 1500) k × 0.35 

RANSAC (3000 samples Five-Point) k × 0.4 

Structure and Motion 0.50 

Bundle adjustment (100 views) 60 

Dense stereo 3.0 

Table 1. Typical average timings of our system on an AMD 

Athlon(tm) 64 X2 Dual Core Processor 4400+. The parameter 

k specifies the number of images taken for post-verification. Typically, 

we set k to 1% of the current database size. 

Figure 4. Vocabulary tree performance for image retrieval depending 

on the database size. The y-axis shows the probability to find 

an epipolar neighbor in the first k-ranked images reported by the 

vocabulary tree scoring. 

database. Our experiments suggest that the vocabulary tree 

approach for image retrieval is very efficient for large scale 

structure and motion computation. In Figure 4 the change 

of retrieval performance depending on the database size is 

shown. On average an image has an overlap with about 

eight other images in the database. However, there are 

also images with no geometric relation to existing database 

entries. The detailed distribution of epipolar neighbors is 

shown in Figure 5. The retrieval performance is measured 

by the rank of the first image that passes the geometric verification 

procedure. Note, even for a relative large database 

of about 7000 images, the first ranked image reported by 

the vocabulary tree is a valid epipolar neighbor of the query 

image with a probability of more than 90%. What we can 

observe also is that the scoring saturates rapidly, therefore 

employing the 1% from the vocabulary tree ranking is a 

good choice between speed and retrieval quality. Finally, 

two captured sites and the resulting sparse and dense models 

are shown in Figure 6 and Figure 7. The final reprojec-

(a) 

Figure 5. Probability density function of the number of verified 

epipolar neighbors for a query image in a database of 7181 streetside 

images. On average a query images has an overlap with about 

eight images in the database. 

tion errors of the 3D structure and camera poses for these 

datasets after bundle adjustment are about 1/3 pixel. 

6. Future Work 

Currently, images are contributed by persons associated 

with this project and with basic knowledge in 3D computer 

vision. Future work needs to increase the robustness of the 

structure from motion methods to allow the public to participate 

in the creation of the visual database. In particular, it 

needs to be assured that low-quality or defective images do 

not degrade the 3D models in the generated database. 

Another future option is to allow uncalibrated images to 

be added for the purpose of image localization and geometric 

alignment. Such images will generally not result in an 

update of the 3D structure. 

Dense geometry generation is currently not integrated 

into the online processing workflow. Depth estimation can 

be easily adapted to the online setting, but the employed 

range image fusion method is entirely an offline procedure. 

A novel approach for robust and efficient depth image fusion 

working incrementally is a present research topic. 

Acknowledgements 

This work is partly funded by the Vienna Science and 

Technology Fund (WWTF) and by the Kplus VRVis research 

center. 

References 

[1] A. Akbarzadeh et al. Towards urban 3D reconstruction from 

video. In Proc. 3DPVT, 2006. 

[2] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up 

robust features. In Proc. ECCV, pages 404–417, 2006. 

(b) 

(c) 

Figure 6. (a) Some sample images and (b) sparse reconstruction 

from more than 400 viewpoints. Dense reconstruction (c) from a 

subset of registered images. 

[3] C. Beder and R. Steffen. Determining an initial image pair 

for fixing the scale of a 3d reconstruction from an image sequence. 

In Proc. DAGM, pages 657–666, 2006. 

[4] M. Brown and D. Lowe. Automatic panoramic image stitching 

using invariant features. IJCV, 2006. 

[5] N. Cornelis and L. Van Gool. Real-time connectivity constrained 

depth map computation using programmable graphics 

hardware. In Proc. CVPR, pages 1099–1104, 2005. 

[6] R. M. Haralick, C. Lee, K. Ottenberg, and M. Nölle. Analy-

(a) 

(b) 

(c) 

Figure 7. (a) Some sample images from a collection of 49 views. 

(b) Sparse reconstruction and camera orientations. (c) Final dense 

reconstruction after depth map fusion. 

sis and solutions of the three point perspective pose estimation 

problem. In Proc. CVPR, pages 592–598, 1991. 

[7] H. Jegou, H. Harzallah, and C. Schmid. A contextual dissimilarity 

measure for accurate and efficient image search. 

In Proc. CVPR, 2007. 

[8] M. Lourakis and A. Argyros. The design and implementation 

of a generic sparse bundle adjustment software package 

based on the levenberg-marquardt algorithm. Technical Report 

340, Institute of Computer Science - FORTH, 2004. 

[9] D. Lowe. Distinctive image features from scale-invariant 

keypoints. IJCV, 60(2):91–110, 2004. 

[10] E. Malis and R. Cipolla. Camera self-calibration from unknown 

planar structures enforcing the multiview constraints 

between collineations. TPAMI, 24(9):1268–1272, 2002. 

[11] D. Nistér. An efficient solution to the five-point relative pose 

problem. TPAMI, 26(6):756–770, 2004. 

[12] D. Nistér and H. Stewenius. Scalable recognition with a vocabulary 

tree. In Proc. CVPR, pages 2161–2168, 2006. 

[13] T. Pajdla, T. Werner, and V. Hlaváč. Correcting radial lens 

distortion without knowledge of 3-D structure. Technical report, 

Center for Machine Perception, Czech Technical University, 

1997. 

[14] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, 

K. Cornelis, J. Tops, and R. Koch. Visual modeling with 

a hand-held camera. IJCV, 59(3):207–232, 2004. 

[15] M. Pollefeys, F. Verbiest, and L. Van Gool. Surviving dominant 

planes in uncalibrated structure and motion recovery. In 

Proc. ECCV, pages 837–851, 2002. 

[16] F. Schaffalitzky and A. Zisserman. Multi-view matching for 

unordered image sets, or ”How do I organize my holiday 

snaps?”. In Proc. ECCV, pages 414–431, 2002. 

[17] G. Schindler, M. Brown, and R. Szelisk. City-scale location 

recognition. In Proc. CVPR, 2007. 

[18] J. Sivic and A. Zisserman. Video google: A text retrieval approach 

to object matching in videos. In International Conference 

on Computer Vision, pages 1470–1477, 2003. 

[19] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring 

photo collections in 3D. In Proceedings of SIGGRAPH 

2006, pages 835–846, 2006. 

[20] B. Tordoff and D. W. Murray. Guided sampling and consensus 

for motion estimation. In Proc. ECCV, pages 82–98, 

2002. 

[21] B. Triggs. Autocalibration from planar scenes. In Proceedings 

of the 5th European Conference on Computer Vision 

(ECCV’98), pages 89–105, 1998. 

[22] R. Y. Tsai. A versatile camera calibration technique for high 

accuracy 3d machine vision metrology using off-the-shelf tv 

cameras and lenses. IEEE Journal of Robotics and Automation, 

3(4):323–344, 1987. 

[23] M. Vergauwen and L. Van Gool. Web-based 3D reconstruction 

service. Mach. Vision Appl., 17(6):411–426, 2006. 

[24] R. Yang and M. Pollefeys. Multi-resolution real-time stereo 

on commodity graphics hardware. In Proc. CVPR, pages 

211–217, 2003. 

[25] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm 

for robust TV-L 1 range image integration. In Proc. 

ICCV, 2007. to appear. 

[26] C. Zach, M. Sormann, and K. Karner. Scanline optimization 

for stereo on graphics hardware. In Proc. 3DPVT, 2006. 

[27] Z. Zhang. A flexible new technique for camera calibration. 

TPAMI, 22(11):1330–1334, 2000.

Towards Wiki-based Dense City Modeling - Institute for Computer ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?