18.07.2014 Views

Towards Wiki-based Dense City Modeling - Institute for Computer ...

Towards Wiki-based Dense City Modeling - Institute for Computer ...

Towards Wiki-based Dense City Modeling - Institute for Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Towards</strong> <strong>Wiki</strong>-<strong>based</strong> <strong>Dense</strong> <strong>City</strong> <strong>Modeling</strong><br />

Arnold Irschara<br />

<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> Graphics and Vision, TU Graz<br />

irschara@icg.tugraz.at<br />

Horst Bischof<br />

<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> Graphics and Vision, TU Graz<br />

bischof@icg.tugraz.at<br />

Christopher Zach<br />

VRVis Research Center<br />

zach@vrvis.at<br />

Abstract<br />

This work reports on the advances and on the current<br />

status of a terrestrial city modeling approach, which uses<br />

images contributed by end-users as input. Hence, the <strong>Wiki</strong><br />

principle well known from textual knowledge databases is<br />

transferred to the goal of incrementally building a virtual<br />

representation of the occupied habitat. In order to achieve<br />

this objective, many state-of-the-art computer vision methods<br />

must be applied and modified according to this task. We<br />

describe the utilized 3D vision methods and show initial results<br />

obtained from the current image database acquired by<br />

in-house participants.<br />

1. Introduction<br />

Recently, 3D vision methods demonstrate increased robustness<br />

and result in high quality models, hence multiview<br />

modeling appears more often in industry projects targeted<br />

at large scale modeling of environments. In particular,<br />

ef<strong>for</strong>ts like Google Earth and Virtual Earth aim on the systematic<br />

creation of virtual models from aerial images. In<br />

this work we focus on the uncoordinated generation of digital<br />

copies of urban habitats from community supplied images.<br />

Hence, more 3D vision methods will become known<br />

in a larger audience. Panorama creation tools like Autostitch<br />

[4] are already well-established in the public. Now,<br />

even end-user applications <strong>for</strong> the more sensitive structure<br />

from motion determination exist at least as technology previews.<br />

Currently, the Photo Tourism software [19] (and the related<br />

PhotoSynth application) is the most-well known application<br />

<strong>for</strong> automatic structure and motion computation from<br />

a large set of images. A collection of supplied images is<br />

analyzed and correspondences are established, from which<br />

a relevant subset of views and the respective 3D structure<br />

are determined. Photo Tourism does not explicitly incorporate<br />

calibrated cameras, but relies partially on the focal<br />

length specification found in the image meta-data to obtain<br />

the initial metric structure. Images with incorrect or missing<br />

meta-data can be registered by DLT pose estimation.<br />

In contrast to Photo Tourism we employ calibrated cameras,<br />

thereby substantially easing the structure and motion<br />

computation. Per<strong>for</strong>ming the calibration procedure proposed<br />

in our framework can be easily done by end-users.<br />

One additional advantage of using calibrated cameras is<br />

the higher accuracy of the computed poses, which enables<br />

the subsequent generation of dense geometry using multiview<br />

stereo techniques. In the first instance, we aim on<br />

textured dense models in quality similar to the results presented<br />

in [1]. Apparently, these raw models need to be postprocessed<br />

in subsequent steps to allow an efficient Internet<strong>based</strong><br />

visualization.<br />

The majority of 3D modeling approaches is intended<br />

<strong>for</strong> a decentralized use on personal computers. Vergauwen<br />

and Van Gool [23] present a Web-<strong>based</strong> interface to their<br />

3D modeling engine, again working with uncalibrated cameras<br />

[14]. Registered users can upload their images and subsequently<br />

receive the resulting depth maps. Their proposed<br />

system is targeted at reconstructing individual sites, but it is<br />

not aimed on building and maintaining a global image and<br />

3D model database.<br />

The goal of our work is the creation of a 3D virtual model<br />

representing an urban environment from captured digital<br />

images. Instead of using publicly available photo collections<br />

(as done in [19]), we rely on images submitted by<br />

interested users, since the participants need to provide the<br />

camera calibration data in addition. By adding more images,<br />

the virtual representations of an urban habitat can be<br />

incrementally maintained and refined gradually. Hence, we<br />

apply the famous and effective <strong>Wiki</strong> principle on the objective<br />

of creating a photorealistic 3D city model. The following<br />

sections describe the utilized methods and the results<br />

available so far in more detail.<br />

1


(a)<br />

Figure 1. 96 calibration markers arranged in a 4 by 4 layout on the<br />

floor.<br />

2. Camera Calibration<br />

In order to avoid to distinguish between scenes with and<br />

without dominant planar structures (e.g. [15]), we require<br />

the cameras to be calibrated. Additionally, our experience<br />

with self-calibration techniques indicates, that the accuracy<br />

of structure and motion computation is expected to be<br />

higher with a calibrated setup. In particular, accurate pose<br />

is essential to obtain a faithful reconstruction using dense<br />

depth techniques.<br />

To ease the calibration ef<strong>for</strong>t <strong>for</strong> the end-user, we employ<br />

a procedure aiming <strong>for</strong> the accuracy of target calibration<br />

techniques without the need <strong>for</strong> a precise calibration<br />

pattern. The approach is <strong>based</strong> on simple printed markers<br />

imaged in several views. Using specific markers enables to<br />

establish robust and correct correspondences between the<br />

views. Image-<strong>based</strong> city modeling requires generally an infinite<br />

focus and a wide angle setup, hence the calibration<br />

pattern needs to be sufficiently large. Thus, the marker patterns<br />

are printed on several sheets of paper and are typically<br />

arranged on the floor (see Figures 1(a) and (b)). These pages<br />

can be laid out arbitrarily, hence the well-known method of<br />

Zhang [27] is not applicable. It is not necessary to have all<br />

markers visible in the captured images, but <strong>for</strong> good calibration<br />

results most markers should be visible and well distributed<br />

in the images.<br />

The first step in the calibration procedure is the detection<br />

of the circular markers in the images and the extraction<br />

of the unique marker ID. The 2D feature point associated<br />

with a marker is either the center of the extracted ellipse,<br />

which only approximates the true marker center. Additionally,<br />

the 2D feature position can be refined using the central<br />

checkerboard pattern. In this case a non-linear search<br />

<strong>for</strong> the correct center is per<strong>for</strong>med by aligning a synthetic<br />

checkerboard pattern with a suitable section of the marker<br />

image.<br />

Matching feature points across multiple views is trivial,<br />

since unique and easily extractable IDs are available. Of<br />

course, the uniqueness of extracted markers in every image<br />

needs to be checked to avoid incorrect detections in case of<br />

blurred or otherwise low-quality images.<br />

Since the marker images are laid out on a planar surface,<br />

(b)<br />

corresponding feature points are related by a homography.<br />

Hence, the first estimation of lens distortion parameters attempts<br />

to minimize the reprojection error between extracted<br />

feature points with free homography and lens distortion parameters<br />

[13]. More <strong>for</strong>mally, if x k i denotes the position of<br />

marker k in the i-th image, the initial distortion estimation<br />

determines<br />

∑<br />

arg min | ˜D(x k j , θ) − ˜D(H ij x k i , θ)| 2 , (1)<br />

H ij,θ<br />

i,j<br />

where H ij denotes the image homography from view i to<br />

j and ˜D(x, θ) is the inverse distortion function with coefficients<br />

θ. The distortion model is<br />

˜D(x, θ) = (x − (u 0 , v 0 ) T ) · (1 + k 1 r 2 + k 2 r 4 ), (2)<br />

with r = ‖x − (u 0 , v 0 ) T ‖. θ is the vector (u 0 , v 0 , k 1 , k 2 )<br />

consisting of the distortion center (u 0 , v 0 ) and the coefficients<br />

k 1 and k 2 .<br />

The center of radial distortion (u 0 , v 0 ) is independent<br />

from the optical principal point, thus essentially removing<br />

the need <strong>for</strong> decentering distortion parameters [22]. The initial<br />

homographies are set to the gold standard results and the<br />

distortion parameters are initialized with the image center,<br />

and 0 <strong>for</strong> the coefficients k 1 and k 2 , respectively. The nonlinear<br />

minimization is per<strong>for</strong>med with a (sparse) Levenberg-<br />

Marquardt method. Note, that the homographies are not<br />

independent: a consistent set of inter-image homographies<br />

should satisfy H ij = H lj H il <strong>for</strong> all l. This can be en<strong>for</strong>ced<br />

in our implementation by using a minimal parametrization<br />

solely <strong>based</strong> on homographies between adjacent views,<br />

H i,i+1 , and representing H ij = ∏ j>l≥i H l,l+1.<br />

After determining the initial estimate <strong>for</strong> the lens distortion,<br />

the focal length of the camera is estimated from the set<br />

of homographies. Both [21] and [10] employ a non-linear<br />

minimization technique <strong>for</strong> intrinsic parameters estimation<br />

and an initial estimate is required. We utilize a much simpler<br />

search technique to quickly determine the camera intrinsics:<br />

At first, we assume that the principal point is close<br />

to the image center and that the aspect ratio and skew are<br />

one and zero, respectively. Hence, we search <strong>for</strong> a constant,<br />

but unknown focal length f determining the calibration matrix<br />

K. If the correct intrinsic matrix K is known, the<br />

image-<strong>based</strong> homographies H ij can be updated to homographies<br />

between metric image planes, ˜Hij = K −1 H ij K.<br />

For a particular view i assumed with canonical pose, ˜Hij<br />

can be decomposed as ˜H ij = R ij − t ij n T i /d i, where<br />

(R ij , t ij ) depicts the relative pose and n i and d i denote<br />

the plane normal and distance (according to the coordinate<br />

frame of view i), respectively. Note, that each ˜H ij provides<br />

its own estimate of n i = n i ( ˜H ij ).<br />

For the true calibration matrix K, the extracted normals<br />

n i ( ˜H ij ) should coincide into one common estimate of the


plane normal. Hence, a low variance of the set {n i } indicates<br />

approximately correct calibration parameters. A slight<br />

complication is induced by the fact, that decomposing ˜H ij<br />

results in two possible relative poses and plane parameters<br />

(denoted with n + i and n − i ). Let (n+ 0 , n− 0 ) be the most separated<br />

pair of normals from all pairs (n + i ( ˜H ij ), n − i ( ˜H ij )).<br />

We use n + 0 and n− 0 as the estimates <strong>for</strong> the mean of the set<br />

{n i }. Now, the score <strong>for</strong> K is the minimum of<br />

∑<br />

min<br />

(∠(n + i ( ˜H ij ), n + 0 ), ∠(n− i ( ˜H<br />

)<br />

ij ), n + 0 ) (3)<br />

i,j<br />

i,j<br />

and<br />

∑<br />

min<br />

(∠(n + i ( ˜H ij ), n − 0 ), ∠(n− i ( ˜H<br />

)<br />

ij ), n − 0 ) . (4)<br />

This score is evaluated <strong>for</strong> potential choices of f, e.g.<br />

f ∈ [0.3, 3] in terms of normalized pixel coordinates. The<br />

value of f with the lowest score is used as initial estimate<br />

<strong>for</strong> the focal length. This procedure is both simple and very<br />

fast, and yields to sufficiently accurate focal lengths in our<br />

experiments.<br />

With the (approximate) knowledge of the focal length, an<br />

initial metric reconstruction <strong>based</strong> on two appropriate views<br />

is generated. The remaining views are added by estimating<br />

their absolute poses. The final bundle adjustment procedure<br />

optimizes <strong>for</strong> the parameters of the <strong>for</strong>ward distortion<br />

function, hence the inverse of the originally obtained distortion<br />

parameters is required. Since the employed polynomial<br />

distortion model is not closed under function inversion,<br />

the initial <strong>for</strong>ward distortion parameters are determined by<br />

a least squares approach. A final bundle adjustment procedure<br />

is applied to refine the camera intrinsics and distortion<br />

parameters and to improve the only approximately planar<br />

3D structure and the camera poses.<br />

3. Structure and Motion Computation<br />

3.1. Preprocessing<br />

The first step after image uploading is resampling the<br />

image according to the obtained lens distortion. In order to<br />

avoid frequent recomputation or retrieval of the corresponding<br />

lookup table, this procedure is run as a daemon process<br />

caching several least recently used distortion lookup tables.<br />

Afterwards, feature points and their respective descriptors<br />

are extracted. The current implementation uses SIFT features<br />

[9] because of its success reported in the vision community.<br />

3.2. Image Matching<br />

Retrieving similar images <strong>for</strong> a given one is currently a<br />

very active research topic (e.g. [16, 12, 7, 17]. We employ<br />

a visual vocabulary tree approach similar to [12] to retrieve<br />

a set of potentially similar views from a large collection of<br />

images. Thus, other image related data to limit the candidate<br />

images <strong>for</strong> matching like GPS position is helpful, but<br />

not strictly necessary. The vocabulary tree enables us to efficiently<br />

match a single image against a database containing<br />

thousands and even millions of images. Images found in<br />

the database with a high score according to the vocabulary<br />

tree are investigated further by a more discriminant matching<br />

procedure. Since the score induced by the vocabulary<br />

tree may miss relevant images, the set of candidate images<br />

used <strong>for</strong> further matching is augmented with the neighbors<br />

of highly ranked images with respect to the already constructed<br />

view network.<br />

In our system the vocabulary tree is trained in an unsupervised<br />

manner with a subset of 2 × 10 6 SIFT feature vectors<br />

randomly taken from 2500 street-side images. The descriptor<br />

vectors are then hierarchically quantized into clusters<br />

using a k-means algorithm. We set the branch factor to<br />

10 and allow up to 7 tree levels. For each level the k-means<br />

algorithm is initialized with different seed clusters and the<br />

result producing the lowest Euclidean distance error is retained.<br />

Once the vocabulary tree is trained, searching the<br />

visual vocabulary is very efficient and new images can be<br />

inserted on-the-fly. Based on the scoring function a ranking<br />

of relevant database images is reported and used <strong>for</strong> postverification.<br />

In our current setting we rely on an entropy weighted<br />

scoring similar to the tf-idf “term frequency inverse document<br />

frequency” as described in [18]. Let D be an image in<br />

our database and t be the term in the vocabulary associated<br />

to feature f of the current query image Q, then our scoring<br />

function is,<br />

∑<br />

( ) N<br />

log<br />

(5)<br />

n(t)<br />

t∈Q∩D<br />

where N is the total number of images in the collection and<br />

n(t) is the number of images that contain term t. In order<br />

to guarantee fairness between database images with different<br />

number of features, the query results are normalized by<br />

the self-scoring result. There<strong>for</strong>e, if a database image is<br />

used <strong>for</strong> query, a score of 1 is returned. At the same time<br />

we get an absolute measure of image-to-image similarity,<br />

which enables us to set a global threshold <strong>for</strong> scoring. The<br />

k top ranked images reported by the vocabulary tree are then<br />

taken <strong>for</strong> post-verification. Typically, we set k to 1% of the<br />

current database size.<br />

Our verification procedure is <strong>based</strong> on exhaustive matching<br />

and RANSAC <strong>for</strong> geometric consistency check. First of<br />

all, correspondences are computed by mutual nearest neighbor<br />

matching of the 128 dimensional SIFT feature vectors.<br />

We adopt the idea of [2] and match features with the same<br />

contrast only, by taking advantage of the Laplacian sign. In<br />

addition, the epipolar geometry is verified using a RANSAC<br />

procedure. If sufficient reliable 3D structure <strong>for</strong> the image


(a)<br />

Figure 2. (a) Ten sample images from a collection of 105 facade images. (b) Side view of the camera poses and 9097 triangulated feature<br />

points.<br />

(b)<br />

measurements is available, we use pose estimation <strong>for</strong> verification.<br />

The calibrated settings enable us to use the Five-<br />

Point algorithm [11] <strong>for</strong> image-to-image matches and the<br />

Three-Point algorithm [6] <strong>for</strong> 3D to 2D correspondences<br />

as minimal hypothesis generators. Images are accepted as<br />

neighbors if a quality criterion is satisfied: we generate a<br />

number of N hypothesis in prior and then check if the number<br />

of outliers does not exceed the precomputed threshold.<br />

Since it has often been observed (e.g. in [20]) that the<br />

RANSAC stopping criterion is sometimes optimistic, we<br />

use a rather conservative confidence factor p of 1−10 −4 and<br />

require a minimal number of 20 inliers. If an image passes<br />

the post-verification procedure, the neighbors of this image<br />

are also matched with the query image. Note, that this strategy<br />

is highly efficient <strong>for</strong> our purpose, since frequently the<br />

first ranked image by the vocabulary tree is already a candidate<br />

match satisfying the geometry constraints.<br />

Unlike <strong>for</strong> the vocabulary tree scoring, where the whole<br />

set of SIFT descriptors is utilized, we limit the number of<br />

features <strong>for</strong> exhaustive matching. In general we take the<br />

1500 best features according to their DoG magnitude response<br />

<strong>for</strong> matching, only. There<strong>for</strong>e, searching <strong>for</strong> correspondences<br />

is sufficiently fast (see Section 5 <strong>for</strong> detailed<br />

timings).<br />

3.3. Upgrading the View Network<br />

Each time a new image is added to the view network,<br />

its neighbors are computed and the reconstruction process<br />

is started. By reconstruction we mean structure and motion<br />

computation, namely the estimation of the camera orientations<br />

and the sparse 3D structure of the matched features.<br />

Since images may be taken from different locations, we<br />

do not expect to obtain a single coherent reconstruction, but<br />

a <strong>for</strong>est of multiple reconstructions. We require that a reconstruction<br />

consists of at least three images (view triple)<br />

and 20 common triangulated points. In general four different<br />

cases can occur if a new image is processed:<br />

1. The view can be robustly registered with exactly one<br />

already present reconstruction.<br />

2. The image can be aligned with multiple reconstructions.<br />

3. The current view cannot be (robustly) aligned with an<br />

existing reconstruction, but <strong>for</strong>ms a good view triple<br />

with two other already present views.<br />

4. The geometric relation of this image with any of the<br />

known views cannot be established, and structure and<br />

motion determination is postponed until a new suitable<br />

view is inserted.<br />

In the first case the position of the current view can be computed<br />

immediately by robust absolute pose estimation, since<br />

2D to 3D points correspondences are known. Thereafter,<br />

the camera parameters are optimized by iterative refinement<br />

and new correspondences are triangulated to 3D points.<br />

In the second case, where the current image takes part of<br />

two or more different reconstructions, the reconstructions<br />

are merged. We determine robustly a 3D to 3D similarity<br />

trans<strong>for</strong>m <strong>for</strong> the registration of the two corresponding 3D<br />

points. This process is then followed by Euclidean bundle<br />

adjustment, where the sparse geometry and the camera parameters<br />

are optimized.<br />

After updating the reconstructions in this first two cases,<br />

the epipolar neighbors of the newly inserted view are traversed<br />

and checked, whether the updated 3D structure now<br />

allows determination of their poses.<br />

In the third scenario the neighbors of the current image<br />

are estimated as described previously and a new reconstruction<br />

is initialized from a well-conditioned view triple. The<br />

view triple should provide a good triangulation angle and at


Figure 3. <strong>Dense</strong> reconstruction from a collection of 105 facade images.<br />

the same time have many correspondences. There<strong>for</strong>e, in<br />

the first step we identify the view pair which minimizes,<br />

1<br />

N<br />

N∑<br />

i<br />

1<br />

sin 2 (α i )<br />

where α i is the angle between the two camera rays <strong>for</strong> the<br />

3D point X i . Thereafter, a third view which minimizes the<br />

value in Eq. 6 with respect to this first configuration is estimated.<br />

Note, that 1/sin(α i ) approximates the uncertainty<br />

(deviation) of X i in the depth direction. In [3] the view<br />

pair with maximal mean roundness (essentially the same as<br />

1/N ∑ sin(α i )) is taken. Such an approach does not consider<br />

the number of correspondences between two views.<br />

In this work we assume, that the accuracy of further (leastsquares)<br />

computations depending on the initial structure<br />

scales with 1/ √ N. Consequently, Eq. 6 estimates the mean<br />

variance of the initial structure <strong>for</strong> a given view pair. The<br />

relative pose between the first two views is computed by<br />

the Five-Point algorithm and the third camera is inserted by<br />

the Three-Point algorithm with respect to the triangulated<br />

3D points. Thereafter, bundle adjustment is used to globally<br />

optimize the exterior camera orientations and the initial<br />

structure. A view triple is considered as a valid reconstruction<br />

if the sum of triangulation angles is above a threshold,<br />

we require at least twenty 3D points with measurements in<br />

all views and a triangulation angle greater than 2 ◦ .<br />

Whenever a number of M (15 in our case) views is added<br />

to a reconstruction, bundle adjustment is run by optimizing<br />

all cameras and triangulated 3D points. Our bundle adjustment<br />

implementation is similar to the one described in [8].<br />

Thereafter, <strong>for</strong> each image measurement the reprojection error<br />

is computed, 3D points with an average reprojection error<br />

larger than 1.3 pixel and a triangulation angle less than<br />

2 ◦ are removed. Our experiments suggest that this strategy<br />

improves both, accuracy and robustness of the reconstruction<br />

algorithm. A sparse reconstruction result is shown in<br />

Figure 2.<br />

(6)<br />

4. <strong>Dense</strong> Reconstruction<br />

Obtaining the initial 3D structure and motion is an essential<br />

aspect of image-<strong>based</strong> modeling, but dense geometry<br />

is required <strong>for</strong> a faithful virtual representation of the<br />

captured scene. Since we face a large number of images<br />

with potentially associated depth maps, we focus on simple<br />

but fast dense depth estimation procedures. Additionally,<br />

processing of multiple sensor images with respect to<br />

a key view (in contrast to traditional two-frame stereo) is<br />

demanded. Plane-sweep approaches to dense stereo [24, 5]<br />

enable an efficient, GPU-accelerated procedure to create the<br />

depth maps. In order to obtain reasonable depth values in<br />

homogeneous regions, we employ GPU-<strong>based</strong> scanline optimization<br />

in our framework [26]. Note, that dense geometry<br />

generation is currently still per<strong>for</strong>med in an offline fashion<br />

<strong>for</strong> individual large and connected view networks.<br />

In case of general view networks, a suitable selection of<br />

sensor views used <strong>for</strong> depth estimation is necessary. We<br />

use the following simple, but effective heuristic to select<br />

appropriate sensor images, which is <strong>based</strong> on an estimate<br />

<strong>for</strong> image overlap and viewing directions: <strong>for</strong> a particular<br />

key view and a potential sensor view, we determine the correspondences<br />

between these views and compute the convex<br />

hull of the respective 2D measurements in the key view. The<br />

area of this convex hull A H in relation to the total key image<br />

area A gives an estimate of the relevant image overlap.<br />

Note, that A does not necessarily denote the whole key image<br />

size, since insignificant image portions like sky regions<br />

can be excluded. Sensor view candidates with an overlap<br />

A H /A smaller than a given threshold (typically set to 0.3)<br />

are discarded. Image overlap is only a partial guideline <strong>for</strong><br />

view selection. The angle between corresponding camera<br />

rays is another suitable criterion. Very small angles give rise<br />

to large depth uncertainties, whereas larger angles are susceptible<br />

to image distortion and occlusions. From a practical<br />

point of view, triangulation angles of about α 0 = 6 ◦ are<br />

favorable <strong>for</strong> dense stereo (meaning a distance to baseline


atio of about 10:1). Consequently, view pairs with a large<br />

overlap and appropriate triangulation angles are preferable.<br />

Hence, we rank the views with sufficient overlap according<br />

to the following score:<br />

A H<br />

A median(ψ(α i)), (7)<br />

where α i is the angle between corresponding rays and ψ(·)<br />

is a unimodal weighting function with its maximum at α 0 .<br />

We choose ψ as<br />

ψ(α) = α 2 e −2α/α0 . (8)<br />

The two views with highest scores are taken as sensor<br />

views.<br />

<strong>Dense</strong> depth estimation depends heavily on the quality<br />

of the provided epipolar geometry. In order to reduce the<br />

influence of inaccuracies in the estimated poses on the depth<br />

maps, and to increase the per<strong>for</strong>mance, multi-view stereo<br />

is applied on downsampled images (512 × 384 pixels <strong>for</strong><br />

4 : 3 <strong>for</strong>mat digital images). Matching cost computation<br />

and scanline optimization <strong>for</strong> view triples (one key view and<br />

two sensor views) take about 3s on a GeForce 7800GS.<br />

The set of depth maps provides 2.5D geometry <strong>for</strong> each<br />

view. These depth maps are subsequently fused into a common<br />

3D model using a robust depth image integration approach<br />

[25]. This method results in globally optimal 3D<br />

models according to the employed energy functional, but it<br />

operates in batch mode. The incorporation of this step is<br />

the major reason <strong>for</strong> dense geometry generation to be an offline<br />

process. The result obtained by depth map fusion <strong>for</strong><br />

the dataset depicted in Figure 2 (augmented with per-vertex<br />

coloring) is shown in Figure 3.<br />

5. Results<br />

We tested our system on different image collections,<br />

varying from a few hundred to thousands of pictures. All<br />

images are calibrated with the method described in Section<br />

2. The calibration precision (i.e. the final mean reprojection<br />

error) ranges from 1/20 to 1/7 pixel. Our largest<br />

test set is a database containing about 7000 street-side images<br />

taken with four compact digital cameras from different<br />

manufacturers. The images are captured at different days<br />

under varying illumination conditions. Furthermore, the<br />

images are only partially ordered. The size of the source<br />

images varies from two to seven Megapixel. In order to remove<br />

compression artefacts, the supplied images are resampled<br />

to the half resolution <strong>for</strong> further processing. Adding<br />

an image to the view network takes approximately half a<br />

minute, most of the time is spend on exhaustive matching<br />

and bundle adjustment. Average run-times required <strong>for</strong> each<br />

step are listed in Table 1.<br />

In all our tests we use a pre-trained generic vocabulary<br />

tree of about 7 × 10 5 leaf nodes <strong>for</strong> searching the image<br />

Operation<br />

time [s]<br />

Undistortion (2272 × 1704 pixel) 1.0<br />

Feature extraction (2272 × 1704 pixel) 1.6<br />

Vocabulary scoring 0.2<br />

Brute <strong>for</strong>ce NN-matching (1500 × 1500) k × 0.35<br />

RANSAC (3000 samples Five-Point) k × 0.4<br />

Structure and Motion 0.50<br />

Bundle adjustment (100 views) 60<br />

<strong>Dense</strong> stereo 3.0<br />

Table 1. Typical average timings of our system on an AMD<br />

Athlon(tm) 64 X2 Dual Core Processor 4400+. The parameter<br />

k specifies the number of images taken <strong>for</strong> post-verification. Typically,<br />

we set k to 1% of the current database size.<br />

Figure 4. Vocabulary tree per<strong>for</strong>mance <strong>for</strong> image retrieval depending<br />

on the database size. The y-axis shows the probability to find<br />

an epipolar neighbor in the first k-ranked images reported by the<br />

vocabulary tree scoring.<br />

database. Our experiments suggest that the vocabulary tree<br />

approach <strong>for</strong> image retrieval is very efficient <strong>for</strong> large scale<br />

structure and motion computation. In Figure 4 the change<br />

of retrieval per<strong>for</strong>mance depending on the database size is<br />

shown. On average an image has an overlap with about<br />

eight other images in the database. However, there are<br />

also images with no geometric relation to existing database<br />

entries. The detailed distribution of epipolar neighbors is<br />

shown in Figure 5. The retrieval per<strong>for</strong>mance is measured<br />

by the rank of the first image that passes the geometric verification<br />

procedure. Note, even <strong>for</strong> a relative large database<br />

of about 7000 images, the first ranked image reported by<br />

the vocabulary tree is a valid epipolar neighbor of the query<br />

image with a probability of more than 90%. What we can<br />

observe also is that the scoring saturates rapidly, there<strong>for</strong>e<br />

employing the 1% from the vocabulary tree ranking is a<br />

good choice between speed and retrieval quality. Finally,<br />

two captured sites and the resulting sparse and dense models<br />

are shown in Figure 6 and Figure 7. The final reprojec-


(a)<br />

Figure 5. Probability density function of the number of verified<br />

epipolar neighbors <strong>for</strong> a query image in a database of 7181 streetside<br />

images. On average a query images has an overlap with about<br />

eight images in the database.<br />

tion errors of the 3D structure and camera poses <strong>for</strong> these<br />

datasets after bundle adjustment are about 1/3 pixel.<br />

6. Future Work<br />

Currently, images are contributed by persons associated<br />

with this project and with basic knowledge in 3D computer<br />

vision. Future work needs to increase the robustness of the<br />

structure from motion methods to allow the public to participate<br />

in the creation of the visual database. In particular, it<br />

needs to be assured that low-quality or defective images do<br />

not degrade the 3D models in the generated database.<br />

Another future option is to allow uncalibrated images to<br />

be added <strong>for</strong> the purpose of image localization and geometric<br />

alignment. Such images will generally not result in an<br />

update of the 3D structure.<br />

<strong>Dense</strong> geometry generation is currently not integrated<br />

into the online processing workflow. Depth estimation can<br />

be easily adapted to the online setting, but the employed<br />

range image fusion method is entirely an offline procedure.<br />

A novel approach <strong>for</strong> robust and efficient depth image fusion<br />

working incrementally is a present research topic.<br />

Acknowledgements<br />

This work is partly funded by the Vienna Science and<br />

Technology Fund (WWTF) and by the Kplus VRVis research<br />

center.<br />

References<br />

[1] A. Akbarzadeh et al. <strong>Towards</strong> urban 3D reconstruction from<br />

video. In Proc. 3DPVT, 2006.<br />

[2] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up<br />

robust features. In Proc. ECCV, pages 404–417, 2006.<br />

(b)<br />

(c)<br />

Figure 6. (a) Some sample images and (b) sparse reconstruction<br />

from more than 400 viewpoints. <strong>Dense</strong> reconstruction (c) from a<br />

subset of registered images.<br />

[3] C. Beder and R. Steffen. Determining an initial image pair<br />

<strong>for</strong> fixing the scale of a 3d reconstruction from an image sequence.<br />

In Proc. DAGM, pages 657–666, 2006.<br />

[4] M. Brown and D. Lowe. Automatic panoramic image stitching<br />

using invariant features. IJCV, 2006.<br />

[5] N. Cornelis and L. Van Gool. Real-time connectivity constrained<br />

depth map computation using programmable graphics<br />

hardware. In Proc. CVPR, pages 1099–1104, 2005.<br />

[6] R. M. Haralick, C. Lee, K. Ottenberg, and M. Nölle. Analy-


(a)<br />

(b)<br />

(c)<br />

Figure 7. (a) Some sample images from a collection of 49 views.<br />

(b) Sparse reconstruction and camera orientations. (c) Final dense<br />

reconstruction after depth map fusion.<br />

sis and solutions of the three point perspective pose estimation<br />

problem. In Proc. CVPR, pages 592–598, 1991.<br />

[7] H. Jegou, H. Harzallah, and C. Schmid. A contextual dissimilarity<br />

measure <strong>for</strong> accurate and efficient image search.<br />

In Proc. CVPR, 2007.<br />

[8] M. Lourakis and A. Argyros. The design and implementation<br />

of a generic sparse bundle adjustment software package<br />

<strong>based</strong> on the levenberg-marquardt algorithm. Technical Report<br />

340, <strong>Institute</strong> of <strong>Computer</strong> Science - FORTH, 2004.<br />

[9] D. Lowe. Distinctive image features from scale-invariant<br />

keypoints. IJCV, 60(2):91–110, 2004.<br />

[10] E. Malis and R. Cipolla. Camera self-calibration from unknown<br />

planar structures en<strong>for</strong>cing the multiview constraints<br />

between collineations. TPAMI, 24(9):1268–1272, 2002.<br />

[11] D. Nistér. An efficient solution to the five-point relative pose<br />

problem. TPAMI, 26(6):756–770, 2004.<br />

[12] D. Nistér and H. Stewenius. Scalable recognition with a vocabulary<br />

tree. In Proc. CVPR, pages 2161–2168, 2006.<br />

[13] T. Pajdla, T. Werner, and V. Hlaváč. Correcting radial lens<br />

distortion without knowledge of 3-D structure. Technical report,<br />

Center <strong>for</strong> Machine Perception, Czech Technical University,<br />

1997.<br />

[14] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest,<br />

K. Cornelis, J. Tops, and R. Koch. Visual modeling with<br />

a hand-held camera. IJCV, 59(3):207–232, 2004.<br />

[15] M. Pollefeys, F. Verbiest, and L. Van Gool. Surviving dominant<br />

planes in uncalibrated structure and motion recovery. In<br />

Proc. ECCV, pages 837–851, 2002.<br />

[16] F. Schaffalitzky and A. Zisserman. Multi-view matching <strong>for</strong><br />

unordered image sets, or ”How do I organize my holiday<br />

snaps?”. In Proc. ECCV, pages 414–431, 2002.<br />

[17] G. Schindler, M. Brown, and R. Szelisk. <strong>City</strong>-scale location<br />

recognition. In Proc. CVPR, 2007.<br />

[18] J. Sivic and A. Zisserman. Video google: A text retrieval approach<br />

to object matching in videos. In International Conference<br />

on <strong>Computer</strong> Vision, pages 1470–1477, 2003.<br />

[19] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring<br />

photo collections in 3D. In Proceedings of SIGGRAPH<br />

2006, pages 835–846, 2006.<br />

[20] B. Tordoff and D. W. Murray. Guided sampling and consensus<br />

<strong>for</strong> motion estimation. In Proc. ECCV, pages 82–98,<br />

2002.<br />

[21] B. Triggs. Autocalibration from planar scenes. In Proceedings<br />

of the 5th European Conference on <strong>Computer</strong> Vision<br />

(ECCV’98), pages 89–105, 1998.<br />

[22] R. Y. Tsai. A versatile camera calibration technique <strong>for</strong> high<br />

accuracy 3d machine vision metrology using off-the-shelf tv<br />

cameras and lenses. IEEE Journal of Robotics and Automation,<br />

3(4):323–344, 1987.<br />

[23] M. Vergauwen and L. Van Gool. Web-<strong>based</strong> 3D reconstruction<br />

service. Mach. Vision Appl., 17(6):411–426, 2006.<br />

[24] R. Yang and M. Pollefeys. Multi-resolution real-time stereo<br />

on commodity graphics hardware. In Proc. CVPR, pages<br />

211–217, 2003.<br />

[25] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm<br />

<strong>for</strong> robust TV-L 1 range image integration. In Proc.<br />

ICCV, 2007. to appear.<br />

[26] C. Zach, M. Sormann, and K. Karner. Scanline optimization<br />

<strong>for</strong> stereo on graphics hardware. In Proc. 3DPVT, 2006.<br />

[27] Z. Zhang. A flexible new technique <strong>for</strong> camera calibration.<br />

TPAMI, 22(11):1330–1334, 2000.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!