29.01.2014 Views

Fast Robust Large-scale Mapping from Video and Internet Photo ...

Fast Robust Large-scale Mapping from Video and Internet Photo ...

Fast Robust Large-scale Mapping from Video and Internet Photo ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Fast</strong> <strong>Robust</strong> <strong>Large</strong>-<strong>scale</strong> <strong>Mapping</strong> <strong>from</strong> <strong>Video</strong> <strong>and</strong><br />

<strong>Internet</strong> <strong>Photo</strong> Collections<br />

Jan-Michael Frahm a , Marc Pollefeys b,a , Svetlana Lazebnik a , Christopher<br />

Zach b , David Gallup a , Brian Clipp a , Rahul Raguram a , Changchang Wu a ,<br />

Tim Johnson a<br />

a Universtiy of North Carolina at Chapel Hill, NC, USA<br />

b Eidgenössische Technische Hochschule Zürich<br />

Abstract<br />

<strong>Fast</strong> automatic modeling of large-<strong>scale</strong> environments has been enabled through<br />

recent research in computer vision. In this paper we present a system approaching<br />

the fully automatic modeling of large-<strong>scale</strong> environments. Our<br />

system takes as input either a video stream or collection of photographs,<br />

which can be readily obtained <strong>from</strong> <strong>Internet</strong> photo sharing web-sites such as<br />

Flickr. The design of the system is aimed at achieving high computational<br />

performance in the modeling. This is achieved through algorithmic optimizations<br />

for efficient robust estimation, a two stage stereo estimation reducing<br />

the computational cost while maintaining competitive modeling results, <strong>and</strong><br />

the use of image based recognition for efficient grouping of corresponding<br />

images in the photo collection. The second major improvement in computational<br />

speed is achieved through parallelization <strong>and</strong> execution on commodity<br />

graphics hardware. These improvements lead to real-time video processing<br />

<strong>and</strong> to reconstruction <strong>from</strong> tens of thous<strong>and</strong>s of images within the span of a<br />

day on a single commodity computer. We demonstrate the performance of<br />

both systems on a variety of real world data.<br />

Preprint submitted to ISPRS January 22, 2010


Keywords:<br />

1. Introduction<br />

The fully automatic modeling of large-<strong>scale</strong> environments has been a research<br />

goal in photogrammetry <strong>and</strong> computer vision since a long time. Detailed<br />

3D models automatically acquired <strong>from</strong> the real world have many uses<br />

including civil <strong>and</strong> military planning, mapping, virtual tourism, games, <strong>and</strong><br />

movies. In this paper we present a system approaching the fully automatic<br />

modeling of large-<strong>scale</strong> environments, either <strong>from</strong> video or <strong>from</strong> photo collections.<br />

Our system has been designed for efficiency <strong>and</strong> scalability. GPU implementations<br />

for SIFT feature matching, KLT feature tracking, gist feature<br />

extraction <strong>and</strong> multi-view stereo allow our system to rapidly process large<br />

amounts of video <strong>and</strong> photographs. For large photo collections, iconic image<br />

clustering allows the dataset to be broken into small closely related parts.<br />

The parts are then processed <strong>and</strong> merged, allowing tens of thous<strong>and</strong>s of photos<br />

to be registered. Similarly, for videos, loop detection finds intersections in<br />

the camera path, which reduces drift for long sequences <strong>and</strong> allows multiple<br />

videos to be registered.<br />

Recently mapping systems like Microsoft Bing Maps <strong>and</strong> Google Earth<br />

have started to use 3D models of cities in their visualizations. Currently these<br />

systems still require a human in the loop for delivering models of reasonable<br />

quality. Nevertheless they already achieve impressive results, modeling large<br />

areas with regular updates. However, these models are very low complexity,<br />

<strong>and</strong> do not provide enough detail for ground-level viewing. Furthermore the<br />

2


Figure 1: The left shows an overview model of Chapel Hill reconstructed <strong>from</strong> 1.3 million<br />

video frames on a single PC in 11.5 hrs. On the right a model of the statue of liberty<br />

is shown. The reconstruction algorithm registered 9025 cameras out of 47238 images<br />

downloaded <strong>from</strong> the <strong>Internet</strong>.<br />

availability is restricted to only a small number of cities across the globe.<br />

On the other h<strong>and</strong>, our system can automatically produce 3D models <strong>from</strong><br />

ground-level images with ground-level detail. And the efficiency <strong>and</strong> scalability<br />

of our system presents the possibility to reconstruct 3D models for all<br />

the world’s cities quickly <strong>and</strong> cheaply.<br />

In this paper we will review the details of our system, <strong>and</strong> present 3D<br />

models of large areas reconstructed <strong>from</strong> thous<strong>and</strong>s to millions of images.<br />

These models are produced with state-of-the-art computational efficiency,<br />

yet are competitive with the state of the art in quality.<br />

2. Related Work<br />

In the last years there has been a considerable progress in the area of<br />

large <strong>scale</strong> reconstruction <strong>from</strong> video for urban environments <strong>and</strong> arial data<br />

[1, 2, 3] as well as there is significant progress on reconstruction <strong>from</strong> <strong>Internet</strong><br />

3


photo collections. Here we discuss the most related work in the areas of fast<br />

structure <strong>from</strong> motion, camera registration <strong>from</strong> <strong>Internet</strong> photo collections,<br />

real-time dense stereo, scene summarization, <strong>and</strong> l<strong>and</strong>mark recognition.<br />

Besides the purely image based approaches there is also work on modeling<br />

the geometry by combining cameras <strong>and</strong> active range scanners. Früh <strong>and</strong><br />

Zakhor [4] proposed a mobile system mounted on a vehicle to capture large<br />

amounts of data while driving in urban environments. Earlier systems by<br />

Stamos <strong>and</strong> Allen [5] <strong>and</strong> El-Hakim et al. [6] constrained the scanning laser<br />

to be in a few pre-determined viewpoints. In contrast to the comparably expensive<br />

active systems, our approach for real-time reconstruction <strong>from</strong> video<br />

uses only cameras leveraging the methodology developed by the computer<br />

vision community within the last two decades [7, 8].<br />

The first step in our systems is to establish correspondences between the<br />

video frames or the different images of the photo collection respectively. We<br />

use two different approaches for establishing correspondences. For video data<br />

we use an extended KLT tracker [9] that exploits the inherent parallelism<br />

of the tracking problem to improve the computational performance through<br />

execution on the graphics processor (GPU). The specific approach used is<br />

introduced by Zach et al. in detail in [10].<br />

In the case of <strong>Internet</strong> photo collections we have to address the challenge<br />

of dataset collection, which is the following problem: starting with the<br />

heavily contaminated output of an <strong>Internet</strong> image search query, extract a<br />

high-precision subset of images that are actually relevant to the query. Existing<br />

approaches to this problem [11, 12, 13, 14, 15] consider general visual<br />

categories not necessarily related by rigid 3D structure. These techniques<br />

4


use statistical models to combine different kinds of 2D image features (texture,<br />

color, keypoints), as well as text <strong>and</strong> tags. The approach by Philbin<br />

et al. [16, 17] uses a loose spatial consistency constraint. In contrast, our<br />

system enforces a rigid 3D scene structure shared by the images.<br />

Our systems determines the camera registration <strong>from</strong> images or video<br />

frames. In general there are two classes of methods used to determine the<br />

camera registration. The first class leverages the work in multiple view geometry<br />

<strong>and</strong> typically alternates between robustly estimating camera poses <strong>and</strong><br />

3D point locations directly [18, 19, 20, 21]. Often bundle adjustment [22] is<br />

used in the process to refine the estimate. The other class of methods uses<br />

an extended Kalman filter to estimate both camera motion <strong>and</strong> 3D point<br />

locations jointly as the state of the filter [23, 24].<br />

For reconstruction <strong>from</strong> <strong>Internet</strong> photo collections our system uses a hierarchical<br />

reconstruction method starting <strong>from</strong> a set of canonical or iconic<br />

views representing different viewpoints as well as different parts of the scene.<br />

The issue of canonical view selection is one that has been addressed both in<br />

psychology [25, 26] <strong>and</strong> computer vision literature [27, 28, 29, 30, 31, 32].<br />

Simon et al. [31] observed that community photo collections provide a likelihood<br />

distribution over the viewpoints <strong>from</strong> which people prefer to take<br />

photographs. Hence, canonical view selection identifies prominent clusters<br />

or modes of this distribution. Simon et al. find these modes by clustering<br />

images based on the output of local feature matching <strong>and</strong> epipolar geometry<br />

verification between every pair of images in the dataset through a 3D registration<br />

of the data. While this solution is effective, it is computationally<br />

expensive, <strong>and</strong> it treats scene summarization as a by-product of 3D recon-<br />

5


struction. By contrast, we view summarization as an image organization step<br />

that precedes 3D reconstruction, <strong>and</strong> we take advantage of the observation<br />

that a good subset of representative or “iconic” images can be obtained using<br />

relatively simple 2D appearance-based techniques [29, 33].<br />

Given that [34, 35] provide excellent surveys on the work on binocular<br />

or multiple-view stereo we refer the reader to these articles. Our system<br />

for reconstruction <strong>from</strong> video streams extends the idea by Collins [36] to<br />

dense geometry estimation. This approach permits an efficient execution<br />

on the GPU as proposed originally by Yang <strong>and</strong> Pollefeys [37]. There has<br />

been a number of approaches proposed to specifically target urban environments<br />

using mainly the fact that they predominantly consist of planar<br />

surfaces [38, 39, 40], or even more strict orthogonality constraints [41]. To<br />

ensure computational feasibility large-<strong>scale</strong> systems generate partial reconstructions,<br />

which are afterwards merged into a common model. Conflicts<br />

<strong>and</strong> errors in the partial reconstructions are identified <strong>and</strong> resolved during<br />

the merging process. Most approaches so far target data produced by range<br />

finders, where noise levels <strong>and</strong> the fraction of outliers are typically significantly<br />

lower than those of passive stereo. Turk <strong>and</strong> Levoy [42] proposed to<br />

remove the overlapping mesh parts after registering the triangular meshes.<br />

A similar approach was proposed by Soucy <strong>and</strong> Laurendeau [43]. Volumetric<br />

model merging was proposed by Hilton et al. [44], Curless <strong>and</strong> Levoy [45],<br />

which was later extended by Wheeler et al. [46]. Koch et al. [47] presented a<br />

volumetric approach for fusion as part of an uncalibrated 3D modeling system.<br />

Given depth maps for all images, the depth estimates for all pixels are<br />

projected in the voxelized 3D space. Each depth estimate votes for a voxel in<br />

6


a probabilistic way <strong>and</strong> the final surfaces are extracted by thresholding the<br />

3D probability volume.<br />

Recently there has been a number of systems covering the large-<strong>scale</strong><br />

reconstruction of urban environments or l<strong>and</strong>marks. The 4D Atlanta project<br />

of Schindler et al. [48, 49], models the evolution of an urban environment<br />

over time <strong>from</strong> an unordered collection of photos. To achieve fast processing<br />

<strong>and</strong> provide efficient visualization Cornelis et al. [50] proposed a ruled surface<br />

model for urban environments. Specifically, in their approach the facades are<br />

modeled as ruled surfaces parallel to the gravity vector.<br />

One of the first works to demonstrate 3D reconstruction of l<strong>and</strong>marks<br />

<strong>from</strong> <strong>Internet</strong> photo collections is the <strong>Photo</strong> Tourism system [51, 52]. This<br />

system achieves high-quality reconstruction results with the help of exhaustive<br />

pairwise image matching <strong>and</strong> global bundle adjustment after inserting<br />

each new view. Unfortunately, this process becomes very computationally<br />

expensive for large data sets, <strong>and</strong> it is particularly inefficient for heavily contaminated<br />

collections, most of whose images cannot be registered to each<br />

other. Since the introduction of the seminal <strong>Photo</strong> Tourism system, several<br />

researchers have developed SfM methods that exploit the redundancy<br />

in community photo collections to make reconstruction more efficient. In<br />

[53], a technique for out-of-core bundle-adjustment is proposed which takes<br />

advantage of this redundancy by locally optimizing the “hot spots” <strong>and</strong> then<br />

connecting the local solutions into a global one. Snavely et al. [54] find<br />

skeletal sets of images <strong>from</strong> the collection whose reconstruction provides a<br />

good approximation to a reconstruction involving all the images. One remaining<br />

limitation of this work is that it still requires as an initial step the<br />

7


exhaustive computation of all two-view relationships in the dataset. Using<br />

the independence of the pairwise evaluation Agarwal et al. [55] address this<br />

computational challenge by using a computing cluster with up to 500 cores<br />

to reduce the computation time significantly. Similar to our system they<br />

use image recognition techniques like approximate nearest neighbor search<br />

[56] <strong>and</strong> query expansion [57] to reduce the number of c<strong>and</strong>idate relations.<br />

In contrast, our system uses single image descriptors to efficiently identify<br />

correlating images leading to an even smaller number of c<strong>and</strong>idates. Images<br />

can then be registered on a single computation core in time comparable to<br />

the 500 cores of Agarwal et al.<br />

3. Overview<br />

In this paper we describe our system for real-time computation of 3D<br />

models <strong>from</strong> ground reconnaissance video <strong>and</strong> for efficient 3D reconstruction<br />

<strong>from</strong> <strong>Internet</strong> photo collections downloaded <strong>from</strong> photo sharing web sites like<br />

Flickr 1 . The design of the system aims at efficient reconstruction through<br />

parallelization of the algorithms used enabling their execution on the GPU.<br />

The processing task for each type of input algorithmically share a large number<br />

of methods with some adaptation accounting for the significantly different<br />

characteristics of the data sources. In this section we will shortly discuss<br />

the different characteristics of the data sources <strong>and</strong> then introduce the main<br />

building blocks for the system.<br />

Our method for 30 Hz real-time reconstruction <strong>from</strong> video uses the video<br />

1 www.flickr.com<br />

8


streams of multiple cameras mounted on a vehicle to obtain a 3D reconstruction<br />

of the scene. Our system can operate <strong>from</strong> video alone, but additional<br />

sensors, such as differential GPS <strong>and</strong> an inertial sensor (INS), reduce accumulated<br />

error (drift) <strong>and</strong> provide registrations to the Earth’s coordinate system.<br />

Our system efficiently fuses the GPS, INS <strong>and</strong> camera measurements into an<br />

absolute camera position. Additionally, our system exploits the temporal<br />

order of the video frames, which implies a spatial relation ship of the frames,<br />

to efficiently perform the camera motion tracking <strong>and</strong> the dense geometry<br />

computation.<br />

In contrast to video, in <strong>Internet</strong> photo collections we encounter a high<br />

number of photos that do not share a rigid geometry with the images of<br />

the scene. These images are often mislabeled images <strong>from</strong> the data base,<br />

artistic images of the scene, duplicates, or parodies of the scene. An example<br />

for these different classes of unrelated images are shown in Figure 2. To<br />

quantify the amount of unrelated images in <strong>Internet</strong> photo collections we<br />

labeled a r<strong>and</strong>om subset for three downloaded datasets. Besides the dataset<br />

for the Statue of Liberty (47238 images with approximately 40% outliers) we<br />

downloaded two more datasets by searching for “San Marco” (45322 images<br />

with approximately 60 % outliers) <strong>and</strong> “Notre Dame” (11928 images with<br />

approximately 49%). The iconic clustering employed by our system can<br />

efficiently h<strong>and</strong>le such high outlier ratios which is essential for h<strong>and</strong>ling large<br />

<strong>Internet</strong> photo collections.<br />

Despite the differences in the input data our systems share a common<br />

algorithmic structure as shown in Algorithm 1:<br />

• The local correspondence estimation establishes correspondences<br />

9


Figure 2: Images downloaded form the the <strong>Internet</strong> by searching for “Statue of Liberty”,<br />

which do not represent the Statue of Liberty in New York.<br />

for salient feature points to neighboring views. In the case of video these<br />

correspondences are determined through KLT-tracking on the GPU<br />

with simultaneous estimation of the camera gain (Section 4.1.1). For<br />

<strong>Internet</strong> photo collections our system first establishes a spatial neighborhood<br />

relationship for each image through a clustering of single image<br />

descriptors, which effectively clusters close viewpoints. The feature<br />

points used are the SIFT features [58] due to their enhanced invariance<br />

to larger viewpoint changes (Section 4.1).<br />

• Camera pose/motion estimation <strong>from</strong> local correspondences is robustly<br />

performed through an efficient RANSAC technique called ARRSAC<br />

(Section 4.1.2). It determines the inlier correspondences <strong>and</strong> the camera<br />

position with respect to the previous image <strong>and</strong> the known model.<br />

10


In the case of available GPS data our system uses the local correspondences<br />

to perform sensor fusion with the GPS <strong>and</strong> INS data 4.1.3.<br />

• Global Correspondence estimation is performed to enable global<br />

drift correction. This step searches beyond the temporal neighbors in<br />

video <strong>and</strong> beyond the neighbors in the cluster for frames with overlap<br />

to the current frame (Section 4.2).<br />

• Bundle adjustment uses the global correspondences to reduce the<br />

accumulated drift of the camera registrations <strong>from</strong> the local camera<br />

pose estimates (Section 4.3).<br />

• Dense geometry estimation in real-time is performed by our 3D<br />

reconstruction <strong>from</strong> video system to extract the depth of all pixels<br />

<strong>from</strong> the camera (Section 5). Our system uses a two stage estimation<br />

strategy. First we use efficient GPU based stereo to determine the<br />

depth map for every frame. Then we use the temporal redundancy of<br />

the stereo estimations to filter out erroneous depth estimates <strong>and</strong> fuse<br />

the correct depth estimates.<br />

• Model extraction is performed to extract a triangular mesh representing<br />

the scene <strong>from</strong> the depth maps.<br />

The next section will first introduce our algorithm for real-time 3D reconstruction<br />

<strong>from</strong> video. Afterwards in Section 4.1 we will detail the differences<br />

<strong>from</strong> our algorithm for camera registration of <strong>Internet</strong> photo collections.<br />

11


Algorithm 1 Reconstruction Scheme<br />

for all images/frames do<br />

Local correspondence estimation with immediate spatial neighbors<br />

Camera pose/motion estimation <strong>from</strong> local correspondences<br />

end for<br />

for all registered images/frames do<br />

Global Correspondence estimation to non-neighbors to identify path<br />

intersections <strong>and</strong> overlaps not initially found<br />

Bundle Adjustment for global error mitigation using the global <strong>and</strong> local<br />

correspondences<br />

end for<br />

for all frames in video do<br />

Dense geometry estimation using stereo <strong>and</strong> depth map fusion<br />

Model extraction<br />

end for<br />

4. Camera Pose Estimation<br />

In this section we discuss the methods used for camera registration in<br />

our system. We first introduce the methods for camera registration for video<br />

sequences in Section 4.1. These methods inherently use the known temporal<br />

relationships of the frames for the camera motion estimation providing<br />

efficient constraints for camera registration. Second, in Section 4.4 we discuss<br />

the extension of the camera registration to <strong>Internet</strong> photo collections to<br />

address the challenge of an unordered data set.<br />

12


4.1. Camera Pose <strong>from</strong> <strong>Video</strong><br />

Our system for the registration of video streams employs the fact that<br />

typically in a video sequence the observed motion between consecutive frames<br />

is small, namely a few pixel. This allows our system to treat the correspondence<br />

estimation as a tracking problem as discussed in Section 4.1.1. From<br />

these correspondences we estimate the camera motion through our recently<br />

proposed efficient adaptive real-time r<strong>and</strong>om sampling consensus method<br />

(ARRSAC) [59] introduced in section 4.1.2. Alternatively, if GPS <strong>and</strong> inertial<br />

measurements are available the camera motion is estimated through<br />

a Kalman filter, efficiently fusing visual correspondences <strong>and</strong> the six DOF<br />

measurements for the camera pose as introduced in Section 4.1.3.<br />

4.1.1. Local Correspondences<br />

Reconstructing the structure of a scene <strong>from</strong> images begins with finding<br />

corresponding features between pairs of images. In a video sequence we can<br />

take advantage of the temporal ordering <strong>and</strong> small camera motion between<br />

frames to speed up correspondence finding. Kanade-Lucas-Tomassi feature<br />

tracking [60, 61] is a differential tracking method that first finds strong corner<br />

features in a video frame. These strong features are then tracked using a<br />

linear approximation of the image gradients over a small window of pixels<br />

surrounding the feature. One of the limitations of this approach is that<br />

feature motion of more than one pixel cannot be differentially tracked. A<br />

<strong>scale</strong> space approach based on an image pyramid is generally used to alleviate<br />

this problem. If a feature’s motion is more than a pixel at the finest <strong>scale</strong> we<br />

can still track it because its motion will be less than one pixel at a courser<br />

image <strong>scale</strong>.<br />

13


Building image pyramids <strong>and</strong> tracking features are both highly parallel<br />

operations <strong>and</strong> so can be programmed efficiently on the graphics processor<br />

(GPU). Building an image pyramid is simply a series of convolution operations<br />

with Gaussian blur kernels followed by down sampling. Differential<br />

tracking can be implemented in parallel by assigning each feature to track to<br />

a separate processor. With typically on the order of 1024 features to track<br />

<strong>from</strong> frame to frame in a 1024x768 image we can achieve a great deal of<br />

parallelism in this way.<br />

One major limitation of st<strong>and</strong>ard KLT tracking is that it uses the absolute<br />

difference between the windows of pixels in two images to calculate<br />

the feature track update. When a large change in intensity occurs between<br />

frames this can cause the tracking to fail. This happens frequently in videos<br />

shot outdoors when the camera moves <strong>from</strong> light into shadow for example.<br />

Kim et al. in [62] developed a gain adaptive KLT tracker that measures<br />

the change in mean intensity of all the tracked features <strong>and</strong> compensates for<br />

this change in the KLT update equations. Estimating the gain can also be<br />

performed on the GPU at minimal additional cost in comparison to st<strong>and</strong>ard<br />

GPU KLT.<br />

4.1.2. <strong>Robust</strong> Pose Estimation<br />

The estimated local correspondences typically contain a significant portion<br />

of erroneous correspondences. To determine the correct camera position<br />

we apply a R<strong>and</strong>om Sample Consensus (RANSAC) algorithm [63] to estimate<br />

the relative camera motion through the essential matrix[20]. While<br />

being highly robust the RANSAC algorithm can also be computationally<br />

very expensive, with a runtime that is exponential in outlier ratio <strong>and</strong> model<br />

14


complexity. Our recently proposed Adaptive Real-Time R<strong>and</strong>om Sample<br />

Consensus (ARRSAC) algorithm [64], is capable of providing accurate realtime<br />

estimation over a wide range of inlier ratios.<br />

ARRSAC moves away <strong>from</strong> the traditional hypothesize-<strong>and</strong>-verify framework<br />

of RANSAC by adopting a more parallel approach, along the lines of<br />

preemptive RANSAC [65]. However, while preemptive RANSAC makes the<br />

limiting assumption that a good estimate of the inlier ratio is available beforeh<strong>and</strong>,<br />

ARRSAC is capable of efficiently adapting to the contamination<br />

level of the data. This results in high computational savings.<br />

In real-time scenarios, the goal of a robust estimation algorithm is to deliver<br />

the best possible solution within a fixed time-budget. While a parallel,<br />

or breadth-first, approach is indeed a natural formulation for real-time applications,<br />

adapting the number of hypotheses to the contamination level of the<br />

data requires an estimate of the inlier ratio, which necessitates the adoption<br />

of a depth-first scan. The ARRSAC framework retains the benefits of both<br />

approaches (i.e., bounded run-time as well as adaptivity) by operating in a<br />

partially depth-first manner as detailed in Algorithm 2.<br />

The performance of ARRSAC was evaluated in detail on synthetic <strong>and</strong><br />

real data, over a wide range of inlier ratios <strong>and</strong> dataset sizes. From the<br />

results in [64], it can be seen that the ARRSAC approach produces significant<br />

computational speed-ups, while simultaneously providing accurate<br />

robust estimation in real-time. For the case of epipolar geometry estimation,<br />

for instance, ARRSAC operates with estimation speeds ranging between 55-<br />

350 Hz, which represent speedups ranging <strong>from</strong> 11.1 to 305.7 compared to<br />

st<strong>and</strong>ard RANSAC.<br />

15


Algorithm 2 The ARRSAC algorithm<br />

Partition the data into k equal size r<strong>and</strong>om partitions<br />

for all k partitions do<br />

if #hypothesis less than required then<br />

Non-uniform sampling for hypothesis generation<br />

end if<br />

Hypothesis evaluation using Sequential Probability Ratio Test (SPRT) [66]<br />

Perform local optimization to generate additional hypotheses<br />

Use surviving hypotheses to provide an estimate of the inlier ratio<br />

Use inlier ratio estimate to determine number of hypotheses needed<br />

for all hypotheses do<br />

if<br />

SPRT(hypothesis) is valid then<br />

Update inlier ratio estimate based on the current best hypothesis<br />

end if<br />

end for<br />

end for<br />

4.1.3. Camera Pose Estimation through Fusion<br />

In some instances we may have not only images or video data but also<br />

geo-location or orientation information. This additional information can be<br />

derived <strong>from</strong> global positioning system (GPS) sensors as well as inertial navigation<br />

systems containing accelerometers <strong>and</strong> gyroscopes. We can combine<br />

data <strong>from</strong> these other sensors with image correspondences to get a more<br />

accurate pose (rotation <strong>and</strong> translation) than we might get with vision or<br />

geo-location/orientation alone.<br />

In order to fuse image correspondences with other types of sensor mea-<br />

16


surements we must normalize all of the measurements so that they have<br />

comparable units. For example, without doing this normalization one cannot<br />

compare pixels of reprojection error with meters of error relative to a<br />

GPS measurement. Our sensor fusion approach, first described in [67], fuses<br />

measurements <strong>from</strong> a GPS/INS unit with image correspondences using an<br />

extended Kalman filter (EKF). Our EKF maintains a state which was a map<br />

with the current camera pose, velocity <strong>and</strong> rotation rate as well as 3D features<br />

which correspond to features extracted <strong>and</strong> tracked <strong>from</strong> image to image in<br />

a video. The EKF uses a smooth motion model to predict the next camera<br />

pose. Then the system performs an update of the current camera pose <strong>and</strong><br />

3D feature measurements based on the difference between the predicted <strong>and</strong><br />

measured feature measurements <strong>and</strong> GPS/INS measurements.<br />

These corrections are weighted both by the EKF’s estimate of its certainty<br />

of the camera pose <strong>and</strong> 3D points (its covariance matrix) <strong>and</strong> a measurement<br />

error model for the sensors. For the feature measurements we assume the<br />

presence of additive Gaussian noise. For the GPS/INS measurements we<br />

assume a Gaussian distribution of rotation errors <strong>and</strong> use a constant bias<br />

model for the position error. The Gaussian distribution of rotation error is<br />

justified by the fact that the GPS/INS we use gives us post processed data<br />

with extremely high accuracy <strong>and</strong> precision. However, the position error was<br />

seen to occasionally grow when the vehicle was stationary. This might be due<br />

to multipath error on the GPS signals because we were working in an urban<br />

environment surrounded by two to three level buildings. Using the constant<br />

bias model for position error the image features could keep the estimate of<br />

the vehicle position stationary while the vehicle was truly not moving even<br />

17


as the GPS/INS reported that the vehicle moved up 10 cm in one example.<br />

4.2. Global Correspondences<br />

Since our system creates the camera registration sequentially <strong>from</strong> the<br />

input frames the obtained registrations are always subject to drift. Each<br />

small inaccuracy in motion estimation will propagate forward <strong>and</strong> the absolute<br />

positions <strong>and</strong> motions will be inaccurate. It is therefore necessary to<br />

do a global optimization step afterwards to remove the drift. This makes<br />

constraints necessary that are capable to remove drift. Such constraints can<br />

come <strong>from</strong> global pose measurements like GPS, but more interesting is to<br />

exploit internal consistencies like loops <strong>and</strong> intersections of the camera path.<br />

In this section we will therefore discuss solutions to the challenging task of<br />

detecting loops <strong>and</strong> intersections <strong>and</strong> using them for global optimization.<br />

Registering the camera with respect to the previously estimated path<br />

provides an estimate of the accumulated drift error. For robustness the path<br />

self-intersection itself can only rely on the views itself <strong>and</strong> not on the estimated<br />

camera motion, which can drift unbounded. Our method determines<br />

the path intersection by evaluating the similarity of salient image features in<br />

the current frame to all features in all previous views. We found that the<br />

SIFT-feature [58] provides a good salient feature for our purposes. For the<br />

fast computation of SIFT features, we make use our SiftGPU 2 , which can<br />

for example extract SIFT features at 16Hz <strong>from</strong> 1024 × 768 images on an<br />

NVidia GTX280. To enhance the robustness for the video streams we could<br />

use local geometry (described in Section 5) to employ our view invariant<br />

2 Available online: http://cs.unc.edu/∼ccwu/siftgpu/<br />

18


VIP-features [68].<br />

To avoid a computationally prohibitive exhaustive search within all other<br />

frames we use the vocabulary tree [69] to find a small set of potentially corresponding<br />

views. The vocabulary tree provides a computationally efficient<br />

indexing for the set of previous views. It provides a mapping <strong>from</strong> the feature<br />

space to an integer called visual word. The union of the visual words defines<br />

the possibly corresponding images through an inverted file.<br />

Determining the visual words for the features extracted <strong>from</strong> the query<br />

image requires traversal of the vocabulary tree for each extracted feature in<br />

the current view including a number of comparisons for the query feature with<br />

the node descriptors. Hereby the features <strong>from</strong> the query image are h<strong>and</strong>led<br />

independently, hence the tree traversal can be performed in parallel for each<br />

feature. To optimize performance we employ an in-house CUDA-based approach<br />

executed on the GPU for faster determination of the respective visual<br />

words. The speed-up induced by the GPU is about 20 on a GeForce GTX280<br />

versus a CPU implementation executed on an Intel Pentium D 3.2Ghz. This<br />

allows to perform more descriptor comparisons than in [69], i.e. a deeper<br />

tree with a smaller branching factor can be replaced by a shallower tree with<br />

a significantly higher number of branches. As pointed out in [70], a broader<br />

tree yields to a more uniform <strong>and</strong> therefore representative sampling of the<br />

high-dimensional descriptor space. To further increase the performance of the<br />

above described global correspondence search we use our recently proposed<br />

index compression method [71].<br />

Since the vocabulary tree only delivers a list of previous views potentially<br />

overlapping with the current view we perform a geometric verification of the<br />

19


delivered result. First we generate putative matches on the GPU. Afterwards<br />

the potential correspondences are tested for a valid two-view relationship<br />

between the views through ARRSAC <strong>from</strong> Section 4.1.2.<br />

4.3. Bundle Adjustment<br />

As the initial sparse model is subject to drift due to the incremental<br />

nature of the estimation process the reprojection error of the global correspondences<br />

is typically higher than the error of the local correspondences if<br />

drift occured. In order to obtain results with the highest possible accuracy,<br />

an additional refinement procedure, generally referred as bundle adjustment,<br />

is necessary. It is a non-linear optimization method, which jointly optimizes<br />

the camera calibration, the respective camera registration <strong>and</strong> the inaccuracies<br />

in the 3D points. It incorporates all available knowledge—initial camera<br />

parameters <strong>and</strong> poses, image correspondences <strong>and</strong> optionally other known<br />

applicable constraints, <strong>and</strong> minimizes a global error criterion over a large set<br />

of adjustable parameters, in particular the camera poses, the 3D structure,<br />

<strong>and</strong> optionally the intrinsic calibration parameters. In general, bundle adjustment<br />

delivers 3D models with sub-pixel accuracy for our system, e.g. an<br />

initial mean reprojection error in the order of pixels is typically reduced to<br />

0.2 or even 0.1 pixels mean error. Thus, the estimated two-view geometry<br />

is better aligned with the actual image features, <strong>and</strong> dense 3D models show<br />

a significantly higher precision. The major challenges with a bundle adjustment<br />

approach are the huge numbers of variables in the optimization problem<br />

<strong>and</strong> (to a lesser extent) the non-linearity <strong>and</strong> non-convexity of the objective<br />

function. The first issue is addressed by sparse representation of matrices<br />

<strong>and</strong> by utilizing appropriate numerical methods. Sparse techniques are still<br />

20


an active research topic in the computer vision <strong>and</strong> photogrammetry community<br />

[72, 73, 74, 75]. Our implementation of sparse bundle adjustment 3<br />

largely follows [72] by utilizing sparse Cholesky decomposition methods in<br />

combination with a suitable column reordering scheme [76]. For large data<br />

sets this approach is considerably faster than bundle adjustment implementation<br />

using dense matrix factorization after applying the Schur complement<br />

method (e.g. [77]).<br />

Our implementation requires less than 20min to perform full bundle adjustment<br />

on a 1700 image data set (which amounts to 0.7sec per image).<br />

For video sequences, the bundle adjustment can be performed incrementally,<br />

adjusting only the most recently computed poses with respect to the existing<br />

already-adjusted camera path. This allows bundle adjustment to run in realtime<br />

albeit with slightly higher error than a full offline bundle adjustment.<br />

This bundle adjustment technique is also used extensively by our system for<br />

processing photo collections as described hereafter.<br />

4.4. Camera Pose <strong>from</strong> Image Collections<br />

This section introduces the extension of the methods described in Section<br />

4.1 to accommodate <strong>Internet</strong> photo collections. The main difference to<br />

the video streams is the unknown initial relationship between the images.<br />

While for the video the temporal order implies a spatial relationship for the<br />

video frames. <strong>Photo</strong> collections downloaded <strong>from</strong> the <strong>Internet</strong> on the other<br />

h<strong>and</strong> do not have any order. In fact typically these collections are highly<br />

contaminated with outliers. Hence in contrast to video we first need to es-<br />

3 available in source at http://cs.unc.edu/∼cmzach/opensource.html<br />

21


tablish the spatial relationship between the images. In principle we can use<br />

feature based techniques similar to the method of Philbin et al. <strong>and</strong> Zheng<br />

et al. [16, 17] to detect related frames within the photo collection. These<br />

methods use computationally very intensive indexing based on local image<br />

features (keypoints) followed by loose spatial verification. Their spatial constraints<br />

are given by a 2D affine transformation or filters based on proximity<br />

of local features. Our method effectively enforces 3D structure <strong>from</strong> motion<br />

constraints (SfM constraints) for the dataset. Similarly methods like [51, 52]<br />

enforce 3D SfM constraints for the full set of registered frames. Their methods<br />

first exhaustively evaluate all possible pairs for a valid epipolar geometry<br />

<strong>and</strong> then enforce the stronger multi-view geometry constraints. Our method<br />

avoids the prohibitively expensive exhaustive pairwise matching using an initial<br />

stage in which images are grouped using global image features prior to<br />

indexing based on local features. This gives us a bigger gain in efficiency<br />

<strong>and</strong> an improved ability to group similar compositions mainly corresponding<br />

to similar viewpoints. We then enforce the SfM constraints on these groups<br />

which reduces the complexity of the computation by orders of magnitude.<br />

Snavely et al. [54] also reduced the complexity of the SfM constraints by<br />

minimizing the number of image pairs for which it is computed to a minimal<br />

set expected to obtain the same overall reconstruction.<br />

4.4.1. Efficiently Finding Corresponding Images in <strong>Photo</strong> Collections<br />

To efficiently identify related images our system uses the gist feature [78],<br />

which encodes the spatial layout of the image <strong>and</strong> perceptual properties of<br />

the image. The gist feature was found to be effective for grouping images by<br />

perceptual similarity <strong>and</strong> retrieving structurally similar scenes [79, 80]. To<br />

22


achieve high computational performance we developed a highly parallel gist<br />

feature extraction on the GPU. It derives a gist descriptor for each image<br />

as the concatenation of two independently computed sub-vectors. The first<br />

sub-vector is computed by convolving a downsampled to 128 × 128 gray<strong>scale</strong><br />

version of the image with Gabor filters at 3 different <strong>scale</strong>s (with 8, 8 <strong>and</strong><br />

4 orientations for the three <strong>scale</strong>s, respectively). The filter responses are<br />

aggregated to a 4 × 4 spatial resolution, downsampling each convolution to<br />

a 4 × 4 patch, <strong>and</strong> concatenating the results, yielding a 320-dimensional<br />

vector. In addition, we augment this gist descriptor with color information,<br />

consisting of a subsampled L*a*b image, at 4 × 4 spatial resolution. We thus<br />

obtain a 368-dimensional vector as a representation of each image in the<br />

dataset. The implementation on the GPU improves the computation time<br />

by a factor of 100 compared to a CPU implementation. For detailed timings<br />

of the gist computation please see Table 1.<br />

In the next step we use the fact that photos <strong>from</strong> nearby viewpoints with<br />

similar camera orientation have similar gist descriptors. Hence, to identify<br />

viewpoint clusters we use k-means clustering of the gist descriptors. At<br />

this point we aim for an over-segmentation since that will best reduce our<br />

computational complexity. We empirically found that searching for 10% as<br />

many clusters as images yields a sufficient over-segmentation. As shown by<br />

Table 1 the clustering of the gist descriptors can be executed very efficiently.<br />

This is key to the computational efficiency of our system since this early<br />

grouping allows us to limit all further geometric verifications avoiding an<br />

exhaustive search over the whole dataset.<br />

The clustering successfully identifies the popular viewpoints, although it<br />

23


Figure 3: A large gist cluster for the Statue of Liberty dataset. We show the hundred<br />

images closest to the cluster mean. The effectiveness of the low-dimensional gist representation<br />

can be gauged by noting that even without enforcing geometric consistency, these<br />

clusters display a remarkable degree of structural similarity.<br />

is sensitive to image variation such as clutter as for example people in front<br />

of the camera, lighting conditions, <strong>and</strong> camera zoom. We found that the<br />

large clusters are typically almost outlier free (see Figure 3 for an example).<br />

The smaller clusters experiences significantly more noise or belong to the<br />

unrelated images in the photo collection. The clusters now define a loose<br />

ordering of the dataset that defines corresponding images as being in the<br />

same cluster. The next step will employ this order to efficiently identify<br />

overlapping images as well as the outlier images on a per cluster basis using<br />

the same techniques (Section 4.1.2) as in the case of video streams.<br />

4.4.2. Iconic Image Registration<br />

While the clusters define a local relationship our method still needs to<br />

establish a global relationship between the photos. To facilitate an efficient<br />

global registration we further summarize the dataset. The objective of the<br />

next step is to find a representative image (iconic) for each cluster of views.<br />

This iconic is then later used for efficient global registration. The second<br />

objective is to enforce a valid two-view geometry with respect to the iconic<br />

24


for each image in the cluster to remove uncorrelated images.<br />

Given the typically significantly larger feature motion in the photo collections,<br />

similar to the global correspondence computation in Section 4.2 we<br />

use the SIFT features [58]. To establish correspondences for each considered<br />

pair, putative feature matching is performed on the GPU. In general, the<br />

feature matching process may be structured as a matrix multiplication between<br />

large <strong>and</strong> dense matrices. Our approach thus consists of a call to dense<br />

matrix multiplication in the CUBLAS 4 library with subsequent instructions<br />

to apply the distance ratio test [58].<br />

Next we use our ARRSAC algorithm <strong>from</strong> Section 4.1.2 to validate the<br />

two-view relationship between the views. To achieve robustness to degenerate<br />

data we combine ARRSAC with our RANSAC based robust model selection<br />

QDEGSAC [81].<br />

Our method first finds the triplet with mutual valid two-view relationships<br />

whose gist descriptors are the closest to the gist cluster center. The image<br />

with the highest number of inliers in this set is selected as the iconic image<br />

of the cluster. Then all remaining images in the cluster are tested for a valid<br />

two-view relationship to the iconic. Hence after this geometric-verification<br />

each cluster only contains images showing the same scene <strong>and</strong> each cluster is<br />

visually represented by its iconic image.<br />

Typically there is a significant number of images which are not geometrically<br />

consistent with the other images in the cluster or that are in clusters<br />

that are not geometrically consistent. Through evaluation we found that<br />

4 http://developer.download.nvidia.com/compute/cuda/1 0/CUBLAS Library 1.0.pdf<br />

25


these rejected images contain a significant number of images related to the<br />

scene of interest. To register these images we perform a reclustering using<br />

the clustering <strong>from</strong> Section 4.4.1 followed by the above geometric verification.<br />

Afterwards all unregistered images are tested against the five iconic images<br />

nearest in gist space. Using our labeling of the subsets of the datasets we<br />

verified that we typically register well over 90% of the frames belonging to<br />

the scene of interest.<br />

4.4.3. Iconic Image Registration<br />

After using the clusters to establish a local order <strong>and</strong> to remove unrelated<br />

images the next step is aiming at establishing a global relationship of<br />

the images. For efficiency reasons we use the small number of iconic images<br />

to bootstrap the global registration for all images. We first compute the exhaustive<br />

pairwise two-view relationships for all iconics using ARRSAC <strong>from</strong><br />

Section 4.1.2. This defines a graph of pairwise global relationships between<br />

the different iconics, which is called the iconic scene graph. The edges of the<br />

graph have a cost corresponding to the number of mutual matches between<br />

the images. This graph represents the global registration of the iconics for<br />

the dataset based on the 2D constraints given by the epipolar geometry or<br />

the homography between the views.<br />

Next our system performs an incremental structure <strong>from</strong> motion process<br />

using the iconic scene graph to obtain a registration for the iconic images.<br />

First an initial image pair is selected, which has sufficiently accurate registration<br />

as defined by the roundness of the uncertainty ellipsoids of the triangulated<br />

3D points [82]. Then the remaining pairs are added by selecting the<br />

image with the next strongest set of correspondences until no further iconic<br />

26


can be added to the 3D model. This gives the 3D sub-model for the component<br />

of the iconic scene graph corresponding to the initially selected pair.<br />

Each completed sub-model is refined by bundle adjustment as described in<br />

Section 4.3.<br />

Typically the iconic scene graph contains multiple independent or weakly<br />

connected components. These components for example correspond to parts<br />

of the scene that has no visual connection to the other scene parts, interior<br />

parts of the model or other scenes consistently mislabeled (an example would<br />

be Ellis Isl<strong>and</strong>, which is often also labeled as “Statue of Liberty” see Figure<br />

4). Hence we repeat the 3D modeling until all images are registered or none<br />

of the remaining images has any connections to any of the 3D sub-models.<br />

To obtain more complete 3D sub-models than possible through registering<br />

the iconics, our system afterwards searches for images in the clusters that<br />

support a matching for the iconics in both clusters. For efficiency reasons<br />

we only test for each image in the cluster if it matches to the iconic of the<br />

other cluster. If the match to the target iconic is sufficiently strong (empirically<br />

determined as more than 18 features) then the matching information<br />

of this image is stored. Once a sufficient number of connections is identified<br />

our method uses all constraints provided by the matches to merge the two<br />

sub-models to which the clusters belong. The merging is performed through<br />

estimating the similarity transformation between the sub-models. This merging<br />

process is continued until no more additional merges are possible.<br />

4.4.4. Cluster Camera Registration<br />

After the registration of the iconic images we have a valid global registration<br />

for the images of the iconic scene graph. In the last step we aim at<br />

27


Figure 4: Left: view of the 3D model <strong>and</strong> the 9025 registered cameras. Right: model <strong>from</strong><br />

the interior of Ellis Isl<strong>and</strong> erroneously labeled as “Statue of Liberty”.<br />

extending the registration to all cluster images in the clusters corresponding<br />

to the registered iconics. This process takes advantage of feature matches<br />

between the non-iconic images <strong>and</strong> their respective iconics. This process<br />

again uses the 2D matches to the iconic to determine the 2D-3D correspondences<br />

for the image. It then applies ARRSAC (Section 4.1.2 to determine<br />

the camera s registration. Given the incremental image registration during<br />

this process drift inherently accumulates. Hence we periodically apply<br />

bundle adjustment to perform a global error correction. In our current system<br />

bundle adjustment is performed after 25 images have been added to a<br />

sub-model. This saves significant computational resources while in practice<br />

keeping the accumulated error at a reasonable level. Detailed results of our<br />

3D reconstruction algorithm are shown in Figures 4-5.<br />

5. Dense Geometry<br />

The above described methods for video <strong>and</strong> for photo collections provide<br />

a registration for the camera poses of all video frames or photographs.<br />

These external <strong>and</strong> internal camera calibrations are the output of the sparse<br />

28


Figure 5: Left: 3D reconstruction of the San Marco Dataset with 10338 cameras. Right:<br />

Reconstruction <strong>from</strong> the Notre Dame dataset with 1300 registered cameras.<br />

Dataset<br />

Gist<br />

Gist clustering<br />

Geometric<br />

Re-clustering<br />

registered views<br />

total time<br />

Liberty 4 min 21 min 56 min 3.3 hr 13888 4 hr 42 min<br />

SM 4 min 19 min 24 min 2.76 hr 12253 3 hr 34 min<br />

ND 1 min 3 min 22 min 1.0 hr 3058 1 hr 28 min<br />

Table 1: Computation times for the photo collection reconstruction for the Statue of<br />

Liberty dataset, the San Marco dataset (SM) <strong>and</strong> the Notre Dame dataset (ND).<br />

reconstruction step of both systems. They allow for the photo collections<br />

to generate an image based browsing experience as shown by the “<strong>Photo</strong>-<br />

Tourism” paper [51, 52]. The camera registration can also be used as input<br />

to dense stereo methods. The method of [83] can extract dense scene geometry<br />

but with significant computational effort <strong>and</strong> therefore only for small<br />

scenes. Next we will introduce our real-time stereo system for video streams<br />

capable of reconstructing large scenes.<br />

We adopt a stereo/fusion approach for dense geometry reconstruction.<br />

29


For every frame of video, a depthmap is generated using a real-time GPUbased<br />

multi-view planesweep stereo algorithm. The planesweep algorithm,<br />

comprised primarily of image warping operations, is highly efficient on the<br />

GPU [62]. Because it is multi-view, it is quite robust <strong>and</strong> can efficiently<br />

h<strong>and</strong>le occlusions, a major problem in stereo. The stereo depthmaps are<br />

then fused using a visibility-based depthmap fusion method, which also runs<br />

on the GPU. The fusion method removes outliers <strong>from</strong> the depthmaps, <strong>and</strong><br />

also combines depth estimates to enhance their accuracy. Because of the redundancy<br />

in video (each surface is imaged multiple times), fused depthmaps<br />

only need to be produced for a subset of the video frames. This reduces<br />

processing time as well as the 3D model size as discussed in Section 6.<br />

The planesweep stereo algorithm [36] computes a depthmap by testing<br />

a family of plane hypotheses, <strong>and</strong> for each depthmap pixel recording the<br />

distance to the plane with the best photoconsistency score. One view is<br />

designated as reference, <strong>and</strong> a depthmap is computed for that view. The<br />

remaining views are designated matching views. For each plane, all matching<br />

views are back-projected onto the plane, <strong>and</strong> then projected into the<br />

reference view. This mapping, known as a plane homography [84], can be<br />

performed very efficiently on the GPU. Once the matching views are warped,<br />

the photoconsistency score is computed. We use a sum of absolute differences<br />

(SAD) photoconsistency measure. To h<strong>and</strong>le occlusions, the matching views<br />

are divided into a left <strong>and</strong> right subset, <strong>and</strong> the best matching score of the two<br />

subsets is kept [85]. Before the difference is computed, the image intensities<br />

are multiplied by the relative exposure (gain) computed during KLT tracking<br />

(Section 4.1.1). The stereo algorithm can produce a 512 × 384 depthmap<br />

30


<strong>from</strong> 11 images (10 matching, 1 reference) <strong>and</strong> 48 plane hypotheses at a rate<br />

of 42 Hz on an Nvidia GTX 280.<br />

After depthmaps have been computed, a fused depthmap is computed<br />

for a reference view using the surrounding stereo depthmaps [86]. All points<br />

<strong>from</strong> the depthmaps are projected into the reference view, <strong>and</strong> for each pixel<br />

the point with the highest stereo confidence is chosen. Stereo confidence is<br />

computed based on the matching scores <strong>from</strong> the planesweep:<br />

C(x) = ∑ e −(SAD(x,π)−SAD(x,π 0)) 2 (1)<br />

π≠π 0<br />

where x is the pixel in question, π is a plane <strong>from</strong> the planesweep, <strong>and</strong> π 0<br />

is the best plane chosen by the planesweep. This confidence measure prefers<br />

depth estimates with low uncertainty, where the matching score is much lower<br />

than scores for all other planes. For each pixel, after the most confident point<br />

is chosen, it is scored according to mutual support, free-space violations, <strong>and</strong><br />

occlusions. Supporting points are those that fall within 5% of the chosen<br />

point’s depth value. Occlusions are caused by points that would make the<br />

chosen point invisible in the reference view, <strong>and</strong> free-space violations are<br />

points that are made invisible in some other view by the chosen point. If<br />

the number of occlusions <strong>and</strong> free-space violations exceeds the number of<br />

support points, the chosen point is discarded <strong>and</strong> the depth value is marked<br />

as invalid. If the point survives, it is replaced with a new point computed as<br />

the average of the chosen point <strong>and</strong> support points. See Figure 6.<br />

Currently our dense geometry method is used only for video. The temporal<br />

sequence of the video makes it easy to select nearby views for stereo<br />

matching <strong>and</strong> depthmap fusion. For photo collections, a view selection strat-<br />

31


Figure 6: Our depthmap fusion method combines multiple stereo depthmaps (middle<br />

image shows an example depthmap) to produce a single fused depthmap (shown on the<br />

right) with greater accuracy <strong>and</strong> less redundancy.<br />

egy would be needed which identifies views with similar content <strong>and</strong> compatible<br />

appearance (day, night, summer, winter, etc.). This is left as future<br />

work.<br />

6. Model Extraction<br />

Once a fused depthmap has been computed, the next step is to convert<br />

it into a texture-mapped 3D polygonal mesh. This can be done trivially by<br />

olverlaying the depthmap image with a quadrilateral mesh, <strong>and</strong> assigning<br />

the 3D coordinate of each vertex to the 3D position given by the depthmap.<br />

The mesh can also be trivially texture-mapped with the color image for the<br />

corresponding depthmap. However, a number of improvements can be made.<br />

First, depth discontinuities must be detected to prevent the mesh <strong>from</strong><br />

joining discontinuous foreground <strong>and</strong> background objects. We simply use a<br />

threshold on the magnitude of the depth gradient to determine if a mesh edge<br />

can be created between two vertices. Afterwards, connected components with<br />

a small number of vertices are discarded.<br />

Second, surfaces are discarded which have already been reconstructed<br />

32


Original Mesh Simplified Mesh Textured Mesh<br />

Figure 7: Models are extracted <strong>from</strong> the fused depthmaps.<br />

in previous fused depthmaps. The surface reconstructed <strong>from</strong> the previous<br />

fused depthmap in the video sequence is projected into the current depthmap.<br />

Any vertices that fall within threshold of the previous surface are marked as<br />

invalid. Their neighboring vertices are then adjusted to coincide with the<br />

edge of the previous surface to prevent any cracks.<br />

Third, the mesh is simplified by hierarchically merging neighboring quadrilaterals<br />

that are nearly co-planar. Actually we use a top-down approach. We<br />

start with a single quadrilateral spanning the whole depthmap. If the distance<br />

to the quadrilaters of any of its contained points is greater than a<br />

threshold, the quadrilateral is split. This is repeated recursively, until every<br />

quadrilateral faithfully represents its underlying points. The splitting is<br />

performed along the longest direction or at r<strong>and</strong>om if the quadrilateral is<br />

square. This method represents planar <strong>and</strong> quasi-planar surfaces with fewer<br />

polygons <strong>and</strong> greatly reduces the size of the output models. See Figure 8.<br />

33


Figure 8: Several 3D models reconstructed <strong>from</strong> video sequences.<br />

34


7. Conclusions<br />

In this paper we presented methods for the real-time reconstruction <strong>from</strong><br />

video <strong>and</strong> the fast reconstruction <strong>from</strong> <strong>Internet</strong> photo collections on a single<br />

PC. We demonstrated the performance of the methods on a variety of<br />

large-<strong>scale</strong> datasets proving the computational performance of the proposed<br />

methods. We achieved this through the consequent parallelization of all computations<br />

involved enabling an execution on the graphics card as a highly<br />

parallel processor. Additionally, for the modeling <strong>from</strong> <strong>Internet</strong> photo collections<br />

we also combine constraints <strong>from</strong> recognition <strong>and</strong> geometric constraints<br />

leading to an orders of magnitude more efficient image registration than any<br />

existing system.<br />

Acknowledgements:. We would like to acknowledge the DARPA UrbanScape<br />

project, NSF Grant IIS-0916829 <strong>and</strong> other funding <strong>from</strong> the US government,<br />

as well as our collaborators David Nister, Amir Akbarzadeh, Philippos Mordohai,<br />

Paul Merrell, Chris Engels, Henrik Stewenius, Brad Talton, Liang<br />

Wang, Qingxiong Yang, Ruigang Yang, Greg Welch, Herman Towles, Xiaowei<br />

Li.<br />

[1] A. Fischer, T. Kolbe, F. Lang, A. Cremers, W. Förstner, L. Plümer,<br />

V. Steinhage, Extracting buildings <strong>from</strong> aerial images using hierarchical<br />

aggregation in 2d <strong>and</strong> 3d, Computer Vision <strong>and</strong> Image Underst<strong>and</strong>ing<br />

72 (2) (1998) 185–203.<br />

[2] A. Gruen, X. Wang, Cc-modeler: A topology generator for 3-d city<br />

models, ISPRS Journal of <strong>Photo</strong>grammetry & Remote Sensing 53 (5)<br />

(1998) 286–295.<br />

35


[3] Z. Zhu, A. Hanson, E. Riseman, Generalized parallel-perspective stereo<br />

mosaics <strong>from</strong> airborne video, IEEE Trans. on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence 26 (2) (2004) 226–237.<br />

[4] C. Früh, A. Zakhor, An automated method for large-<strong>scale</strong>, ground-based<br />

city model acquisition, Int. J. of Computer Vision 60 (1) (2004) 5–24.<br />

[5] I. Stamos, P. Allen, Geometry <strong>and</strong> texture recovery of scenes of large<br />

<strong>scale</strong>, Computer Vision <strong>and</strong> Image Underst<strong>and</strong>ing 88 (2) (2002) 94–118.<br />

[6] S. El-Hakim, J.-A. Beraldin, M. Picard, A. Vettore, Effective 3d modeling<br />

of heritage sites, in: 4th International Conference of 3D Imaging<br />

<strong>and</strong> Modeling, 2003, pp. 302–309.<br />

[7] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint,<br />

MIT Press, 1993.<br />

[8] R. I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,<br />

2nd Edition, Cambridge University Press, 2004.<br />

[9] B. Lucas, T. Kanade, An Iterative Image Registration Technique with<br />

an Application to Stereo Vision, in: Int. Joint Conf. on Artificial Intelligence,<br />

1981, pp. 674–679.<br />

[10] C. Zach, D. Gallup, J. Frahm, <strong>Fast</strong> gain-adaptive KLT tracking on<br />

the GPU, in: CVPR Workshop on Visual Computer Vision on GPU’s<br />

(CVGPU), 2008.<br />

[11] R. Fergus, P. Perona, A. Zisserman, A visual category filter for Google<br />

images, in: ECCV, 2004.<br />

36


[12] T. Berg, D. Forsyth, Animals on the web, in: CVPR, 2006.<br />

[13] L.-J. Li, G. Wang, L. Fei-Fei, Optimol: automatic object picture collection<br />

via incremental model learning, in: CVPR, 2007.<br />

[14] F. Schroff, A. Criminisi, A. Zisserman, Harvesting image databases <strong>from</strong><br />

the web, in: ICCV, 2007.<br />

[15] B. Collins, J. Deng, L. Kai, L. Fei-Fei, Towards scalable dataset construction:<br />

An active learning approach, in: ECCV, 2008.<br />

[16] J. Philbin, A. Zisserman, Object mining using a matching graph on<br />

very large image collections, in: Proceedings of the Indian Conference<br />

on Computer Vision, Graphics <strong>and</strong> Image Processing, 2008.<br />

[17] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco,<br />

F. Brucher, T.-S. Chua, H. Neven, Tour the world: Building a web-<strong>scale</strong><br />

l<strong>and</strong>mark recognition engine, in: CVPR, 2009.<br />

[18] P. Beardsley, A. Zisserman, D. Murray, Sequential updating of projective<br />

<strong>and</strong> affine structure <strong>from</strong> motion, Int. J. Computer Vision 23 (3) (1997)<br />

235–259.<br />

[19] A. Fitzgibbon, A. Zisserman, Automatic camera recovery for closed or<br />

open image sequences, in: European Conf. on Computer Vision, 1998,<br />

pp. 311–326.<br />

[20] D. Nistér, An efficient solution to the five-point relative pose problem,<br />

IEEE Trans. on Pattern Analysis <strong>and</strong> Machine Intelligence 26 (6) (2004)<br />

756–777.<br />

37


[21] D. Nistér, O. Naroditsky, J. Bergen, Visual odometry for ground vehicle<br />

applications, Journal of Field Robotics 23 (1).<br />

[22] American Society of <strong>Photo</strong>grammetry, Manual of <strong>Photo</strong>grammetry (5th<br />

edition), Asprs Pubns, 2004.<br />

[23] A. Azarbayejani, A. Pentl<strong>and</strong>, Recursive estimation of motion,<br />

structure, <strong>and</strong> focal length, IEEE Trans. on Pattern<br />

Analysis <strong>and</strong> Machine Intelligence 17 (6) (1995) 562–575.<br />

doi:http://doi.ieeecomputersociety.org/10.1109/34.387503.<br />

[24] S. Soatto, P. Perona, R. Frezza, G. Picci, Recursive motion <strong>and</strong> structure<br />

estimation with complete error characterization, in: Int. Conf. on<br />

Computer Vision <strong>and</strong> Pattern Recognition, 1993, pp. 428–433.<br />

[25] S. Palmer, E. Rosch, P. Chase, Canonical perspective <strong>and</strong> the perception<br />

of objects, Attention <strong>and</strong> Performance IX (1981) 135–151.<br />

[26] V. Blanz, M. Tarr, H. Bulthoff, What object attributes determine canonical<br />

views?, Perception 28 (5) (1999) 575–600.<br />

[27] T. Denton, M. Demirci, J. Abrahamson, A. Shokouf<strong>and</strong>eh, S. Dickinson,<br />

Selecting canonical views for view-based 3-d object recognition, in:<br />

ICPR, 2004, pp. 273–276.<br />

[28] P. Hall, M. Owen, Simple canonical views, in: BMVC, 2005, pp. 839–<br />

848.<br />

[29] T. L. Berg, D. Forsyth, Automatic ranking of iconic images, Tech. rep.,<br />

University of California, Berkeley (2007).<br />

38


[30] Y. Jing, S. Baluja, H. Rowley, Canonical image selection <strong>from</strong> the web,<br />

in: Proceedings of the 6th ACM international Conference on Image <strong>and</strong><br />

<strong>Video</strong> Retrieval, 2007.<br />

[31] I. Simon, N. Snavely, S. M. Seitz, Scene summarization for online image<br />

collections, in: ICCV, 2007.<br />

[32] T. L. Berg, A. C. Berg, Finding iconic images, in: The 2nd <strong>Internet</strong><br />

Vision Workshop at IEEE Conference on Computer Vision <strong>and</strong> Pattern<br />

Recognition, 2009.<br />

[33] R. Raguram, S. Lazebnik, Computing iconic summaries of general visual<br />

concepts, in: Workshop on <strong>Internet</strong> Vision CVPR, 2008.<br />

[34] D. Scharstein, R. Szeliski, A taxonomy <strong>and</strong> evaluation of dense twoframe<br />

stereo correspondence algorithms, Int. J. of Computer Vision<br />

47 (1-3) (2002) 7–42.<br />

[35] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, R. Szeliski, A comparison<br />

<strong>and</strong> evaluation of multi-view stereo reconstruction algorithms,<br />

in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition, 2006, pp.<br />

519–528.<br />

[36] R. Collins, A space-sweep approach to true multi-image matching, in:<br />

Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition, 1996, pp. 358–<br />

363.<br />

[37] R. Yang, M. Pollefeys, Multi-resolution real-time stereo on commodity<br />

graphics hardware, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern<br />

Recognition, 2003, pp. 211–217.<br />

39


[38] T. Werner, A. Zisserman, New techniques for automated architectural<br />

reconstruction <strong>from</strong> photographs, in: European Conf. on Computer Vision,<br />

2002, pp. 541–555.<br />

[39] P. Burt, L. Wixson, G. Salgian, Electronically directed ”focal” stereo,<br />

in: Int. Conf. on Computer Vision, 1995, pp. 94–101.<br />

[40] S. N. Sinha, D. Steedly, R. Szeliski, Piecewise planar stereo for imagebased<br />

rendering, in: International Conference on Computer Vision<br />

(ICCV), 2009.<br />

[41] Y. Furukawa, B. Curless, S. M. Seitz, , R. Szeliski, Manhattan-world<br />

stereo, in: Computer Vision <strong>and</strong> Pattern Recognition (CVPR), 2009.<br />

[42] G. Turk, M. Levoy, Zippered polygon meshes <strong>from</strong> range images., in:<br />

SIGGRAPH, 1994, pp. 311–318.<br />

[43] M. Soucy, D. Laurendeau, A general surface approach to the integration<br />

of a set of range views, IEEE Trans. on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence 17 (4) (1995) 344–358.<br />

[44] A. Hilton, A. Stoddart, J. Illingworth, T. Windeatt, Reliable surface<br />

reconstruction <strong>from</strong> multiple range images, in: European Conf. on Computer<br />

Vision, 1996, pp. 117–126.<br />

[45] B. Curless, M. Levoy, A volumetric method for building complex models<br />

<strong>from</strong> range images, SIGGRAPH 30 (1996) 303–312.<br />

[46] M. Wheeler, Y. Sato, K. Ikeuchi, Consensus surfaces for modeling 3d<br />

40


objects <strong>from</strong> multiple range images, in: Int. Conf. on Computer Vision,<br />

1998, pp. 917–924.<br />

[47] R. Koch, M. Pollefeys, L. Van Gool, <strong>Robust</strong> calibration <strong>and</strong> 3d geometric<br />

modeling <strong>from</strong> large collections of uncalibrated images, in: DAGM, 1999,<br />

pp. 413–420.<br />

[48] G. Schindler, P. Krishnamurthy, F. Dellaert, Line-based structure <strong>from</strong><br />

motion for urban environments, in: 3DPVT, 2006.<br />

[49] G. Schindler, F. Dellaert, S. Kang, Inferring temporal order of images<br />

<strong>from</strong> 3d structure, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition,<br />

2007.<br />

[50] N. Cornelis, K. Cornelis, L. Van Gool, <strong>Fast</strong> compact city modeling for<br />

navigation pre-visualization, in: Int. Conf. on Computer Vision <strong>and</strong><br />

Pattern Recognition, 2006.<br />

[51] N. Snavely, S. M. Seitz, R. Szeliski, <strong>Photo</strong> tourism: Exploring photo<br />

collections in 3d, in: SIGGRAPH, 2006, pp. 835–846.<br />

[52] N. Snavely, S. M. Seitz, R. Szeliski, Modeling the world <strong>from</strong> <strong>Internet</strong><br />

photo collections, International Journal of Computer Vision 80 (2)<br />

(2008) 189–210.<br />

[53] K. Ni, D. Steedly, F. Dellaert, Out-of-core bundle adjustment for large<strong>scale</strong><br />

3d reconstruction, in: ICCV, 2007.<br />

[54] N. Snavely, S. M. Seitz, R. Szeliski, Skeletal sets for efficient structure<br />

<strong>from</strong> motion, in: CVPR, 2008.<br />

41


[55] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, R. Szeliski, Building Rome<br />

in a day, in: ICCV, 2009.<br />

[56] S. Arya, D. Mount, N. Netanyahu, R. Silverman, A. Wu, An optimal<br />

algorithm for approximate nearest neighbor searching fixed dimensions,<br />

in: JACM, Vol. 45, 1998, pp. 891–923.<br />

[57] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall:<br />

Automatic query expansion with a generative feature model for object<br />

retrieval, in: ICCV, 2007.<br />

[58] D. Lowe, Distinctive image features <strong>from</strong> <strong>scale</strong>-invariant keypoints, Int.<br />

J. of Computer Vision 60 (2) (2004) 91–110.<br />

[59] R. Raguram, J.-M. Frahm, M. Pollefeys, A comparative analysis of<br />

RANSAC techniques leading to adaptive real-time r<strong>and</strong>om sample consensus,<br />

in: ECCV, 2008.<br />

[60] B. Lucas, T. Kanade, An iterative image registration technique with<br />

an application to stereo vision, in: International Joint Conference on<br />

Artificial Intelligence (IJCAI), 1981, pp. 674–679.<br />

URL citeseer.ist.psu.edu/lucas81iterative.html<br />

[61] J. Shi, C. Tomasi, Good features to track, in: IEEE Conference on<br />

Computer Vision <strong>and</strong> Pattern Recognition (CVPR), 1994.<br />

[62] S. Kim, D. Gallup, J.-M. Frahm, A. Akbarzadeh, Q. Yang, R. Yang,<br />

D. Nistér, M. Pollefeys, Gain adaptive real-time stereo streaming, in:<br />

Int. Conf. on Vision Systems, 2007.<br />

42


[63] M. Fischler, R. Bolles, R<strong>and</strong>om sample consensus: A paradigm for model<br />

fitting with applications to image analysis <strong>and</strong> automated cartography,<br />

Comm. of the ACM 24 (6) (1981) 381–395.<br />

[64] R. Raguram, J.-M. Frahm, M. Pollefeys, A comparative analysis of<br />

ransac techniques leading to adaptive real-time r<strong>and</strong>om sample consensus,<br />

in: Proc. ECCV, 2008.<br />

[65] D. Nister, Preemptive RANSAC for live structure <strong>and</strong> motion estimation,<br />

in: Proc. ICCV, Vol. 1, 2003, pp. 199–206.<br />

[66] J. Matas, O. Chum, R<strong>and</strong>omized ransac with sequential probability ratio<br />

test, in: Proc. ICCV, 2005, pp. 1727–1732.<br />

[67] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai,<br />

B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha,<br />

B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch,<br />

H. Towles, Detailed real-time urban 3d reconstruction <strong>from</strong> video, Int.<br />

Journal of Computer Vision special issue on Modeling <strong>Large</strong>-Scale 3D<br />

Scenes.<br />

[68] C. Wu, F. Fraundorfer, J.-M. Frahm, M. Pollefeys, 3d model<br />

search <strong>and</strong> pose estimation <strong>from</strong> single images using vip features,<br />

Computer Vision <strong>and</strong> Pattern Recognition Workshop 0 (2008) 1–8.<br />

doi:http://doi.ieeecomputersociety.org/10.1109/CVPRW.2008.4563037.<br />

[69] D. Nister, H. Stewenius, Scalable recognition with a vocabulary<br />

tree, in: In Proceedings of IEEE CVPR, Vol. 2, IEEE Com-<br />

43


puter Society, Los Alamitos, CA, USA, 2006, pp. 2161–2168.<br />

doi:http://doi.ieeecomputersociety.org/10.1109/CVPR.2006.264.<br />

[70] G. Schindler, M. Brown, R. Szeliski, City-<strong>scale</strong> location recognition, in:<br />

In Proceedings of IEEE CVPR, 2007.<br />

[71] A. Irschara, C. Zach, J.-M. Frahm, H. Bischof, From structure-<strong>from</strong>motion<br />

point clouds to fast location recognition, in: In Proceedings of<br />

IEEE CVPR, 2009, pp. 2599–2606.<br />

[72] F. Dellaert, M. Kaess, Square root SAM: Simultaneous location <strong>and</strong><br />

mapping via square root information smoothing, International Journal<br />

of Robotics Research.<br />

[73] C. Engels, H. Stewénius, D. Nistér, Bundle adjustment rules, in: <strong>Photo</strong>grammetric<br />

Computer Vision (PCV), 2006.<br />

[74] K. Ni, D. Steedly, F. Dellaert, Out-of-core bundle adjustment for large<strong>scale</strong><br />

3D reconstruction, in: IEEE International Conference on Computer<br />

Vision (ICCV), 2007.<br />

[75] M. Kaess, A. Ranganathan, F. Dellaert, iSAM: Incremental smoothing<br />

<strong>and</strong> mapping, IEEE Transactions on Robotics.<br />

[76] T. A. Davis, J. R. Gilbert, S. Larimore, E. Ng, A column approximate<br />

minimum degree ordering algorithm, ACM Transactions on Mathematical<br />

Software 30 (3) (2004) 353–376.<br />

[77] M. Lourakis, A. Argyros, The design <strong>and</strong> implementation of a generic<br />

sparse bundle adjustment software package based on the Levenberg-<br />

44


Marquardt algorithm, Tech. Rep. 340, Institute of Computer Science -<br />

FORTH (2004).<br />

[78] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation<br />

of the spatial envelope, IJCV 42 (3) (2001) 145–175.<br />

[79] J. Hays, A. A. Efros, Scene completion using millions of photographs,<br />

in: SIGGRAPH, 2007.<br />

[80] M. Douze, H. Jégou, H. S<strong>and</strong>hawalia, L. Amsaleg, C. Schmid, Evaluation<br />

of gist descriptors for web-<strong>scale</strong> image search, in: International<br />

Conference on Image <strong>and</strong> <strong>Video</strong> Retrieval, 2009.<br />

[81] J.-M. Frahm, M. Pollefeys, RANSAC for (quasi-) degenerate data<br />

(QDEGSAC), in: CVPR, Vol. 1, 2006, pp. 453–460.<br />

[82] C. Beder, R. Steffen, Determining an initial image pair for fixing the<br />

<strong>scale</strong> of a 3d reconstruction <strong>from</strong> an image sequence, in: Proc. DAGM,<br />

2006, pp. 657–666.<br />

[83] M. Goesele, N. Snavely, B. Curless, H. Hoppe, S. M. Seitz, Multi-view<br />

stereo for community photo collections, in: ICCV, 2007.<br />

[84] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,<br />

Cambridge University Press, 2000.<br />

[85] S. Kang, R. Szeliski, J. Chai, H<strong>and</strong>ling occlusions in dense multi-view<br />

stereo, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition,<br />

2001, pp. 103–110.<br />

45


[86] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm,<br />

R. Yang, D. Nister, M. Pollefeys, Real-Time Visibility-Based Fusion<br />

of Depth Maps, in: Proceedings of International Conf. on Computer<br />

Vision, 2007.<br />

46

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!