Fast Robust Large-scale Mapping from Video and Internet Photo ...
Fast Robust Large-scale Mapping from Video and Internet Photo ...
Fast Robust Large-scale Mapping from Video and Internet Photo ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Fast</strong> <strong>Robust</strong> <strong>Large</strong>-<strong>scale</strong> <strong>Mapping</strong> <strong>from</strong> <strong>Video</strong> <strong>and</strong><br />
<strong>Internet</strong> <strong>Photo</strong> Collections<br />
Jan-Michael Frahm a , Marc Pollefeys b,a , Svetlana Lazebnik a , Christopher<br />
Zach b , David Gallup a , Brian Clipp a , Rahul Raguram a , Changchang Wu a ,<br />
Tim Johnson a<br />
a Universtiy of North Carolina at Chapel Hill, NC, USA<br />
b Eidgenössische Technische Hochschule Zürich<br />
Abstract<br />
<strong>Fast</strong> automatic modeling of large-<strong>scale</strong> environments has been enabled through<br />
recent research in computer vision. In this paper we present a system approaching<br />
the fully automatic modeling of large-<strong>scale</strong> environments. Our<br />
system takes as input either a video stream or collection of photographs,<br />
which can be readily obtained <strong>from</strong> <strong>Internet</strong> photo sharing web-sites such as<br />
Flickr. The design of the system is aimed at achieving high computational<br />
performance in the modeling. This is achieved through algorithmic optimizations<br />
for efficient robust estimation, a two stage stereo estimation reducing<br />
the computational cost while maintaining competitive modeling results, <strong>and</strong><br />
the use of image based recognition for efficient grouping of corresponding<br />
images in the photo collection. The second major improvement in computational<br />
speed is achieved through parallelization <strong>and</strong> execution on commodity<br />
graphics hardware. These improvements lead to real-time video processing<br />
<strong>and</strong> to reconstruction <strong>from</strong> tens of thous<strong>and</strong>s of images within the span of a<br />
day on a single commodity computer. We demonstrate the performance of<br />
both systems on a variety of real world data.<br />
Preprint submitted to ISPRS January 22, 2010
Keywords:<br />
1. Introduction<br />
The fully automatic modeling of large-<strong>scale</strong> environments has been a research<br />
goal in photogrammetry <strong>and</strong> computer vision since a long time. Detailed<br />
3D models automatically acquired <strong>from</strong> the real world have many uses<br />
including civil <strong>and</strong> military planning, mapping, virtual tourism, games, <strong>and</strong><br />
movies. In this paper we present a system approaching the fully automatic<br />
modeling of large-<strong>scale</strong> environments, either <strong>from</strong> video or <strong>from</strong> photo collections.<br />
Our system has been designed for efficiency <strong>and</strong> scalability. GPU implementations<br />
for SIFT feature matching, KLT feature tracking, gist feature<br />
extraction <strong>and</strong> multi-view stereo allow our system to rapidly process large<br />
amounts of video <strong>and</strong> photographs. For large photo collections, iconic image<br />
clustering allows the dataset to be broken into small closely related parts.<br />
The parts are then processed <strong>and</strong> merged, allowing tens of thous<strong>and</strong>s of photos<br />
to be registered. Similarly, for videos, loop detection finds intersections in<br />
the camera path, which reduces drift for long sequences <strong>and</strong> allows multiple<br />
videos to be registered.<br />
Recently mapping systems like Microsoft Bing Maps <strong>and</strong> Google Earth<br />
have started to use 3D models of cities in their visualizations. Currently these<br />
systems still require a human in the loop for delivering models of reasonable<br />
quality. Nevertheless they already achieve impressive results, modeling large<br />
areas with regular updates. However, these models are very low complexity,<br />
<strong>and</strong> do not provide enough detail for ground-level viewing. Furthermore the<br />
2
Figure 1: The left shows an overview model of Chapel Hill reconstructed <strong>from</strong> 1.3 million<br />
video frames on a single PC in 11.5 hrs. On the right a model of the statue of liberty<br />
is shown. The reconstruction algorithm registered 9025 cameras out of 47238 images<br />
downloaded <strong>from</strong> the <strong>Internet</strong>.<br />
availability is restricted to only a small number of cities across the globe.<br />
On the other h<strong>and</strong>, our system can automatically produce 3D models <strong>from</strong><br />
ground-level images with ground-level detail. And the efficiency <strong>and</strong> scalability<br />
of our system presents the possibility to reconstruct 3D models for all<br />
the world’s cities quickly <strong>and</strong> cheaply.<br />
In this paper we will review the details of our system, <strong>and</strong> present 3D<br />
models of large areas reconstructed <strong>from</strong> thous<strong>and</strong>s to millions of images.<br />
These models are produced with state-of-the-art computational efficiency,<br />
yet are competitive with the state of the art in quality.<br />
2. Related Work<br />
In the last years there has been a considerable progress in the area of<br />
large <strong>scale</strong> reconstruction <strong>from</strong> video for urban environments <strong>and</strong> arial data<br />
[1, 2, 3] as well as there is significant progress on reconstruction <strong>from</strong> <strong>Internet</strong><br />
3
photo collections. Here we discuss the most related work in the areas of fast<br />
structure <strong>from</strong> motion, camera registration <strong>from</strong> <strong>Internet</strong> photo collections,<br />
real-time dense stereo, scene summarization, <strong>and</strong> l<strong>and</strong>mark recognition.<br />
Besides the purely image based approaches there is also work on modeling<br />
the geometry by combining cameras <strong>and</strong> active range scanners. Früh <strong>and</strong><br />
Zakhor [4] proposed a mobile system mounted on a vehicle to capture large<br />
amounts of data while driving in urban environments. Earlier systems by<br />
Stamos <strong>and</strong> Allen [5] <strong>and</strong> El-Hakim et al. [6] constrained the scanning laser<br />
to be in a few pre-determined viewpoints. In contrast to the comparably expensive<br />
active systems, our approach for real-time reconstruction <strong>from</strong> video<br />
uses only cameras leveraging the methodology developed by the computer<br />
vision community within the last two decades [7, 8].<br />
The first step in our systems is to establish correspondences between the<br />
video frames or the different images of the photo collection respectively. We<br />
use two different approaches for establishing correspondences. For video data<br />
we use an extended KLT tracker [9] that exploits the inherent parallelism<br />
of the tracking problem to improve the computational performance through<br />
execution on the graphics processor (GPU). The specific approach used is<br />
introduced by Zach et al. in detail in [10].<br />
In the case of <strong>Internet</strong> photo collections we have to address the challenge<br />
of dataset collection, which is the following problem: starting with the<br />
heavily contaminated output of an <strong>Internet</strong> image search query, extract a<br />
high-precision subset of images that are actually relevant to the query. Existing<br />
approaches to this problem [11, 12, 13, 14, 15] consider general visual<br />
categories not necessarily related by rigid 3D structure. These techniques<br />
4
use statistical models to combine different kinds of 2D image features (texture,<br />
color, keypoints), as well as text <strong>and</strong> tags. The approach by Philbin<br />
et al. [16, 17] uses a loose spatial consistency constraint. In contrast, our<br />
system enforces a rigid 3D scene structure shared by the images.<br />
Our systems determines the camera registration <strong>from</strong> images or video<br />
frames. In general there are two classes of methods used to determine the<br />
camera registration. The first class leverages the work in multiple view geometry<br />
<strong>and</strong> typically alternates between robustly estimating camera poses <strong>and</strong><br />
3D point locations directly [18, 19, 20, 21]. Often bundle adjustment [22] is<br />
used in the process to refine the estimate. The other class of methods uses<br />
an extended Kalman filter to estimate both camera motion <strong>and</strong> 3D point<br />
locations jointly as the state of the filter [23, 24].<br />
For reconstruction <strong>from</strong> <strong>Internet</strong> photo collections our system uses a hierarchical<br />
reconstruction method starting <strong>from</strong> a set of canonical or iconic<br />
views representing different viewpoints as well as different parts of the scene.<br />
The issue of canonical view selection is one that has been addressed both in<br />
psychology [25, 26] <strong>and</strong> computer vision literature [27, 28, 29, 30, 31, 32].<br />
Simon et al. [31] observed that community photo collections provide a likelihood<br />
distribution over the viewpoints <strong>from</strong> which people prefer to take<br />
photographs. Hence, canonical view selection identifies prominent clusters<br />
or modes of this distribution. Simon et al. find these modes by clustering<br />
images based on the output of local feature matching <strong>and</strong> epipolar geometry<br />
verification between every pair of images in the dataset through a 3D registration<br />
of the data. While this solution is effective, it is computationally<br />
expensive, <strong>and</strong> it treats scene summarization as a by-product of 3D recon-<br />
5
struction. By contrast, we view summarization as an image organization step<br />
that precedes 3D reconstruction, <strong>and</strong> we take advantage of the observation<br />
that a good subset of representative or “iconic” images can be obtained using<br />
relatively simple 2D appearance-based techniques [29, 33].<br />
Given that [34, 35] provide excellent surveys on the work on binocular<br />
or multiple-view stereo we refer the reader to these articles. Our system<br />
for reconstruction <strong>from</strong> video streams extends the idea by Collins [36] to<br />
dense geometry estimation. This approach permits an efficient execution<br />
on the GPU as proposed originally by Yang <strong>and</strong> Pollefeys [37]. There has<br />
been a number of approaches proposed to specifically target urban environments<br />
using mainly the fact that they predominantly consist of planar<br />
surfaces [38, 39, 40], or even more strict orthogonality constraints [41]. To<br />
ensure computational feasibility large-<strong>scale</strong> systems generate partial reconstructions,<br />
which are afterwards merged into a common model. Conflicts<br />
<strong>and</strong> errors in the partial reconstructions are identified <strong>and</strong> resolved during<br />
the merging process. Most approaches so far target data produced by range<br />
finders, where noise levels <strong>and</strong> the fraction of outliers are typically significantly<br />
lower than those of passive stereo. Turk <strong>and</strong> Levoy [42] proposed to<br />
remove the overlapping mesh parts after registering the triangular meshes.<br />
A similar approach was proposed by Soucy <strong>and</strong> Laurendeau [43]. Volumetric<br />
model merging was proposed by Hilton et al. [44], Curless <strong>and</strong> Levoy [45],<br />
which was later extended by Wheeler et al. [46]. Koch et al. [47] presented a<br />
volumetric approach for fusion as part of an uncalibrated 3D modeling system.<br />
Given depth maps for all images, the depth estimates for all pixels are<br />
projected in the voxelized 3D space. Each depth estimate votes for a voxel in<br />
6
a probabilistic way <strong>and</strong> the final surfaces are extracted by thresholding the<br />
3D probability volume.<br />
Recently there has been a number of systems covering the large-<strong>scale</strong><br />
reconstruction of urban environments or l<strong>and</strong>marks. The 4D Atlanta project<br />
of Schindler et al. [48, 49], models the evolution of an urban environment<br />
over time <strong>from</strong> an unordered collection of photos. To achieve fast processing<br />
<strong>and</strong> provide efficient visualization Cornelis et al. [50] proposed a ruled surface<br />
model for urban environments. Specifically, in their approach the facades are<br />
modeled as ruled surfaces parallel to the gravity vector.<br />
One of the first works to demonstrate 3D reconstruction of l<strong>and</strong>marks<br />
<strong>from</strong> <strong>Internet</strong> photo collections is the <strong>Photo</strong> Tourism system [51, 52]. This<br />
system achieves high-quality reconstruction results with the help of exhaustive<br />
pairwise image matching <strong>and</strong> global bundle adjustment after inserting<br />
each new view. Unfortunately, this process becomes very computationally<br />
expensive for large data sets, <strong>and</strong> it is particularly inefficient for heavily contaminated<br />
collections, most of whose images cannot be registered to each<br />
other. Since the introduction of the seminal <strong>Photo</strong> Tourism system, several<br />
researchers have developed SfM methods that exploit the redundancy<br />
in community photo collections to make reconstruction more efficient. In<br />
[53], a technique for out-of-core bundle-adjustment is proposed which takes<br />
advantage of this redundancy by locally optimizing the “hot spots” <strong>and</strong> then<br />
connecting the local solutions into a global one. Snavely et al. [54] find<br />
skeletal sets of images <strong>from</strong> the collection whose reconstruction provides a<br />
good approximation to a reconstruction involving all the images. One remaining<br />
limitation of this work is that it still requires as an initial step the<br />
7
exhaustive computation of all two-view relationships in the dataset. Using<br />
the independence of the pairwise evaluation Agarwal et al. [55] address this<br />
computational challenge by using a computing cluster with up to 500 cores<br />
to reduce the computation time significantly. Similar to our system they<br />
use image recognition techniques like approximate nearest neighbor search<br />
[56] <strong>and</strong> query expansion [57] to reduce the number of c<strong>and</strong>idate relations.<br />
In contrast, our system uses single image descriptors to efficiently identify<br />
correlating images leading to an even smaller number of c<strong>and</strong>idates. Images<br />
can then be registered on a single computation core in time comparable to<br />
the 500 cores of Agarwal et al.<br />
3. Overview<br />
In this paper we describe our system for real-time computation of 3D<br />
models <strong>from</strong> ground reconnaissance video <strong>and</strong> for efficient 3D reconstruction<br />
<strong>from</strong> <strong>Internet</strong> photo collections downloaded <strong>from</strong> photo sharing web sites like<br />
Flickr 1 . The design of the system aims at efficient reconstruction through<br />
parallelization of the algorithms used enabling their execution on the GPU.<br />
The processing task for each type of input algorithmically share a large number<br />
of methods with some adaptation accounting for the significantly different<br />
characteristics of the data sources. In this section we will shortly discuss<br />
the different characteristics of the data sources <strong>and</strong> then introduce the main<br />
building blocks for the system.<br />
Our method for 30 Hz real-time reconstruction <strong>from</strong> video uses the video<br />
1 www.flickr.com<br />
8
streams of multiple cameras mounted on a vehicle to obtain a 3D reconstruction<br />
of the scene. Our system can operate <strong>from</strong> video alone, but additional<br />
sensors, such as differential GPS <strong>and</strong> an inertial sensor (INS), reduce accumulated<br />
error (drift) <strong>and</strong> provide registrations to the Earth’s coordinate system.<br />
Our system efficiently fuses the GPS, INS <strong>and</strong> camera measurements into an<br />
absolute camera position. Additionally, our system exploits the temporal<br />
order of the video frames, which implies a spatial relation ship of the frames,<br />
to efficiently perform the camera motion tracking <strong>and</strong> the dense geometry<br />
computation.<br />
In contrast to video, in <strong>Internet</strong> photo collections we encounter a high<br />
number of photos that do not share a rigid geometry with the images of<br />
the scene. These images are often mislabeled images <strong>from</strong> the data base,<br />
artistic images of the scene, duplicates, or parodies of the scene. An example<br />
for these different classes of unrelated images are shown in Figure 2. To<br />
quantify the amount of unrelated images in <strong>Internet</strong> photo collections we<br />
labeled a r<strong>and</strong>om subset for three downloaded datasets. Besides the dataset<br />
for the Statue of Liberty (47238 images with approximately 40% outliers) we<br />
downloaded two more datasets by searching for “San Marco” (45322 images<br />
with approximately 60 % outliers) <strong>and</strong> “Notre Dame” (11928 images with<br />
approximately 49%). The iconic clustering employed by our system can<br />
efficiently h<strong>and</strong>le such high outlier ratios which is essential for h<strong>and</strong>ling large<br />
<strong>Internet</strong> photo collections.<br />
Despite the differences in the input data our systems share a common<br />
algorithmic structure as shown in Algorithm 1:<br />
• The local correspondence estimation establishes correspondences<br />
9
Figure 2: Images downloaded form the the <strong>Internet</strong> by searching for “Statue of Liberty”,<br />
which do not represent the Statue of Liberty in New York.<br />
for salient feature points to neighboring views. In the case of video these<br />
correspondences are determined through KLT-tracking on the GPU<br />
with simultaneous estimation of the camera gain (Section 4.1.1). For<br />
<strong>Internet</strong> photo collections our system first establishes a spatial neighborhood<br />
relationship for each image through a clustering of single image<br />
descriptors, which effectively clusters close viewpoints. The feature<br />
points used are the SIFT features [58] due to their enhanced invariance<br />
to larger viewpoint changes (Section 4.1).<br />
• Camera pose/motion estimation <strong>from</strong> local correspondences is robustly<br />
performed through an efficient RANSAC technique called ARRSAC<br />
(Section 4.1.2). It determines the inlier correspondences <strong>and</strong> the camera<br />
position with respect to the previous image <strong>and</strong> the known model.<br />
10
In the case of available GPS data our system uses the local correspondences<br />
to perform sensor fusion with the GPS <strong>and</strong> INS data 4.1.3.<br />
• Global Correspondence estimation is performed to enable global<br />
drift correction. This step searches beyond the temporal neighbors in<br />
video <strong>and</strong> beyond the neighbors in the cluster for frames with overlap<br />
to the current frame (Section 4.2).<br />
• Bundle adjustment uses the global correspondences to reduce the<br />
accumulated drift of the camera registrations <strong>from</strong> the local camera<br />
pose estimates (Section 4.3).<br />
• Dense geometry estimation in real-time is performed by our 3D<br />
reconstruction <strong>from</strong> video system to extract the depth of all pixels<br />
<strong>from</strong> the camera (Section 5). Our system uses a two stage estimation<br />
strategy. First we use efficient GPU based stereo to determine the<br />
depth map for every frame. Then we use the temporal redundancy of<br />
the stereo estimations to filter out erroneous depth estimates <strong>and</strong> fuse<br />
the correct depth estimates.<br />
• Model extraction is performed to extract a triangular mesh representing<br />
the scene <strong>from</strong> the depth maps.<br />
The next section will first introduce our algorithm for real-time 3D reconstruction<br />
<strong>from</strong> video. Afterwards in Section 4.1 we will detail the differences<br />
<strong>from</strong> our algorithm for camera registration of <strong>Internet</strong> photo collections.<br />
11
Algorithm 1 Reconstruction Scheme<br />
for all images/frames do<br />
Local correspondence estimation with immediate spatial neighbors<br />
Camera pose/motion estimation <strong>from</strong> local correspondences<br />
end for<br />
for all registered images/frames do<br />
Global Correspondence estimation to non-neighbors to identify path<br />
intersections <strong>and</strong> overlaps not initially found<br />
Bundle Adjustment for global error mitigation using the global <strong>and</strong> local<br />
correspondences<br />
end for<br />
for all frames in video do<br />
Dense geometry estimation using stereo <strong>and</strong> depth map fusion<br />
Model extraction<br />
end for<br />
4. Camera Pose Estimation<br />
In this section we discuss the methods used for camera registration in<br />
our system. We first introduce the methods for camera registration for video<br />
sequences in Section 4.1. These methods inherently use the known temporal<br />
relationships of the frames for the camera motion estimation providing<br />
efficient constraints for camera registration. Second, in Section 4.4 we discuss<br />
the extension of the camera registration to <strong>Internet</strong> photo collections to<br />
address the challenge of an unordered data set.<br />
12
4.1. Camera Pose <strong>from</strong> <strong>Video</strong><br />
Our system for the registration of video streams employs the fact that<br />
typically in a video sequence the observed motion between consecutive frames<br />
is small, namely a few pixel. This allows our system to treat the correspondence<br />
estimation as a tracking problem as discussed in Section 4.1.1. From<br />
these correspondences we estimate the camera motion through our recently<br />
proposed efficient adaptive real-time r<strong>and</strong>om sampling consensus method<br />
(ARRSAC) [59] introduced in section 4.1.2. Alternatively, if GPS <strong>and</strong> inertial<br />
measurements are available the camera motion is estimated through<br />
a Kalman filter, efficiently fusing visual correspondences <strong>and</strong> the six DOF<br />
measurements for the camera pose as introduced in Section 4.1.3.<br />
4.1.1. Local Correspondences<br />
Reconstructing the structure of a scene <strong>from</strong> images begins with finding<br />
corresponding features between pairs of images. In a video sequence we can<br />
take advantage of the temporal ordering <strong>and</strong> small camera motion between<br />
frames to speed up correspondence finding. Kanade-Lucas-Tomassi feature<br />
tracking [60, 61] is a differential tracking method that first finds strong corner<br />
features in a video frame. These strong features are then tracked using a<br />
linear approximation of the image gradients over a small window of pixels<br />
surrounding the feature. One of the limitations of this approach is that<br />
feature motion of more than one pixel cannot be differentially tracked. A<br />
<strong>scale</strong> space approach based on an image pyramid is generally used to alleviate<br />
this problem. If a feature’s motion is more than a pixel at the finest <strong>scale</strong> we<br />
can still track it because its motion will be less than one pixel at a courser<br />
image <strong>scale</strong>.<br />
13
Building image pyramids <strong>and</strong> tracking features are both highly parallel<br />
operations <strong>and</strong> so can be programmed efficiently on the graphics processor<br />
(GPU). Building an image pyramid is simply a series of convolution operations<br />
with Gaussian blur kernels followed by down sampling. Differential<br />
tracking can be implemented in parallel by assigning each feature to track to<br />
a separate processor. With typically on the order of 1024 features to track<br />
<strong>from</strong> frame to frame in a 1024x768 image we can achieve a great deal of<br />
parallelism in this way.<br />
One major limitation of st<strong>and</strong>ard KLT tracking is that it uses the absolute<br />
difference between the windows of pixels in two images to calculate<br />
the feature track update. When a large change in intensity occurs between<br />
frames this can cause the tracking to fail. This happens frequently in videos<br />
shot outdoors when the camera moves <strong>from</strong> light into shadow for example.<br />
Kim et al. in [62] developed a gain adaptive KLT tracker that measures<br />
the change in mean intensity of all the tracked features <strong>and</strong> compensates for<br />
this change in the KLT update equations. Estimating the gain can also be<br />
performed on the GPU at minimal additional cost in comparison to st<strong>and</strong>ard<br />
GPU KLT.<br />
4.1.2. <strong>Robust</strong> Pose Estimation<br />
The estimated local correspondences typically contain a significant portion<br />
of erroneous correspondences. To determine the correct camera position<br />
we apply a R<strong>and</strong>om Sample Consensus (RANSAC) algorithm [63] to estimate<br />
the relative camera motion through the essential matrix[20]. While<br />
being highly robust the RANSAC algorithm can also be computationally<br />
very expensive, with a runtime that is exponential in outlier ratio <strong>and</strong> model<br />
14
complexity. Our recently proposed Adaptive Real-Time R<strong>and</strong>om Sample<br />
Consensus (ARRSAC) algorithm [64], is capable of providing accurate realtime<br />
estimation over a wide range of inlier ratios.<br />
ARRSAC moves away <strong>from</strong> the traditional hypothesize-<strong>and</strong>-verify framework<br />
of RANSAC by adopting a more parallel approach, along the lines of<br />
preemptive RANSAC [65]. However, while preemptive RANSAC makes the<br />
limiting assumption that a good estimate of the inlier ratio is available beforeh<strong>and</strong>,<br />
ARRSAC is capable of efficiently adapting to the contamination<br />
level of the data. This results in high computational savings.<br />
In real-time scenarios, the goal of a robust estimation algorithm is to deliver<br />
the best possible solution within a fixed time-budget. While a parallel,<br />
or breadth-first, approach is indeed a natural formulation for real-time applications,<br />
adapting the number of hypotheses to the contamination level of the<br />
data requires an estimate of the inlier ratio, which necessitates the adoption<br />
of a depth-first scan. The ARRSAC framework retains the benefits of both<br />
approaches (i.e., bounded run-time as well as adaptivity) by operating in a<br />
partially depth-first manner as detailed in Algorithm 2.<br />
The performance of ARRSAC was evaluated in detail on synthetic <strong>and</strong><br />
real data, over a wide range of inlier ratios <strong>and</strong> dataset sizes. From the<br />
results in [64], it can be seen that the ARRSAC approach produces significant<br />
computational speed-ups, while simultaneously providing accurate<br />
robust estimation in real-time. For the case of epipolar geometry estimation,<br />
for instance, ARRSAC operates with estimation speeds ranging between 55-<br />
350 Hz, which represent speedups ranging <strong>from</strong> 11.1 to 305.7 compared to<br />
st<strong>and</strong>ard RANSAC.<br />
15
Algorithm 2 The ARRSAC algorithm<br />
Partition the data into k equal size r<strong>and</strong>om partitions<br />
for all k partitions do<br />
if #hypothesis less than required then<br />
Non-uniform sampling for hypothesis generation<br />
end if<br />
Hypothesis evaluation using Sequential Probability Ratio Test (SPRT) [66]<br />
Perform local optimization to generate additional hypotheses<br />
Use surviving hypotheses to provide an estimate of the inlier ratio<br />
Use inlier ratio estimate to determine number of hypotheses needed<br />
for all hypotheses do<br />
if<br />
SPRT(hypothesis) is valid then<br />
Update inlier ratio estimate based on the current best hypothesis<br />
end if<br />
end for<br />
end for<br />
4.1.3. Camera Pose Estimation through Fusion<br />
In some instances we may have not only images or video data but also<br />
geo-location or orientation information. This additional information can be<br />
derived <strong>from</strong> global positioning system (GPS) sensors as well as inertial navigation<br />
systems containing accelerometers <strong>and</strong> gyroscopes. We can combine<br />
data <strong>from</strong> these other sensors with image correspondences to get a more<br />
accurate pose (rotation <strong>and</strong> translation) than we might get with vision or<br />
geo-location/orientation alone.<br />
In order to fuse image correspondences with other types of sensor mea-<br />
16
surements we must normalize all of the measurements so that they have<br />
comparable units. For example, without doing this normalization one cannot<br />
compare pixels of reprojection error with meters of error relative to a<br />
GPS measurement. Our sensor fusion approach, first described in [67], fuses<br />
measurements <strong>from</strong> a GPS/INS unit with image correspondences using an<br />
extended Kalman filter (EKF). Our EKF maintains a state which was a map<br />
with the current camera pose, velocity <strong>and</strong> rotation rate as well as 3D features<br />
which correspond to features extracted <strong>and</strong> tracked <strong>from</strong> image to image in<br />
a video. The EKF uses a smooth motion model to predict the next camera<br />
pose. Then the system performs an update of the current camera pose <strong>and</strong><br />
3D feature measurements based on the difference between the predicted <strong>and</strong><br />
measured feature measurements <strong>and</strong> GPS/INS measurements.<br />
These corrections are weighted both by the EKF’s estimate of its certainty<br />
of the camera pose <strong>and</strong> 3D points (its covariance matrix) <strong>and</strong> a measurement<br />
error model for the sensors. For the feature measurements we assume the<br />
presence of additive Gaussian noise. For the GPS/INS measurements we<br />
assume a Gaussian distribution of rotation errors <strong>and</strong> use a constant bias<br />
model for the position error. The Gaussian distribution of rotation error is<br />
justified by the fact that the GPS/INS we use gives us post processed data<br />
with extremely high accuracy <strong>and</strong> precision. However, the position error was<br />
seen to occasionally grow when the vehicle was stationary. This might be due<br />
to multipath error on the GPS signals because we were working in an urban<br />
environment surrounded by two to three level buildings. Using the constant<br />
bias model for position error the image features could keep the estimate of<br />
the vehicle position stationary while the vehicle was truly not moving even<br />
17
as the GPS/INS reported that the vehicle moved up 10 cm in one example.<br />
4.2. Global Correspondences<br />
Since our system creates the camera registration sequentially <strong>from</strong> the<br />
input frames the obtained registrations are always subject to drift. Each<br />
small inaccuracy in motion estimation will propagate forward <strong>and</strong> the absolute<br />
positions <strong>and</strong> motions will be inaccurate. It is therefore necessary to<br />
do a global optimization step afterwards to remove the drift. This makes<br />
constraints necessary that are capable to remove drift. Such constraints can<br />
come <strong>from</strong> global pose measurements like GPS, but more interesting is to<br />
exploit internal consistencies like loops <strong>and</strong> intersections of the camera path.<br />
In this section we will therefore discuss solutions to the challenging task of<br />
detecting loops <strong>and</strong> intersections <strong>and</strong> using them for global optimization.<br />
Registering the camera with respect to the previously estimated path<br />
provides an estimate of the accumulated drift error. For robustness the path<br />
self-intersection itself can only rely on the views itself <strong>and</strong> not on the estimated<br />
camera motion, which can drift unbounded. Our method determines<br />
the path intersection by evaluating the similarity of salient image features in<br />
the current frame to all features in all previous views. We found that the<br />
SIFT-feature [58] provides a good salient feature for our purposes. For the<br />
fast computation of SIFT features, we make use our SiftGPU 2 , which can<br />
for example extract SIFT features at 16Hz <strong>from</strong> 1024 × 768 images on an<br />
NVidia GTX280. To enhance the robustness for the video streams we could<br />
use local geometry (described in Section 5) to employ our view invariant<br />
2 Available online: http://cs.unc.edu/∼ccwu/siftgpu/<br />
18
VIP-features [68].<br />
To avoid a computationally prohibitive exhaustive search within all other<br />
frames we use the vocabulary tree [69] to find a small set of potentially corresponding<br />
views. The vocabulary tree provides a computationally efficient<br />
indexing for the set of previous views. It provides a mapping <strong>from</strong> the feature<br />
space to an integer called visual word. The union of the visual words defines<br />
the possibly corresponding images through an inverted file.<br />
Determining the visual words for the features extracted <strong>from</strong> the query<br />
image requires traversal of the vocabulary tree for each extracted feature in<br />
the current view including a number of comparisons for the query feature with<br />
the node descriptors. Hereby the features <strong>from</strong> the query image are h<strong>and</strong>led<br />
independently, hence the tree traversal can be performed in parallel for each<br />
feature. To optimize performance we employ an in-house CUDA-based approach<br />
executed on the GPU for faster determination of the respective visual<br />
words. The speed-up induced by the GPU is about 20 on a GeForce GTX280<br />
versus a CPU implementation executed on an Intel Pentium D 3.2Ghz. This<br />
allows to perform more descriptor comparisons than in [69], i.e. a deeper<br />
tree with a smaller branching factor can be replaced by a shallower tree with<br />
a significantly higher number of branches. As pointed out in [70], a broader<br />
tree yields to a more uniform <strong>and</strong> therefore representative sampling of the<br />
high-dimensional descriptor space. To further increase the performance of the<br />
above described global correspondence search we use our recently proposed<br />
index compression method [71].<br />
Since the vocabulary tree only delivers a list of previous views potentially<br />
overlapping with the current view we perform a geometric verification of the<br />
19
delivered result. First we generate putative matches on the GPU. Afterwards<br />
the potential correspondences are tested for a valid two-view relationship<br />
between the views through ARRSAC <strong>from</strong> Section 4.1.2.<br />
4.3. Bundle Adjustment<br />
As the initial sparse model is subject to drift due to the incremental<br />
nature of the estimation process the reprojection error of the global correspondences<br />
is typically higher than the error of the local correspondences if<br />
drift occured. In order to obtain results with the highest possible accuracy,<br />
an additional refinement procedure, generally referred as bundle adjustment,<br />
is necessary. It is a non-linear optimization method, which jointly optimizes<br />
the camera calibration, the respective camera registration <strong>and</strong> the inaccuracies<br />
in the 3D points. It incorporates all available knowledge—initial camera<br />
parameters <strong>and</strong> poses, image correspondences <strong>and</strong> optionally other known<br />
applicable constraints, <strong>and</strong> minimizes a global error criterion over a large set<br />
of adjustable parameters, in particular the camera poses, the 3D structure,<br />
<strong>and</strong> optionally the intrinsic calibration parameters. In general, bundle adjustment<br />
delivers 3D models with sub-pixel accuracy for our system, e.g. an<br />
initial mean reprojection error in the order of pixels is typically reduced to<br />
0.2 or even 0.1 pixels mean error. Thus, the estimated two-view geometry<br />
is better aligned with the actual image features, <strong>and</strong> dense 3D models show<br />
a significantly higher precision. The major challenges with a bundle adjustment<br />
approach are the huge numbers of variables in the optimization problem<br />
<strong>and</strong> (to a lesser extent) the non-linearity <strong>and</strong> non-convexity of the objective<br />
function. The first issue is addressed by sparse representation of matrices<br />
<strong>and</strong> by utilizing appropriate numerical methods. Sparse techniques are still<br />
20
an active research topic in the computer vision <strong>and</strong> photogrammetry community<br />
[72, 73, 74, 75]. Our implementation of sparse bundle adjustment 3<br />
largely follows [72] by utilizing sparse Cholesky decomposition methods in<br />
combination with a suitable column reordering scheme [76]. For large data<br />
sets this approach is considerably faster than bundle adjustment implementation<br />
using dense matrix factorization after applying the Schur complement<br />
method (e.g. [77]).<br />
Our implementation requires less than 20min to perform full bundle adjustment<br />
on a 1700 image data set (which amounts to 0.7sec per image).<br />
For video sequences, the bundle adjustment can be performed incrementally,<br />
adjusting only the most recently computed poses with respect to the existing<br />
already-adjusted camera path. This allows bundle adjustment to run in realtime<br />
albeit with slightly higher error than a full offline bundle adjustment.<br />
This bundle adjustment technique is also used extensively by our system for<br />
processing photo collections as described hereafter.<br />
4.4. Camera Pose <strong>from</strong> Image Collections<br />
This section introduces the extension of the methods described in Section<br />
4.1 to accommodate <strong>Internet</strong> photo collections. The main difference to<br />
the video streams is the unknown initial relationship between the images.<br />
While for the video the temporal order implies a spatial relationship for the<br />
video frames. <strong>Photo</strong> collections downloaded <strong>from</strong> the <strong>Internet</strong> on the other<br />
h<strong>and</strong> do not have any order. In fact typically these collections are highly<br />
contaminated with outliers. Hence in contrast to video we first need to es-<br />
3 available in source at http://cs.unc.edu/∼cmzach/opensource.html<br />
21
tablish the spatial relationship between the images. In principle we can use<br />
feature based techniques similar to the method of Philbin et al. <strong>and</strong> Zheng<br />
et al. [16, 17] to detect related frames within the photo collection. These<br />
methods use computationally very intensive indexing based on local image<br />
features (keypoints) followed by loose spatial verification. Their spatial constraints<br />
are given by a 2D affine transformation or filters based on proximity<br />
of local features. Our method effectively enforces 3D structure <strong>from</strong> motion<br />
constraints (SfM constraints) for the dataset. Similarly methods like [51, 52]<br />
enforce 3D SfM constraints for the full set of registered frames. Their methods<br />
first exhaustively evaluate all possible pairs for a valid epipolar geometry<br />
<strong>and</strong> then enforce the stronger multi-view geometry constraints. Our method<br />
avoids the prohibitively expensive exhaustive pairwise matching using an initial<br />
stage in which images are grouped using global image features prior to<br />
indexing based on local features. This gives us a bigger gain in efficiency<br />
<strong>and</strong> an improved ability to group similar compositions mainly corresponding<br />
to similar viewpoints. We then enforce the SfM constraints on these groups<br />
which reduces the complexity of the computation by orders of magnitude.<br />
Snavely et al. [54] also reduced the complexity of the SfM constraints by<br />
minimizing the number of image pairs for which it is computed to a minimal<br />
set expected to obtain the same overall reconstruction.<br />
4.4.1. Efficiently Finding Corresponding Images in <strong>Photo</strong> Collections<br />
To efficiently identify related images our system uses the gist feature [78],<br />
which encodes the spatial layout of the image <strong>and</strong> perceptual properties of<br />
the image. The gist feature was found to be effective for grouping images by<br />
perceptual similarity <strong>and</strong> retrieving structurally similar scenes [79, 80]. To<br />
22
achieve high computational performance we developed a highly parallel gist<br />
feature extraction on the GPU. It derives a gist descriptor for each image<br />
as the concatenation of two independently computed sub-vectors. The first<br />
sub-vector is computed by convolving a downsampled to 128 × 128 gray<strong>scale</strong><br />
version of the image with Gabor filters at 3 different <strong>scale</strong>s (with 8, 8 <strong>and</strong><br />
4 orientations for the three <strong>scale</strong>s, respectively). The filter responses are<br />
aggregated to a 4 × 4 spatial resolution, downsampling each convolution to<br />
a 4 × 4 patch, <strong>and</strong> concatenating the results, yielding a 320-dimensional<br />
vector. In addition, we augment this gist descriptor with color information,<br />
consisting of a subsampled L*a*b image, at 4 × 4 spatial resolution. We thus<br />
obtain a 368-dimensional vector as a representation of each image in the<br />
dataset. The implementation on the GPU improves the computation time<br />
by a factor of 100 compared to a CPU implementation. For detailed timings<br />
of the gist computation please see Table 1.<br />
In the next step we use the fact that photos <strong>from</strong> nearby viewpoints with<br />
similar camera orientation have similar gist descriptors. Hence, to identify<br />
viewpoint clusters we use k-means clustering of the gist descriptors. At<br />
this point we aim for an over-segmentation since that will best reduce our<br />
computational complexity. We empirically found that searching for 10% as<br />
many clusters as images yields a sufficient over-segmentation. As shown by<br />
Table 1 the clustering of the gist descriptors can be executed very efficiently.<br />
This is key to the computational efficiency of our system since this early<br />
grouping allows us to limit all further geometric verifications avoiding an<br />
exhaustive search over the whole dataset.<br />
The clustering successfully identifies the popular viewpoints, although it<br />
23
Figure 3: A large gist cluster for the Statue of Liberty dataset. We show the hundred<br />
images closest to the cluster mean. The effectiveness of the low-dimensional gist representation<br />
can be gauged by noting that even without enforcing geometric consistency, these<br />
clusters display a remarkable degree of structural similarity.<br />
is sensitive to image variation such as clutter as for example people in front<br />
of the camera, lighting conditions, <strong>and</strong> camera zoom. We found that the<br />
large clusters are typically almost outlier free (see Figure 3 for an example).<br />
The smaller clusters experiences significantly more noise or belong to the<br />
unrelated images in the photo collection. The clusters now define a loose<br />
ordering of the dataset that defines corresponding images as being in the<br />
same cluster. The next step will employ this order to efficiently identify<br />
overlapping images as well as the outlier images on a per cluster basis using<br />
the same techniques (Section 4.1.2) as in the case of video streams.<br />
4.4.2. Iconic Image Registration<br />
While the clusters define a local relationship our method still needs to<br />
establish a global relationship between the photos. To facilitate an efficient<br />
global registration we further summarize the dataset. The objective of the<br />
next step is to find a representative image (iconic) for each cluster of views.<br />
This iconic is then later used for efficient global registration. The second<br />
objective is to enforce a valid two-view geometry with respect to the iconic<br />
24
for each image in the cluster to remove uncorrelated images.<br />
Given the typically significantly larger feature motion in the photo collections,<br />
similar to the global correspondence computation in Section 4.2 we<br />
use the SIFT features [58]. To establish correspondences for each considered<br />
pair, putative feature matching is performed on the GPU. In general, the<br />
feature matching process may be structured as a matrix multiplication between<br />
large <strong>and</strong> dense matrices. Our approach thus consists of a call to dense<br />
matrix multiplication in the CUBLAS 4 library with subsequent instructions<br />
to apply the distance ratio test [58].<br />
Next we use our ARRSAC algorithm <strong>from</strong> Section 4.1.2 to validate the<br />
two-view relationship between the views. To achieve robustness to degenerate<br />
data we combine ARRSAC with our RANSAC based robust model selection<br />
QDEGSAC [81].<br />
Our method first finds the triplet with mutual valid two-view relationships<br />
whose gist descriptors are the closest to the gist cluster center. The image<br />
with the highest number of inliers in this set is selected as the iconic image<br />
of the cluster. Then all remaining images in the cluster are tested for a valid<br />
two-view relationship to the iconic. Hence after this geometric-verification<br />
each cluster only contains images showing the same scene <strong>and</strong> each cluster is<br />
visually represented by its iconic image.<br />
Typically there is a significant number of images which are not geometrically<br />
consistent with the other images in the cluster or that are in clusters<br />
that are not geometrically consistent. Through evaluation we found that<br />
4 http://developer.download.nvidia.com/compute/cuda/1 0/CUBLAS Library 1.0.pdf<br />
25
these rejected images contain a significant number of images related to the<br />
scene of interest. To register these images we perform a reclustering using<br />
the clustering <strong>from</strong> Section 4.4.1 followed by the above geometric verification.<br />
Afterwards all unregistered images are tested against the five iconic images<br />
nearest in gist space. Using our labeling of the subsets of the datasets we<br />
verified that we typically register well over 90% of the frames belonging to<br />
the scene of interest.<br />
4.4.3. Iconic Image Registration<br />
After using the clusters to establish a local order <strong>and</strong> to remove unrelated<br />
images the next step is aiming at establishing a global relationship of<br />
the images. For efficiency reasons we use the small number of iconic images<br />
to bootstrap the global registration for all images. We first compute the exhaustive<br />
pairwise two-view relationships for all iconics using ARRSAC <strong>from</strong><br />
Section 4.1.2. This defines a graph of pairwise global relationships between<br />
the different iconics, which is called the iconic scene graph. The edges of the<br />
graph have a cost corresponding to the number of mutual matches between<br />
the images. This graph represents the global registration of the iconics for<br />
the dataset based on the 2D constraints given by the epipolar geometry or<br />
the homography between the views.<br />
Next our system performs an incremental structure <strong>from</strong> motion process<br />
using the iconic scene graph to obtain a registration for the iconic images.<br />
First an initial image pair is selected, which has sufficiently accurate registration<br />
as defined by the roundness of the uncertainty ellipsoids of the triangulated<br />
3D points [82]. Then the remaining pairs are added by selecting the<br />
image with the next strongest set of correspondences until no further iconic<br />
26
can be added to the 3D model. This gives the 3D sub-model for the component<br />
of the iconic scene graph corresponding to the initially selected pair.<br />
Each completed sub-model is refined by bundle adjustment as described in<br />
Section 4.3.<br />
Typically the iconic scene graph contains multiple independent or weakly<br />
connected components. These components for example correspond to parts<br />
of the scene that has no visual connection to the other scene parts, interior<br />
parts of the model or other scenes consistently mislabeled (an example would<br />
be Ellis Isl<strong>and</strong>, which is often also labeled as “Statue of Liberty” see Figure<br />
4). Hence we repeat the 3D modeling until all images are registered or none<br />
of the remaining images has any connections to any of the 3D sub-models.<br />
To obtain more complete 3D sub-models than possible through registering<br />
the iconics, our system afterwards searches for images in the clusters that<br />
support a matching for the iconics in both clusters. For efficiency reasons<br />
we only test for each image in the cluster if it matches to the iconic of the<br />
other cluster. If the match to the target iconic is sufficiently strong (empirically<br />
determined as more than 18 features) then the matching information<br />
of this image is stored. Once a sufficient number of connections is identified<br />
our method uses all constraints provided by the matches to merge the two<br />
sub-models to which the clusters belong. The merging is performed through<br />
estimating the similarity transformation between the sub-models. This merging<br />
process is continued until no more additional merges are possible.<br />
4.4.4. Cluster Camera Registration<br />
After the registration of the iconic images we have a valid global registration<br />
for the images of the iconic scene graph. In the last step we aim at<br />
27
Figure 4: Left: view of the 3D model <strong>and</strong> the 9025 registered cameras. Right: model <strong>from</strong><br />
the interior of Ellis Isl<strong>and</strong> erroneously labeled as “Statue of Liberty”.<br />
extending the registration to all cluster images in the clusters corresponding<br />
to the registered iconics. This process takes advantage of feature matches<br />
between the non-iconic images <strong>and</strong> their respective iconics. This process<br />
again uses the 2D matches to the iconic to determine the 2D-3D correspondences<br />
for the image. It then applies ARRSAC (Section 4.1.2 to determine<br />
the camera s registration. Given the incremental image registration during<br />
this process drift inherently accumulates. Hence we periodically apply<br />
bundle adjustment to perform a global error correction. In our current system<br />
bundle adjustment is performed after 25 images have been added to a<br />
sub-model. This saves significant computational resources while in practice<br />
keeping the accumulated error at a reasonable level. Detailed results of our<br />
3D reconstruction algorithm are shown in Figures 4-5.<br />
5. Dense Geometry<br />
The above described methods for video <strong>and</strong> for photo collections provide<br />
a registration for the camera poses of all video frames or photographs.<br />
These external <strong>and</strong> internal camera calibrations are the output of the sparse<br />
28
Figure 5: Left: 3D reconstruction of the San Marco Dataset with 10338 cameras. Right:<br />
Reconstruction <strong>from</strong> the Notre Dame dataset with 1300 registered cameras.<br />
Dataset<br />
Gist<br />
Gist clustering<br />
Geometric<br />
Re-clustering<br />
registered views<br />
total time<br />
Liberty 4 min 21 min 56 min 3.3 hr 13888 4 hr 42 min<br />
SM 4 min 19 min 24 min 2.76 hr 12253 3 hr 34 min<br />
ND 1 min 3 min 22 min 1.0 hr 3058 1 hr 28 min<br />
Table 1: Computation times for the photo collection reconstruction for the Statue of<br />
Liberty dataset, the San Marco dataset (SM) <strong>and</strong> the Notre Dame dataset (ND).<br />
reconstruction step of both systems. They allow for the photo collections<br />
to generate an image based browsing experience as shown by the “<strong>Photo</strong>-<br />
Tourism” paper [51, 52]. The camera registration can also be used as input<br />
to dense stereo methods. The method of [83] can extract dense scene geometry<br />
but with significant computational effort <strong>and</strong> therefore only for small<br />
scenes. Next we will introduce our real-time stereo system for video streams<br />
capable of reconstructing large scenes.<br />
We adopt a stereo/fusion approach for dense geometry reconstruction.<br />
29
For every frame of video, a depthmap is generated using a real-time GPUbased<br />
multi-view planesweep stereo algorithm. The planesweep algorithm,<br />
comprised primarily of image warping operations, is highly efficient on the<br />
GPU [62]. Because it is multi-view, it is quite robust <strong>and</strong> can efficiently<br />
h<strong>and</strong>le occlusions, a major problem in stereo. The stereo depthmaps are<br />
then fused using a visibility-based depthmap fusion method, which also runs<br />
on the GPU. The fusion method removes outliers <strong>from</strong> the depthmaps, <strong>and</strong><br />
also combines depth estimates to enhance their accuracy. Because of the redundancy<br />
in video (each surface is imaged multiple times), fused depthmaps<br />
only need to be produced for a subset of the video frames. This reduces<br />
processing time as well as the 3D model size as discussed in Section 6.<br />
The planesweep stereo algorithm [36] computes a depthmap by testing<br />
a family of plane hypotheses, <strong>and</strong> for each depthmap pixel recording the<br />
distance to the plane with the best photoconsistency score. One view is<br />
designated as reference, <strong>and</strong> a depthmap is computed for that view. The<br />
remaining views are designated matching views. For each plane, all matching<br />
views are back-projected onto the plane, <strong>and</strong> then projected into the<br />
reference view. This mapping, known as a plane homography [84], can be<br />
performed very efficiently on the GPU. Once the matching views are warped,<br />
the photoconsistency score is computed. We use a sum of absolute differences<br />
(SAD) photoconsistency measure. To h<strong>and</strong>le occlusions, the matching views<br />
are divided into a left <strong>and</strong> right subset, <strong>and</strong> the best matching score of the two<br />
subsets is kept [85]. Before the difference is computed, the image intensities<br />
are multiplied by the relative exposure (gain) computed during KLT tracking<br />
(Section 4.1.1). The stereo algorithm can produce a 512 × 384 depthmap<br />
30
<strong>from</strong> 11 images (10 matching, 1 reference) <strong>and</strong> 48 plane hypotheses at a rate<br />
of 42 Hz on an Nvidia GTX 280.<br />
After depthmaps have been computed, a fused depthmap is computed<br />
for a reference view using the surrounding stereo depthmaps [86]. All points<br />
<strong>from</strong> the depthmaps are projected into the reference view, <strong>and</strong> for each pixel<br />
the point with the highest stereo confidence is chosen. Stereo confidence is<br />
computed based on the matching scores <strong>from</strong> the planesweep:<br />
C(x) = ∑ e −(SAD(x,π)−SAD(x,π 0)) 2 (1)<br />
π≠π 0<br />
where x is the pixel in question, π is a plane <strong>from</strong> the planesweep, <strong>and</strong> π 0<br />
is the best plane chosen by the planesweep. This confidence measure prefers<br />
depth estimates with low uncertainty, where the matching score is much lower<br />
than scores for all other planes. For each pixel, after the most confident point<br />
is chosen, it is scored according to mutual support, free-space violations, <strong>and</strong><br />
occlusions. Supporting points are those that fall within 5% of the chosen<br />
point’s depth value. Occlusions are caused by points that would make the<br />
chosen point invisible in the reference view, <strong>and</strong> free-space violations are<br />
points that are made invisible in some other view by the chosen point. If<br />
the number of occlusions <strong>and</strong> free-space violations exceeds the number of<br />
support points, the chosen point is discarded <strong>and</strong> the depth value is marked<br />
as invalid. If the point survives, it is replaced with a new point computed as<br />
the average of the chosen point <strong>and</strong> support points. See Figure 6.<br />
Currently our dense geometry method is used only for video. The temporal<br />
sequence of the video makes it easy to select nearby views for stereo<br />
matching <strong>and</strong> depthmap fusion. For photo collections, a view selection strat-<br />
31
Figure 6: Our depthmap fusion method combines multiple stereo depthmaps (middle<br />
image shows an example depthmap) to produce a single fused depthmap (shown on the<br />
right) with greater accuracy <strong>and</strong> less redundancy.<br />
egy would be needed which identifies views with similar content <strong>and</strong> compatible<br />
appearance (day, night, summer, winter, etc.). This is left as future<br />
work.<br />
6. Model Extraction<br />
Once a fused depthmap has been computed, the next step is to convert<br />
it into a texture-mapped 3D polygonal mesh. This can be done trivially by<br />
olverlaying the depthmap image with a quadrilateral mesh, <strong>and</strong> assigning<br />
the 3D coordinate of each vertex to the 3D position given by the depthmap.<br />
The mesh can also be trivially texture-mapped with the color image for the<br />
corresponding depthmap. However, a number of improvements can be made.<br />
First, depth discontinuities must be detected to prevent the mesh <strong>from</strong><br />
joining discontinuous foreground <strong>and</strong> background objects. We simply use a<br />
threshold on the magnitude of the depth gradient to determine if a mesh edge<br />
can be created between two vertices. Afterwards, connected components with<br />
a small number of vertices are discarded.<br />
Second, surfaces are discarded which have already been reconstructed<br />
32
Original Mesh Simplified Mesh Textured Mesh<br />
Figure 7: Models are extracted <strong>from</strong> the fused depthmaps.<br />
in previous fused depthmaps. The surface reconstructed <strong>from</strong> the previous<br />
fused depthmap in the video sequence is projected into the current depthmap.<br />
Any vertices that fall within threshold of the previous surface are marked as<br />
invalid. Their neighboring vertices are then adjusted to coincide with the<br />
edge of the previous surface to prevent any cracks.<br />
Third, the mesh is simplified by hierarchically merging neighboring quadrilaterals<br />
that are nearly co-planar. Actually we use a top-down approach. We<br />
start with a single quadrilateral spanning the whole depthmap. If the distance<br />
to the quadrilaters of any of its contained points is greater than a<br />
threshold, the quadrilateral is split. This is repeated recursively, until every<br />
quadrilateral faithfully represents its underlying points. The splitting is<br />
performed along the longest direction or at r<strong>and</strong>om if the quadrilateral is<br />
square. This method represents planar <strong>and</strong> quasi-planar surfaces with fewer<br />
polygons <strong>and</strong> greatly reduces the size of the output models. See Figure 8.<br />
33
Figure 8: Several 3D models reconstructed <strong>from</strong> video sequences.<br />
34
7. Conclusions<br />
In this paper we presented methods for the real-time reconstruction <strong>from</strong><br />
video <strong>and</strong> the fast reconstruction <strong>from</strong> <strong>Internet</strong> photo collections on a single<br />
PC. We demonstrated the performance of the methods on a variety of<br />
large-<strong>scale</strong> datasets proving the computational performance of the proposed<br />
methods. We achieved this through the consequent parallelization of all computations<br />
involved enabling an execution on the graphics card as a highly<br />
parallel processor. Additionally, for the modeling <strong>from</strong> <strong>Internet</strong> photo collections<br />
we also combine constraints <strong>from</strong> recognition <strong>and</strong> geometric constraints<br />
leading to an orders of magnitude more efficient image registration than any<br />
existing system.<br />
Acknowledgements:. We would like to acknowledge the DARPA UrbanScape<br />
project, NSF Grant IIS-0916829 <strong>and</strong> other funding <strong>from</strong> the US government,<br />
as well as our collaborators David Nister, Amir Akbarzadeh, Philippos Mordohai,<br />
Paul Merrell, Chris Engels, Henrik Stewenius, Brad Talton, Liang<br />
Wang, Qingxiong Yang, Ruigang Yang, Greg Welch, Herman Towles, Xiaowei<br />
Li.<br />
[1] A. Fischer, T. Kolbe, F. Lang, A. Cremers, W. Förstner, L. Plümer,<br />
V. Steinhage, Extracting buildings <strong>from</strong> aerial images using hierarchical<br />
aggregation in 2d <strong>and</strong> 3d, Computer Vision <strong>and</strong> Image Underst<strong>and</strong>ing<br />
72 (2) (1998) 185–203.<br />
[2] A. Gruen, X. Wang, Cc-modeler: A topology generator for 3-d city<br />
models, ISPRS Journal of <strong>Photo</strong>grammetry & Remote Sensing 53 (5)<br />
(1998) 286–295.<br />
35
[3] Z. Zhu, A. Hanson, E. Riseman, Generalized parallel-perspective stereo<br />
mosaics <strong>from</strong> airborne video, IEEE Trans. on Pattern Analysis <strong>and</strong> Machine<br />
Intelligence 26 (2) (2004) 226–237.<br />
[4] C. Früh, A. Zakhor, An automated method for large-<strong>scale</strong>, ground-based<br />
city model acquisition, Int. J. of Computer Vision 60 (1) (2004) 5–24.<br />
[5] I. Stamos, P. Allen, Geometry <strong>and</strong> texture recovery of scenes of large<br />
<strong>scale</strong>, Computer Vision <strong>and</strong> Image Underst<strong>and</strong>ing 88 (2) (2002) 94–118.<br />
[6] S. El-Hakim, J.-A. Beraldin, M. Picard, A. Vettore, Effective 3d modeling<br />
of heritage sites, in: 4th International Conference of 3D Imaging<br />
<strong>and</strong> Modeling, 2003, pp. 302–309.<br />
[7] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint,<br />
MIT Press, 1993.<br />
[8] R. I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,<br />
2nd Edition, Cambridge University Press, 2004.<br />
[9] B. Lucas, T. Kanade, An Iterative Image Registration Technique with<br />
an Application to Stereo Vision, in: Int. Joint Conf. on Artificial Intelligence,<br />
1981, pp. 674–679.<br />
[10] C. Zach, D. Gallup, J. Frahm, <strong>Fast</strong> gain-adaptive KLT tracking on<br />
the GPU, in: CVPR Workshop on Visual Computer Vision on GPU’s<br />
(CVGPU), 2008.<br />
[11] R. Fergus, P. Perona, A. Zisserman, A visual category filter for Google<br />
images, in: ECCV, 2004.<br />
36
[12] T. Berg, D. Forsyth, Animals on the web, in: CVPR, 2006.<br />
[13] L.-J. Li, G. Wang, L. Fei-Fei, Optimol: automatic object picture collection<br />
via incremental model learning, in: CVPR, 2007.<br />
[14] F. Schroff, A. Criminisi, A. Zisserman, Harvesting image databases <strong>from</strong><br />
the web, in: ICCV, 2007.<br />
[15] B. Collins, J. Deng, L. Kai, L. Fei-Fei, Towards scalable dataset construction:<br />
An active learning approach, in: ECCV, 2008.<br />
[16] J. Philbin, A. Zisserman, Object mining using a matching graph on<br />
very large image collections, in: Proceedings of the Indian Conference<br />
on Computer Vision, Graphics <strong>and</strong> Image Processing, 2008.<br />
[17] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco,<br />
F. Brucher, T.-S. Chua, H. Neven, Tour the world: Building a web-<strong>scale</strong><br />
l<strong>and</strong>mark recognition engine, in: CVPR, 2009.<br />
[18] P. Beardsley, A. Zisserman, D. Murray, Sequential updating of projective<br />
<strong>and</strong> affine structure <strong>from</strong> motion, Int. J. Computer Vision 23 (3) (1997)<br />
235–259.<br />
[19] A. Fitzgibbon, A. Zisserman, Automatic camera recovery for closed or<br />
open image sequences, in: European Conf. on Computer Vision, 1998,<br />
pp. 311–326.<br />
[20] D. Nistér, An efficient solution to the five-point relative pose problem,<br />
IEEE Trans. on Pattern Analysis <strong>and</strong> Machine Intelligence 26 (6) (2004)<br />
756–777.<br />
37
[21] D. Nistér, O. Naroditsky, J. Bergen, Visual odometry for ground vehicle<br />
applications, Journal of Field Robotics 23 (1).<br />
[22] American Society of <strong>Photo</strong>grammetry, Manual of <strong>Photo</strong>grammetry (5th<br />
edition), Asprs Pubns, 2004.<br />
[23] A. Azarbayejani, A. Pentl<strong>and</strong>, Recursive estimation of motion,<br />
structure, <strong>and</strong> focal length, IEEE Trans. on Pattern<br />
Analysis <strong>and</strong> Machine Intelligence 17 (6) (1995) 562–575.<br />
doi:http://doi.ieeecomputersociety.org/10.1109/34.387503.<br />
[24] S. Soatto, P. Perona, R. Frezza, G. Picci, Recursive motion <strong>and</strong> structure<br />
estimation with complete error characterization, in: Int. Conf. on<br />
Computer Vision <strong>and</strong> Pattern Recognition, 1993, pp. 428–433.<br />
[25] S. Palmer, E. Rosch, P. Chase, Canonical perspective <strong>and</strong> the perception<br />
of objects, Attention <strong>and</strong> Performance IX (1981) 135–151.<br />
[26] V. Blanz, M. Tarr, H. Bulthoff, What object attributes determine canonical<br />
views?, Perception 28 (5) (1999) 575–600.<br />
[27] T. Denton, M. Demirci, J. Abrahamson, A. Shokouf<strong>and</strong>eh, S. Dickinson,<br />
Selecting canonical views for view-based 3-d object recognition, in:<br />
ICPR, 2004, pp. 273–276.<br />
[28] P. Hall, M. Owen, Simple canonical views, in: BMVC, 2005, pp. 839–<br />
848.<br />
[29] T. L. Berg, D. Forsyth, Automatic ranking of iconic images, Tech. rep.,<br />
University of California, Berkeley (2007).<br />
38
[30] Y. Jing, S. Baluja, H. Rowley, Canonical image selection <strong>from</strong> the web,<br />
in: Proceedings of the 6th ACM international Conference on Image <strong>and</strong><br />
<strong>Video</strong> Retrieval, 2007.<br />
[31] I. Simon, N. Snavely, S. M. Seitz, Scene summarization for online image<br />
collections, in: ICCV, 2007.<br />
[32] T. L. Berg, A. C. Berg, Finding iconic images, in: The 2nd <strong>Internet</strong><br />
Vision Workshop at IEEE Conference on Computer Vision <strong>and</strong> Pattern<br />
Recognition, 2009.<br />
[33] R. Raguram, S. Lazebnik, Computing iconic summaries of general visual<br />
concepts, in: Workshop on <strong>Internet</strong> Vision CVPR, 2008.<br />
[34] D. Scharstein, R. Szeliski, A taxonomy <strong>and</strong> evaluation of dense twoframe<br />
stereo correspondence algorithms, Int. J. of Computer Vision<br />
47 (1-3) (2002) 7–42.<br />
[35] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, R. Szeliski, A comparison<br />
<strong>and</strong> evaluation of multi-view stereo reconstruction algorithms,<br />
in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition, 2006, pp.<br />
519–528.<br />
[36] R. Collins, A space-sweep approach to true multi-image matching, in:<br />
Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition, 1996, pp. 358–<br />
363.<br />
[37] R. Yang, M. Pollefeys, Multi-resolution real-time stereo on commodity<br />
graphics hardware, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern<br />
Recognition, 2003, pp. 211–217.<br />
39
[38] T. Werner, A. Zisserman, New techniques for automated architectural<br />
reconstruction <strong>from</strong> photographs, in: European Conf. on Computer Vision,<br />
2002, pp. 541–555.<br />
[39] P. Burt, L. Wixson, G. Salgian, Electronically directed ”focal” stereo,<br />
in: Int. Conf. on Computer Vision, 1995, pp. 94–101.<br />
[40] S. N. Sinha, D. Steedly, R. Szeliski, Piecewise planar stereo for imagebased<br />
rendering, in: International Conference on Computer Vision<br />
(ICCV), 2009.<br />
[41] Y. Furukawa, B. Curless, S. M. Seitz, , R. Szeliski, Manhattan-world<br />
stereo, in: Computer Vision <strong>and</strong> Pattern Recognition (CVPR), 2009.<br />
[42] G. Turk, M. Levoy, Zippered polygon meshes <strong>from</strong> range images., in:<br />
SIGGRAPH, 1994, pp. 311–318.<br />
[43] M. Soucy, D. Laurendeau, A general surface approach to the integration<br />
of a set of range views, IEEE Trans. on Pattern Analysis <strong>and</strong> Machine<br />
Intelligence 17 (4) (1995) 344–358.<br />
[44] A. Hilton, A. Stoddart, J. Illingworth, T. Windeatt, Reliable surface<br />
reconstruction <strong>from</strong> multiple range images, in: European Conf. on Computer<br />
Vision, 1996, pp. 117–126.<br />
[45] B. Curless, M. Levoy, A volumetric method for building complex models<br />
<strong>from</strong> range images, SIGGRAPH 30 (1996) 303–312.<br />
[46] M. Wheeler, Y. Sato, K. Ikeuchi, Consensus surfaces for modeling 3d<br />
40
objects <strong>from</strong> multiple range images, in: Int. Conf. on Computer Vision,<br />
1998, pp. 917–924.<br />
[47] R. Koch, M. Pollefeys, L. Van Gool, <strong>Robust</strong> calibration <strong>and</strong> 3d geometric<br />
modeling <strong>from</strong> large collections of uncalibrated images, in: DAGM, 1999,<br />
pp. 413–420.<br />
[48] G. Schindler, P. Krishnamurthy, F. Dellaert, Line-based structure <strong>from</strong><br />
motion for urban environments, in: 3DPVT, 2006.<br />
[49] G. Schindler, F. Dellaert, S. Kang, Inferring temporal order of images<br />
<strong>from</strong> 3d structure, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition,<br />
2007.<br />
[50] N. Cornelis, K. Cornelis, L. Van Gool, <strong>Fast</strong> compact city modeling for<br />
navigation pre-visualization, in: Int. Conf. on Computer Vision <strong>and</strong><br />
Pattern Recognition, 2006.<br />
[51] N. Snavely, S. M. Seitz, R. Szeliski, <strong>Photo</strong> tourism: Exploring photo<br />
collections in 3d, in: SIGGRAPH, 2006, pp. 835–846.<br />
[52] N. Snavely, S. M. Seitz, R. Szeliski, Modeling the world <strong>from</strong> <strong>Internet</strong><br />
photo collections, International Journal of Computer Vision 80 (2)<br />
(2008) 189–210.<br />
[53] K. Ni, D. Steedly, F. Dellaert, Out-of-core bundle adjustment for large<strong>scale</strong><br />
3d reconstruction, in: ICCV, 2007.<br />
[54] N. Snavely, S. M. Seitz, R. Szeliski, Skeletal sets for efficient structure<br />
<strong>from</strong> motion, in: CVPR, 2008.<br />
41
[55] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, R. Szeliski, Building Rome<br />
in a day, in: ICCV, 2009.<br />
[56] S. Arya, D. Mount, N. Netanyahu, R. Silverman, A. Wu, An optimal<br />
algorithm for approximate nearest neighbor searching fixed dimensions,<br />
in: JACM, Vol. 45, 1998, pp. 891–923.<br />
[57] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall:<br />
Automatic query expansion with a generative feature model for object<br />
retrieval, in: ICCV, 2007.<br />
[58] D. Lowe, Distinctive image features <strong>from</strong> <strong>scale</strong>-invariant keypoints, Int.<br />
J. of Computer Vision 60 (2) (2004) 91–110.<br />
[59] R. Raguram, J.-M. Frahm, M. Pollefeys, A comparative analysis of<br />
RANSAC techniques leading to adaptive real-time r<strong>and</strong>om sample consensus,<br />
in: ECCV, 2008.<br />
[60] B. Lucas, T. Kanade, An iterative image registration technique with<br />
an application to stereo vision, in: International Joint Conference on<br />
Artificial Intelligence (IJCAI), 1981, pp. 674–679.<br />
URL citeseer.ist.psu.edu/lucas81iterative.html<br />
[61] J. Shi, C. Tomasi, Good features to track, in: IEEE Conference on<br />
Computer Vision <strong>and</strong> Pattern Recognition (CVPR), 1994.<br />
[62] S. Kim, D. Gallup, J.-M. Frahm, A. Akbarzadeh, Q. Yang, R. Yang,<br />
D. Nistér, M. Pollefeys, Gain adaptive real-time stereo streaming, in:<br />
Int. Conf. on Vision Systems, 2007.<br />
42
[63] M. Fischler, R. Bolles, R<strong>and</strong>om sample consensus: A paradigm for model<br />
fitting with applications to image analysis <strong>and</strong> automated cartography,<br />
Comm. of the ACM 24 (6) (1981) 381–395.<br />
[64] R. Raguram, J.-M. Frahm, M. Pollefeys, A comparative analysis of<br />
ransac techniques leading to adaptive real-time r<strong>and</strong>om sample consensus,<br />
in: Proc. ECCV, 2008.<br />
[65] D. Nister, Preemptive RANSAC for live structure <strong>and</strong> motion estimation,<br />
in: Proc. ICCV, Vol. 1, 2003, pp. 199–206.<br />
[66] J. Matas, O. Chum, R<strong>and</strong>omized ransac with sequential probability ratio<br />
test, in: Proc. ICCV, 2005, pp. 1727–1732.<br />
[67] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai,<br />
B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha,<br />
B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch,<br />
H. Towles, Detailed real-time urban 3d reconstruction <strong>from</strong> video, Int.<br />
Journal of Computer Vision special issue on Modeling <strong>Large</strong>-Scale 3D<br />
Scenes.<br />
[68] C. Wu, F. Fraundorfer, J.-M. Frahm, M. Pollefeys, 3d model<br />
search <strong>and</strong> pose estimation <strong>from</strong> single images using vip features,<br />
Computer Vision <strong>and</strong> Pattern Recognition Workshop 0 (2008) 1–8.<br />
doi:http://doi.ieeecomputersociety.org/10.1109/CVPRW.2008.4563037.<br />
[69] D. Nister, H. Stewenius, Scalable recognition with a vocabulary<br />
tree, in: In Proceedings of IEEE CVPR, Vol. 2, IEEE Com-<br />
43
puter Society, Los Alamitos, CA, USA, 2006, pp. 2161–2168.<br />
doi:http://doi.ieeecomputersociety.org/10.1109/CVPR.2006.264.<br />
[70] G. Schindler, M. Brown, R. Szeliski, City-<strong>scale</strong> location recognition, in:<br />
In Proceedings of IEEE CVPR, 2007.<br />
[71] A. Irschara, C. Zach, J.-M. Frahm, H. Bischof, From structure-<strong>from</strong>motion<br />
point clouds to fast location recognition, in: In Proceedings of<br />
IEEE CVPR, 2009, pp. 2599–2606.<br />
[72] F. Dellaert, M. Kaess, Square root SAM: Simultaneous location <strong>and</strong><br />
mapping via square root information smoothing, International Journal<br />
of Robotics Research.<br />
[73] C. Engels, H. Stewénius, D. Nistér, Bundle adjustment rules, in: <strong>Photo</strong>grammetric<br />
Computer Vision (PCV), 2006.<br />
[74] K. Ni, D. Steedly, F. Dellaert, Out-of-core bundle adjustment for large<strong>scale</strong><br />
3D reconstruction, in: IEEE International Conference on Computer<br />
Vision (ICCV), 2007.<br />
[75] M. Kaess, A. Ranganathan, F. Dellaert, iSAM: Incremental smoothing<br />
<strong>and</strong> mapping, IEEE Transactions on Robotics.<br />
[76] T. A. Davis, J. R. Gilbert, S. Larimore, E. Ng, A column approximate<br />
minimum degree ordering algorithm, ACM Transactions on Mathematical<br />
Software 30 (3) (2004) 353–376.<br />
[77] M. Lourakis, A. Argyros, The design <strong>and</strong> implementation of a generic<br />
sparse bundle adjustment software package based on the Levenberg-<br />
44
Marquardt algorithm, Tech. Rep. 340, Institute of Computer Science -<br />
FORTH (2004).<br />
[78] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation<br />
of the spatial envelope, IJCV 42 (3) (2001) 145–175.<br />
[79] J. Hays, A. A. Efros, Scene completion using millions of photographs,<br />
in: SIGGRAPH, 2007.<br />
[80] M. Douze, H. Jégou, H. S<strong>and</strong>hawalia, L. Amsaleg, C. Schmid, Evaluation<br />
of gist descriptors for web-<strong>scale</strong> image search, in: International<br />
Conference on Image <strong>and</strong> <strong>Video</strong> Retrieval, 2009.<br />
[81] J.-M. Frahm, M. Pollefeys, RANSAC for (quasi-) degenerate data<br />
(QDEGSAC), in: CVPR, Vol. 1, 2006, pp. 453–460.<br />
[82] C. Beder, R. Steffen, Determining an initial image pair for fixing the<br />
<strong>scale</strong> of a 3d reconstruction <strong>from</strong> an image sequence, in: Proc. DAGM,<br />
2006, pp. 657–666.<br />
[83] M. Goesele, N. Snavely, B. Curless, H. Hoppe, S. M. Seitz, Multi-view<br />
stereo for community photo collections, in: ICCV, 2007.<br />
[84] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,<br />
Cambridge University Press, 2000.<br />
[85] S. Kang, R. Szeliski, J. Chai, H<strong>and</strong>ling occlusions in dense multi-view<br />
stereo, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition,<br />
2001, pp. 103–110.<br />
45
[86] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm,<br />
R. Yang, D. Nister, M. Pollefeys, Real-Time Visibility-Based Fusion<br />
of Depth Maps, in: Proceedings of International Conf. on Computer<br />
Vision, 2007.<br />
46