Towards Wiki-based Dense City Modeling - Institute for Computer ...
Towards Wiki-based Dense City Modeling - Institute for Computer ...
Towards Wiki-based Dense City Modeling - Institute for Computer ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Towards</strong> <strong>Wiki</strong>-<strong>based</strong> <strong>Dense</strong> <strong>City</strong> <strong>Modeling</strong><br />
Arnold Irschara<br />
<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> Graphics and Vision, TU Graz<br />
irschara@icg.tugraz.at<br />
Horst Bischof<br />
<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> Graphics and Vision, TU Graz<br />
bischof@icg.tugraz.at<br />
Christopher Zach<br />
VRVis Research Center<br />
zach@vrvis.at<br />
Abstract<br />
This work reports on the advances and on the current<br />
status of a terrestrial city modeling approach, which uses<br />
images contributed by end-users as input. Hence, the <strong>Wiki</strong><br />
principle well known from textual knowledge databases is<br />
transferred to the goal of incrementally building a virtual<br />
representation of the occupied habitat. In order to achieve<br />
this objective, many state-of-the-art computer vision methods<br />
must be applied and modified according to this task. We<br />
describe the utilized 3D vision methods and show initial results<br />
obtained from the current image database acquired by<br />
in-house participants.<br />
1. Introduction<br />
Recently, 3D vision methods demonstrate increased robustness<br />
and result in high quality models, hence multiview<br />
modeling appears more often in industry projects targeted<br />
at large scale modeling of environments. In particular,<br />
ef<strong>for</strong>ts like Google Earth and Virtual Earth aim on the systematic<br />
creation of virtual models from aerial images. In<br />
this work we focus on the uncoordinated generation of digital<br />
copies of urban habitats from community supplied images.<br />
Hence, more 3D vision methods will become known<br />
in a larger audience. Panorama creation tools like Autostitch<br />
[4] are already well-established in the public. Now,<br />
even end-user applications <strong>for</strong> the more sensitive structure<br />
from motion determination exist at least as technology previews.<br />
Currently, the Photo Tourism software [19] (and the related<br />
PhotoSynth application) is the most-well known application<br />
<strong>for</strong> automatic structure and motion computation from<br />
a large set of images. A collection of supplied images is<br />
analyzed and correspondences are established, from which<br />
a relevant subset of views and the respective 3D structure<br />
are determined. Photo Tourism does not explicitly incorporate<br />
calibrated cameras, but relies partially on the focal<br />
length specification found in the image meta-data to obtain<br />
the initial metric structure. Images with incorrect or missing<br />
meta-data can be registered by DLT pose estimation.<br />
In contrast to Photo Tourism we employ calibrated cameras,<br />
thereby substantially easing the structure and motion<br />
computation. Per<strong>for</strong>ming the calibration procedure proposed<br />
in our framework can be easily done by end-users.<br />
One additional advantage of using calibrated cameras is<br />
the higher accuracy of the computed poses, which enables<br />
the subsequent generation of dense geometry using multiview<br />
stereo techniques. In the first instance, we aim on<br />
textured dense models in quality similar to the results presented<br />
in [1]. Apparently, these raw models need to be postprocessed<br />
in subsequent steps to allow an efficient Internet<strong>based</strong><br />
visualization.<br />
The majority of 3D modeling approaches is intended<br />
<strong>for</strong> a decentralized use on personal computers. Vergauwen<br />
and Van Gool [23] present a Web-<strong>based</strong> interface to their<br />
3D modeling engine, again working with uncalibrated cameras<br />
[14]. Registered users can upload their images and subsequently<br />
receive the resulting depth maps. Their proposed<br />
system is targeted at reconstructing individual sites, but it is<br />
not aimed on building and maintaining a global image and<br />
3D model database.<br />
The goal of our work is the creation of a 3D virtual model<br />
representing an urban environment from captured digital<br />
images. Instead of using publicly available photo collections<br />
(as done in [19]), we rely on images submitted by<br />
interested users, since the participants need to provide the<br />
camera calibration data in addition. By adding more images,<br />
the virtual representations of an urban habitat can be<br />
incrementally maintained and refined gradually. Hence, we<br />
apply the famous and effective <strong>Wiki</strong> principle on the objective<br />
of creating a photorealistic 3D city model. The following<br />
sections describe the utilized methods and the results<br />
available so far in more detail.<br />
1
(a)<br />
Figure 1. 96 calibration markers arranged in a 4 by 4 layout on the<br />
floor.<br />
2. Camera Calibration<br />
In order to avoid to distinguish between scenes with and<br />
without dominant planar structures (e.g. [15]), we require<br />
the cameras to be calibrated. Additionally, our experience<br />
with self-calibration techniques indicates, that the accuracy<br />
of structure and motion computation is expected to be<br />
higher with a calibrated setup. In particular, accurate pose<br />
is essential to obtain a faithful reconstruction using dense<br />
depth techniques.<br />
To ease the calibration ef<strong>for</strong>t <strong>for</strong> the end-user, we employ<br />
a procedure aiming <strong>for</strong> the accuracy of target calibration<br />
techniques without the need <strong>for</strong> a precise calibration<br />
pattern. The approach is <strong>based</strong> on simple printed markers<br />
imaged in several views. Using specific markers enables to<br />
establish robust and correct correspondences between the<br />
views. Image-<strong>based</strong> city modeling requires generally an infinite<br />
focus and a wide angle setup, hence the calibration<br />
pattern needs to be sufficiently large. Thus, the marker patterns<br />
are printed on several sheets of paper and are typically<br />
arranged on the floor (see Figures 1(a) and (b)). These pages<br />
can be laid out arbitrarily, hence the well-known method of<br />
Zhang [27] is not applicable. It is not necessary to have all<br />
markers visible in the captured images, but <strong>for</strong> good calibration<br />
results most markers should be visible and well distributed<br />
in the images.<br />
The first step in the calibration procedure is the detection<br />
of the circular markers in the images and the extraction<br />
of the unique marker ID. The 2D feature point associated<br />
with a marker is either the center of the extracted ellipse,<br />
which only approximates the true marker center. Additionally,<br />
the 2D feature position can be refined using the central<br />
checkerboard pattern. In this case a non-linear search<br />
<strong>for</strong> the correct center is per<strong>for</strong>med by aligning a synthetic<br />
checkerboard pattern with a suitable section of the marker<br />
image.<br />
Matching feature points across multiple views is trivial,<br />
since unique and easily extractable IDs are available. Of<br />
course, the uniqueness of extracted markers in every image<br />
needs to be checked to avoid incorrect detections in case of<br />
blurred or otherwise low-quality images.<br />
Since the marker images are laid out on a planar surface,<br />
(b)<br />
corresponding feature points are related by a homography.<br />
Hence, the first estimation of lens distortion parameters attempts<br />
to minimize the reprojection error between extracted<br />
feature points with free homography and lens distortion parameters<br />
[13]. More <strong>for</strong>mally, if x k i denotes the position of<br />
marker k in the i-th image, the initial distortion estimation<br />
determines<br />
∑<br />
arg min | ˜D(x k j , θ) − ˜D(H ij x k i , θ)| 2 , (1)<br />
H ij,θ<br />
i,j<br />
where H ij denotes the image homography from view i to<br />
j and ˜D(x, θ) is the inverse distortion function with coefficients<br />
θ. The distortion model is<br />
˜D(x, θ) = (x − (u 0 , v 0 ) T ) · (1 + k 1 r 2 + k 2 r 4 ), (2)<br />
with r = ‖x − (u 0 , v 0 ) T ‖. θ is the vector (u 0 , v 0 , k 1 , k 2 )<br />
consisting of the distortion center (u 0 , v 0 ) and the coefficients<br />
k 1 and k 2 .<br />
The center of radial distortion (u 0 , v 0 ) is independent<br />
from the optical principal point, thus essentially removing<br />
the need <strong>for</strong> decentering distortion parameters [22]. The initial<br />
homographies are set to the gold standard results and the<br />
distortion parameters are initialized with the image center,<br />
and 0 <strong>for</strong> the coefficients k 1 and k 2 , respectively. The nonlinear<br />
minimization is per<strong>for</strong>med with a (sparse) Levenberg-<br />
Marquardt method. Note, that the homographies are not<br />
independent: a consistent set of inter-image homographies<br />
should satisfy H ij = H lj H il <strong>for</strong> all l. This can be en<strong>for</strong>ced<br />
in our implementation by using a minimal parametrization<br />
solely <strong>based</strong> on homographies between adjacent views,<br />
H i,i+1 , and representing H ij = ∏ j>l≥i H l,l+1.<br />
After determining the initial estimate <strong>for</strong> the lens distortion,<br />
the focal length of the camera is estimated from the set<br />
of homographies. Both [21] and [10] employ a non-linear<br />
minimization technique <strong>for</strong> intrinsic parameters estimation<br />
and an initial estimate is required. We utilize a much simpler<br />
search technique to quickly determine the camera intrinsics:<br />
At first, we assume that the principal point is close<br />
to the image center and that the aspect ratio and skew are<br />
one and zero, respectively. Hence, we search <strong>for</strong> a constant,<br />
but unknown focal length f determining the calibration matrix<br />
K. If the correct intrinsic matrix K is known, the<br />
image-<strong>based</strong> homographies H ij can be updated to homographies<br />
between metric image planes, ˜Hij = K −1 H ij K.<br />
For a particular view i assumed with canonical pose, ˜Hij<br />
can be decomposed as ˜H ij = R ij − t ij n T i /d i, where<br />
(R ij , t ij ) depicts the relative pose and n i and d i denote<br />
the plane normal and distance (according to the coordinate<br />
frame of view i), respectively. Note, that each ˜H ij provides<br />
its own estimate of n i = n i ( ˜H ij ).<br />
For the true calibration matrix K, the extracted normals<br />
n i ( ˜H ij ) should coincide into one common estimate of the
plane normal. Hence, a low variance of the set {n i } indicates<br />
approximately correct calibration parameters. A slight<br />
complication is induced by the fact, that decomposing ˜H ij<br />
results in two possible relative poses and plane parameters<br />
(denoted with n + i and n − i ). Let (n+ 0 , n− 0 ) be the most separated<br />
pair of normals from all pairs (n + i ( ˜H ij ), n − i ( ˜H ij )).<br />
We use n + 0 and n− 0 as the estimates <strong>for</strong> the mean of the set<br />
{n i }. Now, the score <strong>for</strong> K is the minimum of<br />
∑<br />
min<br />
(∠(n + i ( ˜H ij ), n + 0 ), ∠(n− i ( ˜H<br />
)<br />
ij ), n + 0 ) (3)<br />
i,j<br />
i,j<br />
and<br />
∑<br />
min<br />
(∠(n + i ( ˜H ij ), n − 0 ), ∠(n− i ( ˜H<br />
)<br />
ij ), n − 0 ) . (4)<br />
This score is evaluated <strong>for</strong> potential choices of f, e.g.<br />
f ∈ [0.3, 3] in terms of normalized pixel coordinates. The<br />
value of f with the lowest score is used as initial estimate<br />
<strong>for</strong> the focal length. This procedure is both simple and very<br />
fast, and yields to sufficiently accurate focal lengths in our<br />
experiments.<br />
With the (approximate) knowledge of the focal length, an<br />
initial metric reconstruction <strong>based</strong> on two appropriate views<br />
is generated. The remaining views are added by estimating<br />
their absolute poses. The final bundle adjustment procedure<br />
optimizes <strong>for</strong> the parameters of the <strong>for</strong>ward distortion<br />
function, hence the inverse of the originally obtained distortion<br />
parameters is required. Since the employed polynomial<br />
distortion model is not closed under function inversion,<br />
the initial <strong>for</strong>ward distortion parameters are determined by<br />
a least squares approach. A final bundle adjustment procedure<br />
is applied to refine the camera intrinsics and distortion<br />
parameters and to improve the only approximately planar<br />
3D structure and the camera poses.<br />
3. Structure and Motion Computation<br />
3.1. Preprocessing<br />
The first step after image uploading is resampling the<br />
image according to the obtained lens distortion. In order to<br />
avoid frequent recomputation or retrieval of the corresponding<br />
lookup table, this procedure is run as a daemon process<br />
caching several least recently used distortion lookup tables.<br />
Afterwards, feature points and their respective descriptors<br />
are extracted. The current implementation uses SIFT features<br />
[9] because of its success reported in the vision community.<br />
3.2. Image Matching<br />
Retrieving similar images <strong>for</strong> a given one is currently a<br />
very active research topic (e.g. [16, 12, 7, 17]. We employ<br />
a visual vocabulary tree approach similar to [12] to retrieve<br />
a set of potentially similar views from a large collection of<br />
images. Thus, other image related data to limit the candidate<br />
images <strong>for</strong> matching like GPS position is helpful, but<br />
not strictly necessary. The vocabulary tree enables us to efficiently<br />
match a single image against a database containing<br />
thousands and even millions of images. Images found in<br />
the database with a high score according to the vocabulary<br />
tree are investigated further by a more discriminant matching<br />
procedure. Since the score induced by the vocabulary<br />
tree may miss relevant images, the set of candidate images<br />
used <strong>for</strong> further matching is augmented with the neighbors<br />
of highly ranked images with respect to the already constructed<br />
view network.<br />
In our system the vocabulary tree is trained in an unsupervised<br />
manner with a subset of 2 × 10 6 SIFT feature vectors<br />
randomly taken from 2500 street-side images. The descriptor<br />
vectors are then hierarchically quantized into clusters<br />
using a k-means algorithm. We set the branch factor to<br />
10 and allow up to 7 tree levels. For each level the k-means<br />
algorithm is initialized with different seed clusters and the<br />
result producing the lowest Euclidean distance error is retained.<br />
Once the vocabulary tree is trained, searching the<br />
visual vocabulary is very efficient and new images can be<br />
inserted on-the-fly. Based on the scoring function a ranking<br />
of relevant database images is reported and used <strong>for</strong> postverification.<br />
In our current setting we rely on an entropy weighted<br />
scoring similar to the tf-idf “term frequency inverse document<br />
frequency” as described in [18]. Let D be an image in<br />
our database and t be the term in the vocabulary associated<br />
to feature f of the current query image Q, then our scoring<br />
function is,<br />
∑<br />
( ) N<br />
log<br />
(5)<br />
n(t)<br />
t∈Q∩D<br />
where N is the total number of images in the collection and<br />
n(t) is the number of images that contain term t. In order<br />
to guarantee fairness between database images with different<br />
number of features, the query results are normalized by<br />
the self-scoring result. There<strong>for</strong>e, if a database image is<br />
used <strong>for</strong> query, a score of 1 is returned. At the same time<br />
we get an absolute measure of image-to-image similarity,<br />
which enables us to set a global threshold <strong>for</strong> scoring. The<br />
k top ranked images reported by the vocabulary tree are then<br />
taken <strong>for</strong> post-verification. Typically, we set k to 1% of the<br />
current database size.<br />
Our verification procedure is <strong>based</strong> on exhaustive matching<br />
and RANSAC <strong>for</strong> geometric consistency check. First of<br />
all, correspondences are computed by mutual nearest neighbor<br />
matching of the 128 dimensional SIFT feature vectors.<br />
We adopt the idea of [2] and match features with the same<br />
contrast only, by taking advantage of the Laplacian sign. In<br />
addition, the epipolar geometry is verified using a RANSAC<br />
procedure. If sufficient reliable 3D structure <strong>for</strong> the image
(a)<br />
Figure 2. (a) Ten sample images from a collection of 105 facade images. (b) Side view of the camera poses and 9097 triangulated feature<br />
points.<br />
(b)<br />
measurements is available, we use pose estimation <strong>for</strong> verification.<br />
The calibrated settings enable us to use the Five-<br />
Point algorithm [11] <strong>for</strong> image-to-image matches and the<br />
Three-Point algorithm [6] <strong>for</strong> 3D to 2D correspondences<br />
as minimal hypothesis generators. Images are accepted as<br />
neighbors if a quality criterion is satisfied: we generate a<br />
number of N hypothesis in prior and then check if the number<br />
of outliers does not exceed the precomputed threshold.<br />
Since it has often been observed (e.g. in [20]) that the<br />
RANSAC stopping criterion is sometimes optimistic, we<br />
use a rather conservative confidence factor p of 1−10 −4 and<br />
require a minimal number of 20 inliers. If an image passes<br />
the post-verification procedure, the neighbors of this image<br />
are also matched with the query image. Note, that this strategy<br />
is highly efficient <strong>for</strong> our purpose, since frequently the<br />
first ranked image by the vocabulary tree is already a candidate<br />
match satisfying the geometry constraints.<br />
Unlike <strong>for</strong> the vocabulary tree scoring, where the whole<br />
set of SIFT descriptors is utilized, we limit the number of<br />
features <strong>for</strong> exhaustive matching. In general we take the<br />
1500 best features according to their DoG magnitude response<br />
<strong>for</strong> matching, only. There<strong>for</strong>e, searching <strong>for</strong> correspondences<br />
is sufficiently fast (see Section 5 <strong>for</strong> detailed<br />
timings).<br />
3.3. Upgrading the View Network<br />
Each time a new image is added to the view network,<br />
its neighbors are computed and the reconstruction process<br />
is started. By reconstruction we mean structure and motion<br />
computation, namely the estimation of the camera orientations<br />
and the sparse 3D structure of the matched features.<br />
Since images may be taken from different locations, we<br />
do not expect to obtain a single coherent reconstruction, but<br />
a <strong>for</strong>est of multiple reconstructions. We require that a reconstruction<br />
consists of at least three images (view triple)<br />
and 20 common triangulated points. In general four different<br />
cases can occur if a new image is processed:<br />
1. The view can be robustly registered with exactly one<br />
already present reconstruction.<br />
2. The image can be aligned with multiple reconstructions.<br />
3. The current view cannot be (robustly) aligned with an<br />
existing reconstruction, but <strong>for</strong>ms a good view triple<br />
with two other already present views.<br />
4. The geometric relation of this image with any of the<br />
known views cannot be established, and structure and<br />
motion determination is postponed until a new suitable<br />
view is inserted.<br />
In the first case the position of the current view can be computed<br />
immediately by robust absolute pose estimation, since<br />
2D to 3D points correspondences are known. Thereafter,<br />
the camera parameters are optimized by iterative refinement<br />
and new correspondences are triangulated to 3D points.<br />
In the second case, where the current image takes part of<br />
two or more different reconstructions, the reconstructions<br />
are merged. We determine robustly a 3D to 3D similarity<br />
trans<strong>for</strong>m <strong>for</strong> the registration of the two corresponding 3D<br />
points. This process is then followed by Euclidean bundle<br />
adjustment, where the sparse geometry and the camera parameters<br />
are optimized.<br />
After updating the reconstructions in this first two cases,<br />
the epipolar neighbors of the newly inserted view are traversed<br />
and checked, whether the updated 3D structure now<br />
allows determination of their poses.<br />
In the third scenario the neighbors of the current image<br />
are estimated as described previously and a new reconstruction<br />
is initialized from a well-conditioned view triple. The<br />
view triple should provide a good triangulation angle and at
Figure 3. <strong>Dense</strong> reconstruction from a collection of 105 facade images.<br />
the same time have many correspondences. There<strong>for</strong>e, in<br />
the first step we identify the view pair which minimizes,<br />
1<br />
N<br />
N∑<br />
i<br />
1<br />
sin 2 (α i )<br />
where α i is the angle between the two camera rays <strong>for</strong> the<br />
3D point X i . Thereafter, a third view which minimizes the<br />
value in Eq. 6 with respect to this first configuration is estimated.<br />
Note, that 1/sin(α i ) approximates the uncertainty<br />
(deviation) of X i in the depth direction. In [3] the view<br />
pair with maximal mean roundness (essentially the same as<br />
1/N ∑ sin(α i )) is taken. Such an approach does not consider<br />
the number of correspondences between two views.<br />
In this work we assume, that the accuracy of further (leastsquares)<br />
computations depending on the initial structure<br />
scales with 1/ √ N. Consequently, Eq. 6 estimates the mean<br />
variance of the initial structure <strong>for</strong> a given view pair. The<br />
relative pose between the first two views is computed by<br />
the Five-Point algorithm and the third camera is inserted by<br />
the Three-Point algorithm with respect to the triangulated<br />
3D points. Thereafter, bundle adjustment is used to globally<br />
optimize the exterior camera orientations and the initial<br />
structure. A view triple is considered as a valid reconstruction<br />
if the sum of triangulation angles is above a threshold,<br />
we require at least twenty 3D points with measurements in<br />
all views and a triangulation angle greater than 2 ◦ .<br />
Whenever a number of M (15 in our case) views is added<br />
to a reconstruction, bundle adjustment is run by optimizing<br />
all cameras and triangulated 3D points. Our bundle adjustment<br />
implementation is similar to the one described in [8].<br />
Thereafter, <strong>for</strong> each image measurement the reprojection error<br />
is computed, 3D points with an average reprojection error<br />
larger than 1.3 pixel and a triangulation angle less than<br />
2 ◦ are removed. Our experiments suggest that this strategy<br />
improves both, accuracy and robustness of the reconstruction<br />
algorithm. A sparse reconstruction result is shown in<br />
Figure 2.<br />
(6)<br />
4. <strong>Dense</strong> Reconstruction<br />
Obtaining the initial 3D structure and motion is an essential<br />
aspect of image-<strong>based</strong> modeling, but dense geometry<br />
is required <strong>for</strong> a faithful virtual representation of the<br />
captured scene. Since we face a large number of images<br />
with potentially associated depth maps, we focus on simple<br />
but fast dense depth estimation procedures. Additionally,<br />
processing of multiple sensor images with respect to<br />
a key view (in contrast to traditional two-frame stereo) is<br />
demanded. Plane-sweep approaches to dense stereo [24, 5]<br />
enable an efficient, GPU-accelerated procedure to create the<br />
depth maps. In order to obtain reasonable depth values in<br />
homogeneous regions, we employ GPU-<strong>based</strong> scanline optimization<br />
in our framework [26]. Note, that dense geometry<br />
generation is currently still per<strong>for</strong>med in an offline fashion<br />
<strong>for</strong> individual large and connected view networks.<br />
In case of general view networks, a suitable selection of<br />
sensor views used <strong>for</strong> depth estimation is necessary. We<br />
use the following simple, but effective heuristic to select<br />
appropriate sensor images, which is <strong>based</strong> on an estimate<br />
<strong>for</strong> image overlap and viewing directions: <strong>for</strong> a particular<br />
key view and a potential sensor view, we determine the correspondences<br />
between these views and compute the convex<br />
hull of the respective 2D measurements in the key view. The<br />
area of this convex hull A H in relation to the total key image<br />
area A gives an estimate of the relevant image overlap.<br />
Note, that A does not necessarily denote the whole key image<br />
size, since insignificant image portions like sky regions<br />
can be excluded. Sensor view candidates with an overlap<br />
A H /A smaller than a given threshold (typically set to 0.3)<br />
are discarded. Image overlap is only a partial guideline <strong>for</strong><br />
view selection. The angle between corresponding camera<br />
rays is another suitable criterion. Very small angles give rise<br />
to large depth uncertainties, whereas larger angles are susceptible<br />
to image distortion and occlusions. From a practical<br />
point of view, triangulation angles of about α 0 = 6 ◦ are<br />
favorable <strong>for</strong> dense stereo (meaning a distance to baseline
atio of about 10:1). Consequently, view pairs with a large<br />
overlap and appropriate triangulation angles are preferable.<br />
Hence, we rank the views with sufficient overlap according<br />
to the following score:<br />
A H<br />
A median(ψ(α i)), (7)<br />
where α i is the angle between corresponding rays and ψ(·)<br />
is a unimodal weighting function with its maximum at α 0 .<br />
We choose ψ as<br />
ψ(α) = α 2 e −2α/α0 . (8)<br />
The two views with highest scores are taken as sensor<br />
views.<br />
<strong>Dense</strong> depth estimation depends heavily on the quality<br />
of the provided epipolar geometry. In order to reduce the<br />
influence of inaccuracies in the estimated poses on the depth<br />
maps, and to increase the per<strong>for</strong>mance, multi-view stereo<br />
is applied on downsampled images (512 × 384 pixels <strong>for</strong><br />
4 : 3 <strong>for</strong>mat digital images). Matching cost computation<br />
and scanline optimization <strong>for</strong> view triples (one key view and<br />
two sensor views) take about 3s on a GeForce 7800GS.<br />
The set of depth maps provides 2.5D geometry <strong>for</strong> each<br />
view. These depth maps are subsequently fused into a common<br />
3D model using a robust depth image integration approach<br />
[25]. This method results in globally optimal 3D<br />
models according to the employed energy functional, but it<br />
operates in batch mode. The incorporation of this step is<br />
the major reason <strong>for</strong> dense geometry generation to be an offline<br />
process. The result obtained by depth map fusion <strong>for</strong><br />
the dataset depicted in Figure 2 (augmented with per-vertex<br />
coloring) is shown in Figure 3.<br />
5. Results<br />
We tested our system on different image collections,<br />
varying from a few hundred to thousands of pictures. All<br />
images are calibrated with the method described in Section<br />
2. The calibration precision (i.e. the final mean reprojection<br />
error) ranges from 1/20 to 1/7 pixel. Our largest<br />
test set is a database containing about 7000 street-side images<br />
taken with four compact digital cameras from different<br />
manufacturers. The images are captured at different days<br />
under varying illumination conditions. Furthermore, the<br />
images are only partially ordered. The size of the source<br />
images varies from two to seven Megapixel. In order to remove<br />
compression artefacts, the supplied images are resampled<br />
to the half resolution <strong>for</strong> further processing. Adding<br />
an image to the view network takes approximately half a<br />
minute, most of the time is spend on exhaustive matching<br />
and bundle adjustment. Average run-times required <strong>for</strong> each<br />
step are listed in Table 1.<br />
In all our tests we use a pre-trained generic vocabulary<br />
tree of about 7 × 10 5 leaf nodes <strong>for</strong> searching the image<br />
Operation<br />
time [s]<br />
Undistortion (2272 × 1704 pixel) 1.0<br />
Feature extraction (2272 × 1704 pixel) 1.6<br />
Vocabulary scoring 0.2<br />
Brute <strong>for</strong>ce NN-matching (1500 × 1500) k × 0.35<br />
RANSAC (3000 samples Five-Point) k × 0.4<br />
Structure and Motion 0.50<br />
Bundle adjustment (100 views) 60<br />
<strong>Dense</strong> stereo 3.0<br />
Table 1. Typical average timings of our system on an AMD<br />
Athlon(tm) 64 X2 Dual Core Processor 4400+. The parameter<br />
k specifies the number of images taken <strong>for</strong> post-verification. Typically,<br />
we set k to 1% of the current database size.<br />
Figure 4. Vocabulary tree per<strong>for</strong>mance <strong>for</strong> image retrieval depending<br />
on the database size. The y-axis shows the probability to find<br />
an epipolar neighbor in the first k-ranked images reported by the<br />
vocabulary tree scoring.<br />
database. Our experiments suggest that the vocabulary tree<br />
approach <strong>for</strong> image retrieval is very efficient <strong>for</strong> large scale<br />
structure and motion computation. In Figure 4 the change<br />
of retrieval per<strong>for</strong>mance depending on the database size is<br />
shown. On average an image has an overlap with about<br />
eight other images in the database. However, there are<br />
also images with no geometric relation to existing database<br />
entries. The detailed distribution of epipolar neighbors is<br />
shown in Figure 5. The retrieval per<strong>for</strong>mance is measured<br />
by the rank of the first image that passes the geometric verification<br />
procedure. Note, even <strong>for</strong> a relative large database<br />
of about 7000 images, the first ranked image reported by<br />
the vocabulary tree is a valid epipolar neighbor of the query<br />
image with a probability of more than 90%. What we can<br />
observe also is that the scoring saturates rapidly, there<strong>for</strong>e<br />
employing the 1% from the vocabulary tree ranking is a<br />
good choice between speed and retrieval quality. Finally,<br />
two captured sites and the resulting sparse and dense models<br />
are shown in Figure 6 and Figure 7. The final reprojec-
(a)<br />
Figure 5. Probability density function of the number of verified<br />
epipolar neighbors <strong>for</strong> a query image in a database of 7181 streetside<br />
images. On average a query images has an overlap with about<br />
eight images in the database.<br />
tion errors of the 3D structure and camera poses <strong>for</strong> these<br />
datasets after bundle adjustment are about 1/3 pixel.<br />
6. Future Work<br />
Currently, images are contributed by persons associated<br />
with this project and with basic knowledge in 3D computer<br />
vision. Future work needs to increase the robustness of the<br />
structure from motion methods to allow the public to participate<br />
in the creation of the visual database. In particular, it<br />
needs to be assured that low-quality or defective images do<br />
not degrade the 3D models in the generated database.<br />
Another future option is to allow uncalibrated images to<br />
be added <strong>for</strong> the purpose of image localization and geometric<br />
alignment. Such images will generally not result in an<br />
update of the 3D structure.<br />
<strong>Dense</strong> geometry generation is currently not integrated<br />
into the online processing workflow. Depth estimation can<br />
be easily adapted to the online setting, but the employed<br />
range image fusion method is entirely an offline procedure.<br />
A novel approach <strong>for</strong> robust and efficient depth image fusion<br />
working incrementally is a present research topic.<br />
Acknowledgements<br />
This work is partly funded by the Vienna Science and<br />
Technology Fund (WWTF) and by the Kplus VRVis research<br />
center.<br />
References<br />
[1] A. Akbarzadeh et al. <strong>Towards</strong> urban 3D reconstruction from<br />
video. In Proc. 3DPVT, 2006.<br />
[2] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up<br />
robust features. In Proc. ECCV, pages 404–417, 2006.<br />
(b)<br />
(c)<br />
Figure 6. (a) Some sample images and (b) sparse reconstruction<br />
from more than 400 viewpoints. <strong>Dense</strong> reconstruction (c) from a<br />
subset of registered images.<br />
[3] C. Beder and R. Steffen. Determining an initial image pair<br />
<strong>for</strong> fixing the scale of a 3d reconstruction from an image sequence.<br />
In Proc. DAGM, pages 657–666, 2006.<br />
[4] M. Brown and D. Lowe. Automatic panoramic image stitching<br />
using invariant features. IJCV, 2006.<br />
[5] N. Cornelis and L. Van Gool. Real-time connectivity constrained<br />
depth map computation using programmable graphics<br />
hardware. In Proc. CVPR, pages 1099–1104, 2005.<br />
[6] R. M. Haralick, C. Lee, K. Ottenberg, and M. Nölle. Analy-
(a)<br />
(b)<br />
(c)<br />
Figure 7. (a) Some sample images from a collection of 49 views.<br />
(b) Sparse reconstruction and camera orientations. (c) Final dense<br />
reconstruction after depth map fusion.<br />
sis and solutions of the three point perspective pose estimation<br />
problem. In Proc. CVPR, pages 592–598, 1991.<br />
[7] H. Jegou, H. Harzallah, and C. Schmid. A contextual dissimilarity<br />
measure <strong>for</strong> accurate and efficient image search.<br />
In Proc. CVPR, 2007.<br />
[8] M. Lourakis and A. Argyros. The design and implementation<br />
of a generic sparse bundle adjustment software package<br />
<strong>based</strong> on the levenberg-marquardt algorithm. Technical Report<br />
340, <strong>Institute</strong> of <strong>Computer</strong> Science - FORTH, 2004.<br />
[9] D. Lowe. Distinctive image features from scale-invariant<br />
keypoints. IJCV, 60(2):91–110, 2004.<br />
[10] E. Malis and R. Cipolla. Camera self-calibration from unknown<br />
planar structures en<strong>for</strong>cing the multiview constraints<br />
between collineations. TPAMI, 24(9):1268–1272, 2002.<br />
[11] D. Nistér. An efficient solution to the five-point relative pose<br />
problem. TPAMI, 26(6):756–770, 2004.<br />
[12] D. Nistér and H. Stewenius. Scalable recognition with a vocabulary<br />
tree. In Proc. CVPR, pages 2161–2168, 2006.<br />
[13] T. Pajdla, T. Werner, and V. Hlaváč. Correcting radial lens<br />
distortion without knowledge of 3-D structure. Technical report,<br />
Center <strong>for</strong> Machine Perception, Czech Technical University,<br />
1997.<br />
[14] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest,<br />
K. Cornelis, J. Tops, and R. Koch. Visual modeling with<br />
a hand-held camera. IJCV, 59(3):207–232, 2004.<br />
[15] M. Pollefeys, F. Verbiest, and L. Van Gool. Surviving dominant<br />
planes in uncalibrated structure and motion recovery. In<br />
Proc. ECCV, pages 837–851, 2002.<br />
[16] F. Schaffalitzky and A. Zisserman. Multi-view matching <strong>for</strong><br />
unordered image sets, or ”How do I organize my holiday<br />
snaps?”. In Proc. ECCV, pages 414–431, 2002.<br />
[17] G. Schindler, M. Brown, and R. Szelisk. <strong>City</strong>-scale location<br />
recognition. In Proc. CVPR, 2007.<br />
[18] J. Sivic and A. Zisserman. Video google: A text retrieval approach<br />
to object matching in videos. In International Conference<br />
on <strong>Computer</strong> Vision, pages 1470–1477, 2003.<br />
[19] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring<br />
photo collections in 3D. In Proceedings of SIGGRAPH<br />
2006, pages 835–846, 2006.<br />
[20] B. Tordoff and D. W. Murray. Guided sampling and consensus<br />
<strong>for</strong> motion estimation. In Proc. ECCV, pages 82–98,<br />
2002.<br />
[21] B. Triggs. Autocalibration from planar scenes. In Proceedings<br />
of the 5th European Conference on <strong>Computer</strong> Vision<br />
(ECCV’98), pages 89–105, 1998.<br />
[22] R. Y. Tsai. A versatile camera calibration technique <strong>for</strong> high<br />
accuracy 3d machine vision metrology using off-the-shelf tv<br />
cameras and lenses. IEEE Journal of Robotics and Automation,<br />
3(4):323–344, 1987.<br />
[23] M. Vergauwen and L. Van Gool. Web-<strong>based</strong> 3D reconstruction<br />
service. Mach. Vision Appl., 17(6):411–426, 2006.<br />
[24] R. Yang and M. Pollefeys. Multi-resolution real-time stereo<br />
on commodity graphics hardware. In Proc. CVPR, pages<br />
211–217, 2003.<br />
[25] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm<br />
<strong>for</strong> robust TV-L 1 range image integration. In Proc.<br />
ICCV, 2007. to appear.<br />
[26] C. Zach, M. Sormann, and K. Karner. Scanline optimization<br />
<strong>for</strong> stereo on graphics hardware. In Proc. 3DPVT, 2006.<br />
[27] Z. Zhang. A flexible new technique <strong>for</strong> camera calibration.<br />
TPAMI, 22(11):1330–1334, 2000.