Fast Robust Large-scale Mapping from Video and Internet Photo ...

More documents

Recommendations

Info

tablish the spatial relationship between the images. In principle we can use feature based techniques similar to the method of Philbin et al. and Zheng et al. [16, 17] to detect related frames within the photo collection. These methods use computationally very intensive indexing based on local image features (keypoints) followed by loose spatial verification. Their spatial constraints are given by a 2D affine transformation or filters based on proximity of local features. Our method effectively enforces 3D structure from motion constraints (SfM constraints) for the dataset. Similarly methods like [51, 52] enforce 3D SfM constraints for the full set of registered frames. Their methods first exhaustively evaluate all possible pairs for a valid epipolar geometry and then enforce the stronger multi-view geometry constraints. Our method avoids the prohibitively expensive exhaustive pairwise matching using an initial stage in which images are grouped using global image features prior to indexing based on local features. This gives us a bigger gain in efficiency and an improved ability to group similar compositions mainly corresponding to similar viewpoints. We then enforce the SfM constraints on these groups which reduces the complexity of the computation by orders of magnitude. Snavely et al. [54] also reduced the complexity of the SfM constraints by minimizing the number of image pairs for which it is computed to a minimal set expected to obtain the same overall reconstruction. 4.4.1. Efficiently Finding Corresponding Images in Photo Collections To efficiently identify related images our system uses the gist feature [78], which encodes the spatial layout of the image and perceptual properties of the image. The gist feature was found to be effective for grouping images by perceptual similarity and retrieving structurally similar scenes [79, 80]. To 22
achieve high computational performance we developed a highly parallel gist feature extraction on the GPU. It derives a gist descriptor for each image as the concatenation of two independently computed sub-vectors. The first sub-vector is computed by convolving a downsampled to 128 × 128 grayscale version of the image with Gabor filters at 3 different scales (with 8, 8 and 4 orientations for the three scales, respectively). The filter responses are aggregated to a 4 × 4 spatial resolution, downsampling each convolution to a 4 × 4 patch, and concatenating the results, yielding a 320-dimensional vector. In addition, we augment this gist descriptor with color information, consisting of a subsampled L*a*b image, at 4 × 4 spatial resolution. We thus obtain a 368-dimensional vector as a representation of each image in the dataset. The implementation on the GPU improves the computation time by a factor of 100 compared to a CPU implementation. For detailed timings of the gist computation please see Table 1. In the next step we use the fact that photos from nearby viewpoints with similar camera orientation have similar gist descriptors. Hence, to identify viewpoint clusters we use k-means clustering of the gist descriptors. At this point we aim for an over-segmentation since that will best reduce our computational complexity. We empirically found that searching for 10% as many clusters as images yields a sufficient over-segmentation. As shown by Table 1 the clustering of the gist descriptors can be executed very efficiently. This is key to the computational efficiency of our system since this early grouping allows us to limit all further geometric verifications avoiding an exhaustive search over the whole dataset. The clustering successfully identifies the popular viewpoints, although it 23
Page 1 and 2: Fast Robust Large-scale Mapping fro
Page 3 and 4: Figure 1: The left shows an overvie
Page 5 and 6: use statistical models to combine d
Page 7 and 8: a probabilistic way and the final s
Page 9 and 10: streams of multiple cameras mounted
Page 11 and 12: In the case of available GPS data o
Page 13 and 14: 4.1. Camera Pose from Video Our sys
Page 15 and 16: complexity. Our recently proposed A
Page 17 and 18: surements we must normalize all of
Page 19 and 20: VIP-features [68]. To avoid a compu
Page 21: an active research topic in the com
Page 25 and 26: for each image in the cluster to re
Page 27 and 28: can be added to the 3D model. This
Page 29 and 30: Figure 5: Left: 3D reconstruction o
Page 31 and 32: from 11 images (10 matching, 1 refe
Page 33 and 34: Original Mesh Simplified Mesh Textu
Page 35 and 36: 7. Conclusions In this paper we pre
Page 37 and 38: [12] T. Berg, D. Forsyth, Animals o
Page 39 and 40: [30] Y. Jing, S. Baluja, H. Rowley,
Page 41 and 42: objects from multiple range images,
Page 43 and 44: [63] M. Fischler, R. Bolles, Random
Page 45 and 46: Marquardt algorithm, Tech. Rep. 340

Fast Robust Large-scale Mapping from Video and Internet Photo ...

Create successful ePaper yourself

Delete template?

Save as template?