Fast Robust Large-scale Mapping from Video and Internet Photo ...

More documents

Recommendations

Info

photo collections. Here we discuss the most related work in the areas of fast structure from motion, camera registration from Internet photo collections, real-time dense stereo, scene summarization, and landmark recognition. Besides the purely image based approaches there is also work on modeling the geometry by combining cameras and active range scanners. Früh and Zakhor [4] proposed a mobile system mounted on a vehicle to capture large amounts of data while driving in urban environments. Earlier systems by Stamos and Allen [5] and El-Hakim et al. [6] constrained the scanning laser to be in a few pre-determined viewpoints. In contrast to the comparably expensive active systems, our approach for real-time reconstruction from video uses only cameras leveraging the methodology developed by the computer vision community within the last two decades [7, 8]. The first step in our systems is to establish correspondences between the video frames or the different images of the photo collection respectively. We use two different approaches for establishing correspondences. For video data we use an extended KLT tracker [9] that exploits the inherent parallelism of the tracking problem to improve the computational performance through execution on the graphics processor (GPU). The specific approach used is introduced by Zach et al. in detail in [10]. In the case of Internet photo collections we have to address the challenge of dataset collection, which is the following problem: starting with the heavily contaminated output of an Internet image search query, extract a high-precision subset of images that are actually relevant to the query. Existing approaches to this problem [11, 12, 13, 14, 15] consider general visual categories not necessarily related by rigid 3D structure. These techniques 4
use statistical models to combine different kinds of 2D image features (texture, color, keypoints), as well as text and tags. The approach by Philbin et al. [16, 17] uses a loose spatial consistency constraint. In contrast, our system enforces a rigid 3D scene structure shared by the images. Our systems determines the camera registration from images or video frames. In general there are two classes of methods used to determine the camera registration. The first class leverages the work in multiple view geometry and typically alternates between robustly estimating camera poses and 3D point locations directly [18, 19, 20, 21]. Often bundle adjustment [22] is used in the process to refine the estimate. The other class of methods uses an extended Kalman filter to estimate both camera motion and 3D point locations jointly as the state of the filter [23, 24]. For reconstruction from Internet photo collections our system uses a hierarchical reconstruction method starting from a set of canonical or iconic views representing different viewpoints as well as different parts of the scene. The issue of canonical view selection is one that has been addressed both in psychology [25, 26] and computer vision literature [27, 28, 29, 30, 31, 32]. Simon et al. [31] observed that community photo collections provide a likelihood distribution over the viewpoints from which people prefer to take photographs. Hence, canonical view selection identifies prominent clusters or modes of this distribution. Simon et al. find these modes by clustering images based on the output of local feature matching and epipolar geometry verification between every pair of images in the dataset through a 3D registration of the data. While this solution is effective, it is computationally expensive, and it treats scene summarization as a by-product of 3D recon- 5
Page 1 and 2: Fast Robust Large-scale Mapping fro
Page 3: Figure 1: The left shows an overvie
Page 7 and 8: a probabilistic way and the final s
Page 9 and 10: streams of multiple cameras mounted
Page 11 and 12: In the case of available GPS data o
Page 13 and 14: 4.1. Camera Pose from Video Our sys
Page 15 and 16: complexity. Our recently proposed A
Page 17 and 18: surements we must normalize all of
Page 19 and 20: VIP-features [68]. To avoid a compu
Page 21 and 22: an active research topic in the com
Page 23 and 24: achieve high computational performa
Page 25 and 26: for each image in the cluster to re
Page 27 and 28: can be added to the 3D model. This
Page 29 and 30: Figure 5: Left: 3D reconstruction o
Page 31 and 32: from 11 images (10 matching, 1 refe
Page 33 and 34: Original Mesh Simplified Mesh Textu
Page 35 and 36: 7. Conclusions In this paper we pre
Page 37 and 38: [12] T. Berg, D. Forsyth, Animals o
Page 39 and 40: [30] Y. Jing, S. Baluja, H. Rowley,
Page 41 and 42: objects from multiple range images,
Page 43 and 44: [63] M. Fischler, R. Bolles, Random
Page 45 and 46: Marquardt algorithm, Tech. Rep. 340

Fast Robust Large-scale Mapping from Video and Internet Photo ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?