Towards Wiki-based Dense City Modeling - Institute for Computer ...

More documents

Recommendations

Info

(a) Figure 2. (a) Ten sample images from a collection of 105 facade images. (b) Side view of the camera poses and 9097 triangulated feature points. (b) measurements is available, we use pose estimation for verification. The calibrated settings enable us to use the Five- Point algorithm [11] for image-to-image matches and the Three-Point algorithm [6] for 3D to 2D correspondences as minimal hypothesis generators. Images are accepted as neighbors if a quality criterion is satisfied: we generate a number of N hypothesis in prior and then check if the number of outliers does not exceed the precomputed threshold. Since it has often been observed (e.g. in [20]) that the RANSAC stopping criterion is sometimes optimistic, we use a rather conservative confidence factor p of 1−10 −4 and require a minimal number of 20 inliers. If an image passes the post-verification procedure, the neighbors of this image are also matched with the query image. Note, that this strategy is highly efficient for our purpose, since frequently the first ranked image by the vocabulary tree is already a candidate match satisfying the geometry constraints. Unlike for the vocabulary tree scoring, where the whole set of SIFT descriptors is utilized, we limit the number of features for exhaustive matching. In general we take the 1500 best features according to their DoG magnitude response for matching, only. Therefore, searching for correspondences is sufficiently fast (see Section 5 for detailed timings). 3.3. Upgrading the View Network Each time a new image is added to the view network, its neighbors are computed and the reconstruction process is started. By reconstruction we mean structure and motion computation, namely the estimation of the camera orientations and the sparse 3D structure of the matched features. Since images may be taken from different locations, we do not expect to obtain a single coherent reconstruction, but a forest of multiple reconstructions. We require that a reconstruction consists of at least three images (view triple) and 20 common triangulated points. In general four different cases can occur if a new image is processed: 1. The view can be robustly registered with exactly one already present reconstruction. 2. The image can be aligned with multiple reconstructions. 3. The current view cannot be (robustly) aligned with an existing reconstruction, but forms a good view triple with two other already present views. 4. The geometric relation of this image with any of the known views cannot be established, and structure and motion determination is postponed until a new suitable view is inserted. In the first case the position of the current view can be computed immediately by robust absolute pose estimation, since 2D to 3D points correspondences are known. Thereafter, the camera parameters are optimized by iterative refinement and new correspondences are triangulated to 3D points. In the second case, where the current image takes part of two or more different reconstructions, the reconstructions are merged. We determine robustly a 3D to 3D similarity transform for the registration of the two corresponding 3D points. This process is then followed by Euclidean bundle adjustment, where the sparse geometry and the camera parameters are optimized. After updating the reconstructions in this first two cases, the epipolar neighbors of the newly inserted view are traversed and checked, whether the updated 3D structure now allows determination of their poses. In the third scenario the neighbors of the current image are estimated as described previously and a new reconstruction is initialized from a well-conditioned view triple. The view triple should provide a good triangulation angle and at
Figure 3. Dense reconstruction from a collection of 105 facade images. the same time have many correspondences. Therefore, in the first step we identify the view pair which minimizes, 1 N N∑ i 1 sin 2 (α i ) where α i is the angle between the two camera rays for the 3D point X i . Thereafter, a third view which minimizes the value in Eq. 6 with respect to this first configuration is estimated. Note, that 1/sin(α i ) approximates the uncertainty (deviation) of X i in the depth direction. In [3] the view pair with maximal mean roundness (essentially the same as 1/N ∑ sin(α i )) is taken. Such an approach does not consider the number of correspondences between two views. In this work we assume, that the accuracy of further (leastsquares) computations depending on the initial structure scales with 1/ √ N. Consequently, Eq. 6 estimates the mean variance of the initial structure for a given view pair. The relative pose between the first two views is computed by the Five-Point algorithm and the third camera is inserted by the Three-Point algorithm with respect to the triangulated 3D points. Thereafter, bundle adjustment is used to globally optimize the exterior camera orientations and the initial structure. A view triple is considered as a valid reconstruction if the sum of triangulation angles is above a threshold, we require at least twenty 3D points with measurements in all views and a triangulation angle greater than 2 ◦ . Whenever a number of M (15 in our case) views is added to a reconstruction, bundle adjustment is run by optimizing all cameras and triangulated 3D points. Our bundle adjustment implementation is similar to the one described in [8]. Thereafter, for each image measurement the reprojection error is computed, 3D points with an average reprojection error larger than 1.3 pixel and a triangulation angle less than 2 ◦ are removed. Our experiments suggest that this strategy improves both, accuracy and robustness of the reconstruction algorithm. A sparse reconstruction result is shown in Figure 2. (6) 4. Dense Reconstruction Obtaining the initial 3D structure and motion is an essential aspect of image-based modeling, but dense geometry is required for a faithful virtual representation of the captured scene. Since we face a large number of images with potentially associated depth maps, we focus on simple but fast dense depth estimation procedures. Additionally, processing of multiple sensor images with respect to a key view (in contrast to traditional two-frame stereo) is demanded. Plane-sweep approaches to dense stereo [24, 5] enable an efficient, GPU-accelerated procedure to create the depth maps. In order to obtain reasonable depth values in homogeneous regions, we employ GPU-based scanline optimization in our framework [26]. Note, that dense geometry generation is currently still performed in an offline fashion for individual large and connected view networks. In case of general view networks, a suitable selection of sensor views used for depth estimation is necessary. We use the following simple, but effective heuristic to select appropriate sensor images, which is based on an estimate for image overlap and viewing directions: for a particular key view and a potential sensor view, we determine the correspondences between these views and compute the convex hull of the respective 2D measurements in the key view. The area of this convex hull A H in relation to the total key image area A gives an estimate of the relevant image overlap. Note, that A does not necessarily denote the whole key image size, since insignificant image portions like sky regions can be excluded. Sensor view candidates with an overlap A H /A smaller than a given threshold (typically set to 0.3) are discarded. Image overlap is only a partial guideline for view selection. The angle between corresponding camera rays is another suitable criterion. Very small angles give rise to large depth uncertainties, whereas larger angles are susceptible to image distortion and occlusions. From a practical point of view, triangulation angles of about α 0 = 6 ◦ are favorable for dense stereo (meaning a distance to baseline
Page 1 and 2: Towards Wiki-based Dense City Model
Page 3: plane normal. Hence, a low variance
Page 7 and 8: (a) Figure 5. Probability density f

Towards Wiki-based Dense City Modeling - Institute for Computer ...

Create successful ePaper yourself

Delete template?

Save as template?