PHD Thesis - Institute for Computer Graphics and Vision - Graz ...

Graz University of Technology 

Institute for Computer Graphics and Vision 

Head: Prof. Dr. Franz Leberl 

Dissertation 

Visual Localization within a World 

composed of Planes 

Friedrich Fraundorfer 

Graz, May 2006 

Thesis supervisor and first reviewer 

Prof. Dr. Horst Bischof 

Institute for Computer Graphics and Vision, Graz University of 

Technology 

Second reviewer 

Prof. Dr. David Nister 

Center for Visualization and Virtual Environments, University of 

Kentucky

The prize is the pleasure of finding 

the thing out, the kick in the 

discovery, the observation that other 

people use it [...] 

Richard P. Feynman (when asked about the honors 

of the Nobel Prize in 1967) 

i

Abstract 

Visual map building and localization for mobile robots is a wide spread field of research. Research 

done so far has already produced a vast variety of different approaches, yet key questions are 

still open. In this work we present novel approaches focussing on visual map building and 

global localization. First we propose a piece-wise planar world representation which uses small 

planar patches as landmarks. The new world representation is designed to ease the landmark 

correspondence problem. The map is augmented with the original appearances of the landmarks 

and invariant descriptors combining geometry and appearance based features in a local approach. 

For the building of the piece-wise planar map we make use of the recent advances in widebaseline 

stereo matching by using local detectors. The current state-of-the-art of local detectors 

is revised in this work and a new method to evaluate the performance of the different methods is 

proposed. Based on the evaluation results new methods for wide-baseline region matching and 

piece-wise planar scene reconstruction are presented. A map building algorithm is presented, 

which creates a piece-wise planar world map where the world map consists of a set of linked 

metric sub-maps. Second a novel algorithm for global localization from a single small landmark 

match is presented. The method produces an accurate 6DOF pose estimate, gaining benefits 

from the piece-wise planar world representation. Accurate pose estimation from a single small 

landmark makes the localization very robust even against large occlusions. Map building and 

localization are experimentally evaluated on two indoor scenarios. Map building and localization 

prove to be competitive to other state-of-the-art approaches. In fact, the localization accuracy 

is competitive to recent approaches, but gets computed from a single landmark match only. The 

experimental results successfully demonstrate the benefits and strengths of our novel approach. 

ii

Kurzfassung 

Die Lokalisierung von mobilen Robotern und die automatische Kartenerstellung mittels optischen 

Systemen ist ein weit gefächertes Forschungsgebiet. Die bisherige Forschung resultierte 

in einer großen Vielzahl von unterschiedlichen Ansätzen, lässt aber immer noch grundsätzliche 

Fragen offen. In dieser Arbeit präsentieren wir neue Methoden zur Kartenerstellung und zur 

Lokalisierung. Als Erstes schlagen wir eine stückweise ebene Weltbeschreibung vor, in der kleine 

ebene Segmente die Landmarks bilden. Die neue Weltrepräsentation ist darauf zugeschnitten das 

Landmark-Korrespondenzproblem zu erleichtern. Die Karte beinhaltet Originalbilder der Landmarks 

sowie eine invariante Beschreibung mit dem Resultat geometrische und aussehensbasierte 

Merkmale in einem lokalen Ansatz zu vereinen. Zur Erstellung der stückweise ebenen Karte 

nutzen wir die kürzlich erzielten Fortschritte im Bereich von Wide-Baseline Stereomatching mittels 

lokaler Detektoren. Der aktuelle Stand der Technik im Bereich von lokalen Detektoren wird 

in dieser Arbeit zusammengefasst wiedergegeben, und eine neue Methode zur Evaluierung der 

unterschiedlichen Methoden wird vorgestellt. Anhand der Evaluierungsergebnisse werden neue 

Methoden für Wide-Baseline Stereomatching und zur stückweise ebenen Szenenrekonstruktion 

vorgeschlagen. Ein Algorithmus zur Kartenerstellung wird vorgestellt der eine stückweise ebene 

Weltkarte, bestehend aus einer Menge von verbundenen metrischen Teilkarten, generiert. Als 

Zweites wird ein neuer Algorithmus zur globalen Lokalisierung anhand eines einzigen kleinen 

Landmarks vorgestellt. Die Methode berechnet eine Position (mit vollen 6 Freiheitsgraden) 

unter Ausnutzung der stückweise ebenen Weltrepräsentation. Die akkurate Positionsbestimmung 

von einem einzigen kleinen Landmark macht die Lokalisierung sehr robust, sogar gegen 

großflächige Verdeckungen. Die Kartenerstellung und die Lokalisierung werden experimentell 

anhand von zwei Szenen evaluiert. Die Ergebnisse der Kartenerstellung und der Lokalisierung 

erweisen sich vergleichbar zum aktuellen Stand der Technik. Tatsache ist, dass die Genauigkeit 

der Lokalisierung dem aktuellen Stand der Technik entspricht, und das obwohl sie nur von 

einem einzigen kleinen Landmark berechnet wird. Die Experimente demonstrieren weiterhin 

eindrucksvoll die Vorteile und Stärken unseres neuen Ansatzes. 

iii

Contents 

1 Introduction to mobile robotics and vision 1 

1.1 Localization and map building in mobile robotics . . . . . . . . . . . . . . . . . . 2 

1.2 Why vision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.3 What has already been achieved? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.4 Why is it hard? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.5 How can it get solved? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.6 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.7 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2 Visual localization 11 

2.1 Localization in metric maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

2.2 Localization from point features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 

2.3 Localization from line features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.4 Localization from plane features . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

3 Local detectors 20 

3.1 Interest point detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

3.1.1 Harris detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

3.1.2 Hessian detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

3.2 Scale invariant detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

3.2.1 Scale-invariant Harris detector . . . . . . . . . . . . . . . . . . . . . . . . 24 

3.2.2 Scale-invariant Hessian detector . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.2.3 Difference of Gaussian detector (DOG) . . . . . . . . . . . . . . . . . . . 26 

3.2.4 Salient region detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

3.2.5 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

3.3 Affine invariant detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

3.3.1 Affine-invariant Harris detector . . . . . . . . . . . . . . . . . . . . . . . . 31 

3.3.2 Affine-invariant Hessian detector . . . . . . . . . . . . . . . . . . . . . . . 33 

3.3.3 Maximally stable region detector (MSER) . . . . . . . . . . . . . . . . . . 33 

3.3.4 Affine-invariant salient region detector . . . . . . . . . . . . . . . . . . . . 34 

3.3.5 Intensity extrema-based region detector (IBR) . . . . . . . . . . . . . . . 36 

3.3.6 Edge based region detector (EBR) . . . . . . . . . . . . . . . . . . . . . . 37 

3.3.7 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.4 Comparison of the described methods . . . . . . . . . . . . . . . . . . . . . . . . 40 

iv

CONTENTS 

v 

4 Evaluation on non-planar scenes 43 

4.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

4.1.1 Repeatability score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

4.1.2 Matching score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

4.1.3 Complementary score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

4.2 Representation of the detections . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

4.3 Detection correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

4.3.1 Transferring an elliptic region . . . . . . . . . . . . . . . . . . . . . . . . . 47 

4.3.2 Calculating the overlap area from the point set representation . . . . . . . 48 

4.3.3 Justification of the approximation . . . . . . . . . . . . . . . . . . . . . . 49 

4.4 Point transfer using the trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . 50 

4.5 Ground truth generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

4.5.1 Trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

4.5.2 Dense matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

4.5.3 Ground truth quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

4.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

4.6.1 Repeatability and matching score . . . . . . . . . . . . . . . . . . . . . . . 55 

4.6.2 Combining local detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

5 Maximally Stable Corner Clusters (MSCC’s) 68 

5.1 The MSCC detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

5.1.1 Interest point detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

5.1.2 Multi scale clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

5.1.3 Selection of stable clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

5.2 Region representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

5.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

5.4 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

5.5 Detection examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

5.6 Detector evaluation: Repeatability and matching score . . . . . . . . . . . . . . . 80 

5.6.1 Evaluation of the ”Doors” scene . . . . . . . . . . . . . . . . . . . . . . . 80 

5.6.2 Evaluation of the ”Group” and ”Room” scene . . . . . . . . . . . . . . . . 80 

5.7 Combining MSCC with other local detectors . . . . . . . . . . . . . . . . . . . . 81 

6 Wide-baseline methods 91 

6.1 Wide-baseline region matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

6.1.1 Matching and registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

6.2 Piece-wise planar scene reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 96 

6.2.1 Reconstruction using homographies . . . . . . . . . . . . . . . . . . . . . . 96 

6.2.2 Piece-wise planar reconstruction . . . . . . . . . . . . . . . . . . . . . . . 97 

6.2.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

6.2.4 Real Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

7 Living in a piecewise planar world 110 

7.1 Map building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

7.1.1 Sub-map identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

7.1.2 Sub-map creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

7.1.3 Structure computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

7.1.4 Landmark extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

CONTENTS 

vi 

7.1.5 Sub-map linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

7.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

7.2.1 Localization from a single landmark . . . . . . . . . . . . . . . . . . . . . 121 

7.2.2 The local plane score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

7.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

8 Map building and localization experiments 130 

8.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

8.1.1 ActivMedia PeopleBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

8.1.2 Laser range finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 

8.1.3 Camera setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 

8.2 Map building experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 

8.2.1 Office environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 

8.2.2 Hallway environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 

8.3 Localization experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

8.3.1 Localization accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

8.3.2 Path reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

8.3.3 Evaluation of the sub-sampling scheme . . . . . . . . . . . . . . . . . . . . 148 

8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 

9 Conclusion 152 

9.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 

A Projective transformation of ellipses 157 

A.1 Projective ellipse transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

A.2 Affine approximation of ellipse transfer . . . . . . . . . . . . . . . . . . . . . . . . 162 

B The trifocal tensor and point transfer 164 

B.1 The trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 

B.2 Point transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 

Bibliography 165

Chapter 1 

Introduction to mobile robotics and 

vision 

Computer vision is a fascinating and challenging scientific area. Vision science is clearly motivated 

from our own ability to see. Our own visual system allows us to complete task at ease, 

which would be difficult or almost impossible without our eyes. Our eyes allow us to identify 

friends and colleagues, recognize objects we have seen before, categorize objects (even unseen 

ones), estimate properties of objects like size and shape, estimate distances to objects and let us 

know where we are. Generally speaking we get an idea of the world around us. For computer vision 

researchers it is the challenge to develop computer systems capable of achieving these tasks. 

In other words, it is the challenge to build computers that see. Although alone the satisfaction 

of building such computer systems is motivation enough for most researchers, the applications 

are manifold. While vision capabilities already play a big role for immovable computer systems, 

like e.g. access systems, surveillance systems etc., they may play an even bigger role for mobile 

systems. A lot of research in mobile robotics has already been done and much has been achieved. 

Nowadays mobile robots have already entered domestic areas and operate outside laboratory 

environments, for example the museum tour guide robots Rhino [14] and Minerva [104]. Mobile 

robots have already been sent to other planets, like the NASA Mars rover ”Sojourner” in 1997 

or its successors ”Spirit” and ”Opportunity” in 2004. However, the state-of-the-art systems still 

lack autonomy, they can only be used in very constraint environments or have to be supervised 

by human operators. Therefore current research is focusing on building more autonomous systems, 

on building cognitive systems (see the EU-Project CoSy 1 ). A very important necessity 

for a mobile robot is the capability of knowing where it is. That is answering the ”Where I 

am?” question. Localizing itself in the environment is essential for navigation, it is necessary 

to compute a path to its target destination or even to recognize that the target destination has 

already been reached. In 1991 Cox [20] stated, that ”using sensory information to locate the 

robot in its environment is the most fundamental problem to providing a mobile robot with 

autonomous capabilities”. Current systems like Minerva [104] rely on laser range finders and 

sonar sensors to localize themselves. However, a vision sensor provides the most general world 

representation. Research has already been done to use vision sensors for mobile robot localization. 

First attempts already date back to Moravec [80] in 1980. But progress has not been that 

fast as anticipated after the first astonishing results. Still key problems are open and lead the 

path for lots of research to do in visual robot localization. 

1 http://www.cognitivesystems.org 

1

1.1. Localization and map building in mobile robotics 2 

1.1 Localization and map building in mobile robotics 

Close connected to navigation and localization is the issue of map building. Maps definitely 

play a key role in mobile robotics. Maps are needed if a mobile robot wants to localize itself 

or wants to make a plan to navigate to a certain position. Based on the role of the world map 

robot navigation can be split into three broad groups [23]: 

Map-less navigation: This first category of systems does not use a map at all. There is no 

world representation and a map is never ever created during operation. In such a case 

the robot only has a limited view of the world given by its actual sensor readings. This 

is however enough to allow collision detection. If the application is solely to strive around 

and detect a searched object a map would not be necessary, however it would not be 

possible for the robot to go back from where it started. Furthermore, the fact that no 

global localization is possible restricts the set of possible applications. But localization is 

not a necessity for every application. The currently very famous robotic vacuum cleaners 

(e.g. Roomba 2 ) are very successful despite the lack of localization capabilities. 

Map-based navigation: Map-based systems depend on a user provided world representation. 

The map must be available in a form utilizable for a mobile robot. Maps can be 2D floor 

plans or complete 3D CAD models of the environment. The robot must have the ability 

to sense its current environment and compare it to the given map. In particular, the 

map must contain landmarks, which can be detected and identified by the sensors of the 

mobile robot. In visual localization such landmarks may be characteristic edges corners or 

even artificial markers with a characteristic texture which are easy to identify. However, 

creating such a map is very tedious. Furthermore, once a map has been created changes 

in the environment have to be maintained and the map has to be updated frequently. 

The map building necessity therefore complicates the deployment of robotic systems into 

domestic areas. The mobile robot cannot be moved to a new environment and get switched 

on. Thus map-based systems will be restricted to some special applications only. 

Map-building based navigation: Map-building based systems are closely related to mapbased 

systems. They also use a map to localize themselves, however, in addition they have 

the ability to build the necessary map with their own sensors. This is however a difficult 

task. Map building and localization have to be done simultaneously, it is called SLAM 

- Simultaneous Localization and M ap building. The robot must possess the ability to 

update its world representation, to add new structure and new details. In addition it must 

be possible to use the current, sometimes incomplete map to localize itself therein. In such 

a framework the mobile robot has to deal with measurement uncertainties frequently. Map 

features which are detected to be wrong after some time, have to be removed or updated. 

However, the SLAM approach provides the greatest versatility for mobile robots. Such a 

robot could be switched on in a previously unknown environment. It would start striving 

around and building a map of its environment. After the map has been completed the 

robot is fully operational and can do all the navigation and path planning. In addition it 

can sense changes in the environment and adapt the map accordingly. 

In the following we will focus on the map-building based approach. Previous research on 

map-building based robot navigation has generated a lot of different approaches to world rep- 

2 Roomba is available from iRobot (see http://www.irobot.com).

1.2. Why vision? 3 

resentation. A possible classification of the different paradigms according to [65] might be: 

topological, metric and appearance based. 

Topological maps: Topological maps represent the environment as connected graphs. The 

nodes of the graph are possible places, e.g. rooms. Connected nodes therefore represent 

places which are located close by and are reachable for the robot. Navigation and path 

planning in such a map can be difficult. The robot only gets the information which places 

need to be traversed to get to its goal, but in the absence of metric information no direction 

information or distance information can be given. 

Metric maps: In a metric map the single map elements are spatially organized, that means 

the position of a map element (landmark) is known in a common world coordinate frame. 

Metric maps can differ widely by the used landmarks. One possible metric world representation 

partitions the known world by a grid into discrete cells [81]. For each cell position 

it is stored if the position is occupied by an object (e.g. wall, tables, etc.) or if it is free. 

Such a map is often called occupancy grid, and basically represents the 2D floor plan of 

an environment. As the size of each grid cell is, known metric information is available and 

allows distance computations and metric path planning. Another possibility is to represent 

the world by geometric features which are positioned in 3D [96]. Such features can be 3D 

points, 3D lines, etc. Localization in such a world can be done by triangulation. The main 

difficulty however is to find the correspondences between the features in the map and the 

features detected by the current sensor readings. 

Appearance-based maps: In such an approach the world is represented by raw sensor data. 

The map is simply a collection of all previously acquired sensor readings [51]. Guided 

navigation and localization is difficult. But the main problem however is the scalability of 

the approach. Simply storing all the sensor readings is very memory consuming and poses 

a big problem for large scale maps. 

In contrast to this classification, combinations of the different approaches are also proposed 

in the literature. In [103] metric grid-maps are connected by a topological approach on top to 

generate the world representation. 

1.2 Why vision? 

Most of the maps described in the previous section do not depend on a specific kind of sensors. 

In fact, research is done with a variety of different sensors. The prominent sensors for robot 

localization are wheel encoders (odometry), inertial sensors, sonar, infrared, laser range finders 

and of course vision sensors. Each sensor type shows different advantages and disadvantages, a 

list may be found in [65]. Wheel encoders and inertial sensors provide direct information about 

the path of the robot. Sonar, infrared sensors and laser range finders are ranging devices. They 

provide the robot with more or less (depending of the type of the sensor) accurate distances to 

objects in the vicinity of the robot. However, they only provide distance information. Compared 

to these sensors a vision sensor seems to be the most powerful one. A vision sensor can provide 

odometry information as described in [84]. A vision sensor can also act as ranging device, either 

as a stereo setup (demonstrated in [80]) or with a structure-from-motion approach [106]. In 

addition a vision sensor allows to record the appearance of the world surrounding the robot. 

The visual appearance of landmarks can now be associated to range information. A vision sensor

1.3. What has already been achieved? 4 

would probably give the most general world representation. In fact, certain tasks would require 

the use of vision sensors. Imagine a mobile robot with the task to find a certain object, lets say 

a coffee cup, for its user. The task of detecting the coffee cup can certainly not get achieved 

with ranging devices solely. Although one can think about detecting a cup by its 3D shape with 

a laser range finder this method cannot distinguish between similar cups differing only in the 

color. Such a task requires a vision sensor, and thus as vision is already onboard it is tempting 

to use it for navigation and localization too. 

1.3 What has already been achieved? 

The use of vision sensors for mobile robot localization has not yet reached an as elaborate state as 

the use of laser range finders. Mobile robots equipped with laser range finders already navigate 

safely in unknown and people crowded environments [104] and are able to build large and 

accurate maps [105]. But lets discuss what has been achieved using visual sensors in odometry, 

localization and map building. 

In the absence of a map or within a featureless environment visual odometry can be used to 

compute the path a robot went and thus the actual position can be computed from the travelled 

path. For visual odometry point features are tracked from frame to frame and with a structurefrom-motion 

approach the robots movement for each frame can be computed. The estimation 

has to be very accurate, because the final position is computed incrementally from all small 

movements. And even small inaccuracies might result into big deviations. The capabilities of 

the current state-of-the-art in visual odometry has been shown impressively in NASA’s Mars 

Exploration Rovers ”Spirit” and ”Opportunity” [86]. The slippy surface did not allow for an 

accurate wheel odometry computation and laser range finders could not be used in the open 

outdoor environment. However, a fully autonomous visual based navigation was still not possible 

and the rovers where controlled by human operators in the end to compensate for errors of the 

visual localization system. 

The current state-of-the-art in visual localization is defined by vSlam [56] and the method 

described in [96]. Both systems are SLAM-approaches based on SIFT-landmarks [67] and show 

very similar performances. The approaches allow map building in indoor environments of the 

size of a room up to small flats. The robot will explore the environment autonomously and 

create a map of 3D point landmarks. After the map creation is finished the robot can perform 

global localization and path planning tasks. The achieved localization accuracy is about 10-15 

cm at average. For vSlam the robot has to be equipped with a single camera only. The 3D 

reconstruction of the landmarks is done with a structure-from-motion approach. The other 

system uses a stereo setup on the robot for 3D reconstruction. A main limitation of the systems 

is the size of the maintained environment map. For bigger than room-size environments the 

map will be too big to be handled in real-time. 

The last example deals with the automatic map building of large scale environments and 

outdoor environments. The system proposed in [10] is capable of mapping a path of the length 

of several kilometers accurately. The large scale map is composed of connected metric sub-maps. 

The sub-maps contain 3D line features. The system allows loop closing by matching the 3D lines 

from the current reconstruction to the 3D lines of the sub-maps. A global optimization ensures 

the high accuracy of the map. However, the map features are purely geometric and the system 

will get difficulties in buildings with highly similar structures. Moreover the method does not 

allow global localization in general. The map in the presented form cannot be used for robot 

navigation and localization.

1.4. Why is it hard? 5 

These three examples show that visual methods basically are capable of performing odometry, 

SLAM in constraint environments and large scale map building. More details about the potential 

of the state-of-the-art methods in visual localization are given in Chapter 2. 

1.4 Why is it hard? 

The previous section showed us the current state-of-the-art in visual localization and gives us 

an idea what is still missing. But what are the difficulties in visual localization that prevents 

researchers until now to come up which reliable and versatile solutions? Difficulties arise as 

technical issues and conceptual issues. 

High complexity of image processing algorithms: A major technical difficulty is that mobile 

robots require real-time processing of the sensor data, that is real-time image processing. 

Current computer systems which can be used for mobile robotics do have very 

limited computational power. That means that the most advanced methods in computer 

vision simply can not be used on a mobile robot. Very often sub-optimal algorithms are 

used, which are known to produce not the best results but which are computable on mobile 

robots. Although the processing speed of computers is increasing fast, the need for 

real-time processing is a major drawback. A lot of achievements in computer vision are 

therefore simply not used in mobile robotics. 

Changing environment: Other technical difficulties lie in the changing environment, for example 

illumination changes and background changes. Especially computer vision methods 

are very much affected by such environment changes. Such environmental changes are for 

sure one of the main reasons why visual localization has not yet reached the same level as 

localization with other sensors. Illumination and background changes for instance pose no 

problems for laser range finders or sonar sensors. 

Scalability: Scalability poses also a big technical difficulty for mobile robots. Systems which 

work for small indoor environments (see [56]) will not work in larger scenarios or even 

outdoor scenarios. Computation time and memory requirements depend on the size of the 

maintained map of the environment. Real-time processing can simply not be carried out 

for large maps and the robot may simply run out of memory. Scalability is a problem 

independent of the sensors used, but definitely a much more critical issue for vision based 

systems. Digital cameras, especially with high resolution sensors, produce an enormous 

amount of data, posing problems for real-time processing. 

Uncertainties: A rather conceptual difficulty arises in the treatment of uncertainties. Uncertainties 

occur as simple measurement uncertainties, however uncertainties also occur in 

top level reasoning and navigation processes. Uncertainties in measurements and in geometric 

representations are already well understood, e.g. for elements like 3D points and 3D 

lines [96]. But uncertainties in navigational or task oriented decisions pose an enormous 

problem [23]. It is not well understood at which precision such uncertainties should be 

taken into account. 

Cognitive abilities: However the most challenging difficulty is, that for reliable autonomous 

navigation the robot must be aware of the meaning of objects and structures in its environment. 

Cognitive abilities as necessary for tasks like automatic scene interpretation,

1.5. How can it get solved? 6 

fully autonomous navigation and interaction with people are still subject of intense research. 

The current state-of-the-art in computer vision allows to solve very specific and 

constrained tasks, but has not yet reached a level where it could provide the wanted 

autonomy for mobile robots. 

The above list mentions definitely the most prominent difficulties for visual controlled mobile 

robots. However, by having identified the difficulties and problems research can be focused into 

the right directions. 

1.5 How can it get solved? 

By reconsideration of the difficulties in the previous section, the question arises, how the difficulties 

can be solved. Or formulated differently, how should we proceed to build a vision based 

mobile robot. Some researchers believe that there will not be a general overall algorithm which 

will suffice to guide a mobile robot [22]. In fact, there is some evidence that this is also not the 

case with biological navigation systems [90]. Instead it seems to be that vision for animals and 

humans too works as a collection of specialized behaviors, which were developed during long 

times by evolution. This has been stated very impressively by Ramachandran as: ”Vision is 

just a bag of tricks” [90]. Carried on to the domain of mobile robotics this can be interpreted as 

that a mobile robot has a collection of different methods, each very specialized to well defined 

cases. To solve an actual problem only the method has to be selected which works. The single 

methods would be required to work for very well constrained situations, which would ease 

their development. Such an approach has already been proposed by Brooks [12]. The proposed 

robot control system consists of a collection of specialized behaviors organized in hierarchical 

layers. All the behaviors run concurrently and the overall robot action is determined by voting. 

The different behaviors do not share a common world representation, each method has its 

specialized representation. Such a scheme is already very familiar from sensor fusion. Measurements 

from different modalities are combined to achieve higher accuracy and robustness. It is a 

straightforward development to apply this scheme at the localization method level. 

The scheme sounds very promising. The deficiencies of current vision based solutions can be 

analyzed and specialized methods can be developed to overcome the deficiencies in a decoupled 

way. With such an approach it will be possible to produce a most versatile visual localization. 

On-the-edge visual localization like vSlam might already cope with 95% of all encountered 

situations. However, in 5% of the cases human assistance will be necessary to resolve problematic 

cases. For a non-stop operation of the mobile robot even the 5% problematic cases will be too 

much. Furthermore, the last 5% of problematic cases will require a bunch of specialized methods. 

Doing computer vision research with the goal to complete the ”bag of tricks” therefore seems 

to be a very promising way. It will provide robot engineers with specialized methods for lots 

of difficult cases. Research to identify special cases and provide proper solutions will be very 

valuable. The fusion of all the tricks however will be in the responsibility of AI researchers and 

possibly carried out as described in [12]. 

1.6 Contribution of this thesis 

In the spirit of the approach described in the previous section this thesis does not deal with 

the development of a complete visual SLAM method but focusses on the development of some 

primary key technologies. The main focus of this thesis is on global localization in indoor

1.6. Contribution of this thesis 7 

environments. Global localization applies when a robot operates in a known environment, that 

is the map has already been built. Global localization is then the computation of the pose of 

the robot, consisting of position and heading, from the actual sensor reading, i.e. camera image 

without using the previous pose information. Global localization is necessary quite frequently 

when encountering the following situations: 

Switching the robot on: After switching on a mobile robot the position is not known. However, 

the robot may already possess a complete environment map e.g. from a previous run. 

But before it can start useful operations (like navigation and path planning) its position 

has to be determined first. Global localization allows to compute the pose of the mobile 

robot from the actual camera image. 

Kidnapped robot problem: The kidnapped robot problem has been stated by Engelson and 

McDermott [27]. In the kidnapped robot problem a well-localized robot is teleported (or 

simply moved with switched-off sensors) to some other location. The problem with this 

scenario is, that the robot still believes to be at the location from where it has been 

kidnapped. Path planning and navigation based on such an assumption will not work. 

The environment predicted from the map and its assumed position will not match. The 

kidnapped robot problem is basically a test for the ability of a robot to recover from a 

catastrophic localization failure. Global localization is the solution to the kidnapped robot 

problem. With the ability of global localization the robot can immediately determine that 

it has been moved from outside. 

Recover from failure: Recovery from a failure is also possible with global localization. Consider 

the case when a mobile robot moves into an area without landmarks. Abruptly the 

current sensor readings do not contain any landmarks. For vision based systems this can 

occur easily when the robot is moving into an untextured area, e.g. a part of an room 

containing only white walls. In such a case the robot would loose track. The robot would 

then move around randomly to get back into an area where landmarks can be detected. 

However, when he finally enters an area where landmarks re-occur his global position got 

lost, global localization is necessary. It would be possible to rely on the robots wheel odometry 

in the landmark-less area. However, keeping track with odometry only will introduce 

too big deviations in pose estimation and global localization would still be necessary. 

Loop closing: Loop closing is the ability of a mobile robot to recognize previously visited 

areas. It is important in the stage of map building. During map building a mobile robot 

traverses the environment and adds new structure and landmarks to the world map. If 

the robot enters an already mapped area and does not recognize this, the same features 

will be added twice, usually not on the same position because of small drift errors. Global 

localization can be used to notice that the current location has already been visited, that 

the loop has been closed. This gives the information, that the landmarks could already be 

in the map and an update would be appropriate and not a simple insertion. 

Homing: The last example for a situation where global localization would be required is homing. 

Homing is the task that a mobile robot has to go back to some start position. The 

start position may for instance mark an automatic charging device and the robot will go 

there to recharge its batteries. Global localization can tell the robot that its target position 

is reached.

1.6. Contribution of this thesis 8 

The vision based global localization proposed in the following will be based on wide-baseline 

stereo methods and work with a fully 3D piece-wise planar world representation. A new approach 

which allows global localization from a single landmark will be presented in this work. 

Furthermore the methods to build a piece-wise planar world representation from an image sequence 

acquired from a mobile robot will be described. Localization and map building will be 

implemented for an ActivMedia PeopleBot 3 (see Figure 1.1). The robot is equipped with a 

single camera and a wide-angle lens. Localization and map building is done solely using this 

camera setup. The robot is also equipped with a laser range finder, infrared and sonar sensors. 

The additional sensors are used to obtain ground truth for the localization experiments. 

(a) 

(b) 

Figure 1.1: (a) The mobile robot used for localization and map building (ActivMedia PeopleBot). 

It is equipped with a single camera. (b) A closeup of the camera setup. 

The following topics are the main contributions of this thesis: 

Performance evaluation of local detectors: Local detectors are a key ingredient to solve 

the correspondence problem for robot localization. There already exists a variety of different 

methods showing different properties and performances. However it is not clear which 

one of the proposed methods is suited best for visual localization. So far the best source 

of information is the comparison of Mikolajczyk et al. [76] which reveals the differences 

and properties of the different methods. However the comparison evaluates the different 

methods on simple planar test scenes and the method is not applicable to realistic complex 

scenes as will be encountered in mobile robot experiments. One contribution of this 

thesis therefore is the development of a method to evaluate the different local detectors 

on realistic complex scenes. The resulting comparison shows a significant difference to the 

previous evaluation on the restricted test cases. 

Maximally Stable Corner Clusters: Based on the new evaluation results we propose a new 

local detector, the so called Maximally Stable Corner Cluster (MSCC) detector. Interest 

regions are formed by clusters of simple corner points in images. The detection algorithm 

implies a stability criteria for a robust detection. The evaluation of the new detector 

3 http://www.activrobots.com/robots/peoplebot.html

1.7. Structure of the thesis 9 

shows a good repeatability. By comparison with other methods it is revealed that the 

new detector detects regions at image locations left out by the other methods, thus it is 

complementary to the other method. This complementarity is the key property of the new 

detector, it allows an effective combination with current state-of-the-art methods. 

3D Piece-wise planar world map: Another key contribution of this thesis is the development 

of a new world representation. We propose a piece-wise planar world map, where 

each landmark is a small planar patch associated with a SIFT-descriptor and the original 

appearance from the image. The world representation is a crucial element for localization 

and we will show that our proposed map design allows new successful localization methods. 

A batch method is proposed to automatically build the piece-wise planar map from 

an image sequence acquired by a mobile robot. 

Global localization from a single landmark: Based on the newly developed map a global 

localization algorithm is proposed which computes the pose of the robot by solving the 

3D-2D correspondence problem. The main achievement of this algorithm is, that full 3D 

pose estimation is possible from a single landmark match. Furthermore a selection criteria, 

the lp-score, is introduced to select the best pose estimate from a set of hypothesis which 

allows accurate pose estimation from an extremely small image region (area around 400 

pixel). Thus the global localization can deal with a high level of occlusions, necessary for 

crowded environments. 

1.7 Structure of the thesis 

The next two chapters discuss the current state-of-the-art in visual localization (Chapter 2) and 

local detectors for wide-baseline stereo (Chapter 3). Next, it will be discussed how wide-baseline 

methods were used and extended for mobile robot localization. For wide-baseline stereo [70], 

methods have been developed to allow stereo reconstruction for scenes which deviate largely from 

a normal stereo case. That includes large baselines, large projective distortions, scale change 

and rotation. The main achievement by wide-baseline methods is to solve the correspondence 

problem for wide-baseline cases, to compute point matches between the images. Having solved 

the initial correspondence problem, the epipolar geometry between the images can be estimated 

and 3D reconstruction can be performed with known standard methods [44]. The correspondence 

problem however is a key issue in mobile robot localization too. For robot localization it is 

necessary to detect correspondences between map landmarks and landmarks extracted from the 

current image. This is a difficult task, as there are big viewpoint changes as the robot moves 

around. The problems in mobile robotics are therefore very similar to wide-baseline stereo and 

the application of wide-baseline methods for mobile robot localization represents the main focus 

of this thesis. The key ingredients for solving the correspondence problem in wide-baseline 

stereo are local detectors [76], which are interest point and interest region detectors which allow 

a repetitive detection of the same locations in images from widely different viewpoints. Recent 

research in wide-baseline stereo produced a variety of different detectors which quite different 

properties. Available comparisons of the different methods [76] were made on very restricted 

test cases, not comparable with the scenarios which will appear in mobile robot localization. 

The evaluation method in [76] allows an evaluation of the detectors on scenes containing a 

single plane only. The scenes encountered by mobile robots (offices, hallways, etc.) usually 

contain complex and arbitrary structure. Hence, to choose the best detector a new evaluation 

method (based on the trifocal tensor) has been developed which allows evaluation on realistic

1.7. Structure of the thesis 10 

complex scenes (see Chapter 4). A comparison was performed with the current available local 

detectors and a significant difference to the previous evaluation could be observed. Based on 

these evaluation results we propose a new local detector, the so called Maximally Stable Corner 

Cluster (MSCC) detector (see Chapter 5). MSCC regions are clusters of simple corner points. 

MSCC regions are detected in structured, textured image parts. The evaluation showed that 

the areas where MSCC’s are detected are often left out by the other methods. Thus the MSCC 

regions are complementary to the other detections, which allows an effective combination with 

other methods. MSCC regions are a valuable enrichment of the current pool of local detectors. 

For localization and map building specialized wide-baseline methods were developed. A widebaseline 

region matcher, to solve the correspondence problem reliable and robust is described 

in Chapter 6. The key technology is to register planar regions detected in images from different 

viewpoints iteratively. The proposed method produces very accurate and highly reliable matches, 

with a very small false-positive rate. For the map building algorithm a method to reconstruct 

piece-wise planar scenes from wide-baseline images has been developed (see Chapter 6). The 

method works by using inter-image homographies and produces a segmentation of an image into 

scene planes and a piece-wise planar 3D reconstruction of the scene. This method is used in the 

map building algorithm described in Chapter 7. The landmarks in the proposed map are small 

planar patches associated with a SIFT-descriptor and the original appearance from the image. 

The landmarks planes are fully parameterized in 3D. The map is composed of connected metric 

sub-maps. The single sub-maps are connected by rigid transformations into a common world 

coordinate frame. Map building is described as an automatic batch process. The algorithm 

gets as input an image sequence acquired from a random robot run. Map building is performed 

in three steps. In a first step images for sub-map reconstruction are automatically identified. 

In a second step the single sub-maps are created by piece-wise planar scene reconstruction and 

landmark extraction. In a third step the single sub-maps are connected to form the complete 

world representation using wide-baseline region matching. The such created map can then be 

used for global localization from a single actual view. The proposed global localization method 

works by computing the pose of the robot within a local sub-map. The global pose is then 

computed by transforming the position from the local to the global coordinate frame. The pose 

in a local sub-map is computed from 3D-2D point correspondences. In a first step landmark 

matches between the actual view and the map are detected, representing 2D-2D matches. By 

using the 3D plane information 3D parameters of the map landmarks can be computed, which 

yield the necessary 3D-2D point correspondences. Each single landmark yields a set of 3D-2D 

point correspondences and 3D-2D point correspondences computed from a single landmark are 

enough for pose estimation. Experiments with the map building and localization algorithm are 

described in Chapter 8. Finally a discussion and an outlook conclude the thesis in Chapter 9.

Chapter 2 

Visual localization 

This chapter discusses the current state-of-the-art in visual localization, or as coined in the 

previous chapter, the current ”bag of tricks”. Different approaches to visual localization will be 

investigated, discussing their strengths and weaknesses. The focus however is on the methods 

which provide the basis for this thesis and are directly influencing the methods proposed in this 

thesis. The chapter will be closed with a summary and a comparison of the different methods 

in table form. 

2.1 Localization in metric maps 

The most complete positional representation for mobile robots is a full 6 DOF pose in a global 

world coordinate system. However, as this is difficult to achieve a lot of approaches were developed 

to allow robot applications without the need for a complete pose description or the need 

for an explicit pose computation. Anyway a localization method which produces a full 6 DOF 

pose would be highly favorable for all mobile robot applications. 

One possibility to compute a 6 DOF pose is by using visual odometry. Visual features are 

tracked from frame to frame and the translation and rotation of the robot are updated with 

the movements between each frame. Successful visual odometry systems were implemented 

by Nister et al. [84] and Olson et al. [86]. Such approaches do not even need a world map. 

However, as in the case of wheel odometry the approaches suffer from a major, elementary 

problem. Small errors in the computation of the movements between the frames will accumulate 

in the final pose. Such systems encounter big troubles if used for longer time periods. Map 

building based approaches alleviate such problems, allow pose computation without knowing 

the previous position and therefore we will focus on them. However, not all map building based 

approaches compute a full 6 DOF position. Some approaches are limited to place recognition. 

In place recognition approaches the map is partitioned into distinct places and the localization 

returns the information at which place the robot currently is. Although very simple it is still 

possible to do navigation and path planning. Such an approach has been described by Lowe [67]. 

In [60] the method of Lowe has been extended with a Hidden Markov Model and localization 

results are presented for an indoor scenario. Another approach [35] describes place recognition 

by assuming the planarity of the landmarks to increase the reliability of landmark matching. 

One can understand place recognition as a pre-stage to full 6 DOF pose estimation. Place 

recognition is able to restrict complex pose estimation to a smaller part of the entire map, thus 

gaining a speedup. Accurate navigation as needed for service robots for instance however needs 

full 6 DOF pose estimation. 

11

2.2. Localization from point features 12 

Computing a full 6 DOF pose in a map building based approach requires a metric map, 

where each map feature is positioned in a global coordinate frame. Usable map features are 

points, lines and plane features. Different features require different localization methods and in 

the next sections the so far developed methods for localization in different metric maps will be 

described. 

2.2 Localization from point features 

Most of the so far developed localization methods work with world maps containing point features. 

The map landmarks are 3D points associated with a feature vector to solve the correspondence 

problem. Localization from point features is already very well understood. A well 

known method is triangulation. Angle measurements to three distinct landmarks allow the pose 

estimation. This approach has been used in the work of Davison and Murray [21]. Angular 

measurements were made by the use of an active stereo head. The active stereo head carries 

two digital cameras. The stereo head can perform panning and turning movements where both 

cameras are moved together. In addition each camera can rotate around a vertical axis to produce 

converging viewpoints. The mechanical resolution is very accurate and the stereo head 

will deliver accurate odometry information to relate the camera position to the robot position. 

The angle measurements are performed by fixation. Fixation means to direct the cameras to 

point directly at a landmark, i.e. the landmark gets located at the principle point of the image. 

For fixation at first the left camera is centered onto a landmark. Then the right image is 

searched along the corresponding epipolar line for a matching landmark using normalized crosscorrelation. 

The right camera is now moved to the found match. The angle to the landmark 

can be computed from the angle of the cameras and the stereo head. Multiple landmarks are 

fixated in this way and the measured angles are used for triangulation. Although the angles are 

determined mechanically the approach is quite accurate. The measurement accuracy depends 

on the geometry of the stereo head and the accuracy in image matching. Two fixated landmarks 

may show a localization error of at most 1 pixel. This will result in a possible angular error of 

0.3 ◦ . With a baseline of 0.338m of the stereo setup this allows accurate depth measurements in a 

range from 0 to 2m. The uncertainty of the depth estimate increases with the distance and for a 

distance of 5m the expected uncertainty is almost 1m. However, the angular measurements are 

very accurate up to high distances. The analysis of the authors also shows that the mechanical 

odometry of the head is more than accurate for this task. The maximal angular error of the head 

odometry is 0.005 ◦ . This is magnitudes lower than the error of 0.3 ◦ introduced by the allowed 1 

pixel error in image fixation. This fixation method is also used for the initial map construction. 

A possible landmark, detected by the Harris corner detector [40], is fixated to determine the 

angle and the distance to the robot. The measurements are then used to compute the full 3D 

parameters of the point feature and the resulting point is added to the map. However, this 

approach is rather slow as the fixation process requires mechanical movements of the cameras, 

separately for multiple landmarks. 

A different approach has been presented by Karlsson et al. [56]. Localization uses computer 

vision methods and works by computing the robot pose from 3D-2D point correspondences. In 

the current view interest points are detected and matched with the landmarks in the map by 

SIFT [96] feature matching. This establishes 3D-2D point correspondences. The pose is computed 

from the 3D-2D correspondences with the POSIT [24] algorithm. The POSIT algorithm 

is an iterative method and requires at least 4 non-coplanar points. The pose estimated from 

the POSIT algorithm then gets refined by non-linear minimization. The full 6 DOF pose is


recovered. Typically a pose is estimated from 10 to 40 3D-2D correspondences. The visual 

pose is then combined with measuresments from wheel odometry within a probabilistic SLAM 

framework [78]. In fact, during frames with no detected visual landmarks navigation continues 

based on wheel odometry. Map building is also vision based. The robot starts driving around in 

a first unknown environment, building a world map. The 3D points in the map are associated 

with SIFT features and an original view of the landmarks cropped from the original image. 

Each landmark can have a set of associated SIFT features, describing the landmark for various 

viewpoints. A 3D landmark is reconstructed from three images. Three images are taken in sequence 

each in a distance of 20cm. Interest points are detected and matched between the three 

images using SIFT feature matching. With a structure-from-motion approach the 3D landmarks 

are reconstructed and the camera positions (robot positions) are computed. The landmarks are 

reconstructed in a local coordinate frame. By adding this position to the current position of 

the robot the landmarks are transformed into the global coordinate system and this position is 

stored in the map database. Map building continues until all the environment has been traversed 

and no new landmarks are found. The authors describe experiments for a 2-bedroom apartment. 

Map building lasted 32 minutes and the robot created a map containing 82 landmarks. During 

operation map updates are possible. Updates of the landmark position are maintained by a 

Kalman filter [54]. The average localization error measured in the experiments is about 20cm to 

25cm, which is quite high. However, it should be stressed that rather simple methods are used 

in this approach to let the software run in real-time on low-cost computers. The approach of 

Karlsson et al. is especially interesting as it is available as the commercial localization software 

vSlam 1 for the robots sold by Evolution Robotics 2 . vSlam achieves map building and navigation 

with a single low-cost camera. The most limiting factor of the approach, according to the 

authors, is the size of the landmark database. Each landmark needs about 40kB to 500kB of 

memory. This restricts the method to small indoor environments. Another critical issue which 

should be discussed is the reconstruction of the landmarks during map building. A landmark 

is reconstructed from three images at different positions. However, as the camera usually faces 

forward the three views contain only translational forward motion. This imposes very bad conditions 

for 3D reconstruction. In fact, reconstruction of a plane (e.g. wall) will show depth 

estimation errors of about 10cm in practice. However, such uncertainties are handled within the 

SLAM framework. 

A different approach has been presented by Se, Lowe and Little [96]. In their work they 

actually propose three different localization methods. The robot movement is however assumed 

to be restricted to a plane, thus the pose estimate only contains 3 DOF. The map itself contains 

full 3D coordinates of the landmarks. All three methods basically work by computing the pose 

from 3D-3D landmark matches. The robot is equipped with a trinocular stereo head (Triclops 3 ) 

which produces 3D coordinates for each landmark in the current view. The first localization 

approach is based on the Hough transform [48]. A 3D discretized Hough space representing 

the robot poses with three parameters (X, Z, θ) is constructed. Each landmark match votes 

for possible poses in the Hough space. The maximum vote then determines the parameters 

(X, Z, θ) of the robot pose. The second proposed method is a RANSAC scheme [28]. From 

two landmark matches the necessary translation and rotation for alignment and thus the robot 

pose can be computed. This is repeated for a number of randomly chosen landmark samples 

within a RANSAC scheme. For each sample the pose hypothesis is verified by checking how 

1 http://www.evolution.com/core/navigation/vslam.masn 

2 http://www.evolution.com 

3 http://www.ptgrey.com


many landmark matches out of the complete set agree with the pose estimate. The landmarks 

supporting the pose estimate form the consensus set and are called inliers. Finally a leastsquares 

estimate of the pose is performed by all inlier landmarks of the pose hypothesis with 

the largest consensus set. The third method computes the pose by map alignment. It works by 

constructing a local sub-map from landmarks of multiple frames. This local sub-map is then 

aligned with a part of the world map. The local sub-map is created while the robot rotates 

a little, from -15 ◦ to 15 ◦ . The map alignment is implemented with the RANSAC scheme of 

the previous method. This method is to be preferred if only a few landmarks are currently in 

the field-of-view of the robot. Beside localization the authors describe a complete framework 

for visual SLAM including global localization. The system is designed for indoor operations. 

Without an a priori map, the robot will start to construct a map by driving around randomly. 

The map building will be completed, if no new features are detected. DOG-keypoints [67] are 

detected for each image frame and a SIFT descriptor [67] is computed for each detection. The 

3D parameters for each detected image point is computed with the calibrated trinocular stereo 

system. The reconstructed image points are stored in the map as landmarks associated with 

the corresponding SIFT description. The detected image points are tracked in the subsequent 

frames and the different SIFT descriptions are additionally added to the 3D landmarks. Thus a 

landmarks entry in the database consists of the 3D parameters of the point and a collection of 

SIFT descriptions from different viewpoints. The acquired image data will not further be stored. 

A sub-map concept is used for map building. 3D landmarks extracted from an image are not 

immediately added to the map, instead they are added to a local sub-map. If the landmarks 

can be tracked for some time, the whole sub-map will then be added to the global map. The 

local sub-map will be aligned to the already existing landmarks in the global map and new 

landmarks will be added, while already existing landmarks will be updated. Each landmark 

has an associated uncertainty which is decreasing with multiple measurements. The uncertainty 

is represented by a 3 × 3 covariance matrix. A Kalman filter [54] is used to propagate the 

uncertainty of the landmarks. If a landmark is re-detected the uncertainty shrinks, indicating 

that the landmark is better localized. Experiments for map building and localization are shown 

for a 10x10m big room. The measured average position error for global localization was reported 

to be 7cm, while the average rotational error was about 1 ◦ . The experiments show that reliable 

pose estimation requires a minimum of 10 landmark matches. The approach is a very reliable 

visual SLAM algorithm. With a frame rate of 2Hz reported on a relatively slow computer it 

is basically running in real-time. The key component of the method is the use of the SIFT 

descriptor for the landmarks. This allows to generate a map of natural landmarks which can 

be reliable re-detected and matched. The SIFT descriptor is based on orientation histograms 

and is therefore very robust to illumination changes. This allows to solve the correspondence 

problem fast and reliable, which is basically the most crucial part for visual systems. The 

achieved localization accuracy is high enough to allow a safe and useful navigation through the 

environment. Difficulties in 3D reconstruction are avoided by using a fixed stereo setup, which 

directly outputs 3D coordinates. However, this is much more expensive than the use of a single 

camera and it is not suited for small scale robots. It is worth mentioning that the created 3D 

map is a sparse set of 3D landmarks. It cannot be used for visualization purposes and it is 

difficult to use for navigation and path planning tasks, because a lot of the structure in the 

environment is not contained in the map, but only some distinct landmark points. 

Another approach to visual localization uses invariant sets of points. In the work by Atiya 

and Hager [1] the pose is computed from invariant point triples. Another different approach has 

been developed by Sim and Dudek [98] where the pose is computed from the transformation of

2.3. Localization from line features 15 

learned natural landmarks. 

2.3 Localization from line features 

Using lines as landmarks was investigated already in the early times of mobile robotics. One 

reason might be that line extraction works well, even despite of large viewpoint changes. An 

edge detector, e.g. [15] will detect lines repetitively despite viewpoint changes and illumination 

changes. Difficulty however remains in matching extracted lines in an image to the lines in the 

landmark. As a solution for the correspondence problem geometric matching was investigated. 

One approach is known as the FINALE system developed by Kosaka and Kak [59]. The 

approach uses a CAD model of the environment as map. The CAD model is composed of 

lines only. The goal is to match the lines extracted from the current view with the lines in 

the CAD model and thus determine the position of the robot. The FINALE system allows for 

incremental localization only. That means the robots previous position must be known and be 

close to the actual position. Thus the robots position is maintained using a Kalman filter. For 

localization the lines of the CAD model visible from the previous position are first projected 

on the image plane. The such created 2D representations of the map lines are then matched 

by a simple nearest neighbor approach with the lines extracted from the current view. When 

the correspondences are established the position maintained by the Kalman filter is updated 

depending on the deviation of the matched lines to the projected lines. For this approach one 

has to take care, that the CAD model contains edges which are detectable by an edge detector. 

This is usually the case for the edges created by a wall meeting the ceiling and the floor, or for 

edges created by doors. However, the need for a pose estimate for the projection of the lines is 

a drawback of this method. 

A much more recent line based approach has been proposed by Bosse et al. [10]. In their 

work a method for large scale mapping and localization using line features is described. The 

proposed localization algorithm works by sub-map matching. The world map is composed of 

multiple sub-maps. A sub-map covers a small area created from a small number of image frames. 

The sub-maps are linked by rigid 3D transformations. A sub-map contains 3D lines extracted 

and reconstructed from an image sequence, but also 3D points and vanishing points. Only lines 

which correspond to a vanishing point are stored as landmarks in the sub-map. This discards 

a lot of small edge segments and selects mostly vertical and horizontal lines coming from the 

gross structure of buildings. The 3D points are reconstructed from KLT feature tracks [107] 

along the image frames. Localization is performed by building a local sub-map from a short 

image sequence and aligning it with the world map using an extension of ICP [8] which handles 

point and line features. The method uses omnidirectional images generated by a catadioptric 

camera-mirror system. It uses a very original method to extract lines and vanishing points from 

omnidirectional images. The authors demonstrated the mapping of large areas where the robot 

traversed several kilometers. Encountered loops were successfully closed. The method is not 

limited to planar movements but allows full 6 DOF localization. 

Another example for localization from line features is the work by Goedeme et al. [39]. The 

approach is using a topological map and thus localization does not provide a metric robot pose. 

Localization works in the sense of place recognition. The line features used in this work are 

different from other approaches, they do not originate from an edge detector. Instead vertical 

lines are detected in a gradient image of the original image. For this a gradient magnitude 

image is computed first using the Sobel operator. Then the image is processed column by 

column to detect line segments. A line feature is defined by the line segment between two local

2.4. Localization from plane features 16 

gradient magnitude maxima. For each detected line feature a descriptor based on viewpoint 

invariant measures is computed. The description vector is of length 10 and combines color and 

intensity properties of the pixels of the line feature. The such detected line features are invariant 

to viewpoint changes, provided that the robot movements are restricted to a horizontal plane. 

Map building is described as an off-line batch process. A KD-tree is built to store the descriptors 

of each detected line features. Localization is then by extracting line features from the current 

view and matching them with the map features in the KD-tree. The approach has been used 

for autonomous wheelchair navigation. 

Further examples for model-based localization using a CAD-model are described in [2, 17, 

110, 114]. In the work of Tsubouchi and Yuta [110] color information is used as additional cue. 

The system developed by Vincze et al. [114] deals with robot navigation within a ship structure. 

The work of Folkesson et al. [29] uses lines detected on the ceiling for localization. Lines extracted 

from images of a panoramic stereo sensor are used by Yuen and MacDonald [118]. In [82] Neira 

et al. describe how to build a stochastic map from line features. 

2.4 Localization from plane features 

Not so many approaches exist that are using plane features in their world representation. The 

identification of plane features requires algorithms which are more complicated than for the 

detection of line or point features. Furthermore outdoor scenes often do not contain much scene 

planes and thus the approaches would be restricted to indoor scenarios. 

One of the methods proposed so far has been developed by Hayet et al. [45]. The map 

contains planar landmarks like for instance posters, doors, windows etc. The landmarks are 

represented by the contour and for localization the robot pose can be computed from the 3D 

contour in the map and the extracted 2D contour of the landmark in the current view. For their 

approach they use a single camera mounted on a pan-tilt unit. Active vision however is not the 

main focus of their approach. The key concepts are: 

• Detection of planar quadrangular visual landmarks 

• Map building using laser range finder and stereo reconstruction 

• Visibility map 

The choice of planar landmarks seems to be very suitable for indoor environments, like offices 

etc. Posters or paintings attached to walls will provide reliable landmarks. However only 

quadrangular landmarks will be selected. This restriction eases the detection process. Landmark 

detection is based on perceptual grouping of edge segments. First edge detection is applied to 

the images. Grouping is applied to achieve connected edge segments. Combinations of edge 

segments are then searched which fulfill necessary constraints of a perspective projection of 

a quadrangular landmark. After identification the landmark is normalized to be invariantly 

stored in a map database. In a first step the landmark is rectified to a quadrangular area of 

fixed size by applying a homography transform. This representation is invariant to scale and 

viewpoint change. A describing feature vector is extracted from this normalized representation. 

Two approaches are proposed. In the first approach Harris corners [40] detected in the image 

patch are used as descriptor. Landmark matching can then be done by computing the partial 

Hausdorff distance [49] between a landmark in the map and a landmark detected in the current 

view. In addition to the Hausdorff distance Hayet et al. propose to incorporate the graylevel 

information as feature vector too. The second representation approach uses the Principal

2.5. Summary 17 

Component analysis to compute a representative feature vector. Using these landmarks, map 

building is described as an off-line approach. For map building a robot equipped with a camera 

and a laser range finder is steered through the environment. Planar landmarks are detected 

by the previously described method. For each landmark a 3D reconstruction of the contour 

will be performed from two successive frames. The reconstructed landmark will be put into the 

global coordinate system by using the robot pose information of the laser range finder. Robot 

localization then works by matching planar landmarks detected in the current view with the 

map landmarks. For a matched landmark the four corner points are used to create 3D-2D 

point correspondences. The pose of the robot is then computed from the four 3D-2D point 

correspondences using the planar P4P method and a subsequent iterative refinement [45]. A 

key concept of this approach is the visibility map for localization. It is assumed that a pathplanning 

process defines a trajectory in the world coordinate frame. According to this path the 

best suited landmarks are selected for different sections of the path. Localization is performed in 

this approach from a single landmark. The active camera is used to keep the landmarks selected 

from the visibility map in the field-of-view. The quadrangular landmarks however impose a very 

strong restriction upon the potential landmarks. Even for indoor environments this is a big 

limitation. And the localization algorithm relies on the four corner points. A single occluded 

corner point renders the whole landmark invalid. For practical applications these constraint will 

be to rigid. 

A different SLAM approach using plane features has been proposed by Molton et al. [77]. 

The pose is computed by image alignment between learned and detected landmarks. 

2.5 Summary 

The focus of the presented state-of-the-art in the last sections was drawn to methods which allow 

the construction of a world map and allow global localization therein. Both, SLAM approaches 

and batch approaches were enlisted. A SLAM approach allows incremental map building during 

operation. In a batch approach a map has to be created prior robot operation. This does not 

mean, that the map building could not be automatically. A mobile robot navigating with a 

laser range finder could traverse the environment and acquire image data. The world map could 

then be constructed off-line from the image data. Afterwards this map could be used by other 

robots only equipped with a digital camera for localization and navigation. This makes sense as 

digital cameras are much cheaper than a laser range finder and the robots which operate later 

on only need a digital camera. In general a SLAM approach is more challenging to develop, but 

for most cases it would be possible to extend existing batch versions to full SLAM approaches. 

The approaches proposed by Se et al. [96], Karlsson et al. [56] and Davison et al. [21] are SLAM 

approaches. The others are batch approaches. The approach from Kosaka [59] even requires a 

manually constructed CAD-model of the world. All of the presented SLAM approaches use a 

metric map containing 3D point features. Extraction and reconstruction of point landmarks can 

be done very fast. Other landmarks as lines, planes or vanishing points require more complex 

detection and reconstruction methods and might be not applicable for a real-time updating of 

the landmarks as required in a SLAM framework. Until know the influence of the map features 

for navigation and path planning has been neglected. However, it is worth to be discussed. 

For path planning the robot has to know which locations are obstructed and which are free. 

Clearly, this information should be provided from the world representation. The described 

SLAM approaches construct sparse world representations consisting of distinct point features 

only. A path planning algorithm does not know if the space between the landmarks is occupied


or not. Clearly this is a bad situation for navigation. This is however similar with maps based 

on line features. Most of the line features will be vertical and thus the situation will not be 

very different as for point features. In contrast planar landmarks have a spatial extension in 

3D. A planar landmark projected to the ground floor will appear as a line and provides more 

information about obstructing objects than point and line features. 

The discussed methods also differ strongly in the used camera systems. The approach described 

in [96] uses a fix mounted stereo head. It allows 3D reconstruction of landmarks from 

a single location. This eases map building enormously. It also simplifies localization to a computation 

of a rigid transformation between two sets of matched landmarks. However, equipping 

a robot with a stereo head is costly and in any case more expensive as using a single camera. 

In [56] only a single camera is used. Landmark reconstruction thus is not possible from a single 

location, instead the robot has to use three images acquired from different positions. In localization 

the pose has to be computed from 3D-2D landmark correspondences. However, this can be 

done very efficiently. A stereo head is also used in the approach described in [21]. In their work 

it is an active stereo head, i.e. the cameras can be moved independently. This allows an original 

method for landmark reconstruction and localization by triangulation. The method described 

in [10] even uses an omnidirectional sensor. This complicates map building and the approach 

described is a batch method. However, one gains a lot of benefits by having a field-of-view of 

360 ◦ . Nevertheless a mobile robot which has to be equipped with a single standard camera only 

would be most preferable, for cost reasons and reason of simplicity. 

Another major difference of the compared methods is in the pose representation. In [56, 

59, 96] the robots pose is only represented in 2D, by the triple (x, y, θ), where x and y are the 

position and θ is the heading of the robot. It is a quite common assumption, that the mobile 

robot is restricted to move in a horizontal plane. However, a much more general representation 

would be a full 6 DOF representation by a 3D translation and 3D rotation. This would allow 

ramps and different height levels be contained in the world map. 

The characteristics of the reviewed methods are summarized in Table 9.1. Based on this 

review of the current state-of-the-art we also can identify the following main deficiencies: 

6 DOF pose: A lot of approaches for mobile robot localization simply assume that the robot 

is moving on a horizontal plane only. Imposing such restrictions simplifies the localization 

algorithms as only a 3 DOF pose has to be estimated. Clearly the more general 6 DOF 

pose representation is favorable as it allows the robot to operate on different height levels 

or move onto ramps etc. For outdoor environments the horizontal plane assumption does 

not hold anyway. 

Single camera solution: Many systems, so far proposed, use specialized camera setups, like 

stereo setups, active stereo setups or even trinocular camera systems. The use of advanced 

imaging devices certainly eases a lot. However, specialized hardware is expensive and 

often easier prone to malfunctions. If robotic systems are to be deployed to domestic 

environments e.g. as service robots the cost factor cannot be neglected and therefore the 

cheaper single camera solution is needed. 

Landmark correspondence problem: The correspondence problem is well known in the 

computer vision community and it is known to be hard. Detecting landmark correspondences 

is also one of the most important issues in robot localization. Recent advances in 

wide-baseline stereo already allow efficient and reliable landmark matching (see [56, 96]). 

However, still a high number of false matches are produced carrying the potential to


compromise the localization algorithm. Any new method which provides a more reliable 

landmark matching will therefore increase the overall localization performance. 

Localization despite large occlusions: Occlusions of the robots view will occur frequently if 

the robot is operating in a crowded environment. The view to landmarks will therefore be 

quite often limited. Localization algorithms should therefore be capable of computing an 

accurate pose from only a minimal number of detected landmark matches. The methods 

described in [56, 96] require about 10-20 landmark matches for a reliable pose estimate, 

quite a high number to met in a crowded and heavy occluded environment. 

Automatic map interpretation: Automatic map interpretation is a necessity to allow mobile 

robots to interact autonomously with the world and to carry out more complex tasks than 

vacuum cleaning. Nowadays systems can already get confused by a simple door. Assume 

that the mobile robot maps a room with an open door. In the map this will be reflected 

as an opening to traverse. Imagine that the other day the robot is heading towards the 

door and finds it closed. A simple localization algorithm will believe in a false position 

estimate. If however the robot knows about the functionality it can reason that the door 

has an open and a closed state and thus gets not confused. Well working service robot 

need to know even more about the environment. They need to know the names of the 

objects, functionalities of the objects, which objects are moveable, etc. Clearly this goes 

hand in hand with research in object recognition, but it should be thought about how the 

world representation of a mobile robot can support achieving this goal. 

Authors World map Sensor 

system 

Map features 

Landmark 

matching 

Map 

building 

Global 

localization 

(#landmarks ∗ ) 

Pose representation 

Se, 

Lowe, 

Little [96] 

sparse 

metric 

stereo 3D points + SIFT feature 

matching 

SLAM 

tri-angulation, 

map-alignment 

(>= 10) 

2D (3DOF) 

Karlsson 

et al. [56] 

sparse 

metric 

monocular 

3D points + SIFT 

+ appearance 

feature 

matching 

SLAM 

3D-2D 

(>= 4) 

2D (3DOF) 

Davison 

et al. [21] 

sparse 

metric 

active 

stereo 

3D points correlation SLAM tri-angulation 

(>= 3) 

3D (6DOF) 

Bosse 

et al. [10] 

sparse 

metric 

omnidirectional 

3D points 

+ 3D lines 

+ vanishing points 

nearest 

neighbor 

batch 

map matching 

(approx.30) 

3D (6DOF) 

Goedeme 

et al. [39] 

topological 

monocular, 


2D lines 

+ color descriptor 

+ intensity descriptor 

feature 

matching 

batch 

line matching and 

voting 

topological 

location 

Kosaka 

et al. [59] 

sparse 

metric, 

CAD-model 

monocular 3D lines nearest 

neighbor 

manual - 2D (3DOF) 

Hayet 

et al. [45] 

sparse 

metric 

monocular 

quadrangular 3D planes 

+ PCA descriptor 

feature 

matching 

batch 

3D-2D 

(1) 

3D (6DOF) 

Table 2.1: Main characteristics of the revised literature approaches. ( ∗ necessary landmark 

matches for robust pose estimation)

Chapter 3 

Local detectors 

Research on local detectors can be dated back to 1977 when Hans Moravec has described an 

interest operator which is today known as the Moravec operator [79]. In [80] Hans Moravec 

described obstacle avoidance and navigation for a mobile robot. He was using his interest 

operator to detect interest points in stereo image pairs and images from different viewpoints to 

use them as features to build a 3D map of the environment. Feature matching was achieved with 

correlation of 6 × 6 pixel image patches around the detected feature locations. The Moravec 

operator is based on the auto-correlation function, that is measuring the gray-level difference 

between a window and a shifted window in four directions. Calculating the sum of squared 

distances in the window gives a measure for every shift. The values are high, if the graylevel 

variance is high (textured) and low if there is low gray-level variance (e.g. homogeneous 

region). If the measures for every direction are high, the pixel location is a good candidate for 

an interest point. The smallest measure is then used as a quality measure for the interest point. 

In most cases the detected locations lie on edges and corners. For such cases a little shift already 

causes a difference. However, an obvious deficiency is the anisotropic behavior because there is 

only a discrete set of shifts. This basic idea was carried on leading to the well known Harris 

corner detector [40]. The idea got re-formulated using the structure tensor [9] and the second 

moment matrix respectively, leading to different variants of corner detectors [30, 61, 91, 107]. 

Other approaches [7, 57] use the second derivatives (Hessian matrix [115]) instead of the first 

derivatives. All these approaches can be considered belonging to one class of simple interest 

point detectors. They all have in common to detect a location only. That means that for a 

subsequent task like image matching via cross-correlation the size of the necessary matching 

window has to be chosen independently. This limitation shows up if dealing with images which 

show scale change. Although the detector might be able to detect the corresponding location, 

the correlation window will not contain the same gray-values and the matching will fail. 

This limitation was addressed by estimating a proper scale for every detected interest point. 

With this information the scale of the matching window can be normalized and cross-correlation 

would again work. The first work going into this direction was done by Tony Lindeberg [64] in 

1998. Other approaches followed shortly by David Lowe [66] or Krystian Mikolajczyk [72]. This 

class of interest operators is usually called scale-invariant interest operators. 

However, research again went one step further. According to the success of interest operators 

which are invariant to scale change methods were sought to create interest operators invariant to 

a larger class of image transformations. This was driven mostly by developments in wide-baseline 

image matching where significant perspective distortions occur. Research therein led to a new 

class of interest detectors, affine-invariant detectors. In most cases such a detection consists of a 

20

3.1. Interest point detectors 21 

point location and an elliptical delineation of the detection. The ellipse representation captures 

the affine transformation of the detection. By normalizing the ellipse to an unit-circle the affine 

transformation can be removed. This method was first suggested in 2000 by Baumberg et al. [6]. 

It led to a wide variety of affine-invariant detectors [53, 70, 73, 112]. The common property 

of these approaches is that they provide information how the region around the detection can 

be normalized to allow image matching. The detections itself however may not be simple point 

locations anymore. In the case of the MSER detector [70] a detection is a whole image region 

showing similar gray-values. Approaches like that are usually referred to as distinguished region 

detectors, moreover as every affine detector defines his own support region too. Thus the term 

’local detector’ emanated, which stands for simple point detectors as well as region detectors. 

3.1 Interest point detectors 

Interest point detectors are equivalent to corner detectors. A corner point shows strong intensity 

change in x and y direction. See Figure 3.1 for an example. Such corner points can easily be 

detected by examination of the gradients in x and y direction. 

Figure 3.1: Corner point: showing strong intensity change in x and y direction. (Image adapted 

from [87]) 

. 

When speaking from an interest point one usually means the x and y coordinates of a corner 

point. The interest point is defined only by its position. When using interest points for feature 

matching a description from a certain window around the interest point position has to be 

computed. A interest point detector however, does not define the size and shape of such a 

window. Let us call such a window ’measurement region’. We will see that other local detectors 

which will be described later on are able to define different types of measurement regions. Now 

let us re-state that an interest point detector does define a position, but no measurement region. 

Let us now look at the details of two popular interest point detectors, the Harris detector and 

the Hessian detector. 

3.1.1 Harris detector 

The Harris detector is probably the best known and most used interest point and corner detector. 

The Harris detector is an extension of the Moravec operator [79] and is dated back to 1988 [40]. 

The Moravec operator calculates the auto-correlation function (that is measuring the gray-level 

difference between a window and a shifted window) in four directions. The auto-correlation

3.1. Interest point detectors 22 

function will be high, if the gray-level variance is high (textured) and low if there is low graylevel 

variance (e.g. homogeneous region). If the measures for every direction are high, the pixel 

location is a good candidate for an interest point. This idea is carried on to the Harris corner 

detector, however the anisotropic behavior from using only a discrete set of shifts is extended 

to an isotropic formulation. This is done by a first order Taylor-series expansion of the autocorrelation 

function. To cope with image noise a Gaussian filtering is applied too. Written in 

matrix form the resulting value of the auto-correlation E for a small shift (x, y) is 

E(x, y) = (x, y)M(x, y) T (3.1) 

where M is the 2 × 2 matrix 

[ 

M = exp − x2 +y 2 

2σ 2 ⊗ 

( ∂I 

∂x )2 

( ∂I ∂I 

∂x 

)( 

∂y ) 

( ∂I ∂I 

∂x 

)( 

∂y ) 

] 

( ∂I 

∂y )2 

[ 

= exp − x2 +y 2 I 

2 

2σ 2 ⊗ x I x I y 

I x I y 

I 2 y 

] 

. (3.2) 

I(x, y) is the gray-level intensity of an image I at position (x,y) and exp − x2 +y 2 

2σ 2 ⊗ means convolution 

with a 2D Gaussian filter with some predefined σ. The matrix M is computed for 

every pixel location in the image I and from M a cornerness measure for every pixel location is 

computed. Harris and Stephens defined the following cornerness measure R: 

R = det M − k(trace M) 2 (3.3) 

R is often denoted as ’corner response’ also. The scalar factor k is set to 0.04, which is 

a value defined by experimental validation. A positive value of R now characterizes a corner. 

The higher the value of R the stronger is the corner. A small value, close to zero, denotes 

homogenous image regions. The value of R can be negative as well and in such a case it is an 

indication for an edge pixel, i.e. a pixel location with R < 0 is an edge pixel. Figure 3.2(a) 

shows example detections. The interest points are marked with yellow crosses. 

The original Harris corner algorithm computes the partial derivatives in x and y direction by 

simple difference computation. The gradients are computed by convolution with the following 

kernels: 

∂I 

= I(x, y) ⊗ (−1, 0, 1) 

∂x 

(3.4) 

∂I 

∂y = I(x, y) ⊗ (−1, 0, 1)T (3.5) 

By computing the gradients with Gaussian derivatives Schmid et al. reported a significant 

improvement in robustness and stability [95]. The choice of the standard deviation σ for the 

Gaussian filters is also very important. The corner response differs strongly for different values 

of σ. The parameter σ can be seen as scale parameter. For large values of σ only strong corners 

will be detected. For small values of σ also smaller corners will be detected and usually a small 

σ leads to multiple close-by detections. Non-maxima suppression should be performed which 

would reduce the number of nearby detections also in the case of small values for σ. 

The Matrix M is also known as the structure tensor from the work of Bigün [9]. This relation 

allows a different interpretation of the Harris corner measure in terms of the Eigenvalues of the 

structure tensor, which gives significant insights into the properties of the detector. The reader 

may be referred to [87] and [26] for details. 

Besides the Harris corner measure a vast variety of detectors based on the structure tensor 

exist. A good overview can be found in [87].

3.2. Scale invariant detectors 23 

3.1.2 Hessian detector 

The Hessian detector is very similar to the Harris detector. Instead of the structure tensor the 

Hessian matrix is computed to identify corners. 

[ ] 

H = exp − x2 +y 2 

2σ 2 ⊗ 

= exp − x2 +y 2 Ixx I 

2σ 2 ⊗ xy 

(3.6) 

I xy I yy 

[ ∂I 

∂x 2 

∂I 

∂x∂y 

] 

∂I 

∂x∂y 

∂I 

∂y 2 

As a measure for interest points, the determinant det(H) of the Hessian matrix is used. 

det(H) = I xx I yy − I 2 xy (3.7) 

This measure has been first introduced by Beaudet [7] in 1978. Figure 3.2(b) shows example 

detections. The interest points are marked with yellow crosses. 

Another measure based on the Hessian matrix has been proposed by Kitchen and Rosenfeld 

[57]. The measure K is defined as 

K = I xxIy 2 + I yy Ix 2 + 2I xy I x I y 

Ix 2 + Iy 

2 . (3.8) 

Recently it has been shown by Mikolajczyk [75] that the Hessian matrix can be used for 

scale selection. It will be described in detail in the next section. 

(a) 

(b) 

Figure 3.2: Detection examples for interest point detectors on ”Group” scene. (a) Harris detector. 

(b) Hessian detector. 

3.2 Scale invariant detectors 

Scale invariant detectors are interest point detectors which additionally define a circular measurement 

region. In addition to x, y a third parameter s which is the size of the circular measurement


region is found by the detectors. The important point is now, that the measurement region is 

found identical in two images, where one is a scaled version. That means that if an image is 

reduced in size by 2 then the measurement region for the same interest point in the smaller 

image is half the size of the one in the original. Thus the detectors are called scale invariant. 

This is illustrated in Figure 3.3. This property is important for feature matching. It is now 

possible to normalize detections, so that the feature vector is computed from the identical size. 

Normalization means a scale transformation of one of the measurement regions to the size of the 

other. Feature vectors computed from normalized patches will ease correspondence detection 

enormously and allow to handle much more complicated situations as when using simple interest 

point detectors. In the following we will describe four different scale invariant detectors. 

(a) 

(b) 

Figure 3.3: Example for scale-invariant Harris detector. The left image shows a detection with 

scale estimate on the original image. The right image shows the detection on a smaller version 

of the image (60% size of original). The scale is estimated so that the same image region as in 

the original is selected. 

3.2.1 Scale-invariant Harris detector 

The scale-invariant Harris detector has been proposed by Mikolajczyk et al. [72]. The scaleinvariant 

Harris detector detects interest points with an associated circular measurement region 

around the center. The scale of the measurement region will be geometrically stable, that means 

applying the detector on a re-scaled version of the image, will produce a detection with the 

identical (but re-scaled) image content within the measurement region. The detection is a two 

step process. First Harris corners are detected on multiple scales. For this a scale-adapted Harris 

detector is used. In a second step a characteristic scale for each Harris corner is identified. The 

characteristic scale directly determines the size of the resulting measurement region. Extrema of 

the Laplacian-of-Gaussian are used to detect the characteristic scale of an interest point. Thus 

the detector is also known as Harris-Laplace detector. 

A necessity for the first step is the scale-adapted Harris detector. The original Harris detector 

[40] is not invariant to scale change. To overcome this the authors of [72] propose a 

combination with the automatic scale selection described by Lindeberg [64]. The combination 

leads to the scale adapted second moment matrix. The second moment matrix describes the 

gradient distribution in a local neighborhood of a point and is the basis for corner detection 

with the Harris method. The scale adapted second moment matrix is defined by: 

[ ] 

M(x, σ i , σ d ) = σd 2 I g(σ 2 

i) ⊗ x (x, σ d ) I x I y (x, σ d ) 

I x I y (x, σ d ) Iy 2 (3.9) 

(x, σ d )


g(σ i ) is a 2-dimensional Gauss kernel with standard deviation σ i . σ d is the differentiation scale. 

The local derivatives are computed with Gaussian derivatives and the differentiation scale σ d 

determines the size of the Gaussian filter. σ i is the so called integration scale and determines the 

size of the Gaussian window which is used for smoothing the gradients in the local neighborhood. 

The Harris measure for the scale-adapted second moment matrix is now defined by: 

R = det M(x, σ i , σ d ) − k(trace 2 M(x, σ i , σ d )) (3.10) 

By computing the second moment matrix and the cornerness measure R for different values 

for σ i and σ d a scale-space representation of Harris corners can be established. The authors 

propose in [75] to compute a scale-space representation for pre-selected scales σ n = ξ n σ 0 , where 

ξ is the scale factor between successive levels. In [64] Lindeberg suggest ξ = 1.4. The integration 

scale σ i for computation of the second moment matrix is set to σ i = σ n . The differentiation scale 

σ d is set to σ d = sσ n = sσ i , where s is a constant factor. s is set to 0.7 in [75]. This couples the 

integration scale and the differentiation scale by a multiplicative scalar factor. Harris corners 

are finally identified by thresholding the cornerness value R and non-maxima suppression in a 

8-neighborhood for every scale level. 

The next step in the algorithm is the detection of a characteristic scale for the Harris detections. 

In the previous step Harris corners were independently detected on each different 

scale level. Now for each such detection a characteristic scale is estimated using the Laplacianof-Gaussians 

(LoG) function. A characteristic scale is determined by a local maxima of the 

following function: 

|LoG(x, σ n )| = σ 2 n|L xx (x, σ n ) + L yy (x, σ n )| (3.11) 

For every detected point location the function is evaluated over all available scales. The characteristic 

scale corresponds to the local maxima. If more that one local maxima exists multiple 

characteristic scales are assigned to the detection. Besides the LoG other functions would be 

possible. However, an evaluation in [72] revealed that the LoG performs best. If an evaluated 

point location does not show a LoG maximum or if the response is below a threshold the point 

will be discarded. All the other detections are reported as results of the detectors, where the 

characteristic scale directly determines the size of the measurement region in pixel. In [72] the 

radius of the measurement region in pixel is 2.8σ n . The performance of the Harris-Laplace detector 

is evaluated very thoroughly in [75] and compared to other methods. Figure 3.4(a) shows 

example detections for the Harris-Laplace method. Each detection is visualized by the center 

point (yellow cross) and the characteristic scale drawn as a circle around the center point. 

3.2.2 Scale-invariant Hessian detector 

The scale-invariant Hessian detector is very similar to the previously described Harris-Laplace 

detector. It is also known as Hessian-Laplace detector and described in [71]. The detection 

algorithm is basically identical to the Harris-Laplace detector with the only exception, that 

the initial interest points are identified with the Hessian matrix instead of the second moment 

matrix. As cornerness measure the determinant of the Hessian matrix is used. The Hessian- 

Laplace detector produces very similar results to the Harris-Laplace detector, which is not very 

surprising as the algorithms are almost identical. Example detections for the Hessian-Laplace 

method are shown in Figure 3.4(b). Each detection is represented by the center point (yellow 

cross) and the characteristic scale drawn as a circle around the center point.


(a) 

(b) 

Figure 3.4: Detection examples for the scale invariant Harris and Hessian detectors on ”Group” 

scene. (a) Harris-Laplace detector. (b) Hessian-Laplace detector. 

3.2.3 Difference of Gaussian detector (DOG) 

The Difference of Gaussian detector has been developed by David Lowe and first presented 

in [66]. The DOG-keypoints were introduced in combination with a suitable descriptor, the 

SIFT-descriptor. This has led to quite a misconception, and often DOG-keypoints are called 

SIFT-keypoints. However, despite the fact that DOG-keypoints and SIFT-descriptor are often 

used as combination, each method also stands for it alone and DOG-keypoints should be 

therefore not reduced to SIFT-keypoints. 

The essence of the DOG-detector is to find blob like structures in a scale-space [117] created 

from the input image. This is done by computing the difference of Gaussians for multiple scales 

and searching for local extrema therein. The difference of Gaussians is a close approximation of 

the Laplacian of Gaussians investigated in [63]. The main reason for the use of the difference 

of Gaussian is computational efficiency. Moreover, most of the DOG-detection algorithm is 

designed for efficiency. 

A scale-space of the image I is defined as a function L(x, y, σ). It is gained by convolution 

of a variable-scale Gaussian G(x, y, σ) with the image I(x, y). 

G(x, y, σ) is a 2D Gaussian kernel with the scale parameter σ. 

L(x, y, σ) = G(x, y, σ) ∗ I(x, y). (3.12) 

The difference-of-Gaussian function D(x, y, σ) is now defined as 

G(x, y, σ) = 1 

2πσ 2 exp −(x2 +y 2 ) 

2σ 2 (3.13) 

D(x, y, σ) = L(x, y, kσ) − L(x, y, σ) (3.14)


where k is a constant multiplicative factor. This means that D(x, y, σ) is simply the subtraction 

of two neighboring discrete scale-space representations of the image I. The scale-space for DOG 

detection is defined in the following manner. It consists of a pre-defined number of partitions, 

called octaves. Each new octave starts with a σ with a double-as-high value of the previous 

octave. Each octave is partitioned into a number of s discrete scale-space representations, where 

s is an integer number. With this condition the parameter k is defined as k = √ 2. For each 

octave the image I is re-sampled down to half of the size of the previous image. Re-sampling 

is done by simply selecting every other pixel of the image. This is done for computational 

efficiency. Doing the re-sampling everytime when the σ is doubling is consistent with the scalespace 

theory. The difference of Gaussian function D(x, y, σ) is now produced by subtracting the 

neighboring scale-space slices within each octave. The next step after computation of D(x, y, σ) 

is the detection of local extrema therein. The extrema to be detected are the local minima 

and maxima of D(x, y, σ). Every pixel of the scale-space representation is checked if it is an 

extremum of D(x, y, σ). If a pixel is an extremum then it is selected as a DOG-keypoint. If the 

extremum is located on one of the re-sampled octaves the x and y coordinate in the original 

image scale have to be computed. The characteristic scale of the DOG-point is the value of the 

σ of the scale-space slice on which the extremum has been found. For extremum detection all 

26 neighbor pixels in scale-space are investigated. The pixel is a local maximum if its value is 

higher than the values of its neighbor and it is a local minimum if it is smaller than all of its 

neighbors. The 26 neighbors are defined by a 8-connecting neighborhood in scale-space. The 26 

neighbors consist of the 8 neighbors of the same slice, 9 neighbors on the upper scale-level and 9 

neighbors on the lower scale-level. Point detection in such a way only gives detection with pixel 

accuracy. In a subsequent step to detection a sub-pixel keypoint localization will be performed. 

This step ensures, that keypoints are located exactly on corners or edges. To gain sub-pixel 

accuracy a 3D quadratic function will be fitted to the local scale-space region. The keypoint 

will finally be localized at the interpolated maximum or minimum of the quadratic function (for 

more details see [13]). 

However not all detected extrema are suited to finally act as keypoints. Detected keypoints 

with low contrast are not well suited as keypoints. Scale-space extrema also tend to be located 

on edges. However, they are not well localized along the edge itself. A final filtering step will 

eliminate such ambiguous detections. Edge responses are eliminated by Eigenvalue analysis of 

the Hessian matrix H of the keypoint location. The process is very similar to corner detection 

using the Hessian matrix. The ratio of the two principal directions is computed and the 

keypoint is eliminated if one direction is significant stronger than the second one. The ratio is 

approximated by the ratio of the squared trace to the determinant. If 

trace(H) 2 

det(H) 

< 

(r + 1)2 

r 

(3.15) 

the location is accepted as DOG-keypoint, where r = 10 is a reasonable value for a lot of 

situations. 

It is possible to implement the necessary steps of the DOG-detector very efficiently. The 

DOG-detector is therefore a candidate of choice if one wants to build a real-time system. Figure 

3.6(a) shows examples for DOG-keypoints. Each keypoint is represented by the center point 

(yellow cross) and the characteristic scale drawn as a circle around the center point.


3.2.4 Salient region detector 

The salient region detector has been proposed by Kadir and Brady [52]. As the other scale 

invariant detectors a location and a characteristic scale is detected. However, a major difference 

is in the selection of the location. The goal is to detect salient image regions. Kadir and Brady 

propose as a measure for saliency the entropy of the gray-value distribution within an image 

region. The entropy H of an image region is defined by 

H = − ∑ i 

p(d i )log 2 p(d i ) (3.16) 

where p(d i ) is the probability of gray-value d i in the image region. The values p(d i ) can be 

computed by the histogram of the image region. The histogram counts the frequency of the 

occurrence Kadir&Brady: of each gray-value. The entropy Entropy can be computed with the normalized histogram 

counts. The goal is to select regions which show a distributed histogram. A distributed histogram 

indicates highly textured, thus salient regions. A peaked histogram indicates low texture and 

lots of similar gray-values. Figure 3.5 depicts examples for peaked and distributed histograms. 

distributed 

peaked 

Figure 3.5: Example for peaked and distributed histograms. The image patch corresponding 

to the peaked histogram shows low texture. The distributed histogram corresponds to a highly 

textured region. 

Peaked and distributed histograms can be distinguished by their entropy value H. Distributed 

histograms show a larger (negative) entropy value than peaked histograms. To detect 

salient regions the entropy is computed for different window sizes. Different window sizes lead 

to different histograms. Consider an image with a homogeneous background showing a textured 

object. Computing the histogram for the object yields a distributed histogram. If the window 

size for the histogram will be increased, the window will contain more and more from the homogeneous 

background and the histogram will change from distributed to peaked. Such changes 

now indicate salient regions. In detail, a peak in the function H(w) indicates a salient region. 

The window size w of the peak in H(w) can be seen as the characteristic scale of the salient 

region. 

The algorithm can be summarized as follows. First, compute the entropy value for multiple 

window sizes for every pixel location. Search for a peak in H(w) for every pixel location. Select 

the locations which show a peak in H(w). The selected locations can be stored as triplet 〈x, y, s〉, 

where x, y is the location in the image and s is the window size, or scale respectively. Each triplet 

corresponds to an entropy value computed at location x, y with window size s. For lots of pixel


locations H(w) does not contain a peak, but is monotonic increasing or decreasing. Such pixel 

locations will be discarded. The remaining triplets indicate salient regions, however a sort of 

non-maximum suppression is needed. With clustering in x, y, s space close nearby detections are 

merged and the cluster centers are the resulting salient regions, containing a position x, y and 

a scale s. Different to other detectors an absolute saliency measure can be computed for each 

detection which introduces an ordering of the detections. The saliency Y is computed by 

Y = H(w)W (w). (3.17) 

W (w) is a weight function which measures the intra-scale unpredictability. In simple words, it 

measures the gray-value difference between two adjacent scales. A scale step which produces a 

high gray-value difference is a measure for high saliency. The intra-scale unpredictability W (w) 

is defined as 

∫ ∣ ∣∣∣ ∂ 

W (w) = w 

∂w p(d i, w) 

∣ dI. (3.18) 

W (w) can be computed practically by the absolute sum of differences between the histograms of 

the adjacent scales and multiplied with the current window size. The absolute saliency measure 

Y allows to limit the detection to the most n salient features. 

In Figure 3.6(b) examples for detected salient regions are shown. Each salient region is 

visualized by the center point (yellow cross) and the characteristic scale drawn as a circle around 

the center point. 

The entropy may not necessarily be computed from the image gray-values. The method 

can also be applied to directional data as produced by an edge detector as shown in [52]. The 

method can readily be applied to a lot of different descriptors, which gives the method a large 

versatility. For example, in [31] salient regions are detected by computing the entropy on the 

cornerness value of the Harris detector. 

3.2.5 Normalization 

The additional scale information of the scale-invariant interest detectors can be used for normalization. 

The scale parameter defines a circular measurement region around the detection. 

This is important for image matching, where corresponding detections are searched. In most 

cases matching works by extracting an image descriptor from the measurement region of the 

detection or by area based correlation of the measurement regions. Scale changes are therefore 

a big problem for matching. This can be overcome by normalization of the detection. By 

knowing the characteristic scale of the detection the measurement region can be re-sampled to 

a fixed canonical size which is used for correlation or descriptor extraction. Re-sampling and 

interpolation however have to be performed carefully. Best results will be gained if the bigger 

measurement region is downsized to fit the smaller one. If the scale change is high downsizing 

according to the scale-space theory (Gauss filtering and re-sizing) should be performed. Normalizing 

the measurement region in the described way is possible for all previously described 

detector methods. 

In addition to scale normalization Lowe describes a method to normalize the DOG-keypoints 

for an arbitrary rotation [66]. The method works by computing a histogram of the gradients 

which occur in the measurement region. A histogram with 36 bins covering the 360 ◦ is reported to 

give good results. The histogram maximum defines the principal orientation of the detection. If 

the histogram contains multiple equally strong local maxima the detection gets assigned multiple 

orientations. To compensate for the low resolution of the histogram (every bin accounts for 10 ◦ )

3.3. Affine invariant detectors 30 

(a) 

(b) 

Figure 3.6: Detection examples for scale invariant DOG and salient region detector on ”Group” 

scene. (a) DOG detector. (b) Salient region detector. 

a parabola is fit to the maxima and its neighbors and the interpolated peak of the parabola is 

used as the principal orientation. This principal orientation can now be used in the re-sampling 

stage of the normalization to rotate the detection into a canonical orientation. This orientation 

estimation has be successfully used for the DOG-keypoints. However, the method can also be 

applied without restrictions to all of the previously described scale-invariant interest detectors. 

3.3 Affine invariant detectors 

Affine invariant detectors are designed to produce repetitive detections despite of an arbitrary 

affine transformation of the image. For an example see Figure 3.7. Figure 3.7(a) shows a single 

detection on the original image. Figure 3.7(b) shows the detection on an affine transformed 

version of the original. The affine transformation contains a rotation of 30 ◦ , an axis shear of 10 ◦ , 

a scale change in x and y of 90% and 80% respectively. Despite the distortion the center point is 

detected repetitively and the ellipse detected in the transformed image covers the same content as 

in the original image. Affine invariant detectors were developed to cope with heavy perspective 

distortions in wide-baseline scenarios where large view-point changes occur. The effect of a 

perspective distortion can be approximated locally by an affine transformation. Locally means 

in this respect a small area (measurement region) around an interest point detection. The in the 

following presented affine invariant detectors are based on different concepts. The affine-invariant 

Harris and Hessian detectors are based on simple interest point detectors and searches an affine 

invariant measurement region for each point detection. An other method, the MSER detector 

finds homogeneous, characteristic delineated image regions with a method not affected by an 

affine transformation. However, independent of the detection method each detector detects a 

measurement region and a characteristic point within the measurement region. The shape of the


measurement region of different methods may vary. Some detect elliptical measurement regions, 

others detect measurement regions of arbitrary shape, however every method finds an affine 

transformation to transform the measurement region into a normalized canonical coordinate 

frame. In the case of an elliptical measurement region the computed affine transformation 

transforms the ellipse into a circular region of unit size. The affine transformation removes the 

different scaling in the both principal directions and the shear. What remains, however, is an 

arbitrary rotation between the canonical representation. A different normalization scheme can 

be used for the MSER detector, where a characteristic outline of the region is detected. Here 

the normalization works by creating a so called local affine frame (LAF). Such a normalization 

also accounts for the rotation. In the following the most prominent methods are described in 

detail. 

(a) 

(b) 

Figure 3.7: Example for an affine invariant detector. The left image shows a detection on the 

original image. The right image shows the detection on an affine transformed version of the 

original. The center point is detected repetitively and the ellipse detected in the transformed 

image covers the same content as in the original image. 

3.3.1 Affine-invariant Harris detector 

The affine-invariant Harris detector, also known as Harris-Affine detector, has been introduced 

by Mikolajczyk [73] as an extension to the scale invariant Harris-Laplace detector [72]. The 

detector works by estimating the affine shape of a local structure in the neighborhood of a 

scale adapted Harris corner. The method assumes that the local neighborhood of an interest 

point is an affine transformed, and thus anisotropic structure of an originally isotropic structure. 

By finding the parameters of this affine transformation the local anisotropic structure can be 

transformed back to the isotropic structure. An isotropic circular structure could be represented 

by a circular region. Its affine transformation results in an ellipse. The regions of the Harris- 

Affine detector are therefore ellipses representing the affine, anisotropic transformation of the 

local structure. The detector does not only detect the interest regions invariant to an affine 

transformation, but returns also the transformation parameters to normalize them into shapes 

which show an isotropic local neighborhood. 

The first step of the algorithm is the detection of scale adapted Harris corners on different 

scale levels. This step is identical to the Harris-Laplace detector. In a next step for each detected 

Harris corner the shape adaptation is now performed to estimate the anisotropic structure of the 

local neighborhood. The characteristic scale as defined previously is used as an initial value for 

the affine shape adaptation. The anisotropic shape of a local image structure can be estimated


with the second moment matrix. This has been shown by Lindeberg [64] and later Baumberg [6]. 

The second moment matrix in an affine scale-space is given by: 

M(x, Σ I , Σ D ) = det(Σ D )g(Σ I ) ⊗ ( (∇I)(x, Σ D )(∇I)(x, Σ D ) T ) (3.19) 

Σ I is a covariance matrix which determines the integration Gaussian kernel, which is for the 

smoothing of the gradient values over a local neighborhood. Σ D is the covariance matrix for the 

differentiation Gaussian kernel which steers the Gaussian derivatives for the gradient computation. 

In [73] the authors propose to set Σ I = sΣ D , where s is a scalar. This limits, with a little 

loss of generality, the number of possible kernel combinations to make the computation feasible 

in practice. Basically this means that the differentiation and integration kernel will differ only in 

size and not in shape. The second moment matrix M(x, Σ I , Σ D ) can now be used to transform 

the local structure into an isotropic structure with 

x = M − 1 2 x 

′ 

(3.20) 

where x ′ is a point in the original anisotropic neighborhood. x is a point in the normalized 

isotropic neighborhood. The transformation matrix is the inverse matrix square root of the 

second moment matrix M of the local structure of the point x ′ . For two points x L in a left 

image and x R in a right image which are related by an affine transformation x R = Ax L a 

relation x L and x R can be derived which relates both points in terms of the second moment 

matrices. 

M 1 2 

R 

x R = RM 1 2 

L 

x L (3.21) 

This relation is determined up to an arbitrary rotation R. The task of shape adaptation is 

now to estimate the second moment matrix M which transforms the local neighborhood into 

an isotropic structure. The eigenvalues of the second moment matrix can be interpreted as 

a measure of isotropy. Equal eigenvalues indicate an isotropic structure. The ratio of the 

eigenvalues gives then a normalized measure for the isotropy: 

Q = λ min 

λ max 

(3.22) 

The range for the value for Q is in the range of [0..1] where 1 indicates a perfect isotropic 

structure. This measure is now used to evaluate the current estimate of the transformation 

matrix U which transforms a local structure into a perfectly isotropic one. The transformation 

matrix U is a concatenation of square roots of second moment matrices. 

U = ∏ k 

(M − 1 2 ) (k) U (0) (3.23) 

(M − 1 2 ) (k) is the square root of the second moment matrix in step (k) of the iterative algorithm 

and U (0) is the 2 × 2 identity matrix. For each iteration the second moment matrix will be 

estimated with the characteristic scale as initial value for Σ I and Σ D . The estimated transformation 

will then be applied to the local neighborhood and U will be updated. In the next 

step the second moment matrix will be estimated again and the transformation will be applied 

again. This will be iterated until the measure Q of the second moment matrix is close to 1, that 

means the structure is almost isotropic. The sequence of transformations U will then be used 

to represent the elliptic shape of the detection. The algorithm converges fast, usually less then 

10 iterations are necessary. 

Figure 3.8(a) shows examples for the Harris-Affine detector. Each detection is visualized by 

its center point (yellow cross) and its associated elliptical measurement region. For more details 

and also for implementation details the interested reader my be referred to [73].


3.3.2 Affine-invariant Hessian detector 

The affine-invariant Hessian detector is very similar to the previously described Harris-Affine 

detector. It is also known as Hessian-Affine detector and described in [71]. The detection algorithm 

is basically identical to the Harris-Affine detector with the only exception, that the initial 

interest points are identified with the Hessian-Laplace method instead of the Harris-Laplace 

method. The Hessian-Affine detector however produces very similar results to the Harris-Affine 

detector, which is not very surprising as the algorithms are almost identical. Figure 3.8(b) 

shows examples for the Hessian-Affine detector. Each detection is represented by its center 

point (yellow cross) and its associated elliptical measurement region. 

(a) 

(b) 

Figure 3.8: Detection examples for affine invariant Harris and Hessian detectors on ”Group” 

scene. (a) Harris-Affine detector. (b) Hessian-Affine detector. 

3.3.3 Maximally stable region detector (MSER) 

The MSER detector is a currently very popular affine invariant region detector developed by 

Matas [70]. The concept of the detector is very different from the previously described detectors. 

One of the biggest differences is, that the measurement region can be of arbitrary shape. The 

MSER region is defined by its border pixels, a connected set of pixels. The border pixels and all 

pixel inside constitute the MSER region. A MSER region is a part of an image, delineated by a 

boundary, where all pixels inside the boundary are either brighter or darker as the pixels outside 

the boundary. Such image regions have a variety of interesting favorable properties. First, the 

region definition is unaffected by monotonic changes of image intensities. The region is defined 

only by a relative ordering of the intensities. Common models for photometric changes, like a 

change in illumination will not effect the detection of the interest regions. The most important 

property however is that the definition is invariant to continuous geometric transformations. A 

connected set of pixels will again be transformed into a connected set of pixels by a continuous


geometric transformation. Rotation, scale change, plane-perspective transformations will not 

influence the repetitive detection of a MSER region. This is an enormously valuable property 

when dealing with wide-baseline scenarios. 

The MSER detection algorithm is related to thresholding, which is already anticipated from 

the definition of a MSER region. In terms of thresholding the algorithm can easily be defined 

as follows. Imagine all possible thresholdings of a gray-level image. For a gray-level image we 

have 256 thresholds t 0 < t 1 < ... < t 256 . Let the thresholded binary images show white pixels if 

the gray-value is higher than the threshold and black otherwise. Now imagine a movie showing 

the binary images starting with the one computed with t 0 and the others following in increasing 

order. The first frame will be completely white but soon black regions will appear and grow 

with increasing thresholds. Some of the appearing black regions will stay stable for a series 

of thresholds and these regions are the ones to be detected. Maximally stable regions are the 

ones which area does not change for a certain number of thresholds. These image regions will 

be reported by the algorithm. With increasing threshold initially distinct regions will merge 

and eventually create another stable region out of two others. Generally the algorithm might 

produced nested detections on different scales. Referencing back to the definition of the MSER 

regions the above described thresholding method will detect regions where all the inside pixels 

have a higher gray-value than the boundary pixels. The other variant of the MSER regions can 

be computed by reversing the order and starting with the binary image for the highest threshold 

t 256 . This will produce the regions where the pixels inside the border show lower gray-values 

than the outside pixels. An efficient implementation however will not compute the single binary 

images. Instead a sorted list of pixel-values is created similar to a histogram but including the 

pixel location. For an image with n pixels this can be done in O(n) time. Now connected 

components at each level have to be detected and maintained over all different levels. The 

change in area for the identified regions has to be computed to identify stable regions. This can 

efficiently be solved by using the union-find [97] algorithm as proposed in [70]. The union-find 

algorithm then determines the overall complexity by O(n log log n). The described algorithm 

returns a MSER region as a set of connected pixels. A lot of applications however, e.g. epipolar 

geometry estimation, are in need for a single point location associated to each detection. For the 

MSER detection this can be done by computing the center of gravity (COG) of the MSER pixels. 

As shown in [85] the COG of an MSER regions is invariant to affine transformation. In other 

words the COG’s of two MSER regions connected by an affine transformation are connected by 

the same affine transformation. The COG however is not localized on a special image feature 

like a corner or an edge feature thus the localization accuracy is determined by the pixel set of 

the MSER region only. 

The MSER detection algorithm is not only theoretical very efficient but also is really fast 

in an actual implementation. Current implementations allow a real-time detection of MSER 

regions with about 25 frames per second. This behavior makes the MSER detector extremely 

interesting for the development of computer vision systems. Figure 3.9 shows examples for the 

MSER detector. In Figure 3.9(a) the detections are visualized by ellipses fitted to the MSER 

detections. Figure 3.9(b) shows the original contour based representation of the detections. 

3.3.4 Affine-invariant salient region detector 

The affine-invariant salient region detector is a straightforward extension of the salient region 

detector presented in the previous section. The extension has been proposed by the original 

author in [53]. The affine invariancy is gained by an affine shape adaptation of the detection of


(a) 

(b) 

Figure 3.9: Detection example for MSER regions on ”Group” scene. (a) Ellipse representation. 

(b) Original contour based representation. 

the original salient region detector. The algorithm works by first detecting scale invariant salient 

regions. Starting with a circular salient region the shape adaptation allows to transform the 

original region to elliptical shape. The goal is to adapt the shape so that the saliency measure 

Y is maximal. The elliptical shape is parameterized by 3 values, the scale s, the ratio of the 

major to the minor axis ρ and the orientation of the major axis θ. The length of the major axis 

of the ellipse is defined by √ s 

ρ 

and the minor axis by sρ. This 3-vector replaces the previously 

scalar scale parameter s. Let us remember that the absolute saliency measure is defined by 

Y = H(w)W (w). W (w) is the intra-scale unpredictability and is to be maximized by the shape 

adaptation. The axis ratio ρ and the orientation θ are changed until a maximum value for W (w) 

has been found. After the orientation and axis ratio of the ellipse has been fixed the scale is 

varied to find again the peak of H(w) now for the elliptical shaped region. 

As many other affine invariant detectors the detection is parameterized as a center point and 

an elliptical measurement region around the detection. Different to others the ellipse however is 

not parameterized by the second moment matrix (gained from the gray-value distribution) but 

with the orientation θ and the major and minor axis. There is a difference therein which is worth 

to be discussed. The shape adaptation follows the idea initially presented by Baumberg [6] where 

the second moment matrix of the gray-values from the measurement region of the detection is 

used to remove an arbitrary affine transformation. The transformation specified by the second 

moment matrix accounts for the different modalities of the affine transform, that is rotation, 

scale change in x and y direction and shear (a possible translation is assumed to be removed 

already). In the shape adaptation used for the salient regions, however, the transformation is not 

computed from the second moment matrix but from the ellipse parameters found through shape 

adaptation. The transformation includes orientation, scaling in x and y direction (from the axis 

ratio) but not the shear! Parameterizing an affine transform from such an ellipse description


can not produce a transformation containing shear. There is no way by using this method 

to normalize two regions differing by an affine transformation containing shear into the same 

canonical coordinate frame. This is a big limitation of the proposed method and sadly it is not 

discussed in their paper. 

Another issue to be discussed is the local search strategy for the optimal shape. The authors 

propose a brute-force strategy which is computationally expensive. In addition the tested ellipse 

parameters are discrete. It is not clear from the paper if the method will really find the optimal 

shape with the brute force approach and theoretical considerations about the convergence are 

not given. 

A last point concerns the practical application of the method. The method contains a lot 

of computational expensive steps, histogram computation for different scales, clustering, brute 

force shape adaptation. The method is inherently very slow. For most practical application 

running on state-of-the-art computers the method is in fact too slow. Examples for the affine 

salient region detector are depicted in Figure 3.10. Each affine salient region is visualized by its 

center point (yellow cross) and its associated elliptical measurement region. 

Figure 3.10: Detection example for affine invariant salient region detector on ”Group” scene. 

3.3.5 Intensity extrema-based region detector (IBR) 

Intensity extrema-based regions were first introduced by Tuytelaars and Van Gool [113]. The 

detector selects anchor points using a gray-value intensity criteria and then identifies a region 

border around the anchor point in an affine invariant way. The resulting regions show in general 

arbitrary shapes around a blob-like homogeneous anchor point. The first step of the algorithm 

is the detection of anchor points. Other than previous methods the algorithm does not use a 

corner or edge detector. Instead image locations which show a local intensity extrema are used. 

For this in a first step the image I is smoothed to remove image noise, e.g. with a Gaussian filter. 

Local intensity extrema are now identify by non-maximum suppression. Due to the smoothing 

the intensity extrema do not show a strong peak and therefore are weakly localized. However


without smoothing a lot of anchor points would be produced because of image noise. Such 

identified anchor points are invariant to monotonic intensity transformations. In a second step 

a region delineation is now searched in an affine invariant way for every detected anchor point. 

Searching for a border works by emanating rays from the center of the anchor point. The rays 

are distributed uniformly around the full 360 ◦ . Along each ray the intensity profile gets analyzed 

to find a characteristic gray-value change which is invariant to an affine transform. The function 

f I (t) evaluated for each ray is defined by 

f I (t) = 

max 

|I(t) − I 0 | 

( ∫ t 

0 |I(t)−I 0|dt 

t 

) (3.24) 

, d 

where t is the distance of the current evaluation position on a ray from the anchor point. I(t) is 

the intensity value along the ray at distance t. I 0 is the intensity value of the anchor point. d is 

a small number which prevents a division by zero. f I (t) typically shows a maximum when the 

intensity along the ray is changing significantly compared to the average changes along the ray. 

For instance this will happen if the ray crosses a border of a rather homogeneous image region. 

The function f I (t) is chosen to produce easily detectable extrema on intensity changes. It would 

be possible to detect the extrema in the plain intensity function I(t) along the rays, which would 

be as well affine invariant in theory. However, the extrema in I(t) are shallow and and not as 

stable as for f I (t). In the case that the global extremum along the ray does not significantly differ 

from other local extrema, the one extrema is selected which is located at a similar distance as 

the ones from the neighboring rays. After analyzing all rays the border of the region is given by 

a distinct set of point locations around the anchor point. By connecting the distinct points and 

computing the convex hull a possible region delineation is created. Another possibility, which 

is also preferred by the original authors, is to fit an ellipse to the distinct points. The ellipse 

then defines the region border. The ellipse parametrization provides a simpler handling of the 

regions for subsequent matching tasks. It is important to note that the ellipse fitting creates an 

ellipse which is not necessarily centered around the original anchor point. The original anchor 

point is than replaced by the computed ellipse center in the region description. Figure 3.11(a) 

shows example detections for the IBR detector. Each detection is visualized by its center point 

(yellow cross) and its associated elliptical measurement region. 

For an implementation of this algorithm a proper value for the angle between two neighboring 

rays has to be set, which makes a tradeoff between speed and accuracy of regions border. Using 

a high number of rays it will take longer to evaluate the intensity profiles but will approximate 

the regions border more accurately. A small number of rays will be faster but will provide 

a poor approximation of the region border. Another implementation issue is the sampling of 

the intensity profiles. As the intensity profiles are sampled in different directions a proper 

interpolation or smoothing will be necessary. Both issues are not addressed by the original 

authors. 

As a final remark I would like to mention one specific property stated by the authors in [113]. 

As the used anchor points are not corner points the chance that the region is located on a 3D 

corner is much smaller. Regions located on 3D corners are much more complicated to match as 

planar regions. 

3.3.6 Edge based region detector (EBR) 

The edge based region detector (EBR) has been described by Tuytelaars et al. in [111]. As 

the method is based on geometric constraints it is also known as geometry-based method. The


method exploits the fact that an image corner usually appears when two image edges meet. 

The image corner and the two edges are then used to define an affine invariant region. In a 

first step of the algorithm corners and edges have to be detected in the image. The authors 

propose to use the Harris corner detector [40] to detect the anchor points for the algorithm. 

For edge detection the authors propose to use the Canny edge detector [15]. As corner and 

edge detection is performed by different methods it is not certified that the corner is located at 

the intersection of the edges. This is however not a necessary criteria for the region detection. 

The method which is described in the following will work on non-straight lines. For straight 

lines a special adaptation of the method will be described afterwards. The method works by 

constructing parallelograms from the corner point p and points p 1 and p 2 located on each edge. 

The parallelogram construction is driven by an affine invariant. The functions l 1 and l 2 are 

relative, affine invariants. ∫ ∣ ∣∣∣ 

l 1 = det( dp 1(s 1 ) 

p − p 1 (s 1 )) 

ds 1 

∣ ds 1 (3.25) 

∫ ∣ ∣∣∣ 

l 2 = det( dp 2(s 2 ) 

p − p 2 (s 2 )) 

ds 2 

∣ ds 2 (3.26) 

The ratio l 1 

l2 

is an absolute affine invariant and the association of a point on the one edge with a 

point on the other edge is also affine invariant. Two points p 1 and p 2 are associated when l 1 = l 2 . 

We will denote this relation simply as l. Then the points p 1 and p 2 are parameterized by a single 

parameter l which ensures a family of affine invariant parallelogram constructions. Now certain 

photometric properties of the pixels inside the defined parallelograms are evaluated and the 

parallelogram constructions yielding an extremum of the photometric properties are reported 

as affine invariant regions. The following functions on the pixels inside an parallelogram can be 

used to for this task. 

f 1 (Ω) = 1 

|Ω| 

∑ 

d i (3.27) 

The function f 1 represents the average intensity over the region of the parallelogram Ω. Ω is 

the set of all pixels inside the parallelogram. d i is the intensity of a single pixel and |.| denotes 

the cardinality. Note, that the average intensity itself is not invariant to affine photometric and 

geometric changes, but an extremum in the average intensities for a family of parallelograms is. 

The goal is therefore to identify the parallelogram construction which shows an extremum in 

the average intensity function f 1 . Function f 2 now represents an absolute affine invariant. 

Ω 

f 2 (Ω) = |p − q p − p g| 

|p − p 1 p − p 2 | 

(3.28) 

The function f 2 is a ratio of areas depending on the center of gravity p g . q is the corner of the 

parallelogram opposite to the point p and is defined as q = p 1 +p 2 −p. Although f 2 is an absolute 

affine invariant in practice the best results are given when searching again for extrema of the 

function. In a further work of Tuytelaars [112] two further evaluation functions are introduced. 

Let us now discuss the case of straight lines. Straight lines emanating from a corner point 

occur frequently in images. Thus this case cannot be neglected. For straight lines the functions 

l 1 and l 2 yield l 1 = l 2 = 0. Thus it is not possible to use the relation l 1 = l 2 to associate points on 

one edge to points on the other edge. It is therefore necessary to construct parallelograms for all 

combinations of both edge points. This gives a 2-dimensional search space in the parameters s 1 

and s 2 . However, as shown in [111] a single function does not give a well localized extremum but 

a valley. However, by simultaneous evaluation of two functions, say f 1 and f 2 two valleys will be


created which intersect each other. The intersection of the valleys then defines the parameters 

of the parallelogram reported as affine invariant region. 

The edge based regions differ from regions of other detectors as they are not centered around 

the initial anchor point. Instead the anchor point is located at one corner of the parallelogram 

shaped region. It would be possible to extend the parallelogram in a way that the anchor point 

is located at the intersection of the diagonals. But this would enlarge the initial detection 

and as corners are very often located on depth discontinuities it increases the change that the 

enlarged region then is located on the depth discontinuity, which is not a desired property for 

region matching. In [112] the authors also describe to fit an ellipse to the parallelogram shaped 

regions to create a similar representation as other detectors to compare the performance of 

different detectors. Figure 3.11(b) shows examples for such detections where each EBR region 

is represented by its center point (yellow cross) and its associated elliptical measurement region. 

(a) 

(b) 

Figure 3.11: Detection examples for affine invariant detectors on ”Group” scene. (a) Intensity 

based regions. (b) Edge based regions. 

3.3.7 Normalization 

All of the previously described methods allow to represent the detections as a center point 

associated with an elliptical measurement region. The elliptical measurement region is given by 

an estimate for the affine second moment matrix. This representation is natural for the Harris- 

Affine, Hessian-Affine method and the affine invariant salient region detector. The regions of 

the MSER, EBR and IBR detector are originally represented differently, however they can be 

represented as ellipses as well but which generally causes the loss of some information. The 

following normalization method is based on the elliptical shape representation using the affine 

second moment matrix. For the EBR and IBR regions the authors did not provide an own 

normalization method, therefore this method also applies. For the case of the MSER methods

3.4. Comparison of the described methods 40 

the original authors propose a normalization based on a local affine frame (LAF) [85] which will 

also be outlined in this section. 

Normalization of the elliptical point detections works by re-sampling the region area into 

a canonical isotropic coordinate system. The necessary affine transformation is directly given 

by the inverse square root of the second moment matrix. The points within the measurement 

region can be transformed into the canonical representation by 

x c = M − 1 2 x (3.29) 

where x is the pixel location in the original coordinate system, x c is the pixel location in the 

canonical coordinate system and M − 1 2 is the according affine second moment matrix. Please 

note that the transformation M − 1 2 assumes the center point of the detection to be the origin of 

the transformation coordinate system. Normalization with the second moment matrix results 

in a circular image region, where the different scalings and the shear gets removed. However, 

the patch is arbitrarily rotated. For image matching a rotation invariant descriptor has to be 

used, or the additional rotation normalization as describe for scale-invariant regions above has 

to be used. The normalization is illustrated in Figure 3.12. An original isotropic local structure 

is transformed using two different affine transformations. The isotropic structure is represented 

by two orthogonal lines. Figures 3.12(a),(b) show the initial detections using the Harris-Affine 

detector. The elliptical measurement region is represented using the second moment matrix. 

Figure 3.12(c) shows the normalization of the detection in Figure 3.12(a). Figure 3.12(d) shows 

the normalization of the detection in Figure 3.12(b). The isotropic structure (visualized by the 

two lines) has been reconstructed nicely by the normalization, the original orthogonality has 

been recovered. However, the normalized detections still differ in an arbitrary rotation. 

For the regions of the MSER detector the normalization can be done using a so-called local 

affine frame (LAF) [85]. The basic idea is to identify 3 points which are invariant to an affine 

transformation. This three points define the axes of a coordinate system which represents the 

LAF. The points can now be used to parameterize an affine transformation to a canonical coordinate 

system where the axes are orthogonal and of equal length in both directions. Normalization 

is done by applying the such created affine transform to the detection. The normalized patches 

will be perfectly aligned, also the orientation will be recovered. The critical point for this method 

is however the identification of the affine invariant points within the measurement region. This 

is possible for the MSER regions because the region is represented by its contour. The first 

invariant point is the center of gravity (COG) of the detected region. It is shown in [85] that the 

COG of an MSER region is invariant to an affine transformation. Further additional points are 

topological extremal points of the regions contour. Such extremal points are invariant to affine 

transformations. With the COG and two additional contour points a LAF can be constructed 

and used for normalization. In [85] several possible methods to create LAF’s for MSER’s are 

described. 

3.4 Comparison of the described methods 

As one can see easily from the big list of detectors described above there exists a vast variety 

of different methods. Each method has its pros and cons and peculiarities. In this section 

we present a table (Table 3.1) comparing the properties of the most important local detectors 

against each other. The table is based on the extensive evaluation performed by Mikolajczyk 

and Schmid [72–74, 76] and on the publicly available implementations of the detectors used in

HarrisAffinePoints / shape estimation 

HarrisAffinePoints / shape estimation 


(a) 

(b) 

(c) 

(d) 

Figure 3.12: Example for normalization, an isotropic local structure is transformed using two 

different affine transformations. (a)(b) Initial detections using the Harris-Affine detector, the 

elliptical measurement region is represented using the second moment matrix. (c) Normalization 

of the detection in (a). (d) Normalization of the detection in (b). The isotropic structure 

(visualized by the two lines) has been reconstructed nicely by the normalization. However, the 

normalized detections still differ in an arbitrary rotation. 

the evaluation 1 . The table will be very useful if one needs to select the proper method for an 

application or getting a general overview of the performance of state-of-the-art methods. 

The table contains ratings for invariance, the number of detections, the repeatability score, 

the matching score, the speed of the method and an overall rating based on a combination of 

the other ratings. In the following the ratings and terminology used in the table are described. 

Invariance: In this column the detectors invariance to a class of transformations is given. The 

detectors are classified into three groups, no invariance (’none’), invariant to scale change 

(’scale’) and invariant to affine transformation (’affine’). One method is rated with ’affine ∗ ’. 

This is because the detector is not fully invariant to an affine transformation. For more 

details see the description of the detector in the previous section. 

Number of detections: The number of detections is quite different for the various detectors. 

Although for most methods the number of detections depends on the parameter settings, 

1 Implementations were collected by Krystian Mikolajczyk and are available at 

http://www.robots.ox.ac.uk/∼vgg/research/affine/


each method has quite a characteristic number of useful detections. The detection number 

is classified qualitatively in four categories, low, medium, high, very high. The rating ’low’ 

corresponds to about 100 detections, whereas ’very high’ corresponds to about several 

thousand detections. 

Repeatability: The repeatability score is an important quality criteria for a local detector. It 

has been introduced in [72]. The repeatability scores published in [72–74, 76] have been 

used to rank the detectors. The different values have been qualitatively divided into four 

categories, low, medium, high, very high. 

Matching score: The matching score is also a measure introduced in [72] and used in combination 

with the repeatability score. The scores published in [72–74, 76] have been used to 

rank the detectors based on their matching properties using the SIFT descriptor [66]. The 

different values have been qualitatively divided into four categories, low, medium, high, 

very high. 

Speed: The detection speed is very interesting for building practical applications. The speed 

has been evaluated by using the publicly available implementations. The speed has been 

divided into five categories, very slow, slow, medium, fast, very fast. Methods with the 

rate ’fast’ or ’very fast’ can achieve a real-time frame-rate. 

Overall rate: The overall rate generally rates the usefulness of the different methods for practical 

applications. The rating is based on the evaluations but also reflects the personal 

experience with the different methods. It is divided into four categories, bad, ok, good, 

very good. 

Detector invariance number of 

detections 

repeat. 

matching 

score 

speed 

overall 

Harris none very high high low very fast ok 

Hessian none very high high low very fast ok 

Harris-Laplace scale medium high medium medium ok 

Hessian- 

Laplace 

scale medium high medium medium ok 

DOG scale medium high medium fast very good 

Salient region scale low low low very slow bad 

Harris-Affine affine medium high high medium good 

Hessian-Affine affine medium high high medium good 

MSER affine low high high fast very good 

Affine salient 

region 

affine ∗ low low low very slow bad 

IBR affine low high high slow ok 

EBR affine low medium medium slow ok 

Table 3.1: Comparison of the properties of different local detectors. The ratings are based on 

evaluations in [72–74, 76]. Please see the text for a description of the different properties.

Chapter 4 

Evaluation on non-planar scenes 1 

From the previous chapter we already know that there exists an astonishing variety of different 

local detectors. Each method is based on different image features and in most cases developed to 

perform well on a specific set of image data, mostly driven by the application. The development 

of a new method is then justified by achieving a better performance compared to previous 

methods. Thus it is quite common to compare the new method with current state-of-the-art 

methods. One example for this procedure would be the work from Carneiro and Jepson [16]. 

They present a new local detector, so called phase-based local features and compare this method 

to the Harris-Laplace detector [72] and the DoG detector [66]. Although there is an extensive 

testing the new method is not compared to all state-of-the-art detectors, mainly because of the 

reason that this would involve a big effort to gather implementations of all detectors and putting 

them into a common framework. 

Nevertheless this task was pursued by Mikolajczyk and Schmid. With big effort they collected 

implementations of most state-of-the-art detectors and put it into a common evaluation 

framework. They achieved to get the implementations from the original authors itself to assure 

that the algorithms they compare are most efficient. The test results as well as the evaluation 

methods are published in [71, 74]. For measuring the performance of the detectors a repeatability 

score and a matching score are evaluated. A local detector is assumed to be good if it 

produces interest points and regions repetitively at the same locations on an object independent 

of acquisition conditions like viewpoint, illumination and scale changes. The evaluation of the 

repeatability of the local detectors in the case of a viewpoint change needs an automatic procedure 

for ground truth generation. Obviously this can not be done by matching because every 

known method will introduce mis-matches or will miss corresponding regions. Nevertheless the 

ground-truth can be established by geometric means. On planar surfaces a homography can 

be estimated. By using this homography it is possible to check if an interest point or region 

on a planar patch will occur in an image from a different viewpoint at the same location. The 

homography describes the geometry of the test scene. It acts as ground truth and has to be 

verified for each plane manually, but it allows an automatic verification of all the interest point 

correspondences in the scene. 

1 Based on the publications: 

F. Fraundorfer and H. Bischof. Evaluation of local detectors on non-planar scenes. In Proc. 28th Workshop of 

the Austrian Association for Pattern Recognition, Hagenberg, Austria, pages 125–132, 2004 [32] 

F. Fraundorfer and H. Bischof. A novel performance evaluation method of local detectors on non-planar scenes. In 

Workshop Proceedings Empirical Evaluation Methods in Computer Vision, IEEE Conference on Computer Vision 

and Pattern Recognition, San Diego, California, 2005 [33] 

43

44 

But using a plane to plane homography limits the possible test cases to planar scenes only. 

Because of this limitation it is questionable if the results of previous detector evaluations will 

hold for realistic, non-planar scenes, especially for changing viewpoints. If interest regions are 

primarily detected on depth discontinuities and viewed from a different viewpoint the appearance 

changes significantly which will result in lower matching performance. Therefore the results 

of the detector evaluation may change considerably if using 3D scenes. Figure 4.1 shows an 

example for Hessian-Affine regions and MSER regions. A significant number of Hessian-Affine 

regions are located on depth discontinuities while the MSER detector seems to avoid such areas. 

This motivates our approach to apply the evaluation of local detectors to complex, realistic, 

and practically relevant scenes. The basic idea of enabling this extension is to exploit the 

properties of the trifocal geometry [44]. Instead of defining the ground truth for 2 images we 

propose to use 3 images of the scene. A fundamental property of the trifocal geometry allows the 

coordinate transfer of a point correspondence from 2 views into the third view. And this transfer 

is not restricted to planes but is valid for arbitrary scenes. The proposed evaluation framework 

compares different local detectors according to 3 measures. A repeatability score measures 

the capabilities of local detectors to produce detections repetitive at the same locations in the 

presence of viewpoint changes. A matching score compares the descriptive and discriminative 

qualities of the detected regions. The matching score will also reflect the cases when a local 

detector tends to produce detections on depth discontinuities. As a last measure the absolute 

number of correct matches is introduced which is interesting in object recognition where a higher 

number of matches increases the robustness against partial occlusions as well as in geometry 

estimation where a higher number usually increases the accuracy. 

(a) 

(b) 

Figure 4.1: (a) Hessian-Affine regions (a significant fraction of detections are located on depth 

corners) (b) MSER regions (no detections on depth corners)

4.1. Measures 45 

4.1 Measures 

This section defines the measures which are used to evaluate the different local detectors. In 

previous evaluations from Mikolajczyk and Schmid [71] a repeatability and a matching score were 

defined. To be comparable with the previous evaluations we chose to use the same measures. 

In fact the repeatability and matching score capture the most important properties of local 

detectors, their repeatability. Basically local detectors are designed to select a subset of pixels 

of an image. The goal is that if the operator is applied to two images showing the same scene but 

differ by some transformation like scale change, rotation, translation or viewpoint change, the 

same subset of pixels is selected. In practice one gets two subsets which show some overlap. The 

pixels in the overlapping part can be said to be detected repetitively. The repeatability score 

thus assesses the number of repetitively detected pixel locations. Measuring the repeatability 

score is straight forward by counting the number of detections which correspond. However, 

detecting the corresponding detections is the difficult task therein. It will be dealt with in detail 

in the next section. The repeatability score measures obviously the most basic property of a 

local detector. 

The matching score now measures a property of the next higher level. It evaluates the quality 

of the detections. One needs to consider a complete framework for local appearance based 

methods. After the detection of interest regions, a matcher is applied to identify corresponding 

detections from the two images. Such a matcher builds a description for the detection from its 

gray-value characteristics around it. Matching is than finding such a similar feature vector in the 

other image. Matching heavily relies on discriminative descriptions, i.e. the description of two 

different detections are easily to distinguish. One prerequisite therefore is that the considered 

areas around the detections show a characteristic gray-value variance. The matching score now 

assesses how many of the detections are correctly matched. The results of course may differ for 

various matching schemes which use different descriptors. But this will allow to find detectordescriptor 

pairs which in combination achieve the best performance. 

In addition to the repeatability and matching score we would like to extend the previous 

evaluation framework with a new measure, the complementary score. The complementary score 

comes from the idea to combine two ore more of the available detectors. This has already been 

done, e.g. in the Video Google system from Sivic and Zisserman [99] Harris-Affine regions are 

used alongside with MSER regions. This resulted in a better recognition rate than using one 

of the detectors alone. This raises the fundamental question which of the detectors can be 

used in combination to increase the performance. We call two detectors to be complementary 

if the detections of both do not overlap and are located in different areas. This diversity is 

measured with the complementary score. Ideally one would use a combination of all available 

methods. However, most applications are time critical and would not allow the computation for 

all possible methods. Here an evaluation would allow to select the best detector combinations 

for the specific application and available computing time. 

Let us start first with the details of the repeatability score. 

4.1.1 Repeatability score 

The repeatability score r i is a measure from two images. Let us assume an image sequence 

I 1 ...I n as illustrated in Figure 4.4, where the images are taken with increasing viewpoint change. 

The repeatability score is now calculated for image pairs, where one reference image is paired 

with all the others to get a sequence of increasing viewpoint angle. The arising pair sequence 

is then I 1 ↔ I 2 , I 1 ↔ I 3 , ..., I 1 ↔ I n . The repeatability score r i for image I i is thus the ratio of

4.2. Representation of the detections 46 

point-to-point (region-to-region) correspondences between the reference image I 1 and I i and the 

smaller number of points (regions) detected in one of the images. Only points (regions) located 

in the part of the scene present in both images are taken into account. It is given in Eq. (4.1). 

r i = r 1i = 

|C 1i | 

min(|R 1 |, |R i |) 

(4.1) 

R i is the set of all detected regions in image I i and |.| denotes the cardinality of a set. C ij is the 

set containing all true region correspondences between the images I i and I j . C ij contains only 

single correspondences, i.e. no element of R i corresponds to more than one element in R j . 

4.1.2 Matching score 

The matching score m i is the ratio of correct matches and the smaller number of regions detected 

in one of the images. It is depicted in Eq. (4.2). 

m i = m 1i = 

|M 1i | 

min(|R 1 |, |R i |) 

(4.2) 

M ij is the set containing all detected true region matches between the images I i and I j . In 

addition we define m i as the matching score related to the number of possible matches |C ij | (see 

Eq. 4.3). 

m i = m 1i = |M 1i| 

(4.3) 

|C 1i | 

4.1.3 Complementary score 

The complementary score c n i is the number of correctly matched non-overlapping regions between 

two different viewpoints. The measure is given relative to the sum of all matching detections. 

It is given in Eq. (4.4). 

c n i = |M i 1 ∪ M i 2 ∪ ... ∪ M i n| 

|Mi 1| + |M i 2| + ... + |M i n| (4.4) 

M j i 

is the set of correctly matched correspondences for detector type j between images I 1 and I i 

and n is the number of combined detectors. The complementary score is defined between 0..1. A 

complementary score of 0 means that there are no non-overlapping regions, the detectors produce 

the same regions. A complementary score of 1 states that the detections of both detectors are 

completely different. Thus a high or close to 1 complementary score will reveal good detector 

combinations. 

4.2 Representation of the detections 

The variety of local detectors is vast and their results may differ substantial. However, most 

of the detectors represent their results in a similar manner. In most cases the detectors return 

a center location and a measurement area around the center. Most of the difference lies in 

the representation of the measurement area. In the case of simple interest point detectors 

only a location is returned for a detection. Scale invariant detectors commonly return a center 

location and a circular measurement region based on the center location. Most of the affine 

invariant detectors return a center location and an elliptical measurement region based on the

4.3. Detection correspondence 47 

center location. Within this framework we thus distinguish between two representations, a 

point representation (PR) and a region representation (RR). The point representation only 

contains the x and y coordinates of the detection, the region representation contains the x and y 

coordinates of the detection and an elliptical measurement regions centered at the given location. 

For interest point operators thus point representation is used. For scale invariant detectors and 

affine invariant detectors the region representation is used. A special case however is the MSER 

detectors. It returns a measurement region which shape cannot be described by an ellipse. 

In detail, the detector returns a point set which describes the outline of the border of the 

detection. As an approximation of the region shape the ellipse defined by the covariance matrix 

of the border pixels is used. 

4.3 Detection correspondence 

Detecting correspondences, necessary for the calculation of the previously described measures, 

is done with geometric means. It is done by projecting a detection in one image into the other 

image. The two detections correspond if the projection from the first image and the detection 

in the second image are on the same location. Here we have to distinguish between the two 

representations of the detections. Let us consider the point representation first. 

In the point representation we have a detection p = [x y] in image I and a detection q = [x y] 

in image I ′ . By geometric means we project the detection p into image I ′ and denote it by p ′ . 

We now define that p and q correspond if the Euclidean distance between q and p ′ is smaller 

than a threshold t p . According to previous evaluations [71] t p is set to 1.5 pixel. 

In the region representation the two detections p and q are ellipses. p’ is p transferred into 

the image I ′ . In general p’ is not an ellipse anymore, the transfer may change the shape of the 

ellipse p into a complex form depending on the 3D structure of the region. The correspondence 

of p and q is now determined by checking if the areas of q and p’ overlap. We therefore calculate 

the overlap area of both structures as follows: 

overlap = q ∩ p′ 

q ∪ p ′ (4.5) 

The two detections p and q correspond if the overlap area is higher than a threshold t r . According 

to previous evaluations [71] t r is set to 50%. How to calculate the intersection and union 

areas of p and q is outlined in the next section. 

4.3.1 Transferring an elliptic region 

For correspondence detection it is necessary to compute how an ellipse detected in image I is 

seen from the vantage point of the second image I ′ and where the ellipse is located in image 

I ′ . We refer to that as transferring an ellipse from the image I to the image I ′ . The result 

depends strongly on the underlying 3D structure of the scene. In the case that the elliptic 

image structure lies on a plane in 3D the corresponding pixel coordinates in the other image 

form an ellipse too. What more, the shape can be calculated analytically if the geometry 

relations between both images (vantage points) are known (see Appendix A). In every other 

case the shape of the corresponding pixel coordinates changes according to the underlying 3D 

structure. In general it is not a conic anymore. And it is not anymore possible to calculate the 

corresponding shape analytically. An approximation of the resulting shape can be computed by 

sampling the original ellipse border with a raster and transferring each point individually into


the other image. By connecting all points in the same order as in the original image we can 

get a sampled representation (i.e. a polygonal representation) of the resulting shape. The area 

covered is then defined by transferring the pixel coordinates inside the ellipse. 

Let us denote an ellipse detected in image I as E 1 and let E 1 ′ be the ellipse detected in I and 

transferred to the other image I ′ . Ellipse E 2 ′ is the ellipse detected in I′ . When the parameter 

form of the ellipses E 1 , E 1 ′ and E′ 2 is known the values necessary to calculate the overlap area 

can computed by pixel counting. The intersection area q ∩ p ′ can be computed by counting the 

number of pixels which are commonly located within the ellipse E 1 ′ and E′ 2 . The union area 

q ∪ p ′ is computed by counting all pixels which either are within E 1 ′ or E′ 2 without counting 

the pixels belonging to both ellipses twice. The overlap area can be computed at an arbitrarily 

accuracy by choosing an accordingly small pixel raster. 

However, the parameter form for E 1 ′ is only known for the planar case. For the non-planar 

case the transfer result is only determined by a set of pixels showing an arbitrary shape. 

4.3.2 Calculating the overlap area from the point set representation 

With point set representation we denote when an ellipses is represented by the set of points within 

the ellipse borders. For the non-planar case a transfer is only possible with this representation. 

As stated before after transfer the original elliptical region may come off with an arbitrary shape. 

The projective transfer could introduce gaps in the resulting structure. Depth discontinuities 

may even split up the transferred structure into several pieces. Calculating an exact solution 

for the transferred area under these circumstances is not possible. We therefore calculate an 

approximation to the needed areas. 

After transferring the point set of E 1 into I ′ we can assign the points to two sets. One set P 

contains all points which are located inside the ellipse E 2 ′ and the other set Q contains all other 

points. We can define a ratio r as 

r = 

|P | 

|P | + |Q| . (4.6) 

The intersection area is approximated by the area of the convex hull for the set of transferred 

points P . This approximation will also give a good estimation if there is a significant scale change 

and the transferred points are spread out. 

The area of the union is approximated as the sum of the area of the original ellipse E 2 

′ 

and the area represented by Q. The area of E 2 ′ can be calculated exactly. However it is not 

possible to approximate the area of Q by using the convex hull as done for P . Q is not assumed 

to represent one connected structure. The area of Q is therefore estimated from the ratio r 

between the point sets P and Q. 

area(Q) = (1 − r) area(P ) 

r 

(4.7) 

overlap = 

area(P ) 

area(E ′ 2 ) + area(Q) (4.8) 

Figure 4.2 illustrates the area approximation. The black ellipse is the ellipse E 2 ′ which area 

can be calculated exactly. The red ellipse is the exact transferred ellipse E 1 ′ . The convex hull of 

the part of E 1 ′ which is located within E′ 2 is drawn in blue. The blue and red crosses mark the 

pixel locations which represent the transferred ellipse E 1 ′ as point set and which are used for the 

area approximation.


E 2 

' 

E 1 

' 

Figure 4.2: Illustration of overlap area approximation. 

4.3.3 Justification of the approximation 

We give an experimental justification of the presented approximation method by transferring 

ellipses and comparing the approximated overlap area with the true overlap. The relative error 

of the approximation was calculated for various overlap situations with varying viewing angle 

(from -45 ◦ to 45 ◦ ) and with increasing scale factor (from 0.5 to 2). The various testcases are 

illustrated in Figure 4.3. Figure 4.3(a-d) shows the scale steps 0.5, 1, 1.5, 2. Figure 4.3(e-h) 

shows the viewpoint angles of -45 ◦ , -30 ◦ , 0 ◦ , 45 ◦ . The initial uniformly distributed point set gets 

perspectively distorted. Figure 4.3(i-l) shows various overlap scenarios. Table 4.1 summarizes 

the results. The approximation error increases with increasing viewing angle. Especially for 

large scale changes the approximation will introduce high errors. However, such cases can be 

identified by analyzing the distribution of the transformed point set and highlight the cases for 

manual inspection. 

(a) (b) (c) (d) 

(e) (f) (g) (h) 

(i) (j) (k) (l) 

Figure 4.3: (a-d) Scale change from 0.5 to 2. (e-h) Viewpoint change from -45 ◦ to 45 ◦ . (g-l) 

Various overlap scenarios.

4.4. Point transfer using the trifocal tensor 50 

viewing angle [ ◦ ] error [%] error [%] error [%] error [%] 

scale 0.5 scale 1.0 scale 1.5 scale 2.0 

-45 7.4 10.4 20.1 26.2 

-40 8.1 7.9 12.7 19.8 

-35 8.2 7.6 8.7 15.0 

-30 7.7 7.7 7.8 11.0 

-25 9.1 7.4 7.3 7.9 

-20 8.9 7.1 7.3 7.1 

-15 6.3 7.1 6.7 7.1 

-10 3.5 6.2 6.7 6.7 

-5 4.6 5.8 6.4 6.3 

0 5.6 5.3 6.3 6.6 

5 5.4 5.6 6.4 6.3 

10 3.5 6.2 6.8 6.7 

15 5.5 7.3 6.7 7.1 

20 8.9 7.1 7.3 7.1 

25 9.1 7.4 7.3 7.9 

30 8.3 7.8 7.8 11.1 

35 8.2 7.6 8.8 15.0 

40 8.1 7.9 12.7 19.8 

45 7.4 10.5 20.1 26.2 

Table 4.1: Overlap approximation error compared to exact overlap for viewing angle from -45 ◦ 

to 45 ◦ and scale changes from 0.5 to 2. 

4.4 Point transfer using the trifocal tensor 

For non-planar scenes the pixel-by-pixel transfer of the ellipses in point representation can be 

computed using the trifocal tensor. The trifocal geometry describes the relations between images 

taken from 3 different vantage points. That means in trifocal geometry there are 3 images, say 

I, I ′ and I ′′ . Point locations are denoted in the same way, p, p ′ and p ′′ where p is a homogeneous 

vector containing the x and y coordinates p = [x y 1] T . The geometry between the 3 images is 

encapsulated by the trifocal tensor T which can be estimated from point correspondences in the 

3 images p ↔ p ′ ↔ p ′′ . The point transfer property allows to compute the location of a matched 

pair of points p ↔ p ′ in a third view I ′′ , provided the trifocal tensor between the three views is 

known (see Appendix B for details). This relation can be written as 

p ′′ = f(T, p ↔ p ′ ), (4.9) 

where T is the trifocal tensor. Assume that we want to transfer the pixels of an ellipse from 

view I to I ′′ . One consequence of this relation is that for each ellipse point in I we need to know 

the corresponding point in a second image I ′ to carry out the transfer. This can be achieved by 

establishing a dense matching between the images I and I ′ . That is, for every pixel location in 

I we know the corresponding location in I ′ . This will allow to transfer each location of I to I ′′ . 

The entities needed for the point transfer, i.e. trifocal tensor and dense matching, are further 

denoted as ground truth.

4.5. Ground truth generation 51 

4.5 Ground truth generation 

Ground truth is the geometry information for the test images which is necessary to do the ellipse 

transfer between two images. The ground truth is composed of a dense matching between two 

nearby images and the trifocal tensors between the image triplets used for evaluation. A typical 

evaluation scenario is illustrated in Figure 4.4. A 3D scene (or object) is imaged from various 

vantage points. It is convenient to move the camera in a circular path around the 3D object 

so that the viewpoint change can be annotated in degrees. One image has to be chosen as the 

reference image. An image close to the reference image should be used for the dense matching 

and which will serve as intermediate image for the point transfer. The other images of the 

sequence can then be used for the evaluation. However, there are basically no geometrical 

restrictions to an image sequence to be used for evaluation. It is not necessary to acquire the 

test images on a special setup (e.g. like a turn-table). All methods used to create the ground 

truth can deal with uncalibrated image data. Not more than the images itself are necessary 

to create the ground truth data and do the evaluation. That means, it is possible to create 

evaluation ground truth to whatever images you obtain (e.g. downloads from the internet). 

4.5.1 Trifocal tensor 

The trifocal tensor encapsulates the geometry between three images. It is the analogue to the 

fundamental matrix from the two view case. The trifocal tensor can be calculated from 7 point 

correspondences across the three images [44]. The calculation of the trifocal tensor from the 

point correspondences works straight forward. The difficult part however is the detection of the 

point correspondences. Due to the nature of the testcases wide-baseline methods are needed 

to generate the point correspondences. In our evaluation framework point correspondences 

are automatically established by detecting MSER regions [70] and matching them using the 

SIFT descriptor [67]. For cases where this automatic method fails the correspondences must be 

selected manually. 

4.5.2 Dense matching 

For dense matching of two nearby images there exists a variety of algorithms [46, 58, 93, 102]. 

However, one special requirement for the dense matching is sub-pixel accuracy, such that the 

points of the reference image lie on the pixel raster and the points in the intermediate images 

are sub-pixel shifted to achieve the best correlation. Therefore we do not simple employ one of 

the standard algorithms but implemented a dense matching which perfectly fits our needs. Our 

matching method is outlined in Algorithm 1. 

Algorithm 1 Dense matching 

Interest point detection and matching on low resolution images 

Robust fundamental matrix estimation (RANSAC) 

Image rectification 

Initial iterative point matching (enforcing epipolar constraint) 

Upgrade to dense sub-pixel matching 

The first step of the dense matching is to estimate the fundamental matrix. This is a necessary 

precondition to do image rectification. Thus Harris corners [40] are extracted and matched


using template matching and normalized cross correlation. This is done on re-sampled lower resolution 

versions of the images. This will speed up the initial matching enormously. The detected 

point correspondences are used to calculate the fundamental matrix. The Gold standard method 

for fundamental matrix estimation is used [44]. It is robust against outliers and minimizes the 

re-projection error. The next step is to rectify the images. The projective rectification method 

proposed by Hartley is used [43]. The method is able to work with uncalibrated images. The 

prerequisites are the fundamental matrix and a small set of point correspondences. It works by 

factorizing the fundamental matrix and estimating a matching pair of image transformations 

which have to be applied to the images. The corresponding points have to be outlier-free and 

very accurate. Inaccurate point matches effect the algorithm badly. Thus only a subset of the 

initial point matches which fit the epipolar geometry best are selected for that step. The resulting 

rectified images are then matched again with an iterative method [50]. The algorithm 

returns a 4×4 grid matching at subpixel accuracy and enforces the epipolar constraint. For 

subpixel accuracy the method of Lan and Mohr [62] is used, which is reported to achieve a 

matching precision better than 0.1 pixels for selected interest points. In a next step the matching 

is densified by filling in the matches between the grid points. Because of the grid matching 

it is possible to restrict the search window for template matching to a very small area. In fact, 

by establishing an affine transformation between 3 neighboring grid points the expected position 

of a corresponding point in the other images can be calculated. In most cases only a sub-pixel 

correction of the point match is necessary. As a last step the point matches are de-rectified to 

gain the point correspondences in the original image coordinates. 

4.5.3 Ground truth quality 

Inaccuracies of the generated ground truth directly effect the evaluation results. But first let us 

consider what are the quality characteristics: 

• False point correspondences 

• Inaccurate point correspondences 

• Regions without dense matching because of homogeneous texture 

• Regions without dense matching because of occlusions 

• Inaccurate trifocal tensors 

Most of the listed characteristics concern the dense matching. Let us discuss the different 

cases in more detail. It is almost impossible to create a dense matching which is completely 

without false point correspondences. Although one can enforce the epipolar constraint and 

an additional ordering constraint this does not guarantee that all false correspondences get 

discarded. However, in most dense matching algorithms the number of false correspondences is 

close to zero. For our application where a set of point matches is used to represent an ellipse the 

occurrence of one or two false matches would hardly influence the overall result, because this 

number of false matches would be negligible compared to the number of correct points used for 

the representation. 

The accuracy of the point correspondences however is a very crucial subject. It directly 

effects the point transfer. In fact, pixel accurate matches are not accurate enough, sub-pixel 

accuracy is necessary. However, with the used sub-pixel method [62] the necessary accuracy can 

be achieved.


Another critical issue is when the dense matching does not cover the whole image. It is 

known that homogenous and non-textured regions cause problems for correlation based matching 

algorithms. Correlation based matching requires a local variance of the gray-values. If the nontextured 

region is bigger than the correlation window it is not possible to identify the matching 

pixel location. Thus such image parts may not be covered by dense matches. The consequence 

thereof is that the representation of local detections for the evaluation is not complete which 

may effect the evaluation results. 

Parts without dense matching may occur also because of occlusions or depth discontinuities. 

As we are dealing with non-planar scenes and different vantage points such cases will definitely 

occur. However, as the dense matching is done on short-baseline images the influence of occlusions 

and depth discontinuities is only a minor one compared to missing parts because of 

non-textured regions. 

The so far discussed points were issues of the dense matching. However, the accuracy of the 

trifocal tensor directly effects the accuracy of the point transfers. The trifocal tensor is calculated 

from wide-baseline matches across three views. Inaccuracies in the estimation may result from a 

low number of point correspondences as well as the accuracy of the point correspondences itself. 

Wide-baseline images taken from widely different vantage point may show strong occlusions 

and which makes it difficult to establish point matches which are well distributed over the 

whole image area. Such configurations can also result in an inaccurate estimation of the trifocal 

geometry. 

Now as we have identified the different effects which influence the quality of the ground truth 

we can think about assessing the quality in a quantitative way. The following quantities can be 

measured: 

• Re-projection error of the point correspondences 

• Re-projection error for trifocal tensor 

• Number of non-matched image pixels 

The re-projection error [44] of the dense point correspondences gives a measure for the 

accuracy of the matching. It is calculated by building the 3D reconstructions of the point 

matches, doing a re-projection into the images and calculating the distance to the original point 

correspondences. To calculate the 3D reconstructions it is necessary to estimate the fundamental 

matrix from the point matches. 

The re-projection error for the trifocal tensor is similarly calculated. It differs that the 3D 

reconstruction is computed using the trifocal tensor and that the re-projection error is summed 

over 3 images. 

The number of non-matches image pixels is easily computed, it is simple counting. 

Another idea would be not to evaluate the ground truth data but the ground truth generating 

methods. This could be done by generating a synthetic test scene (which known ground 

truth) and apply the ground truth generation to this scene. The estimated ground truth can 

be compared with the known ground truth and the estimation errors can be reported. This 

evaluation can measure the following values: 

• The number of false point correspondences 

• Error distance of dense matches and synthetic matches in intermediate image 

• Transfer error of the estimated trifocal tensor

4.6. Experimental evaluation 54 

Figure 4.4: Two nearby images (e.g. the first two) from the whole sequence are used to create 

the dense matching. With the trifocal tensor it is possible to transfer a point location given in 

the first image into every other image of the sequence. 

The enlisted values can be calculated straight forward. For the synthetic scene all point correspondences 

are known, so that the false match correspondences can easily be identified and 

counted. The error distance between the established matches and the synthetic know point correspondences 

can be calculated very easily. The standard deviation characterizes the quality of 

the matching. The quality of the trifocal tensor estimation can be characterized by the transfer 

error. For the synthetic data the position of every pixel from the reference image in the other 

views is known. When transferring the pixels from the reference image to another view using 

the synthetic point matches, the only error source lies in the trifocal tensor. The transfer error 

is the distance of the transferred point to the exact point location. This gives a quality measure 

for the trifocal tensor. 

4.6 Experimental evaluation 

Ground truth was calculated for 2 different complex scenes. The test scene ”Group” shows two 

boxes and was acquired at a turntable. This scene is piece-wise planar. The second test scene 

”Room” shows a part of an room. This scene is of higher complexity than the first one. Both 

image sequences consist of 19 images and the viewpoint varies from 0 ◦ to 90 ◦ . Figure 4.5(a), 

(c) show examples for both scenes. Figure 4.5(b), (d) show the depth maps resulting from the 

dense matching. Black image parts contain no matches. The ”Group” scene with a resolution of 

896×1024 pixel is covered by matches 96.5% (452167 pixels) (excluding the background). The 

”Room” scene with a resolution of 800×600 pixel is covered 71.4% (342668 pixels) with matches. 

Most of the missing parts are due to large homogenous regions. This will not severely bias the 

evaluation results because most detectors will not find regions in homogeneous image parts. For 

interest point evaluation the average distance to the nearest matched point was 0.43 pixel. For 

interest region evaluation the average coverage of the regions with matched points is 86%.


(a) 

(b) 

(c) 

(d) 

Figure 4.5: (a) Test scene ”Group”. (b) Depth map for ”Group” scene (not matched parts are 

black). (c) Test scene ”Room”. (d) Depth map for ”Room” scene. 

4.6.1 Repeatability and matching score 

We evaluate 7 different detectors on increasing viewpoint change. The compared values are the 

repeatability score and the matching score. The evaluated detectors are the Maximally Stable 

Extremal Regions (MSER) [70], the Hessian-Affine regions [73], the Harris-Affine regions [73], 

the intensity based regions (IBR) [112], Difference of Gaussian keypoints (DOG) [67], Harris 

and Hessian corners [40]. 

For the detectors we use the publicly available implementation from Mikolajczyk. Figure 4.6 

shows the repeatability scores for the ”Group” scene. The best performances are obtained by 

the MSER and the DOG detector. In fact the repeatability score is even for viewpoint changes 

up to 90 ◦ higher than 40%. Figure 4.7 shows the evaluation results for the ”Room” scene. The 

best performance is achieved by the DOG and IBR detector. The IBR detector especially shows


repeatability [%] 

100 

90 

80 

70 

60 

50 

40 

MSER 

DOG 

Harris-Affine 

Hessian-Affine 

IBR 

Harris 

Hessian 

30 

20 

10 

0 

0 10 20 30 40 50 60 70 80 90 100 

viewpoint angle [°] 

(a) 

number of correspondences 

1400 

1200 

1000 

800 

600 

400 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

200 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 

Figure 4.6: (a) Repeatability score for ”Group” scene. (b) Absolute number of correspondences. 

high repeatability scores for large viewpoint changes too. Overall the repeatability scores for 

this complex scene are lower than that for the ”Group” scene. This is because the ”Group” 

scene is composed only of 2 piecewise planar objects while the ”Room” scene contains much 

more objects of arbitrary shape. Generally speaking the results fulfill our expectations. While 

the repeatability of the simple interest corner detectors drops very fast with increasing viewpoint 

change the scores for the more advanced affine invariant detectors stay quite high despite the 

increasing viewpoint change. However, the plot of the absolute number of repetitive detections 

shows that for the ”Room” scene the absolute number of repetitive detections from the MSER 

detector drops below 20 for viewpoint changes larger than 45 ◦ . For some algorithms such a low 

number of possible matches would not allow them to run robustly. Other approaches like the 

DOG detector are still able to produce more than 150 possible matches at such large viewpoint 

changes. 

Figure 4.8(a) shows the matching scores of the ”Group” scene relative to the number of


repeatability score [%] 

90 

80 

70 

60 

50 

40 

30 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

20 

10 

0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 


1400 

1200 

1000 

800 

600 

400 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

200 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 

Figure 4.7: (a) Repeatability score for ”Room” scene. (b) Absolute number of correspondences. 

detected regions. The number of the matches is related to the smaller number of detected 

regions in both images. Figure 4.8(b) shows the matching scores related to the number of 

possible matches. Possible matches are region correspondences established geometrically using 

the ground truth. The first measure actually represents how many of the initial detections could 

be correctly matched. The second measure reveals how well the detector selects discriminative 

image regions, as it relates the number of correct matches to the number of possible matches. 

Figure 4.8(c) shows the absolute number of correct matches, which is interesting if subsequent 

algorithms require a certain number of correspondences, e.g. epipolar geometry estimation. 

Figure 4.9 shows the matching scores for the ”Room” scene. 

In this experiment it is expected to see significant differences between the simple point detectors, 

the scale invariant detectors and the affine invariant detectors. Especially the normalization 

of the affine invariant regions should compensate for the viewpoint change. And actually the


MSER detector accomplishes in average the best matching scores. The DOG detector shows 

surprisingly low matching scores. While it starts similar to the other detectors the matching 

score drops very fast with increasing viewpoint change. This is once more surprising as the DOG 

detector was initially introduced with the SIFT descriptor. Most impressive are the results for 

the simple point detectors. For small viewpoint changes the results are ranked under the top 

three. With increasing viewpoint change the matching scores however drop dramatically. 

Comparing these results to the previous evaluation of Mikolajczyk [74] on planar scenes 

only, one can see two main differences. First, the MSER detector provides a significant higher 

performance than other affine invariant detectors especially for the matching score. This comes 

from the fact that the MSER detector is not corner based and does not tend to detect regions 

on depth continuities. Second, the evaluations on the ”Room” scene show that the achievable 

repeatability scores and matching scores for complex scenes are considerably lower than those 

achieved on planar scenes. This means one must expect a much lower number of matches in 

practice than the previous evaluations suggested. 

4.6.2 Combining local detectors 

This experiment evaluates the benefits gained by combining different detectors. Benefits can be 

gained if the combined detectors produce detections in different parts of the image. To assess 

this we measure the complementary score c n i . Figure 4.10 shows a cumulative plot of the relative 

numbers of non-overlapping matched interest regions from 5 different detectors for the ”Group” 

and ”Room” scene. Every line shows how many new regions are added to the previous set of 

interest regions by the specific detector. The graphs show impressively that combining local 

detectors leads to a larger set of distinct image regions over a wide range of viewpoint changes. 

It is remarkable that the regions from a combination of all 5 detectors still contain less than 20% 

overlapping ones. However, in real applications usually performance issues do not permit to run 

all detectors on an input image. A good choice for combining 2 detectors would be selecting 

the MSER and DOG detector which apparently seem to be the 2 fastest detectors. The graphs 

in Figure 4.11 show only a small number of overlapping regions. Combining Harris-Affine and 

Hessian-Affine detectors together creates a significant number of overlapping regions as seen in 

Figure 4.12. This is expected as the algorithms for both methods are quite similar.


view MSER DOG Har.-Aff. Hes.-Aff. IBR Harris Hessian 

change repeat. repeat. repeat. repeat. repeat. repeat. repeat. 

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] 

10 76.3 68.2 48.5 56.4 57.6 69.3 70.4 

180 1113 586 457 364 1244 853 

15 67.8 63.8 45.0 50.9 52.6 59.6 64.2 

164 1056 543 412 318 1070 734 

20 59.8 59.6 39.9 46.7 49.8 38.3 47.7 

143 950 482 358 308 662 538 

25 56.0 55.8 36.3 43.5 48.8 39.2 44.6 

126 889 436 359 290 636 519 

30 57.7 52.5 36.3 40.6 48.5 34.3 40.0 

127 811 439 341 275 534 464 

35 55.6 50.9 33.8 41.2 47.2 31.9 38.3 

120 754 408 353 265 494 459 

40 50.7 50.0 33.6 37.2 45.3 26.4 31.9 

114 742 406 236 267 412 401 

45 50.9 49.2 31.0 34.0 44.3 28.0 32.0 

112 703 375 253 255 443 402 

50 46.6 48.9 31.4 35.6 46.8 28.9 32.9 

108 713 379 258 256 457 414 

55 47.2 46.7 32.0 34.4 45.0 27.6 32.2 

110 695 387 279 247 437 405 

60 45.9 44.8 32.3 33.1 44.2 25.3 31.8 

111 693 390 268 258 418 400 

65 43.3 42.5 29.7 31.9 43.1 23.7 32.1 

107 676 359 258 249 409 404 

70 44.1 43.1 29.4 30.7 39.1 21.5 28.8 

109 676 355 249 236 365 362 

75 41.2 41.6 28.6 26.9 42.2 21.9 28.3 

100 625 346 218 232 367 356 

80 42.5 42.5 26.5 26.2 43.3 21.9 30.2 

97 642 320 212 229 354 377 

85 41.7 44.0 25.5 26.0 43.3 21.4 29.3 

91 615 293 206 218 336 353 

90 37.9 43.7 26.2 25.6 40.8 20.3 30.4 

83 584 292 188 200 304 336 

Table 4.2: Repeatability score and absolute number of correspondences for ”Group” scene with 

changing viewpoint


matchingscore (rel. to #detection) [%] 

90.0 

80.0 

70.0 

60.0 

50.0 

40.0 

30.0 

20.0 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

10.0 

0.0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 

matchingscore (rel. to #possible matches) [%] 

90 

80 

70 

60 

50 

40 

30 

20 

10 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 

absolute number of correct matches 

1000 

900 

800 

700 

600 

500 

400 

300 

200 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

100 

0 

0 10 20 30 40 50 60 70 80 90 100 


(c) 

Figure 4.8: (a) Matching score for ”Group” scene relative to number of detections. (b) Matching 

score for ”Group” scene relative to number of possible matches. (c) Absolute number of correct 

matches.



90 

80 

70 

60 

50 

40 

30 

20 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

10 

0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 


90 

80 

70 

60 

50 

40 

30 

20 

10 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 


1000 

900 

800 

700 

600 

500 

400 

300 

200 

MSER 

DOG 

Harris-Affine 


IBR 

Harris 

Hessian 

100 

0 

0 10 20 30 40 50 60 70 80 90 100 


(c) 

Figure 4.9: (a) Matching score for ”Room” scene relative to number of detections. (b) Matching 

score for ”Room” scene relative to number of possible matches. (c) Absolute number of correct 

matches.



change repeat. repeat. repeat. repeat. repeat. repeat. repeat. 

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] 

15 44.6 47.3 35.6 41.9 46.1 35.4 46.2 

54 371 208 240 111 349 363 

20 38.1 46.0 31.3 34.0 42.5 29.3 39.1 

43 331 188 191 99 289 307 

25 31.3 39.1 26.6 28.0 35.8 23.4 32.3 

35 281 158 160 81 231 254 

30 24.8 35.8 26.8 27.8 37.7 19.0 26.0 

28 260 150 157 81 187 204 

35 23.3 33.5 24.6 26.2 41.9 18.2 24.6 

27 237 142 145 91 179 193 

40 21.4 29.2 20.8 22.5 31.9 12.5 13.4 

21 210 116 113 61 123 105 

45 23.1 28.4 19.8 20.5 29.4 12.5 14.0 

24 205 108 113 60 123 110 

50 20.0 25.7 15.8 15.8 28.0 9.5 12.0 

21 181 89 80 58 94 94 

55 19.3 26.9 16.2 14.7 27.2 10.3 10.1 

22 202 85 82 55 102 79 

60 12.9 24.0 14.4 15.5 24.0 7.8 8.9 

13 164 82 85 49 77 70 

65 18.9 24.4 13.8 15.1 29.0 8.7 8.8 

21 161 82 86 56 86 69 

70 13.5 22.9 12.8 12.7 27.1 8.4 7.4 

15 164 75 71 58 83 58 

75 17.9 22.9 14.4 15.5 30.0 9.9 10.3 

20 163 85 82 67 98 81 

80 14.4 23.8 12.6 13.2 29.9 9.1 9.2 

16 175 76 72 66 90 72 

85 13.5 21.7 11.9 11.4 27.6 9.2 9.3 

14 164 74 66 61 91 73 

90 11.5 22.1 12.1 11.7 28.8 8.2 9.4 

13 159 73 68 66 81 74 

Table 4.3: Repeatability score and absolute number of correspondences for ”Room” scene with 

changing viewpoint



change m. score m. score m. score m. score m. score m. score m. score 

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %, [%, %, [%, %, 

abs.] abs.] abs.] abs.] abs.] abs.] abs.] 

10 66.9 35.1 29.4 34.3 33.5 47.2 53.1 

75 38.5 31.6 36.8 37.3 88.1 88.2 

162 579 355 280 212 883 644 

15 56.5 22.7 24.3 28.3 26.9 35.5 43.4 

69 25.7 26.9 31.2 30.1 77.6 77.1 

140 379 293 231 163 647 497 

20 43.3 17.1 18.8 21.6 23.8 25.9 29.9 

57.3 19.4 22.1 25.6 26.9 67.2 61.9 

106 276 227 176 147 448 338 

25 36.8 12.4 16.3 19.1 22.7 23.0 24.9 

50.6 14.3 20.5 23.7 27.8 58.4 55.7 

85 199 196 156 135 374 290 

30 33.6 9.6 14.3 13.6 18.5 17.7 18.0 

46.3 11.3 18.7 17.6 23.1 51.4 43.7 

76 150 173 111 105 275 209 

35 27.0 8.7 11.5 14.1 19.4 13.8 15.1 

38.2 10.3 15.5 17.9 24.1 43.1 39 

60 131 139 115 109 214 181 

40 24.2 7.3 10.8 11.3 17.3 8.3 9.3 

37.6 8.5 14.2 15.2 21.4 30.4 28.6 

56 110 131 92 102 129 118 

45 24.8 5.9 8.7 10.5 14.2 7.0 7.6 

37.3 6.9 11.8 14.2 17.8 24.2 23.9 

56 85 105 86 82 110 99 

50 18.5 6.2 8.2 9.1 14.8 4.6 6.3 

29.5 6.9 11.3 11.8 16.7 15.7 19.7 

44 91 99 74 81 73 83 

55 20.5 5.0 6.5 7.1 13.2 3.4 4.0 

32.7 5.9 8.6 9.4 15.8 12.1 12.6 

49 76 78 58 73 54 52 

60 14.5 4 6.7 6.1 13.4 2.3 3.7 

24.2 5 9 8.2 16.9 9 11.8 

36 64 81 50 79 39 49 

65 14.1 3.9 7.5 6.5 11.1 1.9 2.7 

24.5 5 10.3 9.1 13.8 8 8.3 

36 65 90 53 65 34 35 

70 11.4 2.7 4.8 5.2 10.6 1.7 1.4 

20.1 3.5 6.7 7.5 14 7.9 4.7 

30 45 58 42 67 30 18 

75 9.4 2.8 5.2 5 10.5 0.7 1.4 

17.7 3.7 7.7 7.9 13.6 3.4 4.8 

25 47 63 41 62 13 18 

80 9.6 2.0 4.0 4.5 7.4 0.7 1.1 

18.1 2.6 5.8 7.2 9.2 3.5 3.4 

25 34 48 37 44 13 14 

85 6.3 1.9 4.2 3.6 7.4 0.5 0.6 

12.7 2.5 6.3 6 9.6 2.8 2 

16 31 51 29 43 10 8 

90 6.2 1.5 2.5 3.6 6.1 0.4 0.4 

13.2 1.9 3.8 6.1 8 2.2 1.4 

16 25 30 29 35 7 5 

Table 4.4: Matching score, matching score relative to number of possible matches and absolute 

number of correct matches for ”Group” scene with changing viewpoint



change m. score m. score m. score m. score m. score m. score m. score 

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %, [%, %, [%, %, 

abs.] abs.] abs.] abs.] abs.] abs.] abs.] 

15 19 11.2 12.8 16.6 12.8 16.4 2 

50.8 18.1 21.7 24.8 19.8 56.4 5.1 

31 102 91 106 36 202 19 

20 9.2 6.7 8.5 11.5 12.8 9.9 1.2 

29.4 10.9 14.6 18.4 22.9 39.8 3.8 

15 58 60 73 36 115 12 

25 5.5 4.8 7.3 9.0 8.9 5.1 0.4 

20.5 9.3 14.1 17.4 18.7 24.5 1.6 

9 42 51 57 25 57 4 

30 4.3 3.4 6 5.7 7.8 3.5 0.2 

20.6 7.2 12.2 11.3 16.4 21 0.9 

7 31 42 36 22 39 2 

35 3.7 2.2 3.2 3.9 4.6 2.0 0.2 

18.2 4.5 6.7 8 9.6 13.3 1 

6 20 23 25 13 24 2 

40 4.6 1.3 2.7 2.6 4 0.7 0.2 

25.9 3.5 7.1 6.5 11 7.1 1.8 

7 12 19 16 11 9 2 

45 3.7 0.7 0.9 1.9 3.9 0.9 0.1 

20.7 1.7 2.4 4.8 11.3 8.7 0.8 

6 6 6 12 11 11 1 

50 3.7 0.7 2.4 1.0 1.5 0.4 0 

24 1.9 7.6 3 4.6 5 0 

6 6 17 6 4 5 0 

55 2.5 0.7 1.6 0.9 2.1 0.3 0.1 

13.8 1.7 5 2.9 6.1 3.8 1.2 

4 6 11 6 6 4 1 

60 0.7 0.2 0.6 0.5 2.6 0.2 0 

5 0.7 1.8 1.6 8.6 2.5 0 

1 2 4 3 7 2 0 

65 1.8 0.2 0.6 0.3 2.2 0.2 0 

10 0.6 1.7 0.9 6.2 2.2 0 

3 2 4 2 6 2 0 

70 1.2 0.2 0.6 0.2 1.1 0 0 

9.5 0.6 1.7 0.6 2.9 0 0 

2 2 4 1 3 0 0 

75 1.2 0.1 0 0.6 1.8 0.1 0 

7.1 0.3 0 2.2 4.3 1 0 

2 1 0 4 5 1 0 

80 0 0.1 0.7 0.2 1.1 0.1 0 

0 0.3 2.3 0.5 2.8 1.1 0 

0 1 5 1 3 1 0 

85 0.6 0.1 0.4 0.5 1.8 0.2 0 

4.2 0.3 1.4 1.6 4.1 2.2 0 

1 1 3 3 5 2 0 

90 1.2 0.3 0.3 0.5 1.1 0.2 0 

10.5 0.9 0.9 1.6 2.5 2.4 0 

2 3 2 3 3 2 0 


number of correct matches for ”Room” scene with changing viewpoint


rel. number of non-overlapping regions 

(cumulative) 

1.0 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 

viewpoint angle (approx.) [°] 

MSER 

Harris- 

Affine 

Hessian- 

Affine 

IBR 

DOG 

(a) 


(cumulative) 

1.0 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 


MSER 

Harris- 

Affine 

Hessian- 

Affine 

IBR 

DOG 

(b) 

Figure 4.10: Relative numbers of non-overlapping matched regions for combination of 5 detectors. 

(a) ”Group” scene (b) ”Room scene”



(cumulative) 

1.2 

MSER 

1.0 

DOG 

0.8 

0.6 

0.4 

0.2 

0.0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 


(a) 


(cumulative) 

1.2 

MSER 

1.0 

DOG 

0.8 

0.6 

0.4 

0.2 

0.0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 


(b) 

Figure 4.11: Relative numbers of non-overlapping matched regions for combination of MSER 

and DOG detector. (a) ”Group” scene (b) ”Room scene”



(cumulative) 

1.2 

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 


Harris- 

Affine 

Hessian- 

Affine 

(a) 


(cumulative) 

1.0 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 


Harris- 

Affine 

Hessian- 

Affine 

(b) 

Figure 4.12: Relative numbers of non-overlapping matched regions for combination of Harris- 

Affine and Hessian-Affine detector. (a) ”Group” scene (b) ”Room scene”

Chapter 5 

Maximally Stable Corner Clusters 

(MSCC’s) 1 

The development of this novel local detector is motivated by the need for highly descriptive 

and discriminative regions for wide-baseline feature matching. Discriminability means that the 

detection has a unique appearance and thus it is easy to distinguish from other detections. 

Descriptiveness means that the detection should possess a significant gray-value variance (e.g. 

texture) so that a meaningful feature-vector can be built. Both measures play a crucial role 

in finding correspondences between different images. Thus it is not surprising that already the 

Moravec operator [79] did account for it. In selecting points which show a high correlation 

difference if only the image window is shifted a little bit, point locations are detected which are 

highly descriptive. For instance, image regions with little texture (homogenous) would show a 

small response while highly textured regions (which usually results in high intensity frequencies) 

would show a high response. The aim for the new detector was therefore to identify highly 

textured, thus descriptive regions. The observation that a highly textured image region gives 

rise to a high number of Harris corners led to the idea to detect distinguished regions based on 

conglomerations of Harris corners. By means of clustering such regions can be detected. Each 

detected cluster will represent a distinguished region. The cluster center defines the position of 

the detection, whereas the outline of the region is defined by the cluster border. Speaking of 

descriptiveness and discriminability it is interesting to think about the best possible detections. 

Obviously, the larger an image region the larger the descriptiveness and discriminability. It 

seems quite convincing that the feature vector characterizing a detection should be calculated 

from the center of the detection to the image borders itself, thus creating the most discriminating 

feature vector. However, only small regions are robust against occlusions. Thus, one must be 

interested in detecting image regions as small as possible with a maximum of descriptiveness. 

Another important property of a local detector is its repeatability. That means detections are 

repetitively reported on the same locations although the image is undergoing transformations like 

rotation, scale change, lighting change, view point change, etc. Such transformations easily occur 

in practice and a good detector should be invariant to such transformations, or at least robust 

1 Based on the publications: 

F. Fraundorfer, M. Winter, and H. Bischof. Mscc: Maximally stable corner clusters. In Proc. 14th Scandinavian 

Conference on Image Analysis, Joensuu, Finland, pages 45–54, 2005 [36] 

F. Fraundorfer, M. Winter, and H. Bischof. Maximally stable corner clusters: A novel distinguished region 

detector and descriptor. In Proc. 1st Austrian Cognitive Vision Workshop, Zell an der Pram, Austria, pages 

59–66, 2005 [37] 

68

5.1. The MSCC detector 69 

against it. Already in 1998 Schmid et al. [95] investigated state-of-the-art interest point detectors 

on their robustness against some classes of transformation. In particular they investigated the 

robustness against rotation, scale and viewpoint change. The results for rotation and viewpoint 

change are reproduced in Figure 5.1 and 5.2. The experiments show the repeatability score of 

different interest point detectors under rotation and viewpoint change. 

(a) 

(b) 

(c) 

Figure 5.1: Results of the interest point detector evaluation of Schmid et al. [95]. The Harris 

detector shows high repeatability for rotated images. (a-b) Harris detections on original and 

rotated image. (d) Repeatability score (Images from [95]) 

The experiments are reproduced to stress the good repeatability score achieved for the simple 

Harris detector. Beyond it, if single points are stable than also point clusters will be stable. In 

fact, point clusters will be more stable since a few missing single points will not affect the clusters 

itself. This leads straightforward to the idea for a new local detector based on clusters of interest 

points. In addition point clusters will provide a delineation of the region, yielding a textured 

thus highly descriptive image region. In the following we will describe a local detector based on 

this principle and we will call the detected regions Maximally Stable Corner Clusters (MSCC). 

5.1 The MSCC detector 

The detection of MSCC regions is equivalent to the detection of clusters in a 2-dimensional 

feature space. The used features are the x and y coordinates of detected interest points. Clustering 

can be performed using graph based methods, where each interest point represents a 

node. Clusters may appear in different sizes (scales) and may be nested, thus a hierarchical 

approach is needed. An important concept of the MSCC detector is a stability criteria which 

results in reliable clusters only. Only clusters which are detected using varying scale parameters 

will be selected. A detected MSCC is finally defined by the extend of its distribution of points 

contributing to the constellation. 

The MSCC algorithm proceeds along the following three steps:

5.1. The MSCC detector 70 

(a) 

(b) 

(c) 

(d) 

Figure 5.2: Results of the interest point detector evaluation of Schmid et al. [95] on viewpoint 

changes. The Harris detector achieves high repeatability. (a-c) Examples for the test images. 

Viewpoint change introduces perspective distortions. (c) Repeatability score (Images from [95]) 

1. Detect single interest points all over the image, e.g. Harris corners 

2. Perform graph-based point clustering on multiple scales 

3. Select clusters which stay stable over a certain number of scales 

5.1.1 Interest point detection 

To detect the interest points acting as cluster primitives we employ the Harris corner detector 

[40]. We select a large number of corners (all local maxima above the noise level) as our 

corner primitives. This ensures that we are not dependent on a cornerness threshold. We do 

not apply non-maxima suppression which would be common for other applications. In our case 

we are interested in Harris corners in close spatial proximity. Non-maxima suppression would 

thin out possible clusters. 

5.1.2 Multi scale clustering 

We would like to find high density clusters of corners which are stable (i.e., a few missing corners 

or the addition of a few corners does not change the cluster structure). Since we do not know the 

number of clusters we have to use a non-parametric clustering method. Clustering is performed 

by first computing the minimal spanning tree (MST) for the detected interest points and then 

removing edges so that the MST splits into multiple subtrees. Each subtree then corresponds 

to a cluster. The subdivison method is inspired by the MSER detector [70]. 

The MST is computed by interpreting the interest points with coordinates x i = (x 1 , x 2 ) as 

the nodes of an undirected weighted graph√in 2D. The weight for the edge between two graph 

nodes i, k is their geometric distance d ik = (x i 1 − xj 1 )2 + (x i 2 − xj 2 )2 to which we will also refer

5.2. Region representation 71 

to as edge length. The minimal spanning tree (MST) is a subset of edges which connects all nodes 

with the smallest cumulative edge length. By computing the minimal spanning tree (MST) we 

create edges between nearby nodes. A well-known method to compute the MST is the Kruskal 

method [19]. Figure 5.3 shows a typical MST computed from detected Harris corners. 

(a) 

(b) 

Figure 5.3: (a) Image with detected Harris corners (b) MST computed from the Harris corners. 

Given a threshold T on the edge length we can get a subdivision of the MST into subtrees 

by removing all edges with an edge length higher than this threshold. Different values for T 

produce different subdivisions of the MST, i.e. different point clusters. To create a multi scale 

clustering we compute subdivisions of the MST for p regularly spaced thresholds T 1 ...T p between 

the minimal and maximal edge length occurring in the MST. An example for splitting a MST 

into subtrees is depicted in Figure 5.4. The full MST is shown in Figure 5.4(f). Five subdivisions 

are computed by applying 5 different thresholds T 1 ...T 5 with T i < T i+1 . Some subtrees stay the 

same for different thresholds, e.g. the two subtrees on the top of the image. 

5.1.3 Selection of stable clusters 

The previous step produced p different cluster sets. We are now interested in clusters which do 

not change their shape over several scales, i.e. those that are stable. As a stability criterion for a 

cluster we compare the set of their points. Clusters consisting of the same set of points across r 

different scales are defined as stable and constitute the output of the MSCC detector. A similar 

stability criterion is used by Matas et al. in the MSER detector [70] with great success. 

Figure 5.5 illustrates the method on a synthetic test image. The image shows 4 differently 

sized squares. The Harris corner detection step produces several responses on the corners of 

the squares. Connecting the single points with the MST reveals a structure where one can 

easily see that clustering can be done by removing the larger edges. Clusters of interest points 

are indicated by ellipses around them. The test image shows the capability of detecting stable 

clusters at multiple scales, starting from very small clusters at the corners of the squares itself 

up to the cluster containing all detected interest points. 

5.2 Region representation 

As mentioned before a MSCC region is defined by a clustered set of points C. Unlike many 

other detectors the MSCC clusters show arbitrary shapes, an approximative delineation may be


400 (1000) detected points 

(a) (b) (c) 

(d) (e) (f) 

Figure 5.4: (a-e) Subdivisions of the MST with 5 regularly spaced thresholds T 1 ...T 5 . Note that 

the two top subtrees do not change for the first three thresholds. They are stable. (f) Full MST 

computed from the Harris corners. 

Figure 5.5: Example of the MSCC detector on a synthetic test image (clustered interest points 

are indicated by ellipses around them). 

obtained by convex hull construction or fitting ellipses. Delineation using the convex hull is the 

preferred method. Ellipse fitting is only a poor estimate of the region delineation. However, as 

an ellipse the detection can be described efficiently with 4 parameters, length of major axis a, 

length of minor axis b, angle of major axis α and ellipse center C = (c x , c y ). 

The ellipse parameters are defined by the covariance ellipse (covariance matrix) of the point 

distribution C. The covariance matrix Σ is defined as 

Σ = E [ (X − E[X])(X − E[X]) T ] (5.1)


where X is a column vector with n scalar random variable components and E[X] is the expected 

value of X. In our case X is a matrix containing the x and y coordinates of the n points forming 

the corner cluster. 

⎛ 

Σ is then a 2 × 2 covariance matrix. 

X = 

⎜ 

⎝ 

⎞ 

x 1 y 1 

. . 

. . 

x n y n 

⎟ 

⎠ (5.2) 

Σ = 1 [ 

(X − E[X])(X − E[X]) 

T ] (5.3) 

n − 1 

A 2 × 2 covariance matrix can be represented as an ellipse. Lets denote it as the region ellipse. 

The parameters of the region ellipse, length of major axis a e and minor axis b e and rotation angle 

α e of the major axis are encoded in the covariance matrix. The parameters can be computed 

by Eigenvalue decomposition of Σ. Eigenvalue decomposition gives λ 1 and λ 2 where λ 1 > λ 2 . 

The length of the major axis a e = λ 1 and b e = λ 2 . Figure 5.6 illustrates the region ellipse for a 

MSCC point cluster. The region ellipse is drawn in black. The black crosses mark the individual 

corners of the point cluster where the covariance matrix Σ is computed from. The region ellipse 

is rotated according to the angle α e , pointing into the main direction of the point distribution. 

The main direction is defined by the Eigenvector for λ 1 denoted as v 1 = (v x , v y ). 

α e = arctan v y 

v x 

(5.4) 

The region delineation is now created by scaling the region ellipse to the size of the point 

distribution. The length of the major axis a is set to the maximum point distance to the center 

of the cluster points. 

a = max ‖C i − C‖ (5.5) 

i 

The ellipse center C is the center of gravity of the point distribution C. The length of the minor 

axis b is defined with 

b = b e 

a 

a e 

. (5.6) 

The scaling leads to the final region delineation shown as blue ellipse. 

Similar to other local detectors the covariance matrix Σ of the cluster points can be used for 

affine normalization as described in Baumberg et al. [6]. Transforming the point distribution 

with the inverse square root of Σ removes an affine distortion up to a remaining rotation. The 

normalized MSCC C n is computed with 

C n = Σ − 1 2 C. (5.7) 

Σ 1 2 is the matrix square root which can be computed by Cholesky decomposition. An example 

for MSCC normalization is shown in Figure 5.7. Figure 5.7(a) shows the detected MSCC region. 

The black crosses are the corners constituting the cluster. A region delineation by computing 

the convex hull of the corners is shown for illustration issues. The region ellipse defined by the 

covariance matrix Σ is shown in black. Figure 5.7(b) shows the same MSCC region after applying 

an arbitrary affine transformation. The distortion effects are clearly visible. Figure 5.7(c) shows 

the normalized original region. The corners constituting the MSCC region are transformed 

with the inverse square root of Σ. The effect of the normalization is visualized with the region

5.3. Computational complexity 74 

6.5 

6 

a 

5.5 

a e 

5 

4.5 

b e 

4 

b 

3.5 

8 8.5 9 9.5 10 10.5 11 11.5 12 

Figure 5.6: Region ellipse and region delineation for a MSCC point cluster. The black crosses 

mark the individual corners of the point cluster. The black ellipse is the region ellipse. The blue 

ellipse is a scaled version of the region ellipse resulting in the final region delineation. 

ellipse. The region ellipse is transformed into a circle of unit radius. Normalizing the affine 

distorted region in Figure 5.7(b) with the according covariance matrix results in Figure 5.7(d). 

The resulting MSCC is within the same canonical coordinate system as the one of Figure 5.7(c). 

They only differ by an unknown rotation. 

5.3 Computational complexity 

The steps 2-4 of the algorithms can be implemented very efficiently. It is possible to do the multi 

scale clustering as well as the selection of stable clusters already during the MST construction. 

The time complexity of the algorithm is therefore determined by the time complexity of the 

MST construction which is in our case O(m log n) for the Kruskal method [19] where m is the 

number of edges in the graph and n the number of nodes. Checking the stability of the clusters 

introduces a constant term depending on the number of thresholds p but produces only very 

little overhead. 

Ultimately a linear time complexity of O(m) would be possible by using the randomized 

MST construction proposed by Karger et al. [55]. The MST is found in linear time up to a 

certain probability. 

5.4 Parameters 

The properties of the MSCC detector can be adjusted with 3 parameters. In the following this 

3 parameters are described in detail and suggestions for choosing useful values are given. Some 

parameters depend on the interest point detector. In our case we describe the method using the 

Harris corner detector.

5.4. Parameters 75 

7 

7 

6.5 

6.5 

6 

6 

5.5 

5.5 

y 

y 

5 

5 

4.5 

4.5 

4 

4 

3.5 

3.5 

8.5 9 9.5 10 10.5 11 11.5 12 

x 

(a) 

8.5 9 9.5 10 10.5 11 11.5 12 

x 

(b) 

7 

7 

6.5 

6.5 

6 

6 

5.5 

5.5 

y 

y 

5 

5 

4.5 

4.5 

4 

4 

3.5 

3.5 

8.5 9 9.5 10 10.5 11 11.5 12 

x 

(c) 

8.5 9 9.5 10 10.5 11 11.5 12 

x 

(d) 

Figure 5.7: (a) Original detected MSCC. (b) Affine distorted MSCC. (c) Normalized original 

MSCC. (d) Normalized affine distorted MSCC. Both normalized regions are in the same canonical 

coordinate system differing only by a rotation. 

Harris cornerness threshold p h : When using the Harris corner detector one parameter is 

the cornerness threshold p h . The Harris corner detector computes a cornerness measure 

for every pixel position. A corner is defined by a high positive value. Usually corners show 

cornerness values in the range of 10 3 − 10 5 . In our case we simply want to find all corners 

above the noise level. Therefore a low threshold in the range of 1 − 100 will work very 

well. 

Gaussian filter size p s : Another parameter of the Harris corner detector is the variance of 

the involved Gaussian filters p s . Simply speaking p s defines the scale on which the corners 

are detected. Our application requires a detection on a small scale, thus an appropriate

5.5. Detection examples 76 

parameter ”Box” ”Group” ”Doors” 

cornerness threshold of Harris detector p h 1 1 1 

sigma of Harris detector p s 0.5 0.5 0.5 

stability parameter p r 5 5 5 

Table 5.1: Parameter values used for the detection examples. 

value for p s would be in the range of 0.5 − 1.5. 

Stability parameter p r : The last parameter is the stability parameter p r . The parameter 

decides on the stability of a cluster and if the cluster should be selected as region. If a 

cluster fulfills the stability criteria for p r threshold steps the cluster is denoted as stable. 

The thresholds start with the minimal edge length in pixel and are increased by 1 pixel 

each step until the maximal edge length is reached. A high value produces only very stable 

clusters and lower values less stable clusters. Useful values for p r are in the range of 5−10. 

5.5 Detection examples 

This section shows detection examples for three different image sequences. Each sequence contains 

images with increasing view point change up to wide-baseline cases. This is to demonstrate 

the repeatability of the MSCC detector under viewpoint change. The interest points are shown 

as red crosses while the MSCC regions are shown as blue ellipses. 

”Box” scene: Figure 5.8 shows the MSCC detections for the ”Box” scene. The ”Box” scene 

is a set of images of a box from different viewpoints. The images were acquired on a 

turntable. The images are of a resolution of 800 × 600 pixel. Many regions are detected 

repetitively in each image. The multi-scale clustering detects very small as well as large 

regions. 

”Group” scene: Figure 5.9 shows the MSCC detections for the ”Group” scene. The scene 

consists of two piecewise planar objects on a turntable. The overall viewpoint change for 

the whole image sequence is almost 90 ◦ . Again many regions are detected repetitively in 

each image despite of the large viewpoint change. The image resolution is 1024×896 pixel. 

”Doors” scene Figure 5.10 shows the MSCC detections for the ”Doors” scene. The ”Doors” 

image set is from a robot localization experiment. The image resolution is 720 × 288. The 

poster in the example contains a lot of written text. The MSCC detector manages to 

identify the different sections of the text as MSCC regions. 

The parameter settings for the 3 scenes are given in Table 5.1.


Figure 5.8: Detection examples on ”Box” scene.


Figure 5.9: Detection examples on ”Group” scene.


Figure 5.10: Detection examples on ”Doors” scene.

5.6. Detector evaluation: Repeatability and matching score 80 

5.6 Detector evaluation: Repeatability and matching score 

The performance of the MSCC detector is compared to other approaches in terms of the repeatability 

and matching score (see Chapter 4 for details). The MSCC detector is evaluated 

on the planar ”Doors” scene using the publicly available evaluation framework of Mikolajczyk 

and Schmid [74]. In addition the MSCC detector is evaluated on the non-planar ”Group” and 

”Room” scenes using the evaluation method of Chapter 4. 

5.6.1 Evaluation of the ”Doors” scene 

The ”Doors” scene consists of 10 images from a robot localization experiment. Figure 5.10 

shows the images of the test set along with detected MSCC regions. To comply with the 

evaluation framework ellipses are fitted to the MSCC regions, i.e the ellipse parameters are 

calculated from the covariance matrix of the interest points belonging to the region. We compare 

the repeatability score and the matching score of our MSCC detector to 4 other detectors on 

increasing viewpoint change up to 130 ◦ . For the matching score the SIFT descriptor [67] is used. 

Figure 5.11 shows the repeatability and matchings score of the MSCC detector compared to the 

Maximally Stable Extremal Regions (MSER) [70], the Hessian-Affine regions (HESAFF) [73], 

the Harris-Affine regions (HARAFF) [73] and the intensity based regions (IBR) [112]. The 

experiment reveals a competitive performance of our novel detector when compared to other 

approaches. The regions detected by our approach are consistently different from those of other 

detectors (see also Section 5.7). 

5.6.2 Evaluation of the ”Group” and ”Room” scene 

Using the evaluation method described in Chapter 4 we compare the MSCC detector to 4 

other local detectors on the ”Group” and ”Room” scene. The repeatability and matching score 

of the MSCC detector is compared to the Maximally Stable Extremal Regions (MSER) [70], 

the Hessian-Affine regions [73], the Harris-Affine regions [73] and the intensity based regions 

(IBR) [112]. 

”Group” scene: Figure 5.12 shows the repeatability scores for the ”Group” scene. The graph 

of the MSCC detector starts with a lower value than the other detectors. With increasing 

viewpoint change the repeatability score decreases at a similar rate as the other detectors. 

However, from 30 ◦ to 75 ◦ the repeatability score stays constant, while the scores of the 

other detectors still decrease. For the last part the MSCC detector matches the values 

of the Hessian-Affine and Harris-Affine detector. Figure 5.12(b) shows that the MSCC 

detector produces as much regions as the MSER detector. Figure 5.14 shows the achieved 

matching scores. The matching score (relative to the number of possible matches) of the 

MSCC detector competitive to the scores of the IBR, Harris-Affine and Hessian-Affine 

detector. It is only outperformed by the MSER detector. For the last part the MSCC 

matching score is however higher than that achieved by the Hessian-Affine and Harris- 

Affine detector. Table 5.2 and Table 5.4 show the corresponding numbers. 

”Room” scene: Figure 5.13 shows the repeatability scores for the ”Room” scene. Up to 50 ◦ 

viewpoint change the repeatability score of the MSCC detector is similar to the MSER, 

Harris-Affine and Hessian-Affine detector. For viewpoint changes more than 50 ◦ the MSCC 

detector is the second best only outperformed by the IBR detector. Figure 4.9 shows 

the matching scores for the ”Room” scene. None of the detectors achieves outstanding

5.7. Combining MSCC with other local detectors 81 


90 

80 

70 

60 

50 

40 

30 

20 

MSCC 

MSER 

HARAFF 

HESAFF 

IBR 

10 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(a) 

matching score [%] 

60 

50 

40 

30 

20 

10 

MSCC 

MSER 

HARAFF 

HESAFF 

IBR 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(b) 

Figure 5.11: (a) Repeatability score for ”Doors” scene. (b) Matching score for ”Doors” scene. 

matching scores on this scene. The corresponding numbers are given in Table 5.3 and 

Table 5.5. 

5.7 Combining MSCC with other local detectors 

This experiment evaluates the complementarity of the MSCC detector. This is done by counting 

the non-overlapping correct matching regions from different detectors. Regions from different 

detectors are counted as non-overlapping if they do not overlap more than 40%. Matching is done 

using SIFT descriptors and nearest neighbor search (as implemented in Mikolajczyks evaluation 

framework). The experiment is carried out using the ”Doors” scene. Figure 5.16(a) shows the 

absolute number of matched MSER regions, MSER regions combined with HESAFF regions, 

combination of MSER, HESAFF and HARAFF, combination of MSER, HESAFF, HARAFF 

and IBR and combination of the previous detectors with the MSCC detector. Figure 5.16(b-e)



100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

MSER 

Harris-Affine 


IBR 

MSCC 

0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 


1400 

1200 

1000 

800 

600 

400 

MSER 

Harris-Affine 


IBR 

MSCC 

200 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 

Figure 5.12: (a) Repeatability score for ”Group” scene. (b) Absolute number of correspondences. 

show the region numbers for combining the MSCC detector with each of the other detectors. 

The graphs show that our MSCC detector is able to add a significant amount of new matches 

to the ones of the other detectors. Figure 5.16(f) and (g) show an example for 120 ◦ viewpoint 

change. The dashed dark ellipses mark the matches from the combination of MSER, HESAFF, 

HARAFF and IBR. The bright ellipses mark the additional matches obtained from the MSCC 

detector.


repeatability score [%] 

90 

80 

70 

60 

50 

40 

30 

MSER 

Harris-Affine 


IBR 

MSCC 

20 

10 

0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 


1400 

1200 

1000 

800 

600 

400 

MSER 

Harris-Affine 


IBR 

MSCC 

200 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 

Figure 5.13: (a) Repeatability score for ”Room” scene. (b) Absolute number of correspondences.



90.0 

80.0 

70.0 

60.0 

50.0 

40.0 

30.0 

20.0 

10.0 

MSER 

Harris-Affine 


IBR 

MSCC 

0.0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 


90 

80 

70 

60 

50 

40 

30 

20 

10 

MSER 

Harris-Affine 


IBR 

MSCC 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 


400 

350 

300 

250 

200 

150 

100 

MSER 

Harris-Affine 


IBR 

MSCC 

50 

0 

0 10 20 30 40 50 60 70 80 90 100 


(c) 

Figure 5.14: (a) Matching score for ”Group” scene relative to number of detections. (b) Matching 

score for ”Group” scene relative to number of possible matches. (c) Absolute number of correct 

matches.



90 

80 

70 

60 

50 

40 

30 

20 

10 

MSER 

Harris-Affine 


IBR 

MSCC 

0 

0 10 20 30 40 50 60 70 80 90 100 


(a) 


90 

80 

70 

60 

50 

40 

30 

20 

10 

MSER 

Harris-Affine 


IBR 

MSCC 

0 

0 10 20 30 40 50 60 70 80 90 100 


(b) 


400 

350 

300 

250 

200 

150 

100 

MSER 

Harris-Affine 


IBR 

MSCC 

50 

0 

0 10 20 30 40 50 60 70 80 90 100 


(c) 

Figure 5.15: (a) Matching score for ”Room” scene relative to number of detections. (b) Matching 

score for ”Room” scene relative to number of possible matches. (c) Absolute number of correct 

matches.


view MSER Har.-Aff. Hes.-Aff. IBR MSCC 

change repeat. repeat. repeat. repeat. repeat. 

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] 

10 76.3 48.5 56.4 57.6 41.5 

180 586 457 364 207 

15 67.8 45.0 50.9 52.6 37.5 

164 543 412 318 187 

20 59.8 39.9 46.7 49.8 34.7 

143 482 358 308 173 

25 56.0 36.3 43.5 48.8 30.5 

126 436 359 290 152 

30 57.7 36.3 40.6 48.5 24.4 

127 439 341 275 122 

35 55.6 33.8 41.2 47.2 20.4 

120 408 353 265 101 

40 50.7 33.6 37.2 45.3 21.2 

114 406 236 267 106 

45 50.9 31.0 34.0 44.3 21.8 

112 375 253 255 109 

50 46.6 31.4 35.6 46.8 23.4 

108 379 258 256 117 

55 47.2 32.0 34.4 45.0 22.4 

110 387 279 247 112 

60 45.9 32.3 33.1 44.2 23.2 

111 390 268 258 116 

65 43.3 29.7 31.9 43.1 22 

107 359 258 249 110 

70 44.1 29.4 30.7 39.1 22.2 

109 355 249 236 111 

75 41.2 28.6 26.9 42.2 21.1 

100 346 218 232 105 

80 42.5 26.5 26.2 43.3 25.2 

97 320 212 229 111 

85 41.7 25.5 26.0 43.3 25.4 

91 293 206 218 101 

90 37.9 26.2 25.6 40.8 25.6 

83 292 188 200 91 

Table 5.2: Repeatability score and absolute number of correspondences for ”Group” scene with 

changing viewpoint



change repeat. repeat. repeat. repeat. repeat. 

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] 

15 44.6 35.6 41.9 46.1 28.4 

54 208 240 111 99 

20 38.1 31.3 34.0 42.5 27.2 

43 188 191 99 88 

25 31.3 26.6 28.0 35.8 26.8 

35 158 160 81 86 

30 24.8 26.8 27.8 37.7 22.6 

28 150 157 81 74 

35 23.3 24.6 26.2 41.9 19.0 

27 142 145 91 60 

40 21.4 20.8 22.5 31.9 18.1 

21 116 113 61 58 

45 23.1 19.8 20.5 29.4 19.7 

24 108 113 60 57 

50 20.0 15.8 15.8 28.0 19.2 

21 89 80 58 63 

55 19.3 16.2 14.7 27.2 19.5 

22 85 82 55 61 

60 12.9 14.4 15.5 24.0 18.3 

13 82 85 49 59 

65 18.9 13.8 15.1 29.0 16.8 

21 82 86 56 57 

70 13.5 12.8 12.7 27.1 18.4 

15 75 71 58 61 

75 17.9 14.4 15.5 30.0 18.2 

20 85 82 67 61 

80 14.4 12.6 13.2 29.9 18.4 

16 76 72 66 62 

85 13.5 11.9 11.4 27.6 18.8 

14 74 66 61 62 

90 11.5 12.1 11.7 28.8 18.7 

13 73 68 66 61 

Table 5.3: Repeatability score and absolute number of correspondences for ”Room” scene with 

changing viewpoint



change m. score m. score m. score m. score m. score 

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %, 

abs.] abs.] abs.] abs.] abs.] 

10 66.9 29.4 34.3 33.5 13.8 

75 31.6 36.8 37.3 27.5 

162 355 280 212 69 

15 56.5 24.3 28.3 26.9 9.2 

69 26.9 31.2 30.1 19.0 

140 293 231 163 46 

20 43.3 18.8 21.6 23.8 11.8 

57.3 22.1 25.6 26.9 24.8 

106 227 176 147 59 

25 36.8 16.3 19.1 22.7 10.6 

50.6 20.5 23.7 27.8 24.2 

85 196 156 135 53 

30 33.6 14.3 13.6 18.5 8.2 

46.3 18.7 17.6 23.1 21.4 

76 173 111 105 41 

35 27.0 11.5 14.1 19.4 5.1 

38.2 15.5 17.9 24.1 18.3 

60 139 115 109 25 

40 24.2 10.8 11.3 17.3 5.2 

37.6 14.2 15.2 21.4 18.4 

56 131 92 102 26 

45 24.8 8.7 10.5 14.2 3.0 

37.3 11.8 14.2 17.8 10.4 

56 105 86 82 15 

50 18.5 8.2 9.1 14.8 5.6 

29.5 11.3 11.8 16.7 19.1 

44 99 74 81 28 

55 20.5 6.5 7.1 13.2 4.2 

32.7 8.6 9.4 15.8 13.6 

49 78 58 73 21 

60 14.5 6.7 6.1 13.4 5.0 

24.2 9 8.2 16.9 16.9 

36 81 50 79 25 

65 14.1 7.5 6.5 11.1 3.2 

24.5 10.3 9.1 13.8 11.4 

36 90 53 65 16 

70 11.4 4.8 5.2 10.6 4.0 

20.1 6.7 7.5 14 15.8 

30 58 42 67 20 

75 9.4 5.2 5 10.5 2.8 

17.7 7.7 7.9 13.6 11.4 

25 63 41 62 14 

80 9.6 4.0 4.5 7.4 2.4 

18.1 5.8 7.2 9.2 9.5 

25 48 37 44 12 

85 6.3 4.2 3.6 7.4 2.2 

12.7 6.3 6 9.6 8.3 

16 51 29 43 11 

90 6.2 2.5 3.6 6.1 1.4 

13.2 3.8 6.1 8 5.4 

16 30 29 35 7 


number of correct matches for ”Group” scene with changing viewpoint



change m. score m. score m. score m. score m. score 

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %, 

abs.] abs.] abs.] abs.] abs.] 

15 19 12.8 16.6 12.8 5.4 

50.8 21.7 24.8 19.8 24.1 

31 91 106 36 21 

20 9.2 8.5 11.5 12.8 2.7 

29.4 14.6 18.4 22.9 10.8 

15 60 73 36 10 

25 5.5 7.3 9.0 8.9 2.9 

20.5 14.1 17.4 18.7 11.7 

9 51 57 25 11 

30 4.3 6 5.7 7.8 2.3 

20.6 12.2 11.3 16.4 12.3 

7 42 36 22 9 

35 3.7 3.2 3.9 4.6 1.3 

18.2 6.7 8 9.6 7.7 

6 23 25 13 5 

40 4.6 2.7 2.6 4 0.3 

25.9 7.1 6.5 11 2 

7 19 16 11 1 

45 3.7 0.9 1.9 3.9 0 

20.7 2.4 4.8 11.3 0 

6 6 12 11 0 

50 3.7 2.4 1.0 1.5 0.3 

24 7.6 3 4.6 1.9 

6 17 6 4 1 

55 2.5 1.6 0.9 2.1 1.3 

13.8 5 2.9 6.1 7.9 

4 11 6 6 5 

60 0.7 0.6 0.5 2.6 0.8 

5 1.8 1.6 8.6 5.2 

1 4 3 7 3 

65 1.8 0.6 0.3 2.2 1.2 

10 1.7 0.9 6.2 8.2 

3 4 2 6 5 

70 1.2 0.6 0.2 1.1 1 

9.5 1.7 0.6 2.9 6.4 

2 4 1 3 4 

75 1.2 0 0.6 1.8 0.7 

7.1 0 2.2 4.3 4.3 

2 0 4 5 3 

80 0 0.7 0.2 1.1 0 

0 2.3 0.5 2.8 0 

0 5 1 3 0 

85 0.6 0.4 0.5 1.8 0.5 

4.2 1.4 1.6 4.1 3.2 

1 3 3 5 2 

90 1.2 0.3 0.5 1.1 0 

10.5 0.9 1.6 2.5 0 

2 2 3 3 0 


number of correct matches for ”Room” scene with changing viewpoint


abs. number of matches 

160 

140 

120 

100 

80 

60 

40 

20 

MSER 

MSER+HESAFF 

MSER+HESAFF+HARAFF 

MSER+HESAFF+HARAFF+IBR 

MSER+HESAFF+HARAFF+IBR+MSCC 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(a) 

70 

70 

60 

MSER 

60 

IBR 


50 

40 

30 

20 

MSER+MSCC 


50 

40 

30 

20 

IBR+MSCC 

10 

10 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(b) 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(c) 

80 

120 


70 

60 

50 

40 

30 

20 

10 

HARAFF 

HARAFF+MSCC 


100 

80 

60 

40 

20 

HESAFF 

HESAFF+MSCC 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(d) 

0 

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 


(e) 

(f) 

(g) 

Figure 5.16: Absolute numbers of non-overlapping matched regions. (a) Combining all detectors. 

(b) Combining MSER and MSCC. (c) Combining IBR and MSCC. (d) Combining HARAFF 

and MSCC. (e) Combining HESAFF and MSCC. (f),(g) Matches for combination of all detectors 

at 120 ◦ viewpoint change. The bright ellipses mark the additional matches obtained from the 

MSCC detector.

Chapter 6 

Wide-baseline methods 1 

This chapter deals with image matching and 3D reconstruction for wide-baseline scenarios. The 

first section provides a solution for the problem of detecting corresponding regions in images 

from far different viewpoints. The proposed method builds upon the detection of affine invariant 

interest regions. Projective transformations introduced by the wide baseline get reduced by 

affine normalization. In the proposed method the projective distortion subsequently gets completely 

removed until both image patches are registered. In the registered image patches point 

correspondences have simply the same pixel coordinates and by knowing the applied transformations 

the pixel coordinates in the original image frame can be computed. 

The second section describes a method to recover scene planes of arbitrary position and orientation 

from oriented images using homographies. Given at least 2 wide-baseline images a 

piece-wise planar 3D reconstruction can be computed. Furthermore the input images are segmented 

into planar image parts. Planar regions are reconstructed using only sparse, affineinvariant 

sets of corresponding seed regions. These regions are iteratively expanded and refined 

using plane-induced homographies. 3D reconstruction needs a calibrated setup, while the planar 

segmentation is possible for uncalibrated images too. 

6.1 Wide-baseline region matching 

In the following the wide-baseline region matching method is described which is a key technique 

for the proposed map building and localization framework. The algorithm is a key ingredient 

of the plane segmentation and reconstruction method described in Section 6.2.1. The plane 

segmentation and reconstruction method is used to build piece-wise planar sub-maps (see Section 

7.1). Another application of this algorithm is in linking sub-maps into a complete world 

map (see Section 7.1.5). It is also a key component of the global localization algorithm presented 

in section 7.2. The method has been designed in a way to exhibit the following properties: 

1. Highly reliable matches, i.e. the algorithms produces a low number of outliers. 

2. Exact point correspondences, i.e. with sub-pixel accuracy. 

3. High number of point correspondences 

1 Based on the publication: 

F. Fraundorfer, K. Schindler, and H. Bischof. Piecewise planar scene reconstruction from sparse correspondences. 

Image and Vision Computing, 24(4):395–406, 2006 [38] 

91

6.1. Wide-baseline region matching 92 

Property 1 is achieved by a 2-step approach. In a first step, tentative correspondences are 

identified by nearest neighbor matching in feature space. The tentative matches however still 

contain a lot of outliers. In a second step the tentative matches are verified by area based 

matching, calculating the correlation over the whole interest region. This step ensures with 

maximal certainty the correctness of the match. 

To achieve property 2 matching patches get exactly registered onto each other, by an iterative 

registration procedure. Registration is performed with sub-pixel accuracy which results in highly 

accurate point correspondences. 

Unlike other approaches this algorithm does not simple use the center point of a region 

match as final correspondence. Instead, within the matched and registered image regions, new 

point correspondences are detected. Each matched image region yields about 20-50 new point 

correspondences (property 3). The registration is done by computing the inter-image homography 

for each region which maps one region exactly onto the other. Therefore the method is 

restricted to planar interest regions only. In fact, non-planar matches will be rejected by this 

method. 

6.1.1 Matching and registration 

Let us now have a close look at the details of the method. It is a 2-step approach consisting of 

generating tentative matches and verification (see Algorithm 2 for a compact description). First 

we will describe the generation of the tentative matches. Input is a wide-baseline image pair I 

and I ′ . In each of the images interest regions are detected. We denote the set of interest regions 

in I with L and in I ′ with L ′ . The method is not restricted to one special detector, every affine 

interest region detector (see [76] for examples) is possible. After detection a local affine frame 

(LAF) is computed for every region in L and L ′ . Next the interest regions are normalized using 

the LAF. Normalization tries to remove the perspective distortion of a viewpoint change and two 

corresponding regions will appear almost identical. Some normalization methods create multiple 

normalized images for a single interest region. The multiple appearances are simply added to 

the region set. For the set of normalized regions L and L ′ SIFT descriptors are extracted and 

stored in D and D ′ . Each entry in D and D ′ is a vector of length 128 describing the appearance 

of a normalized patch using orientation histograms. Corresponding interest regions can now be 

found by nearest neighbor search in this 128-dimensional feature space. For efficient matching a 

KD-tree K is built with the feature vectors in D ′ . Corresponding interest regions for the entries 

in D are now found by querying the KD-tree. The corresponding region for D i is the closest 

feature in D ′ return by the KD-tree query. As distance metric the Euclidean distance is used. 

To avoid random matches a measure based on the ratio of the nearest to the second closest 

feature vector is used. A match is accepted if 

d 0 

d 1 

< d th , (6.1) 

where d 0 is the Euclidean distance between the query feature and the nearest neighbor. d 1 is the 

distance from the query feature to the second closest feature vector. d th is a user set threshold. 

According to [67] an appropriate threshold is 0.8. We denote correspondences detected in this 

way as tentative matches. T is the set of tentative matches with T i = (L i , L ′ j ) and is the 

prerequisite for the verification step. The tentative matches T are now verified by area based 

matching. Correspondence is checked by normalized cross-correlation. This procedure is quite 

slow, but it is applied to the set of tentative matches only, which is significantly smaller than 

the initial set of detected regions. The cross-correlation is calculated on a registered pair of


interest regions. Due to the affine normalization matching interest regions are almost registered. 

This initial registration is improved by estimating the homography to transform one interest 

region from the matching pair into the other one. This is done with an iterative method. Let 

us denote a match pair in T as t ↔ t ′ . First, a fixed number of interest points p ′ (we use 

Harris corners [40]) are detected in t ′ . This is justified by the fixed size of the patches of 

nx × ny pixel. We start with the assumption that both patches are already registered. Thus, 

we establish a set of point correspondences p ↔ p ′ with p = p ′ within the region. See Figure 6.1 

for a step-wise illustration. p ↔ p ′ is a set of point pairs as represented by the blue crosses 

at k = 0 in the illustration. However, as the patches are not perfectly registered the point 

matches in p ↔ p ′ do not represent the best matches. The point locations in p are shifted 

within a search window to the position of the optimal match (marked with the red cross in 

Figure 6.1). We define the optimal matching position as the one with maximal correlation 

value. Finding the best position is done by searching. The correlation values for all pixel 

positions within a search window are calculated and the new position is the one with the highest 

value. The optimal position is refined at sub-pixel accuracy with an interpolation based on the 

correlation coefficients as described in [87]. Point matches with a maximal correlation value 

below threshold c th get removed from the set. The such established and refined point matches 

will be used to compute a transformation which registers the patches t and t ′ . A homography h k 

is estimated from p ↔ p ′ when at least 4 point correspondences could be established. Patch t is 

resampled applying the homography h k . This step is depicted in Figure 6.1 at k = 1. After the 

transformation the difference between the guessed position (blue cross) and the optimal position 

(red cross) got diminished. However a small difference still exists, the calculated homography 

was not accurate enough 2 . The process needs to be iterated. Point correspondences have to 

be established and refined again. A new homography has to be computed and applied. Each 

such iteration registers t and t ′ more accurate. The process can be stopped when the difference 

between two successive iteration falls below a threshold ɛ, or when the estimated homography 

is identical to the identity matrix with a given accuracy ɛ. If t and t ′ are exactly registered the 

homography between both patches will be the identity matrix. The algorithm converges fast, 

usually less than 5 iterations are necessary (see the last row in Figure 6.1). To avoid artifacts 

introduced by iterative resampling the subsequent transformations are concatenated and applied 

to the original image. See Algorithm 3 for a compact outline of the registration method. 

Point correspondences in the coordinate frame of the whole image can now be computed 

from every pixel location of the registered image pair. For every location in t ′ the corresponding 

location in t can be computed by inversely applying the homography sequence h 0 , h 1 , ..., h n . A 

point location in t is p = h −1 p ′ , where h is the inverse homography sequence h = h n ...h 1 h 0 . p 

is now in the coordinate frame of the LAF. By applying the inverse affine transformation used 

for the patch normalization one gets into the original image frame. t and t ′ were created by 

different LAF’s, A i and A ′ i respectively. Point correspondences in the original image frame are 

given by: 

p o = A −1 i p (6.2) 

p ′ o = A ′ i−1 h −1 p ′ (6.3) 

Multiple point correspondences obtained from a single region match are another main benefit 

of this special method. Other wide-baseline matching methods return only a single point per 

region match. In [70] only the center of gravity of the detected region is returned. 

2 The accuracy of the applied sub-pixel interpolation is limited and thus still a deviation remains after one 

application of the warping


k=0 

k=1 

k=n 

Figure 6.1: Iterative registration procedure. At k = 0 the patches are aligned by LAF normalization 

only. Blue crosses denote the same location in both patches. The red cross indicates the 

position with the highest correlation value for the point location marked in the right patch. The 

dashed square illustrates the correlation window. The shifted point location (red) and the original 

point location (blue) are used to estimate a homography. The illustration shows only one 

point pair, for homography estimation additional point correspondences are established (>= 4). 

The left patch is then resampled using the homography. After that we are arrived at k=1 and 

the procedure is repeated. Iteratively new homographies are estimated and applied to the left 

patch until the patches are registered (see k=n). Usually this is achieved in a few number of 

iterations. One may note that a part of the correlation window at k=1 is outside the defined 

image area. In the illustration the correlation window is drawn enlarged for easy reading. In 

the implementation one of course has to choose an appropriate window size to avoid problems 

on the borders.


Algorithm 2 Region matching algorithm 

Detect interest regions in images I and I’, resulting in interest region sets L and L’ 

Normalize each entry in L and L ′ with the LAF and resample to size 64 × 64 pixel 

Compute SIFT descriptor for every entry in L and L ′ , resulting in feature sets D and D’ 

Construct KD-tree K from feature set D ′ 

for all entries in D do 

Query KD-tree K with D i (query results n closest feature vectors D ′closest and Euclidean 

distances to query feature D i d closest = (d 0 , d 1 , d 2 , ..., d n ) in ascending order 

Store L i < − > L j indexed by D < − > D 0 ′closest as tentative match in T if d 0 

d 1 

< d th 

end for 

for all entries in T do 

Register patches T i = (t, t ′ ). Registration returns correlation coefficient c i , transfer distance 

e i , homography h i and point correspondences p i within patch 

Store T i as final match in M if c i > c th ∧ e i < e th 


Algorithm 3 Registration 

Input: t,t ′ ... image patches to register 

Output: c ... correlation coefficient 

Output: e ... point distance 

Output: h ... homography matrix, to warp t ′ onto t 

Output: p ... point matches in t 

Output: p ′ ... point matches in t’ 

Detect n strongest Harris corner in t ′ , store in p ′ 

Initialize p ← p ′ 

h ← 3 × 3 identity matrix 

repeat 

for all entries in p do 

Compute d i = (d x , d y ), to maximize corr(p i + d i , p ′ i , t, t′ ) 


Remove p i , p ′ i with corr(p i + d i , p ′ i , t, t′ ) < c th 

Estimate homography h k (t → t ′ ) with p, p ′ 

h ← hh k 

Warp t using h k 

diff = ‖h k − h k−1 ‖ 

until (diff < ɛ) 

c = 1 

|p| 

e = 1 

|p| 

∑ 

i corr(p i + d i , p ′ i , t, t′ ) 

∑ 

i ‖d i‖

6.2. Piece-wise planar scene reconstruction 96 

6.2 Piece-wise planar scene reconstruction 

In this section we present a method of reconstructing planar regions of a scene, which is useful 

for many man-made objects, such as for example buildings or machinery parts. The approach 

works with inter-image homographies which are a particularly interesting tool for reconstruction 

of planar surfaces: they directly exploit the perspective mapping of planes and thus stay closer 

to the original data than methods, which start with a conventional point-wise reconstruction 

and segment the resulting point cloud or depth map. In the following we describe an automatic, 

image-driven method, which simultaneously solves the region segmentation and the matching 

problem for the planar parts of a scene containing an unknown number of planar regions. This 

is achieved through a novel and innovative combination of state-of-the-art matching and 3D 

reconstruction methods. It uses well-defined interest points to initialize a piecewise planar 

model of the scene. Based on this initialization the raw gray-values are used to refine the initial 

estimate and to achieve a planar scene segmentation. 

Previous methods based on homographies either require lines to restrict each plane to a oneparameter 

family, or require a dense image matching, or deliver only sparse reconstructions [3– 

5, 94, 116, 119]. Our approach recovers scene planes of arbitrary position and orientation using 

only sparse point correspondences and homographies. Furthermore, the method delivers an 

approximate delineation of the detected planar object patches. 

To get a Euclidean 3D reconstruction of the planar structure the camera setup needs to 

be calibrated, i.e. the projection matrices for all cameras are known. However, the relations 

upon which the method is built, are also valid for the uncalibrated case. Plane segmentation 

is still possible but plane reconstruction is limited to a projective reconstruction only. In the 

following we will assume a calibrated camera setup, but we will deal with the uncalibrated case 

in Section 6.2.2. 

6.2.1 Reconstruction using homographies 

The idea, when using homographies for planar reconstruction, is to exploit the fact that a plane 

in 3D space, which is viewed by two perspective cameras, induces a homography between their 

two images. One can think of this as two consecutive perspective projections, one from the first 

image plane to the object plane and a second one from the object plane to the second image plane. 

Let the two cameras (without loss of generality) be given by their (3×4) projection matrices 

C 0 = [I|0] and C 1 = [A|a], and the plane by the homogeneous 4-vector p = [p 1 , p 2 , p 3 , p 4 ] T . Then 

the homography induced by p is given by [69] 

H(p) = A + av T where v = − 1 p 4 

(p 1 , p 2 , p 3 ) T (6.4) 

The homography H(p) belongs to a subclass of homographies, which has only 3 degrees of 

freedom, corresponding to the three parameters of a plane in 3D space. The constraints for 

this subclass are given by the epipolar geometry between the two images, which is coded in the 

fundamental matrix F = [a] x A. 

H T F = F T H = 0 (6.5) 

Given C 0 , C 1 and p, the corresponding homography H(p) can be computed. With H(p) the 

image I 0 can be transformed: I ′ 0 = HI 0. If a region in the scene is incident to p, the similarity


p 

I 

0 

I 1 

C 

H 

C 

0 1 

Figure 6.2: Detection of planar regions with homographies. The images of a plane p are related 

by a homography H, which transforms the first image I 0 to the second image I 1 . 

between corresponding regions in I ′ 0 and I 1 will be high. A similarity measure S(p) such as the 

normalized cross-correlation can therefore be used to decide, whether p describes the region. 

Furthermore, given C 0 , C 1 and three or more corresponding point pairs on a planar region 

{x 0,i ↔ x 1,i }, the homography H(p) and the plane p can be computed. In this case the similarity 

S(p) between I ′ 0 and I 1 can be employed to find the image regions incident to p. 

All these relations are already valid at the projective reconstruction level, since they are 

built upon the incidence relation, which is invariant under projective transformations. 

6.2.2 Piece-wise planar reconstruction 

The proposed reconstruction method starts with the detection of planar seed regions. Affine 

invariant detectors as described in Chapter 3 will provide suitable seed regions. After detection 

and matching, the seed regions are grown by adding image points, which are consistent with 

their respective plane-induced homographies. In an iterative framework, the detection of new 

points of a planar region is alternated with the optimal estimation of the homography based 

on the newly detected points 3 . This results in a segmentation of the images into scene planes 

and simultaneously into a 3D reconstruction of the segmented planes 4 . Algorithm 4 outlines the 

entire reconstruction method. 

Initial homographies from sparse matches 

Plane reconstruction starts with the detection of seed regions for the planes, i.e. corresponding 

image regions originating from a planar part of the scene. In a first step interest regions are 

detected in both images of the image pair leading to two sets of regions R L ,R R . Region matching 

using the method described in Section 6.1 gives the set of corresponding regions M L,R . Each 

3 A similar iterative updating procedure has been employed in [89] for fundamental matrix estimation. 

4 In the following we assume only two images, which we will call the ’left’ image I L and the ’right’ image I R 

(this is done only to make the explanation easier to read, the method can readily be extended to more than one 

’right’ image).


Algorithm 4 Piecewise planar reconstruction outline. 

Detect interest regions 

Match regions (enforcing planarity constraint) 

Estimate initial homographies from corresponding regions 

repeat 

Grow regions by extrapolation of local homographies 

Generate new point correspondences in the extended regions 

Update homographies with new set of correspondences 

until Homographies do not change anymore 

Forward project planar regions onto 3D planes 

matched pair M L,R provides a set of point correspondences. These point correspondences are 

then used to locally estimate the plane-induced homography of the planar region. 

Region growing 

Starting from the corresponding planar seed regions, a region-growing scheme can be employed 

to find the remaining parts of the planar regions they belong to. For each plane, the initially 

estimated plane-induced homography H of the seed region is used to transform the right image: 

I 

R ′ = HI R. With the new image, the seed regions can be expanded by conventional region 

growing. The homogeneity criterion for adding a pixel x to the region is a high similarity 

between I L (x) and I 

R ′ (x). In our implementation similarity is checked by thresholding the 

normalized cross-correlation (NCC) in the neighborhood of x. This concept is depicted in 

Figure 6.3. 

Iterative homography improvement 

Since each homography has been computed only from points within the seed region, using it for 

growing is an extrapolation, and the accuracy thus decreases rapidly with increasing distance 

from the seed region. Therefore, an iterative scheme is required: in the new, extended region, 

interest points are detected in the left image (our implementation uses the Harris detector). 

With the current estimate of the homography, these points are transferred to the right image and 

refined with the sub-pixel matching method of Lan and Mohr [62], which is reported to achieve 

a matching precision better than 0.1 pixels for selected interest points. With the new, larger 

set of accurate correspondences, the homography H is updated, and region growing is continued 

with a new, more accurate image I 

R ′ . 

A stopping criterion for the iteration can now easily be derived. If an iteration does not 

add new point correspondences to the point set, the homography estimation would remain 

unchanged, and further iterations would not change the region anymore. Experiments show 

that the method converges fast. The algorithm generally finishes in less then 10 iteration steps. 

To speed up region growing a hierarchical representation is used. The images are downscaled 

during the intermediate growing steps, while the detection and matching of the interest 

points is done at full resolution. Let us assume that the input images are reduced by a factor 

N = 2 k . The speedup due to the reduction is twofold: firstly, the required area A w of the 

correlation window decreases by a factor of N 2 . The examples shown in section 6.2.3 have been 

computed with N = 2 and A w = (15 × 15) pixels. Secondly, the number of iterations decreases, 

since the tolerance for corresponding image points x L and x ′ R 

is raised from 1 to N pixels. After


(a) (b) (c) (d) 

(e) (f) (g) (h) 

Figure 6.3: Detecting planar regions with homographies. (a) Left image I L . (b) Right image 

I R . (c) Right image I 

R ′ after transformation with the homography induced by the top plane. 

(d) Overlay of I L and I 

R ′ with two rectangular windows marked. (e),(f) The upper window in 

I L and I 

R ′ . The similarity is high. (g),(h) The lower window in I L and I 

R ′ . The similarity is 

low. 

convergence, the final growing step is repeated in the full resolution images to obtain the optimal 

result. 

The uncalibrated case 

So far, we have assumed a calibrated setup, i.e., the projection matrices for all cameras are 

known, and the principal aim was a Euclidean 3D reconstruction of the planar structures. 

However, the relations upon which the method is built, are also valid in the uncalibrated case, 

when we have only a set of images with unknown camera parameters. In this case, the algorithm 

can still recover the scene planes, and we will argue that for scenes with a lot of planar structures, 

this facilitates the subsequent orientation and self-calibration. 

Given the corresponding regions, the homography now has to be estimated from four correspondences, 

without using the as yet unknown epipolar constraint. Like in the calibrated case, a 

robust estimator such as ransac [28] should be used to make sure that the estimate is not corrupted 

by any remaining matching errors. There is a subtle difference between the two methods 

here, which may lead to slight differences in the results: in the calibrated case, both the plane 

corresponding to the homography and the 3D point corresponding to the two image points are 

known in Euclidean space. Therefore, one can use the orthogonal distance from the point to 

the plane to find inliers. In the uncalibrated case, no Euclidean frame is available, hence we 

use the symmetric transfer error d(x 1 , Hx 0 ) 2 + d(x 0 , H −1 x 1 ) 2 in the image plane. Note that the 

uncalibrated case has a degenerate situation: if the camera which took the images underwent 

only a rotation around its projection center, then the two entire image planes are always related 

by a single homography, which is not due to any 3D plane.


When dealing with scenes, which contain a large amount of planar structure, recovering these 

structures beforehand can benefit subsequent structure-and-motion steps. During the growing 

stage, a large and well-distributed set of correspondences is recovered, which are already checked 

for correctness, because they satisfy the homography, a stronger constraint than the fundamental 

matrix. This large and outlier-free point set enables reliable and accurate estimation of the 

fundamental matrix. Note that as soon as at least two planar structures are found, which are 

different and cannot be merged, it is guaranteed that we are not dealing with a degenerate case 

of motion estimation, since 

1. the camera motion cannot be a pure rotation, otherwise all detected homographies would 

be the same and would eventually merge. 

2. the recovered scene points cannot be coplanar, since that would again imply that all 

detected homographies would be the same. 

Although we have not further investigated this issue, we conjecture that in the case of more 

than two images, the large amount of correct and well-distributed points would also benefit 

self-calibration to upgrade the projective reconstruction to a Euclidean one with a method such 

as [88]. 

In section 6.2.3, some experiments are given for the uncalibrated case, which show that in 

practice, the segmentation into planar scene parts is almost the same as for the calibrated case. 

6.2.3 Experimental evaluation 

In this section we present experiments on synthetic and real image data. First, we evaluate 

the reconstruction accuracy and the robustness of the method under large baseline changes on 

synthetic image data. Second, we show the performance of the method on practically relevant 

scenes in experiments with real image data 

Synthetic Images 

The ’Cube’ data-set consists of images with resolution 800×800 pixels, which have been rendered 

from a CAD-model of a cube. Each plane has been textured differently using real world 

images from the freely available ’Graffiti’ image database of Mikolajczyk 5 . The interior and 

exterior orientation are known from the rendering. Seed regions were detected by an extended 

version of the salient region detector [31]. The following results were gained using the calibrated 

method. Figure 6.4(a) shows the left image with the detected matching planar regions. Figure 

6.4(b) shows the initial seed regions. The homographies and planes are calculated from 

the point correspondences gained in the region matching process. Figure 6.4(c-g) shows the 

intermediate steps of iterative region growing and homography estimation, as described in the 

previous section. Figure 6.4(h) shows the final segmentation of the image. The delineation has 

been improved by intersecting the final planes and snapping to the reprojected intersection lines. 

The reconstruction is very accurate. Table 6.1 compares the reconstructed edge length and the 

angle of the planes to the z-axis with the ground truth. 

The detection and delineation of the planar scene regions can be regarded as a segmentation 

of the input images into planar regions. For the synthetic ’Cube’ data set, ground truth is 

also available for this segmentation process, i.e. the correct label is known for every pixel. A 

5 ’Graffiti’ images from Krystian Mikolajczyk available at http://www.robots.ox.ac.uk/∼vgg/research/affine/


plane edge length edge length angle to z-axis [ ◦ ] angle to z-axis [ ◦ ] 

ground truth reconstruction ground truth reconstruction 

1 1 1.0030 90 89.89 

2 1 1.0029 0 0 

3 1 1.0035 90 89.91 

Table 6.1: Comparison of edge length and angle to z-axis with ground truth. 

quantitative evaluation has therefore been carried out to assess the performance of the proposed 

algorithm. The algorithm was run on image pairs with increasing baseline and the pixel sets 

assigned to the visible planes were compared to the ground truth. We counted the number 

of pixels assigned wrongly to a plane (false positives) and missed pixels (false negatives). No 

parameter tuning was allowed for different baselines. All image pairs were treated with the same 

values, which are depicted in Table 6.2. The experiment was conducted with the calibrated and 

the uncalibrated method. The segmentation results of the evaluation are illustrated in Figure 6.5 

(calibrated method). Numerical values for the calibrated methods are given in Table 6.3 (top 

plane) and Table 6.4 (front plane). Figure 6.6(a) and Figure 6.7(a) show the according graphs. 

The results for the uncalibrated methods are given in Table 6.5 (top plane) and Table 6.6 (front 

plane). The according graphs are shown in Figure 6.6(b) and Figure 6.7(b). 

An important observation is that the proposed homography-based region-growing scheme 

can handle larger baselines than the employed region-matching method. The critical breakdown 

point was reached when the region matcher was no longer able to provide seed regions 

(in most cases at more than 60 ◦ viewpoint change), while at this point the regions could still 

be correctly recovered when starting from manually selected seed regions. Thus the evaluation 

has been carried out with manual initialization too, to show the capabilities of the homographybased 

region growing scheme. The stability to view-point changes could therefore be further 

improved, if a better wide-baseline region matching would be available. 

For every test-case less than 5 iterations were necessary to obtain the resulting segmentation. 

As a summary one can say that 

• ≈95% of the points on a visible planar region are correctly assigned (most of the missed 

pixel are due to homogeneous image regions) 

• the rate of non-plane points assigned to a planar region is ≈1% (most of these wrongly 

classified pixels are located on depth edges on the border of the plane and are therefore 

difficult to match) 

• the error rates are almost constant over a wide range of viewing angles and baselines 

respectively 

The segmentation results of the calibrated and uncalibrated method are comparable. However, 

the calibrated method seems to be more robust against outliers in the point sets for the 

initial homographies, leading to more accurate estimates of the initial homographies. This is 

indicated by the front plane reconstruction at > 65 ◦ , where the uncalibrated method was not 

able to grow the region from its initial homography while the calibrated method could do it.


(a) (b) (c) 

(d) (e) (f) 

(g) (h) (i) 

Figure 6.4: Results for synthetic ’Cube’ data-set. (a) Left image with detected seed regions 

(salient region detector). (b) Seed regions. (c-g) Region growing iterations 1-5. The gray image 

parts depict the iteratively growing planar regions. (h) Final delineated segmentation. (i) View 

of the recovered 3D model. 

6.2.4 Real Images 

The ’Laptop’ data-set consists of two images with resolution 2160×1440 pixels taken with a 

calibrated camera. The images were oriented and the described method was applied for reconstruction. 

Seed regions were detected by an extended version of the salient region detector [31]. 

Figure 6.8 shows the different steps leading from a sparse correspondence to the final segmentation. 

Region growing has converged after five iterations. The scene contains five major planes,


Figure 6.5: Segmentation results for synthetic ’Cube’ data-set with view angle changes from 5 ◦ 

to 75 ◦ (calibrated method). 

which are more or less textured. Figure 6.8(f) shows that all five planes have been correctly 

detected and separated. Attention should be drawn to the table and the keyboard of the laptop. 

Both areas are parallel and fairly close to each other, nevertheless the method is accurate


parameter 

value 

cornerness threshold of Harris detector 100 

size of correlation window 

(15×15) pixels 

threshold for normalized cross correlation 0.5 

Table 6.2: Parameter values used for the quantitative evaluation of the algorithm. See text for 

details. 

plane pixel [%] 

30 

25 

20 

15 

10 

Top FN [%] 

Top FP [%] 

Front FN [%] 

Front FP [%] 

5 

0 

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 

view angle [°] 

(a) 


30 

25 

20 

15 

10 

Top FN [%] 

Top FP [%] 

Front FN [%] 

Front FP [%] 

5 

0 

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 


(b) 

Figure 6.6: Comparison of ’Cube’ plane reconstructions to ground truth for different view angles. 

Seed regions have been selected manually. (a) Calibrated method. (b) Uncalibrated method. 

enough to allow a correct separation. One may notice that the reconstructed planes show holes 

in homogeneous image regions. In these region, the ’similarity’ between the images does not 

convey any geometric information, hence no reliable reconstruction is possible. We refer to this 

as the ’safe’ reconstruction. If we assume that homogeneous parts within a planar region are 

part of the region, the missing areas can be filled. This leads to nicer models, but of course 

this assumption is a heuristic and may in certain cases lead to an incorrect reconstruction. Figure 

6.8(b) shows the variance of the gray-values within the correlation window of the left image 

(dark areas denote low variance in correlation window, i.e. homogenous regions). Figure 6.8(f)



30 

25 

20 

15 

10 

Top FN [%] 

Top FP [%] 

Front FN [%] 

Front FP [%] 

5 


0 

30 

25 

20 

15 

10 

5 10 15 20 25 30 35 40 45 50 55 60 


(a) 

Top FN [%] 

Top FP [%] 

Front FN [%] 

Front FP [%] 

5 

0 

5 10 15 20 25 30 35 40 45 50 55 60 


(b) 

Figure 6.7: Comparison of ’Cube’ plane reconstructions to ground truth for different view angles. 

Seed regions have been matched automatically. (a) Calibrated method. (b) Uncalibrated 

method. 

shows the ’safe’ reconstruction of the scene while the 3D model in Figure 6.8(g) has been created 

using the homogeneity assumption. 

The ’Oberkapfenberg castle’ data-set is an outdoor scene recorded with the same calibrated 

camera with a resolution of 2160 × 1440 pixels. The images were oriented and the described 

method was applied for reconstruction. As seed regions the MSER regions [70] were used. The 

eight major planes have been recovered. The results are shown in Figure 6.9. In this particular 

case holes in the reconstruction are not exclusively due to homogenous image parts. Some walls 

we intended to reconstruct were not built completely planar. But even in this complex, partially 

cluttered scene a reconstruction of the overall structure has been possible. 

The results of the experiments can be summarized as follows. The experiments with synthetic 

data showed that the method allows an accurate reconstruction of the scene planes. The 

experiment also demonstrates that the method can cope with large baseline changes with almost 

constant error rates. The experiments on real scenes show the application to practically relevant 

reconstruction tasks. In both of the scenes the major planes could be reconstructed. The 

experiments revealed that in real world scenes difficulties with non-textured regions and with 

not completely planar structures occur. However, we showed that it is possible to overcome this


viewing ground manual seed regions automatic seed regions 

angle truth false neg. false pos. false neg. false pos. 

[ ◦ ] [pixel] [pixel (%)] [pixel (%)] [pixel (%)] [pixel (%)] 

5 123432 6592 ( 5.34 ) 1082 ( 0.88 ) 7684 ( 6.23 ) 812 ( 0.66 ) 

10 123491 6131 ( 4.96 ) 1217 ( 0.99 ) 6255 ( 5.07 ) 1203 ( 0.97 ) 

15 123505 6267 ( 5.07 ) 1505 ( 1.22 ) 6615 ( 5.36 ) 1471 ( 1.19 ) 

20 123518 6685 ( 5.41 ) 1430 ( 1.16 ) 6239 ( 5.05 ) 1907 ( 1.54 ) 

25 123558 6405 ( 5.18 ) 1354 ( 1.10 ) 5970 ( 4.83 ) 1763 ( 1.43 ) 

30 123540 6264 ( 5.07 ) 1389 ( 1.12 ) 6147 ( 4.98 ) 1465 ( 1.19 ) 

35 123597 6264 ( 5.07 ) 1246 ( 1.01 ) 6175 ( 5.00 ) 1363 ( 1.10 ) 

40 123540 6227 ( 5.04 ) 1219 ( 0.99 ) 6351 ( 5.14 ) 1185 ( 0.96 ) 

45 123564 6193 ( 5.01 ) 980 ( 0.79 ) 6135 ( 4.97 ) 1012 ( 0.82 ) 

50 123559 6355 ( 5.14 ) 824 ( 0.67 ) 6291 ( 5.09 ) 863 ( 0.70 ) 

55 123594 6511 ( 5.27 ) 612 ( 0.50 ) 6339 ( 5.13 ) 739 ( 0.60 ) 

60 123544 6700 ( 5.42 ) 519 ( 0.42 ) 5948 ( 4.81 ) 911 ( 0.74 ) 

65 123571 7043 ( 5.70 ) 394 ( 0.32 ) — — 

70 123551 7114 ( 5.76 ) 380 ( 0.31 ) — — 

75 123507 7393 ( 5.99 ) 373 ( 0.30 ) — — 

Table 6.3: Comparison of ’Cube’ top plane reconstruction to ground truth for different view 

angles (calibrated method). 




5 200935 11396 ( 5.67 ) 1157 ( 0.58 ) 11120 ( 5.53 ) 2226 ( 1.11 ) 

10 196926 11070 ( 5.62 ) 1166 ( 0.59 ) 11046 ( 5.61 ) 1291 ( 0.66 ) 

15 190304 12596 ( 6.62 ) 1257 ( 0.66 ) 12284 ( 6.45 ) 1664 ( 0.87 ) 

20 181260 12373 ( 6.83 ) 1308 ( 0.72 ) 12805 ( 7.06 ) 864 ( 0.48 ) 

25 169992 11247 ( 6.62 ) 1251 ( 0.74 ) 11536 ( 6.79 ) 886 ( 0.52 ) 

30 156766 9953 ( 6.35 ) 1235 ( 0.79 ) 9974 ( 6.36 ) 1092 ( 0.70 ) 

35 141781 8615 ( 6.08 ) 1167 ( 0.82 ) 8655 ( 6.10 ) 1051 ( 0.74 ) 

40 125572 6865 ( 5.47 ) 1061 ( 0.84 ) 6753 ( 5.38 ) 1169 ( 0.93 ) 

45 108269 4878 ( 4.51 ) 1003 ( 0.93 ) 4861 ( 4.49 ) 949 ( 0.88 ) 

50 90277 3450 ( 3.82 ) 954 ( 1.06 ) 3405 ( 3.77 ) 980 ( 1.09 ) 

55 72018 2352 ( 3.27 ) 902 ( 1.25 ) 2305 ( 3.20 ) 905 ( 1.26 ) 

60 53702 1788 ( 3.33 ) 914 ( 1.70 ) — — 

65 35725 2127 ( 5.95 ) 879 ( 2.46 ) — — 

Table 6.4: Comparison of ’Cube’ front plane reconstruction to ground-truth for different view 

angles (calibrated method). 

problems by using some heuristic assumptions.





5 123432 8014 ( 6.49 ) 754 ( 0.61 ) 7791 ( 6.31 ) 764 ( 0.62 ) 

10 123491 6027 ( 4.88 ) 1477 ( 1.20 ) 6487 ( 5.25 ) 1019 ( 0.83 ) 

15 123505 6388 ( 5.17 ) 1553 ( 1.26 ) 6632 ( 5.37 ) 1376 ( 1.11 ) 

20 123518 6800 ( 5.51 ) 1360 ( 1.10 ) 6385 ( 5.17 ) 1667 ( 1.35 ) 

25 123558 6449 ( 5.22 ) 1356 ( 1.10 ) 7662 ( 6.20 ) 1072 ( 0.87 ) 

30 123540 6285 ( 5.09 ) 1327 ( 1.07 ) 6010 ( 4.86 ) 1381 ( 1.12 ) 

35 123597 6325 ( 5.12 ) 1320 ( 1.07 ) 6446 ( 5.22 ) 1151 ( 0.93 ) 

40 123540 6340 ( 5.13 ) 1250 ( 1.01 ) 6339 ( 5.13 ) 1199 ( 0.97 ) 

45 123564 6168 ( 4.99 ) 1004 ( 0.81 ) 6158 ( 4.98 ) 1005 ( 0.81 ) 

50 123559 6245 ( 5.05 ) 861 ( 0.70 ) 6225 ( 5.04 ) 864 ( 0.70 ) 

55 123594 6339 ( 5.13 ) 738 ( 0.60 ) 6274 ( 5.08 ) 749 ( 0.61 ) 

60 123544 6493 ( 5.26 ) 609 ( 0.49 ) 5950 ( 4.82 ) 897 ( 0.73 ) 

65 123571 6190 ( 5.01 ) 646 ( 0.52 ) — — 

70 123551 6591 ( 5.33 ) 572 ( 0.46 ) — — 

75 123507 6915 ( 5.60 ) 485 ( 0.39 ) — — 

Table 6.5: Comparison of ’Cube’ top plane reconstruction to ground truth for different view 

angles (uncalibrated method). 




5 200935 11038 ( 5.49 ) 2545 ( 1.27 ) 11087 ( 5.52 ) 2346 ( 1.17 ) 

10 196926 11312 ( 5.74 ) 1021 ( 0.52 ) 10855 ( 5.51 ) 1519 ( 0.77 ) 

15 190304 12503 ( 6.57 ) 1417 ( 0.74 ) 12316 ( 6.47 ) 1626 ( 0.85 ) 

20 181260 12222 ( 6.74 ) 1430 ( 0.79 ) 12487 ( 6.89 ) 1002 ( 0.55 ) 

25 169992 11179 ( 6.58 ) 1331 ( 0.78 ) 11122 ( 6.54 ) 1285 ( 0.76 ) 

30 156766 9852 ( 6.28 ) 1248 ( 0.80 ) 9979 ( 6.37 ) 948 ( 0.60 ) 

35 141781 8530 ( 6.02 ) 1257 ( 0.89 ) 8486 ( 5.99 ) 1322 ( 0.93 ) 

40 125572 6810 ( 5.42 ) 1165 ( 0.93 ) 6736 ( 5.36 ) 1165 ( 0.93 ) 

45 108269 4851 ( 4.48 ) 995 ( 0.92 ) 4847 ( 4.48 ) 993 ( 0.92 ) 

50 90277 3396 ( 3.76 ) 930 ( 1.03 ) 3388 ( 3.75 ) 932 ( 1.03 ) 

55 72018 2330 ( 3.24 ) 886 ( 1.23 ) 2310 ( 3.21 ) 860 ( 1.19 ) 

60 53702 1516 ( 2.82 ) 867 ( 1.61 ) — — 

Table 6.6: Comparison of ’Cube’ front plane reconstruction to ground-truth for different view 

angles (uncalibrated method).


(a) 

(b) 

(c) (d) (e) 

(f) 

(g) 

Figure 6.8: Results for ’Laptop’ data-set. (a) Left image with detected seed regions (salient 

region detector). (b) Confidence map (dark regions denote low variance in correlation window, 

i.e. homogenous regions) (c-e) First three region growing iterations. (f) Final segmentation. (g) 

3D model (homogeneous areas have been filled).


(a) 

(b) 

(c) 

Figure 6.9: Results for ’Oberkapfenberg castle’ data-set. (a) Left image with detected seed 

regions (MSER detector). (b) Final segmentation. (c) 3D model (’safe’ reconstruction).

Chapter 7 

Living in a piecewise planar world 1 

This chapter is designated to explain how the most important tasks in mobile robotics, map 

building and localization, can be accomplished using the wide-baseline methods described in 

the previous chapters. The approach which is to be presented differs in multiple points from 

previous work. One mentionable property of the new approach is that a dense 3D reconstruction 

augmented with partial texture is used as world representation. Current vision based SLAM 

approaches as described in Chapter 2 use much simpler primitives as world representation, like 

3D lines, 3D points or small planar fiducial markers. Irrespective of the primitives used, previous 

approaches created only sparse world representations. Throughout this chapter we describe the 

advantages of our method over previous methods and explain the new method in detail. We 

will show that with our proposed world representation one gains valuable benefits. A second 

novelty of our method is that it is possible to do global localization with a single landmark 

correspondence only. This enables localization in extreme situations, e.g. when large occlusions 

occur. Large occlusions or major temporary scene changes provide a big challenge for state-ofthe-art 

localization methods. Robot localization is deeply connected to the underlying world 

representation and our method of localizing with a single landmark is made feasible by the new 

world representation. 

The great potential of the proposed world representation resides in the use of 3D plane 

patches as map primitives instead of 3D points and 3D lines. The geometrical constraints introduced 

by plane primitives proved extremely valuable, definitely being worth the more complex 

map building algorithms. However, by using 3D plane primitives we introduce a strong assumption 

into our world representation, that our world is piece-wise planar. The world contains a lot 

of structure with cannot be modelled by simple plane primitives. And some may think that this 

assumption is too strict. But locally a piece-wise planar approximation will always come close to 

the original structure. Moreover man-made places contain at a high degree planar structure and 

most robotic platforms are only capable of driving indoors. Furthermore, the 3D reconstructions 

used as maps are for robot localization only, thus it is not necessary to model all the details. It 

is only necessary to model enough details to allow successful localization. On the contrary the 

following particular benefits are gained by the piece-wise planar world description: 

Localization from a single landmark: A single 3D landmark is a small planar patch containing 

6 3D parameters. This gives more constraints for pose estimation than a single 3D 

1 Based on the publication: 

F. Fraundorfer and H. Bischof. Global localization from a single feature correspondence. In Proc. 30th Workshop 

of the Austrian Association for Pattern Recognition, Obergurgl, Austria, pages 151–160, 2006 [34] 

110

7.1. Map building 111 

point landmark. In fact, already a single plane match allows pose estimation while this is 

not possible with a single 3D point landmark. 

Additional geometric constraints: Landmarks which are located on one and the same 3D 

plane are connected by geometric constraints. Plane projective relations are much more 

restrictive than general projective relations. A planar homography can be used very efficiently 

to verify feature matches geometrically. 

Feature reduction: By selecting only landmarks located on 3D planes the number of stored 

features in the map is reduced significantly. The map uses less memory and the computation 

time for feature matching of course depends on the number of features. It also 

increases the robustness and reliability. Non-planar features may change in appearance 

more significantly than planar features under viewpoint changes. Such landmarks are the 

reason for ambiguities in feature matching, and mis-matches will occur more frequently 

which cause problems in pose estimation. 

Easier matching of planar landmarks: State-of-the-art wide-baseline methods assume that 

landmarks undergo a planar projective transformation under viewpoint change. Approximating 

the projective transformation by an affine transformation to create viewpoint 

normalized descriptors are the currently most advanced matching methods. Landmarks 

located on 3D corners strongly violate the just mentioned assumption. Such features would 

cause troubles for matching algorithms and should not be stored as landmarks in the map. 

Increased accuracy: The accuracy of the 3D reconstruction can be increased with plane information. 

3D point reconstructions are coupled by geometric constraints and the 3D 

coordinates can be optimized to be arranged exactly as a plane. 

In the following a batch method for map building is presented. Input for the method is an 

image sequence acquired from a mobile robot equipped with a single perspective camera. The 

camera needs to be calibrated beforehand. Structure-from-motion algorithms and wide-baseline 

stereo methods are applied to build the piece-wise planar world representation. The created 

map can then be used for purely vision based global localization. A mobile robot equipped with 

a single perspective camera can estimate its pose in respect to the world map from a single 

camera image. 

The localization approach to be presented is in analogy to [56] as it computes the robot 

pose from 3D-2D point correspondences. The novelty is the use of small planar patches as 

3D landmarks and that the pose can be computed from a single landmark correspondence. 

This allows to do localization under extreme conditions, where other methods which require 

usually a high number of correspondences would normally fail. The novel localization approach 

is presented in the second part of this chapter. 

7.1 Map building 

The world is represented as a network of linked metric sub-maps (see Figure 7.1 for illustration). 

Each sub-map has its own local coordinate system and each link between two sub-maps 

represents a rigid transformation (containing rotation, translation and scaling) connecting both 

local coordinate system. Thus it is possible to express a position within a specific sub-map 

from each local coordinate system. Furthermore each sub-map contains the transformation into


s,R,t 

sub-map 

s,R,t 

sub-map 

s,R,t 

s,R,t 

world coordinate 

system 

s,R,t 

sub-map 

Figure 7.1: The world is represented as a network of linked metric sub-maps. 

one global world coordinate system yielding one big metric world representation. Robot localization 

is done in the scope of a single sub-map. The pose is initially expressed within the 

local coordinate system but can be transferred into the global world coordinate system with the 

corresponding transformation. A single sub-map is created by 3D reconstruction from a shortbaseline 

image pair. The links between the sub-maps are established via wide-baseline feature 

matching. Map building is treated as an off-line process. Images are acquired by one or multiple 

robots (either controlled manually or using additional sensors, e.g. a laser range finder). From 

this unordered pile of images the environment map is constructed within three steps. In a first 

step the image pile is partitioned into smaller piles containing similar images which will correspond 

to sub-maps. Next, single sub-maps are created using two images of each smaller pile. In 

a last step the individual sub-maps are linked to form the complete world representation. The 

such created map can now be used on a mobile robot only equipped with a single camera for 

global localization within the mapped environment. In the following the three steps are outlined 

in detail. Global localization within the proposed map is dealt with subsequently. 

7.1.1 Sub-map identification 

Starting point is a large set of images I 1 ...I n taken at a high frame rate. We assume that the 

ordering of the images is not known, i.e. that we do not know which images are subsequent 

to others. The task of this step is to partition the whole set, into sub-sets C 1 ...C c containing 

images with a short-baseline variation only. Each partition will than act as a sub-map. The 

partitioning is done by means of clustering. A global similarity criteria is used to group visually 

similar images into clusters. The requirement for the images in each partition is that it is possible


. 

. 

. 

. 

. 

. 

sub-map 

identification 

sub-map 2 

sub-map 3 

sub-map 1 

Figure 7.2: Sub-map identification: An image sequence is partitioned into clusters of visually 

similar images. Each cluster acts as a sub-map. Images within one partition should show small 

baseline variations only. There should be some overlap between images from subsequent clusters. 

to do a stereo reconstruction. Furthermore the images in adjacent partitions should have an 

overlapping part (necessary for sub-map linking). Figure 7.2 illustrates the partitioning of an 

image sequence. 

As similarity measure the Euclidean distance between SIFT descriptors [67] is used. For 

each image a single SIFT feature vector is computed. The feature vector is computed from a 

low resolution version of the image. For feature extraction the images are re-sampled to 64×64 

pixel. Each image is represented by a single feature vector of length 128. This will result in n 

feature vectors x 1 ...x n which will correspond to the images I 1 ...I n , with


x = (x 1 ...x 128 ) ∈ IR 128 . (7.1) 

This results in a 128-dimensional feature space. Visually similar images will form clusters 

and the partitions can be found by clustering. Simple k-means clustering [25] worked well on 

this problem. The algorithm was run with different initial cluster numbers and the solution 

yielding the most compact clusters was selected. Alternatively algorithms could be used which 

do not require an initial guess for the cluster numbers, e.g. hierarchical clustering [25] or mean 

shift clustering [18]. Clustering returns c sets of feature vectors C 1 ...C c and the corresponding 

cluster centers x 1 ...x c . The cluster center is the mean value of the feature vectors of a cluster 

written as 

x i = 1 

|C i | 

∑ 

x∈C i 

x j . (7.2) 

For each cluster two images are chosen to represent the sub-map. The remaining images 

will not further be processed. The two selected images must allow a 3D stereo reconstruction 

as well as landmark extraction. The selection of the two images is done in feature space. The 

first image is the one corresponding to the median cluster center. We define the median cluster 

center as 

x median = arg min (|x j − x|). (7.3) 

x∈C 

The second image is selected in the following way. For each feature vector within the cluster 

the Euclidean distance to the feature vector of the first image x median is calculated. The image 

corresponding to the feature vector with median distance is then selected as second image. 

x = arg median x∈C (|x j − x median |) (7.4) 

To verify the selected images region matching as described in Chapter 6 is performed. The 

region matches must satisfy the epipolar constraint, otherwise another image of the cluster gets 

selected. The sub-map identification step also works as data reduction. From the initially large 

set of n images only 2c images (c ≪ n) are passed on to the subsequent steps. 

7.1.2 Sub-map creation 

Let us define a sub-map as a 9-tuple 

S = 〈T W L , K, I, L, C, Π, D, A, P L 〉. (7.5) 

Table 7.1 provides a quick overview of the sub-map components. In the following the components 

will be discussed in detail. The key components of the sub-map are the landmarks. 

Landmarks are interest regions detected in image I i . The position in the image and the 3D 

position in the local coordinate system of the sub-map are known. A landmark description as a 

feature vector is available too, it allows the detection of corresponding landmarks. Only image 

regions which are planar are used as landmarks. I, L, D, Π, A are used to store the landmarks. 

L is a set of image patches of size n containing one view of each landmark (n is the number of 

detected landmarks). The image patches are stored normalized (and re-sampled) with a size of 

64 × 64 pixel. Normalization is done by applying an affine transformation. The normalization 

transformation is different for each landmark and describes how to transform the image patch 

of the landmark from the original image coordinate system into a canonical coordinate system.


Sub-map component 

T W L 

K 

I 

L 

C 

Π 

D 

A 

P L 

Description 

rigid transformation into the global coordinate system 

camera calibration matrix 

plane index image 

landmark image patches 

plane covariances 

3D planes 

landmark SIFT descriptors 

landmark normalization transformations 

camera matrix of the local coordinate system 

Table 7.1: Components of the piece-wise planar sub-map. 

A is a set of size n holding a transformation for each landmark in L. An entry of A is an affine 

transformation matrix of size 3 × 3. D is a set of feature vectors providing a description for 

each landmark of size n. Each entry of D is a SIFT feature vector of length 128 providing the 

description for the corresponding landmark. The SIFT feature vector is computed from the 

normalized image patches in L. I and Π are used to represent the 3D coordinate of a landmark. 

Π is a set of 3D plane descriptions of the sub-map of size p, where p is the number of planes 

detected in the sub-map. Each plane is described by a 6-vector (parameterized with normal 

vector and one 3D point) representing the 3D parameters within the local coordinate system. 

Each landmark is located on these planes in 3D space. The corresponding mapping is stored in 

I which is an index image holding the information which pixel in the image space corresponds 

to which plane in 3D. The map also contains uncertainties for the 3D planes. The set C of size 

p contains covariance matrices for the different 3D planes. K and P L define the local coordinate 

system. P L is the camera matrix which connects the 3D planes to the image coordinates. K is 

the corresponding 3 × 3 camera calibration matrix. TL 

W represents a rigid transformation into 

the global coordinate system. It is a 4×4 similarity transformation matrix (rotation, translation, 

scale) which transforms 3D points from the local into the global coordinate system. A sub-map 

defined in that way contains all necessary information for global localization. In the following 

the computation of the various entries will be described. 

7.1.3 Structure computation 

We will now describe how to extract the 3D map structure from two images. From the sub-map 

identification step a short-baseline stereo image pair I, I ′ is established. Goal of the structure 

computation is to identify planes in the image scene and compute a 3D reconstruction of the 

planes. The reconstruction will only contain planes, non-planar image parts will be discarded. 

This will result in a piece-wise planar 3D reconstruction of the scene. The segmentation of the 

scene into planar regions will be done with the method described in Chapter 6. Prerequisite for 

this method is that the camera poses are known. Thus in a first step the camera positions for 

the images I, I ′ have to be computed. DoG interest points are detected in both images I, I ′ and 

SIFT descriptors D, D ′ are computed for every detected interest point. Corresponding points 

are identified by nearest neighbor search in feature space. As distance measure the Euclidean


distance is used. Two features correspond if 

d 01 

d 02 

< t (7.6) 

where d 01 is the Euclidean distance between the query feature and the nearest feature point from 

D ′ and d 02 is the Euclidean distance to the second closest feature. t is a distance threshold. 

Good results can be achieved with t=0.8. The distance measure has been suggested by Lowe [67] 

for SIFT feature matching. The such established feature correspondences can now be used for 

estimating the camera poses. As already mentioned we assume calibrated cameras, i.e. the 

calibration matrix K is know. Thus we can estimate the essential matrix which encodes the 

camera positions of the two viewpoints. Essential matrix estimation is performed using the 5- 

point algorithm of Nister [83] within a standard RANSAC scheme [28]. The essential matrix E 

can be decomposed into two camera matrices P, P ′ where P is the canonical camera matrix and 

P ′ defines the second camera position in the local canonical coordinate frame (see equation 7.7 

and 7.8). 

⎛ 

⎞ 

1 0 0 0 

P = ⎝ 0 1 0 0 ⎠ (7.7) 

0 0 1 0 

P ′ = [R|t] (7.8) 

R is a 3 × 3 rotation matrix and t is a translation 3-vector defining the baseline of the stereo 

case. P, P ′ are input parameters for the subsequent plane segmentation and reconstruction step. 

The algorithm requires also initial guesses for small planar regions in the images I, I ′ as input 

parameters with the according inter-image homographies. For that MSER regions are detected 

in I and I ′ . Region matching is performed (as described in Section 6.1) which returns point 

correspondences within each interest region and the according homography transform. This 

constitutes the initial guesses for the plane segmentation algorithm. Plane segmentation and 

reconstruction is now performed yielding the following map components: 

• An index image I. 

• A set of detected and reconstructed planes Π. Each plane is represented by a 6-vector 

giving the full 3D parameters in the local coordinate frame. 

• Covariances for each plane giving an uncertainty measure for the reconstruction accuracy. 

The structure computation will be completed by a post-processing step. Planes which are 

extended behind the camera planes are removed. This consistency check deletes incorrect reconstructed 

image parts resulting in higher robustness. The different steps of structure reconstruction 

are illustrated in Figure 7.3. 

7.1.4 Landmark extraction 

Up to now the sub-map is still missing essential components, the landmarks. Closing this gap 

is the goal of the next step. Landmark appearance has to be connected to 3D information and 

incorporated into the sub-map. In the following we describe an approach using MSER interest 

regions as landmarks. However, the definition of the sub-map is general enough to allow the use 

of any other kind of interest regions. In the following the necessary steps are outlined:


(a) (b) (c) 

(d) 

(e) 

Figure 7.3: Sub-map creation: (a),(b) Short-baseline image pair (with landmarks shown). (c) 

Index image resulting from plane segmentation. (d) Reconstructed 3D planes (e) Planar landmarks 

in 3D. 

• Detection of interest regions (MSER’s) 

• Normalization using LAF 

• SIFT descriptor extraction 

• Computation of 3D coordinates of the landmarks (by projection onto the according 3D 

plane) 

First, interest regions are detected in one of the images of the sub-maps short-baseline pair, in 

our case MSER regions. Each region will be represented by its region border. The region border 

will simply be stored as a list of image coordinates of every border pixel. Next normalization 

of the regions will be performed. We use one of the methods described in [85]. The border is 

searched for points of maximal concavity or convexity. Two such points A, B together with the 

regions center of gravity C define a local affine frame (LAF). CA and CB are the axes of a 2D 

coordinate system having undergone an arbitrary affine transformation. Normalization can be 

done by applying a transformation which restores the orthogonality of CA and CB and scaling 

the axes to unit length. The LAF is transformed into a canonical coordinate system. The 

canonical coordinate system is defined as a 64 × 64 pixel sized image patch and the extracted 

MSER region is re-sampled into the normalized frame. Multiple LAF’s can be constructed for 

a single MSER region yielding different normalized MSER regions. In our framework each new


normalization is simple added to the set of landmarks L. In a next step the SIFT descriptor is 

computed from the normalized MSER regions. The size of 64 × 64 is ideal for the computation 

of the SIFT orientation histogram. For each normalized MSER region a feature vector with 

length 128 is computed. D is the set of all feature vectors for the extracted landmarks. Next, 

for each landmark the corresponding 3D coordinates are computed. This is done with the index 

image I representing the segmentation into scene planes. First, the plane corresponding to each 

landmark has to be identified. The index image I basically works as look-up table. Every 

gray-value in the index image I works as plane identifier. For every pixel within a landmark we 

look at the same pixel position in I and read the plane identifier, which indexes the planes in Π. 

For robustness, we build a histogram from the looked up plane identifiers. The plane with the 

maximal histogram value gets assigned to the landmark. This approach allows us to deal with 

imperfect segmentation. Now for every landmark pixel the 3D coordinate can be calculated by 

computing the intersection of the according 3D plane with the ray connecting the pixel in the 

image plane and the camera center. This approach results in an exact planar 3D reconstruction 

of the landmark. Furthermore it is not necessary to explicitly store the 3D coordinates, they 

simply can be computed when they are needed from the index image I and the plane set Π. An 

illustration of reconstruced landmarks in 3D is given in Figure 7.3(e). 

7.1.5 Sub-map linking 

Let us consider two sub-maps, 

and 

A = 〈T W (A) 

L 

, K (A) , I (A) , L (A) , C (A) , Π (A) , D (A) , A (A) , P (A) 

L 〉 (7.9) 

B = 〈T W (B) 

L 

, K (B) , I (B) , L (B) , C (B) , Π (B) , D (B) , A (B) , P (B) 

L 

〉. (7.10) 

Each sub-map defines its own local coordinate system. The coordinate systems are Euclidean 

and differ by 

• an arbitrary scale factor s, 

• a 3 × 3 rotation matrix R 

• and a translation vector t of length 3. 

Together this represents a scaled rigid point transformation (or similarity transform). 3D points 

in sub-map B p (B) can be transformed into the coordinate frame of sub-map A by 

p (A) = s(Rp (B) + t). (7.11) 

In homogenous coordinates the sequence of transformations can be encapsuled by 

[ ] R t 

RT = s . (7.12) 

0 1 

RT is a 4 × 4 transformation matrix transforming homogeneous 3D points. The transformation 

is written as 

p (A) 

h 

= RT p (B) 

h 

(7.13) 

where p (A) 

h 

and p (B) 

h 

are the homogenous counterparts of p (A) and p (B) . The goal of sub-map 

linking is now to estimate the values of the parameters s, R and t. The necessary parameters 

can be estimated from 3D point correspondences between the two sub-maps with the following 

steps:


• Establishing 3D point correspondences 

• Calculating the scale factor s 

• Estimating R and t of the rigid transformation 

Let us first focus on the generation of 3D point correspondences. By wide-baseline region 

matching (see section 6.1) landmark correspondences between both sub-maps are detected. For 

matching the already extracted features L (A) , D (A) and L (B) , D (B) can be used. Sub-map linking 

is possible from a single region match. However, a higher number of matches will increase the 

robustness of the method. But let us continue with the case of a single region match only. 

The region matching returns multiple point correspondences q ↔ q ′ (at an order of 20-100) 

within the region match. As already described in Section 7.1.4 3D coordinates can be computed 

by projecting q and q ′ onto their corresponding plane. The resulting 3D points Q and Q ′ are 

defined in the local coordinate systems of the sub-maps. Point correspondences in 3D Q ↔ Q ′ 

are directly known from the 2D point correspondences and do not contain outliers 2 . The scale 

s between the sub-maps is the first parameter we estimate from the 3D point correspondences 

Q ↔ Q ′ . Two point pairs Q i ↔ Q ′ i and Q j ↔ Q ′ j are arbitrarily selected from Q ↔ Q′ . The 

scale change from sub-map B to A is defined as 

s = ‖Q i − Q j ‖ 

‖Q ′ i − Q′ j ‖ (7.14) 

where ‖Q i − Q j ‖ is the Euclidean distance. Before we continue with the next steps the scaling 

transform has to be applied to the points Q ′ from sub-map B resulting in the scaled coordinates 

Q ′ s = sQ ′ . (7.15) 

After scaling both 3D point sets Q and Q ′ s differ only by a rigid transformation. The rigid 

transformation parameters R and t are computed from Q ↔ Q ′ s using the quaternion-based 

method described by Horn [47]. Now all needed parameters for the similarity transform are 

computed and it is now possible to transform each 3D point in the local coordinate system of 

sub-map B into the coordinate frame of sub-map A. 

We just developed the method to link two sub-maps. Let us now focus on the problem 

of linking a number of n sub-maps into one complete environment map. The idea is to store 

the information which sub-maps are linked in a graph-like structure altogether with the corresponding 

coordinate transforms. For that we adapt the approach introduced by Schaffalitzky 

et al. [92]. A set of n images is spatially organized by means of wide-baseline feature matching. 

Key ingredient is full two-view image matching including epipolar geometry estimation. Only if 

the epipolar geometry was successfully established two views get linked. However, this usually 

requires a high overlap in the image data. This is in general not the case with our image data, 

in particular we are interested in the cases with small overlapping area. In our approach we 

start with the computation of a distance matrix for all occurring landmarks. We use the already 

detected landmarks which are contained in the sub-map representation. In detail we calculate 

a distance matrix over all descriptors D for each sub-map. As distance metric the normalized 

cross-correlation is used. The correlation value has the favorable property to be limited between 

−1 and 1 which allows the use of absolute thresholds. The size of the distance matrix is N × N 

2 Dealing with outliers in the registration of two 3D point clouds is very challenging. The way the 3D point 

correspondences are generated in our method eases the solution of the problem

7.2. Localization 120 

where N = |D 0 | + |D 1 | + ... + |D n |. Each entry of the distance matrix represents a tentative link 

and in the following tentative links are verified starting by the link with the highest correlation 

value. In the first iteration the match with the highest correlation value is searched and verified 

with the wide-baseline matching method described in Section 6.1. If the match is confirmed the 

sub-maps are linked with the previously described method using the detected landmark correspondences. 

A graph structure G is established where the sub-maps represent the nodes and the 

detected links between sub-maps are the edges in the graph. A confirmed match adds an edge to 

G. With a confirmed match no further links are searched for the two already linked sub-maps. 

If the match could not be confirmed, the match with the second highest correlation value is investigated. 

The algorithm ends if all sub-maps are linked or if all entries in the distance matrix 

are processed. This could lead to a worst case computational complexity of O(N 2 ), however in 

practice one can also end the algorithm if the correlation values of the remaining match pairs 

drop below some threshold c th . 

Please note, that this algorithm does not guarantee a completely linked graph G. G can 

contain isolated clusters or individual nodes. However, this depends on the provided image data 

only. A complete environment map can be constructed by the acquisition of additional images. 

7.2 Localization 

In the following we define the pose of the robot to be the rotation and position of the single 

camera mounted on the robot. The pose is defined within a local sub-map by a rotation R and 

translation t in reference to the origin of the coordinate system. The origin of the local sub-map 

is equivalent with the camera center of the sub-map’s landmarks. This relation is illustrated in 

Figure 7.4. 

The matched landmarks define 2D ↔ 2D point correspondences between the current view 

Π i and the sub-maps view Π 0 . 3D points can be created by projecting the 2D points onto 

the 3D planes of the sub-map. This yields 3D ↔ 2D correspondences from which the pose 

R, t can be computed. The pose estimation method proposed by Lu et al. [68] is known to be 

fast and robust. And it can deal with planar landmarks, i.e. the 3D points are located on a 

plane, which makes the algorithm suitable for our world representation. Pose estimation within 

such a local sub-map S gives the pose within the local coordinate frame R L , t L of the sub-map. 

The goal is now to compute the pose of the robot R W , t W within the global world coordinate 

system W on the basis of the pose estimation within local sub-maps S, each in a different local 

coordinate frame L. Each sub-map contains the necessary transform to the global coordinate 

frame, denoted TL W . T L 

W is a 4 × 4 point transform matrix. A 3D point X L in the sub-map is 

transformed into the global coordinate system by X W = TL W X L. The following steps outline 

the pose estimation and putting the pose into the global coordinate frame. 

From region matching 2D − 2D point correspondences x ↔ x ′ between a map image and the 

current image are retrieved. x are the points from the map image and x ′ are the points from 

the current view. The next step is to create 3D − 2D point correspondences out of the 2D − 2D 

point correspondences. Projecting the points p onto the corresponding map plane Π L gives the 

according 3D points X L . The 3D points X L are in the local coordinate frame of the sub-map 

L defined by the camera center P L = K[R L |t L ]. Next the 2D points from the current image are 

normalized by the calibration matrix K resulting in ˆx ′ = K −1 x ′ . Normalization resolves some 

numerical issues (see Section 7 in [44]). Now the 3D − 2D correspondences X L ↔ ˆx ′ necessary 

for the pose estimation algorithm have been set up. Pose estimation is performed and returns 

R L , t L . R L , t L is the pose of the camera according to the actual camera image in the coordinate


3D structure 

C i 

π i 

R,t 

π 0 

C 0 

Figure 7.4: The pose of the robot is defined as rotation R and translation t from the origin of 

the local sub-map. 

frame of the local sub-map. Transforming R L , t L into the global coordinate system using TL 

W 

done as follows. First the camera center C L is expressed explicitly with 

is 

C L = −R T Lt L . (7.16) 

The camera center C L can be transformed directly using the point transform T W L . 

C W = T W L C L (7.17) 

Now the rotation R L is transformed into the world coordinate frame using the rotational part 

RL W of T L W only. TL W = [ RL W |t W ] 

L S 

W 

L (7.18) 

R W = R W L R L (7.19) 

Having computed R W and C W the camera matrix in the world coordinate system P W can be 

set up. 

P W = K [R W | − R W C W ] (7.20) 

K is the camera calibration matrix. 

7.2.1 Localization from a single landmark 

The situation of pose estimation from a single landmark is illustrated in Figure 7.5. The 3D−2D 

point correspondences from within a single landmark are basically outlier free, assured from the


registration process. The 3D points are exactly located on a plane, because they are computed 

by projecting 2D points onto a scene plane. Thus they do not contain noise. However the 

2D − 2D point correspondences are obtained by correlation based matching and therefore are 

assumed to be distorted by noise. We assume that the 2D points within the landmark from the 

actual view are distorted by Gaussian noise. In the following we will check experimentally how 

the Gaussian noise influences the pose estimation accuracy. 

π 0 

π 1 

Figure 7.5: Pose estimation from 3D ↔ 2D point correspondences. The 3D points are exact, 

the 2D points are assumed to be disturbed by Gaussian noise. The effect of the noise is that 

the rays from the point correspondences do not intersect exactly at the camera center. 

The influence of Gaussian noise distorted 2D points is evaluated with synthetic data. Figure 

7.6 shows the results of the Lu and Hager pose estimation for our special situation. Pose 

estimation for synthetic 3D − 2D point correspondences has been performed with noise added 

to the 2D coordinates of the landmark points from the query image only. Gaussian noise of 

standard deviation σ = 0.1, 0.3 and 0.7 (in pixel) was added to the 2D points. The experiment 

has been repeated 1000 times. In Figure 7.6 each point denotes an estimated camera position. 

Figure 7.6(a-c) show the distribution of the camera position for Gaussian noise with σ = 0.1 

in different views. The blue cross marks the noise-free computed camera position. Noisy 2D 

coordinates create a spherical distribution around the true position. Perpendicular to the line 

connecting the true position and the 3D coordinates the points are spread out widely while the 

depth distribution is small. This experiment shows that Gaussian noise influences the pose estimation 

from a small number of point correspondences within a small image region (landmark) 

significantly. 

Next we investigate a solution to alleviate the influence of noise in the 2D − 2D point correspondences. 

We analyze the effect of estimating the pose from a small sample of correspondences 

only instead of using all 3D ↔ 2D correspondences. In the following experiment the pose is 

estimated from 1000 random samples of size 5, 10, 20 out of 56 correspondences. The correspondences 

are obtained by correlation based matching. The estimated poses are shown in 

Figure 7.7. Sub-sampling generates a distribution of poses around the pose computed with all


(a) 

(b) 

(c) 

(d) 

(e) 

Figure 7.6: Effect of added Gaussian noise to 2D points in pose estimation. The blue dot marks 

the noise-free pose estimate. (a-c) σ = 0.1 (d) σ = 0.3 (e) σ = 0.7 

available correspondences. In fact, the pose estimated from all correspondences is optimal with 

respect to the point correspondences from within the landmark. However, we can evaluate the 

solution by means of additional correspondences from outside our landmark. Let us assume a 

set of 3D ↔ 2D point correspondences distributed over the whole image denoted Q ↔ q. A 

pose estimate R, t can now be used to compute 2D coordinates ˆq with 

ˆq = [R|t]Q. (7.21)


(a) 

(b) 

(c) 

Figure 7.7: Effect of pose estimation from random samples. The blue dot marks the pose 

estimate using all available correspondences. (a) Sample size = 5 (b) Sample size = 10 (c) 

Sample size = 20 

A re-projection error ɛ can be defined as the distance between q ↔ ˆq. 

ɛ = ∑ ‖q i − ˆq i ‖ (7.22) 

Given multiple pose estimates the most accurate one can be identified as the one with the 

smallest re-projection error ɛ. We analyzed the re-projection error for the pose estimates shown 

in Figure 7.7. The re-projection error is coded in the point color. A dark green coded pose has 

a re-projection error smaller than the median re-projection error. For a light green coded pose 

the re-projection error is bigger than the median error. The results are shown in Figure 7.8. 

The pose estimate using all available point correspondences is marked as blue dot. The pose 

estimated with the smallest re-projection error is the red dot. It is evident that the best pose 

estimate does not coincide with the all-points solution. Furthermore, the color coding reveals 

the area in 3D space where the best pose estimate is located. The figure also reveals that the 

distribution gets more compact when bigger sample sets are used. A small sample size will 

create widely spread out hypotheses. 

The conclusion is now, that the all-points solution will not guarantee the best pose estimate. 

A sub-sampling scheme computing multiple pose estimates will in any case contain a better 

pose estimate as the all-points solution. Scoring the different hypotheses with the re-projection 

error the best hypotheses can be selected. However, when dealing with a single landmark match 

additional 3D −2D point correspondences will not be available to compute a re-projection error.


A scoring method is needed for which additional landmark matches are not necessary. In the 

next section we will introduce such a method. 

1 

1 

t z 

0.5 

t z 

0.5 

0 

3 

0 

3 

2.5 

2 

2.5 

2 

1.5 

2 

2 

1 

1.5 

1.5 

0.5 

0.5 

t y 1 0 

t 

t y 

1 0 x 

t x 

1 

1.5 

(a) 

(b) 

1 

t z 

0.5 

0 

3 

2.5 

2 

2 

t y 

1.5 

1 

0 

0.5 

t x 

1 

1.5 

(c) 

Figure 7.8: Pose estimates color coded with re-projection error ɛ. Dark green dots: ɛ ≤ median, 

light green dots: ɛ ≥ median. The blue dot marks the pose estimate using all available correspondences. 

The red dot marks the pose estimate with the smallest re-projection error. (a) 

Sample size = 5 (b) Sample size = 10 (c) Sample size = 20 

7.2.2 The local plane score 

In the previous section we explained how the additional 3D − 2D point correspondences can be 

used to score the pose hypothesis. However, in the absence of the additional landmark matches 

no additional point correspondences are available. In the following we introduce a new measure, 

the local plane score (lp-score, Π l -score). The lp-score is based on information implicitly stored 

in the piece-wise planar structure of the world map. It will be shown that with the lp-score good 

hypothesis can be selected similar to the re-projection error ɛ. 

We use the fact, that each landmark is a small planar patch Π l and in most cases is part 

of a bigger plane structure Π. See Figure 7.9 for an illustration. Given a pose hypothesis P h it 

is now possible to create 2D ↔ 2D point correspondences for the complete extend of the plane 

Π. We call Π the support area of Π l . 3D ↔ 2D point correspondences within the support area 

can be created by projecting image locations onto the 3D plane Π. The created 3D points can


3D structure 

3D-2D 

C i 

π i 

2D-2D 

π C 

0 

0 

Figure 7.9: The landmark (green plane) is part of a bigger planar structure (blue plane). Given 

a pose estimate C i 2D − 2D point correspondence for the bigger structure can be computed 

between π 0 and π i . The transfer error of the homography from the landmark point set achieved 

on the additional points defines a quality measure for the pose estimate. 

be projected into the image plane of the actual view I, which is subject to pose estimation, 

using the pose hypothesis P h . The such created 2D ↔ 2D point correspondences Q = (q, q ′ ) 

are noise-free and exact. We want to stress that this point correspondences are not necessarily 

located in the field of view of the actual image I. The correspondence is determined purely 

geometrically based on the pose hypothesis P h . From landmark matching we already have a set 

of 2D ↔ 2D point correspondences Q l = (q l , q ′ l ) within the landmark. For Q l a homography H l 

can be computed which relates q and q ′ by 

q ′ l = H lq l . (7.23) 

This relation must also hold for the point correspondences Q of the support region, if they 

are computed from a correct pose hypothesis. Let ˆq ′ be the 2D points obtained from an incorrect 

hypothesis with 

ˆq ′ = H l q. (7.24) 

Then ˆq ′ ≠ q ′ and the difference can be quantified with the transfer error ɛ t (see Figure 7.9 

for an illustration). The transfer error ɛ t is defined as


ɛ t = ∑ i 

‖q ′ i − ˆq ′ i‖. (7.25) 

Let us denote the transfer error ɛ t as our local plane score (lp-score). The lp-score is computed 

for every pose estimate. Figure 7.10 shows some results. Each point is a pose estimate and the 

lp-score is coded in the point color. The poses corresponding to the n smallest lp-scores are 

coded dark green. The remaining poses are light green. n is set to 10% of the number of all 

poses. The pose estimate using all available point correspondences is marked as blue dot. The 

pose estimated with the smallest lp-score is the red dot. The results show that the lp-score 

is consistent with the results from the re-projection error but can be computed from a single 

landmark and map information. 

1 

1 

t z 

0.5 

t z 

0.5 

0 

3 

0 

3 

2.5 

2 

2.5 

2 

1.5 

2 

2 

1 

1.5 

1.5 

0.5 

0.5 

t y 1 0 

t 

t y 

1 0 x 

t x 

1 

1.5 

(a) 

(b) 

1 

t z 

0.5 

0 

3 

2.5 

2 

2 

t y 

1.5 

1 

0 

0.5 

t x 

1 

1.5 

(c) 

Figure 7.10: Pose estimates color coded with lp-score. Dark green dots: ɛ ≤ median, light green 

dots: ɛ ≥ median. The blue dot marks the pose estimate using all available correspondences. 

The red dot marks the pose estimate with the smallest lp-score. (a) Sample size = 5 (b) Sample 

size = 10 (c) Sample size = 20 

7.2.3 Algorithms 

In the following we describe two algorithms for pose estimation from a single landmark. The first 

one is using an epipolar criteria to select the best solution. This method can only be applied if


other landmark matches are available to compute the epipolar distance. The other one is using 

the lp-score. Both algorithm are very similar up to the scoring function. 

Epipolar criteria method 

Let us start with landmark extraction and matching. Similar to the sub-map linking step, MSER 

regions are extracted from the image of the current view. A LAF is computed for each region 

and region normalization is performed. A SIFT-descriptor is computed from the normalized 

image patch. Landmark correspondences are detected with the method of section 6.1. The 

subsequent pose estimation step requires planar landmarks, thus the planarity has to be checked 

for each image region. Non-planar landmarks will be discarded. During map building images are 

segmented into piece-wise planar areas. If a landmark is located within the detected planar areas 

its planarity is confirmed. The region matching algorithm establishes point correspondences 

within the support region of the feature (the area the SIFT-descriptor is computed from). For 

each landmark we now compute a set of pose hypotheses Q. For this first 3D − 2D point 

correspondences are computed from the 2D − 2D point correspondences by projection onto 

the 3D plane. Multiple hypotheses are generated by drawing smaller samples (of size p) from 

the whole set of 3D − 2D correspondences. For each of the n random sub-sets S the pose is 

computed as follows. First the 5-point algorithm [83] is used to create an initial rotation R 

from the 2D − 2D point correspondences. With this initial rotation R the pose is computed 

from the 3D − 2D correspondences using the iterative method of Lu and Hager [68]. The pose 

hypotheses is added to Q if it represents a valid configuration, i.e. the 3D points are in front of 

the camera. This is repeated for all matched landmarks. If all pose hypotheses are created the 

epipolar distance is computed for each hypothesis. As resulting pose the one with the smallest 

epipolar distance is selected. The algorithm is outlined in Algorithm 5. 

Algorithm 5 Pose estimation from a single planar landmark (solution selection with epipolar 

criteria) 

Q ← [] {list to hold possible solutions R, t for pose estimation} 

for all region correspondences do 

project 2D points on plane to create 3D-2D matches 

for i = 1 to n do 

select random subset S from 3D-2D correspondences of size p 

compute R,t from S using 5-point algorithm 

3D-2D pose estimation with initial rotation R 

add pose (R,t) to Q if 3D points are located in front of the camera 



for i = 1 to length(Q) do 

calculate mean epipolar distance using R, t from Q(i) on 2D-2D correspondences over all 

matching regions 


return R, t with minimal epipolar distance


lp-score method 

The lp-score method is very similar to the previous method. The landmark extraction and 

matching is identical and the reader may be referred to the previous section. Again for each 

landmark we compute a set of pose hypotheses Q. For this first 3D − 2D point correspondences 

are computed from the 2D − 2D point correspondences by projection onto the 3D plane. 

Multiple hypotheses are generated by drawing smaller samples (of size p) from the whole set 

of 3D − 2D correspondences. For each of the n random sub-sets S the pose is computed as 

follows. First the 5-point algorithm [83] is used to create an initial rotation R from the 2D − 2D 

point correspondences. With this initial rotation R the pose is computed from the 3D − 2D 

correspondences using the iterative method of Lu and Hager [68]. The pose hypotheses is added 

to Q if it represents a valid configuration, i.e. the 3D points are in front of the camera. In 

addition a homography is estimated from the 2D − 2D sub-sample. The homographies for each 

valid pose estimate are maintained in a list H. This is repeated for all matched landmarks. If all 

pose hypotheses are created the lp-score is computed for each hypothesis using the corresponding 

homographies. If there is only a single landmark match the pose with the smallest lp-score 

is selected as resulting pose. For the case of multiple landmark matches the resulting pose is 

found by clustering. A subset Q is selected containing the k entries of Q with the smallest 

lp-score. Thus Q contains pose estimates from various landmarks. Now clustering is applied 

to the translational part of Q to find the dominant cluster. The resulting pose is then the one 

closest to the center of the dominant cluster. This approach was found to be more robust in the 

case of multiple landmark matches. The algorithm is outlined in Algorithm 6. 

Algorithm 6 Pose estimation from a single planar landmark (solution selection with lp-score) 

Q ← [] {list holding possible solutions R, t for pose estimation} 

H ← [] {list holding a homography for each pose in Q} 

for all region correspondences do 

project 2D points on plane to create 3D-2D matches 

for i = 1 to n do 

select random subset S from 3D-2D correspondences of size p 

compute R, t from S using 5-point algorithm 

3D-2D pose estimation with initial rotation R 

compute homography h from 2D-2D matches corresponding to subset S using normalized 

DLT 

add pose (R, t) to Q and homography h to H if 3D points are located in front of the 

camera 



for all entries in H do 

calculate median transfer error of H(i) on 2D-2D correspondences generated from map 

plane 


select subset Q from Q containing the k entries with the smallest transfer error 

compute clustering of the poses in Q (use translation only) 

return R, t of the pose with median position within the dominant cluster

Chapter 8 

Map building and localization 

experiments 

In map building and localization all methods presented so far are working together. The landmarks 

used in map building are chosen based on the evaluation described in Chapter 4. Map 

building also uses the wide-baseline methods for region matching and scene reconstruction described 

in Chapter 6. Localization as well uses the wide-baseline region matching and local 

detectors. Thus the experiments conducted in this chapter do not only evaluate the map building 

and localization methods proposed in the previous chapter but implicitly also measure the 

performance of the wide-baseline methods described previously. Localization will only be successful 

if the localization algorithm can work with reliable landmarks. This is also valid for 

the map building algorithm. Thus successful localization and map building results do not only 

confirm the benefits of localization using a piece-wise planar world map but also confirm the 

capabilities of the wide-baseline region matching, the piece-wise planar scene reconstruction and 

as well the validity of the local detector evaluation results. 

The experiments conducted in this chapter will be carried out using our mobile robot platform, 

the ActivMedia PeopleBot 1 . The experiments will include map building experiments in 

a room-size office environment as well as a large scale mapping of a hallway. The ”Office” environment 

is rather small but represents nevertheless a typical localization scenario. Accurate 

localization results would be expected for such an environment. The ”Hallway” scenario is a 

much more challenging scenario. It has much bigger extents and contains less useful texture to 

extract landmarks. Nevertheless we will demonstrate successful map building, where the overall 

map consist already of 13 single sub-maps. 

The mapping results will be compared to a map created by elaborate laser mapping. The 

laser map will also act as ground truth for the localization experiment. The robot positions 

computed by visual localization are compared to the positions from laser based localization. 

The experiments will demonstrate that the proposed localization method achieves results 

competitive to the current state-of-the-art methods proposed in [96] and [56] but requires only 

a single landmark match therefore being superior to the current state-of-the-art methods. 

1 http://www.activrobots.com/robots/peoplebot.html 

130

8.1. Experimental setup 131 

8.1 Experimental setup 

The experimental setup in general consists of an ActivMedia PeopleBot equipped with a range 

of sensors including a laser range finder and an additional single camera. In the following the 

components are described in detail. 

8.1.1 ActivMedia PeopleBot 

The mobile robot used for this experiments is an ActivMedia PeopleBot. The robot already 

comes fully equipped and operational with lots of localization and navigation software, except 

visual localization and navigation. We choose to work with an already elaborate robotics platform 

to focus on the visual localization solely. The PeopleBot has a size of 47cm×38cm×112cm. 

It features a highly holonomic differential drive, thus it is able to rotate in-place. The safe maximum 

speed is about 0.8m/s. The onboard sensors include wheel encoders, range-finding sonar 

and infrared sensors. The range-finding sonar consists of 24 ultrasonic transducers arranged to 

provide 360-degree coverage. The sonar sensors ranges from 15cm to 7m. Due to the height 

of the robot it includes also infrared sensors pointing in an forward upward direction to detect 

obstacles above the sonar array. The combination of sonar and infrared sensors already allows 

safe navigation in an unknown environment. In addition the robot is equipped with a laser range 

finder (LRF). The LRF allows precise localization and map building. One of the main reason 

to use the PeopleBot is its height. It allows to mount the camera at a height advantageous for 

vision based localization. Figure 8.1 shows the mobile robot and its sensor configuration. 

color camera 

laser range finder 

infrared sensor 

ultrasonic sensors 

ActivMedia PeopleBot 

Figure 8.1: The ActivMedia PeopleBot and its sensor configuration used for our localization 

and map building experiments.

8.2. Map building experiments 132 

8.1.2 Laser range finder 

The mobile robot is equipped with a SICK LMS-200 laser range finder (LRF). The LRF allows 

range measurements with a field-of-view of 180 ◦ . The angular resolution is 0.5 ◦ . Thus in our 

configuration we get 360 range measurements per laser reading. The range measurements are 

accurate up to +/- 15mm for a range of 1m to 8m. The LRF is mounted at a height of 

about 30cm. The robot comes with a laser localization and navigation system, which allows the 

creation of accurate maps using the LRF. The software ScanStudio stitches together individual 

laser measurements to a global map and performs a final global registration. The such created 

maps and robot positions are used as ground truth for the visual localization experiments. 

8.1.3 Camera setup 

For our experiments the mobile robot has been equipped with a 2-mega-pixel digital camera LU- 

205 from Lumenera. The camera features a CMOS sensor and is able to capture color images 

with a maximal resolution of 1600 × 1200 pixel. The achieved frame rate for this resolution is 

15 frames per second. For our actual experiments we used images with a resolution of 800 × 600 

by using the sub-sampling option of the camera. 

The camera is equipped with a 4.8mm wide-angle lens. The wide-angle lens has a field-ofview 

of about 90 ◦ but introduces already heavy radial distortions. Thus the captured images 

have to be re-sampled to remove the radial distortion before further processing. Figure 8.2(a) 

shows an image before the radial distortion got removed. Figure 8.2(b) shows the re-sampled 

image where the radial distortion got removed. Re-sampling is done with bilinear interpolation. 

The radial distortion as well as the principal point and the focal length of the camera setup are 

estimated using the calibration toolbox of Bouguet 2 . 

(a) 

(b) 

Figure 8.2: (a) Original image of the camera setup using a wide-angle lens showing heavy radial 

distortions. (b) Re-sampled image with radial distortion removed. 

8.2 Map building experiments 

In the next two sections results for the map building algorithm are shown. One experiment was 

carried out in an office room. Another experiment was carried out in a hallway of the university 

2 Camera calibration toolbox for Matlab (available at www.vision.caltech.edu/bouguetj/)


building. The experiments will demonstrate the capability of building piece-wise planar maps 

using the algorithm presented in the previous chapter. 

8.2.1 Office environment 

The first experiment to be discussed is visual mapping of an office room with about 12m 2 

size. For this a mobile robot was driven through the office capturing images and acquiring 

range measurements with a laser range finder. The goal of this experiments is to perform the 

construction of a piece-wise planar world map from the image data and compare it to the map 

created from the laser measurements. The map created from the laser range finder readings acts 

as ground truth. The software ScanStudio 3 has been used to create a floor plan of the office 

room. Figure 8.3 shows the map built from the laser range data as well as the path of the robot 

run. The robot positions from where laser readings were taken are marked with blue circles. 

The path of the robot is drawn in black and interpolated between the laser readings. 

Figure 8.3: Test environment ”Office”: Floor plan created by laser range finder data. Circles 

mark the positions of laser readings. The robots path is drawn in black and interpolated between 

the laser readings. 

3 ScanStudio is available from http://www.activrobots.com/


During the robot run 951 images were taken at a frame rate of 3 frames per second. To 

reduce the number of input images for the map building algorithm a manual preselection of the 

images has been performed. Images showing a too small baseline for stereo reconstruction have 

been removed. Map building has been performed as described in the previous section. First, 

sub-maps are identified in the image stack. Second, each sub-map is reconstructed separately. 

Finally, the reconstructed sub-maps are linked together to form a global world map. For the 

”Office” environment 5 sub-maps were identified and reconstructed to build a world map. Figure 

8.4 shows the short-baseline image pairs used for sub-map reconstruction. Figure 8.5 shows 

the individual sub-map reconstructions. The top row of the images shows the reconstructed 

planar structures of the scene which contain landmarks. The bottom row shows the extracted 

planar landmarks only. Map building does not create an entire 3D reconstruction, only the 

parts of a scene useful for robot localization are reconstructed and extracted. The 5 sub-maps 

together form the complete world map of the ”Office” environment. Linking works by searching 

for landmark correspondences within the different sub-maps. With a single landmark correspondence 

the similarity transform between two sub-maps can be computed and the sub-maps 

can be linked. Only a small overlap is needed to link two sub-maps. The linked world map is 

shown in Figure 8.6(a). Figure 8.6(b) shows details of the left corner. Table 8.1 summarizes the 

intrinsics of the created world map, the number of sub-maps, the number of planes, the number 

of landmarks and the metric size. 

To analyze the quality of the world map we compare the visual map with the laser map. As 

the laser map is available as floor plan only we compare a bird’s view of the visual map with the 

laser map. The comparison of the visual map and the laser map is shown in Figure 8.7. The 

laser map is shown in blue and the visual map is shown in green. The alignment of both maps 

(translation, rotation and scale) has been done manually. The laser map has been acquired 

in a height of about 30cm above floor level. The visual map in most cases represents scene 

structures at different heights, thus the room outline of the visual map and the laser map does 

not necessarily coincide. This is in particular visible on the right part of the map. On the right 

part the wall shows an indentation at the height of about 1m. The laser range finder measures 

the wall at 30cm height while the visual method reconstructed the poster in the indentation. The 

bird’s view of the world map reveals that the rectilinear structure got accurately reconstructed. 

Furthermore the piece-wise planar representation allows an accurate alignment of the plane 

structures from different sub-maps. The top wall is represented by 3 different sub-maps and the 

linking process manages an exactly collinear representation without a gap. 

”Office” map intrinsics 

Number of sub-maps 5 

Number of planes 18 

Number of landmarks 1294 

Map size [m] 4 × 3 

Table 8.1: Intrinsics of the ”Office” map.


(a) (b) (c) 

(d) 

(e) 

Figure 8.4: Short-baseline image pairs used for sub-map reconstruction.


(a) (b) (c) 

(d) 

(e) 

Figure 8.5: Individual sub-maps of the ”Office” environment. The top image of each row shows 

the reconstructed planar structures containing landmarks. The bottom row shows the extracted 

planar landmarks only.


(a) 

(b) 

Figure 8.6: (a) Piece-wise planar world map of the ”Office” environment consisting of 5 linked 

sub-maps (3D view). (b) Enlarged detail.


Figure 8.7: Laser map (blue) and bird’s eye view of the visual map (green) overlayed.


8.2.2 Hallway environment 

The second mapping experiment has been conducted in a hallway of our university building. 

The ”Hallway” environment is a very challenging environment. First, the area to be mapped is 

large, 30m × 8m. Second, the environment consists of lots of un-textured areas. A total number 

of 2293 images has been acquired by multiple robot runs. The robot runs were performed on 

different days too. A selection of the whole image set was used to create the world map. For 

map building first sub-maps have been identified within the image set. In a next step each 

identified sub-map has been reconstructed. The last map building step was the linking of the 

sub-maps. Map building resulted in the reconstruction and linking of 13 sub-maps. Table 8.2 

summarizes the details of the ”Hallway” map. Figure 8.9 shows two views of the piece-wise 

planar world map in 3D. A bird’s eye view of the map is shown in Figure 8.8. Each sub-map is 

shown in a different color. The color coding reveals the sub-map structure of the whole map. It 

is visible from the color coding that the sub-maps differ in size. The upper right part of the map 

contains a high number of small sub-maps. In that area the sub-maps overlap substantially. The 

sub-maps in the lower part are bigger. In the lower part sub-map linking has to be performed 

over wide baselines. The upper and lower part of the map are linked only by a single landmark 

contained in the upper left sub-map (brown). Despite linked by a single landmark only both 

parts are nicely parallel and show the capabilities of the sub-map reconstruction and linking 

methods. 

”Hallway” map intrinsics 

Number of sub-maps 13 

Number of planes 37 

Number of landmarks 2093 

Map size [m] 30 × 8 

Table 8.2: Intrinsics of the ”Hallway” map.


Figure 8.8: Bird’s eye view of the visual map of the ”Hallway” environment. The color coding 

shows the individual sub-maps.


(a) 

(b) 

Figure 8.9: (a) Piece-wise planar world map of the ”Hallway” environment consisting of 13 

linked sub-maps (3D view). (b) Enlarged detail.

8.3. Localization experiments 142 

8.3 Localization experiments 

This section shows the results of different localization experiments within the ”Office” and the 

”Hallway” environment demonstrating the capabilities and limitations of the proposed approach. 

8.3.1 Localization accuracy 

To assess the accuracy of the proposed localization method the pose estimates are compared to 

pose estimation using the laser range finder. The pose estimates of the laser range finder act as 

ground truth. As test scenario the ”Office” environment has been chosen. The robot was moved 

to 18 distinct positions (L1-L18) in the room where laser measurements were taken as well as 

images were captured. The laser range finder only reports a 2D position and the heading of the 

robot, while the proposed localization method produces a full 3D position. Thus only the x and 

y component of the position and only one rotation angle of the heading can be compared to the 

laser results. For comparison a position error is calculated as the Euclidean distance between 

the corresponding laser position and visual position. In addition a rotation error between the 

robot heading from the laser and from the visual localization is computed. The rotation error is 

the absolute distance of the robot heading and the visual heading. Table 8.3 shows the average 

error, the median error, the minimal error, the maximal error and the standard deviation over 

all locations for the position and rotation error. Figure 8.10 illustrates the localization results 

using the epipolar criteria algorithm. The tested positions are labelled L1 to L18. Blue circles 

mark the pose estimates from the laser localization. Green circles mark the pose estimates 

from the visual localization. Visual pose estimation failed for the positions L4 and L5. The 

positions L4 and L5 are the top left positions. Both positions are already very close to the wall. 

Especially this area of the wall is mainly un-textured. No landmarks could be detected in the 

images from this viewpoint. The laser localization however had no problems with this positions. 

Localization has also performed using the lp-score method. The localization results achieved 

with the lp-score are almost identical to the epipolar criteria. Localization for the positions L4 

and L5 did also fail for the lp-score method. The difference is only in the accuracy of the pose 

estimates. Figure 8.11 shows the localization results using the lp-score algorithm. 

The visual position estimates show only a small deviation from the laser positions. The 

position estimation with the epipolar criteria algorithm is accurate up to an average error of 

0.061m. The median positional error is 0.059m. The average rotational error is 3.92 ◦ . The 

lp-score algorithm shows slightly higher errors but gives in general similar results. Note that 

the lp-score algorithm uses no additional landmarks for hypothesis selection. Both algorithms 

achieve an average epipolar error of 0.9 pixel on additional landmark matches. The average 

landmark size used for pose estimation is for both algorithms around 800 pixel. The achieved 

localization accuracy is competitive to the accuracies reported for the methods of [96] and [56], 

but requires only a single landmarks match of around 800 pixel size.


epipolar criteria lp-score 

position error (average) [m] 0.061 0.078 

position error (std.) [m] 0.031 0.037 

position error (min) [m] 0.018 0.027 

position error (max) [m] 0.120 0.134 

position error (median) [m] 0.059 0.084 

rotation error (average) [ ◦ ] 3.93 4.09 

rotation error (std.) [ ◦ ] 1.88 2.27 

rotation error (min) [ ◦ ] 1.68 0.84 

rotation error (max) [ ◦ ] 7.80 7.39 

rotation error (median) [ ◦ ] 3.92 4.39 

avg. epipolar distance [pixel] 0.89 0.92 

avg. landmark area [pixel] 876 813 

Table 8.3: Positional and rotational error of visual based localization compared to laser ground 

truth. 

Figure 8.10: Localization experiment with epipolar criteria. The blue circles mark the laser 

ground truth. The green circles mark the positions estimated by visual localization. For L3 and 

L4 no landmark matches for visual localization could detected.


Figure 8.11: Localization experiment with lp-score. The blue circles mark the laser ground 

truth. The green circles mark the positions estimated by visual localization. For L3 and L4 no 

landmark matches for visual localization could detected.


8.3.2 Path reconstruction 

The main application for global localization is to compute an initial robot position to initialize 

a probabilistic SLAM framework, e.g. [78]. The robot position is then usually maintained by an 

extended Kalman filter. The Kalman filter will include the knowledge of previous robot positions 

and also previous speed and heading estimates. Propagating this values probabilistically will 

result in a smooth reconstruction of the traversed path. 

In absence of a SLAM framework the traversed path has to be reconstructed by global localization 

solely. Each position estimate is then computed independently. Figure 8.12 shows a 

part of the robots path through the ”Office” environment reconstructed by global localization. 

The path consists of 204 independent pose estimated. The pose estimates are marked with 

black dots, together forming the traversed path. The red dots mark gross outliers. Table 8.4 

summarizes the corresponding numbers. From 204 total pose estimates 21 showed a large deviation 

from the original path (from laser localization), thus they are marked as gross outliers. 

Such gross outliers would be detected by the Kalman filtering. The average epipolar distance, 

computed as a measure of accuracy, is 1.45 pixel. 

Figure 8.13 shows the reconstruction of a robots path in the ”Hallway” environment. In fact, 

three different sections of a robots path are shown. In total 124 positions have been computed. 

The corresponding paths are drawn as black dots. For this scenario no laser ground truth is 

available, thus outlier detection did not apply. 

#frames (positions) #correct (%) #bad estimates (%) avg. epipolar distance 

(std. dev.) [pixel] 

Office 204 183 (89.7%) 21 (10.3%) 1.45 (1.0) 

Hallway 124 - - 1.49 (0.74) 

Table 8.4: Number of correct pose estimates and bad estimates for the path reconstruction 

experiment. The accuracy of the pose estimates are expressed by the epipolar distance.


Figure 8.12: Reconstruction of a robots path through the ”Office” environment by global localization. 

Each position of the path is estimated independently. The path is visualized by the 

black dots. Red dots mark gross outliers.


Figure 8.13: Reconstruction of a robots path (3 different sections) through the ”Hallway” environment 

by global localization. Each position of the path is estimated independently. The path 

is visualized by the black dots.


8.3.3 Evaluation of the sub-sampling scheme 

This experiment deals with investigating the increase in accuracy gained by the proposed subsampling 

scheme compared to the standard application of the Lu et al. [68] pose estimation 

using all 3D-2D correspondences, denoted as LH method. As a measure of accuracy the epipolar 

distance between 2D image points and epipolar lines is used. Each pose is computed from a 

single landmark only. The other detected landmark matches are now used to assess the quality 

of the computed pose by computing the epipolar distance between the 2D point correspondences 

and the epipolar lines. 

The proposed sub-sampling scheme (see Chapter 7) creates n subsets of size p from the point 

correspondences of a region. That means, every region generates n solutions. The best solution 

is selected and we will show in this experiment, that in most cases there exists a subset 

which produces a better solution than computing the pose from all correspondences. We will 

investigate the two hypothesis selection strategies proposed in the previous chapter, the epipolar 

criteria and the lp-score method. The LH method uses all correspondences of a landmark 

for pose estimation. The results of the three methods are compared by means of the epipolar 

distance measure. 

For our algorithm n was set to 50 and p was set to 10. Figure 8.14 and Table 8.5 summarize 

the results. We calculated the poses for 3 different sequences (all part of the robots path 

through the office). Each sequence uses a different sub-map for localization. The table shows 

the achieved average epipolar distance (of the best solutions), minimal and maximal epipolar 

distance and the standard deviation as well. It is evident that our proposed algorithm achieves a 

smaller epipolar error than the LH method. The sub-sampling scheme with the epipolar criteria 

achieved the smallest epipolar distances. The lp-score shows a little higher epipolar distances but 

is still better than the LH method. The result is even more impressive by looking at individual 

frames, e.g., for frame 15 in sequence 2 the epipolar distance achieved by the LH method was 

13.15 pixel while our method came down to 0.75 pixel. This is an order of magnitude improvement. 

The differences between both methods are illustrated in Figure 8.15 where the positions 

of a forward motion sequence are computed. That means, all the camera positions should be 

aligned in a row. The result with the sub-sampling method (epipolar criteria) shows only a 

little deviation from a straight line. The results obtained from the LH method however shows 

large deviations. One hardly sees, that this should be the path from a strict forward motion only.


avg. epipolar distance [pixel] 

6 

5 

4 

3 

2 

1 

Sub-sampling (epipolar criteria) 

Sub-sampling (lp-score) 

LH method 

0 

S1 S2 S3 S1+S2+S3 

Figure 8.14: The graph compares the average epipolar distance for our sub-sampling method 

(lp-score and epipolar criteria) to the LH method for 3 different image sequences (S1,S2,S3) and 

all sequences together. Our sub-sampling methods produce a smaller error than the LH method. 

sequence 1 avg. epidist min. epidist max. epidist std. epidist 

31 frames [pixel] [pixel] [pixel] [pixel] 

sub-sampling (epipolar criteria) 1.07 0.87 1.40 0.12 

sub-sampling (lp-score) 1.40 0.94 1.86 0.24 

LH method 1.47 0.98 2.55 0.31 

sequence 2 avg. epidist min. epidist max. epidist stddev epidist 




LH method 5.60 0.43 28.33 8.67 

sequence 3 avg. epidist min. epidist max. epidist stddev epidist 




LH method 1.53 1.05 1.93 0.29 

all sequences avg. epidist min. epidist max. epidist stddev epidist 




LH method 2.90 0.43 28.33 5.39 

Table 8.5: Epipolar distances for pose estimation using the LH method and the sub-sampling 

method (lp-score and epipolar criteria).


(a) 

(b) 

Figure 8.15: (a) Results for global localization. Images show poses computed for a forward 

motion sequence using the sub-sampling algorithm (b) LH method. Estimated poses show big 

deviations from a strict forward motion.


8.4 Summary 

The results gained by the experiments in this chapter are very satisfying. Map building as 

well as localization experiments with the proposed methods were carried out very successfully. 

Map building has been demonstrated for two different scenarios, the ”Office” scenario and the 

”Hallway” scenario. The ”Office” scenario was used to demonstrate the accuracy of the visual 

map building, by comparing it to a map created by a laser range finder. The ”Hallway” scenario 

on the other hand is much more challenging. It demonstrates the capability of the method to 

build maps on a larger scale. The ”Hallway” map consists of about 13 different sub-maps. 

The localization experiments confirm, that a piece-wise planar world map is a beneficial 

world representation for visual global localization. The 3D plane parameters for each landmark 

incorporated in the piece-wise planar map allows pose estimation from only a single landmark 

match. The localization experiments in the ”Office” environment reveal that the achieved accuracy 

is competitive to the current state-of-the-art methods [96] and [56]. A detail analysis 

of the proposed sub-sampling scheme and the lp-score quantifies the accuracy improvements by 

measuring the epipolar distance. The results show a significant improvement by sub-sampling 

and hypothesis selection.

Chapter 9 

Conclusion 

More than 25 years have passed since Moravec [80] presented the first astonishing results in visual 

robot localization. But progress did advance slower than people expected from the advances 

in the early days. Visual robot localization turned out to be a challenging task and research is 

still going on. A lot of people have already participated in this challenge and their research provided 

bits and pieces towards a reliable and robust visual localization system for mobile robots. 

Systems like reported in [96] and [56] are already on the edge to fully operational and reliable 

visual localization and mapping for constrained indoor environments. However, if nowadays 

systems will probably work reliable in 95% of the cases there are still 5% missing. And the last 

5% will maybe represent a bigger challenge than the already achieved 95%. It is reasonable that 

for solving the last 5% gap, a set of different and specialized methods and algorithms has to be 

developed and integrated into current systems, keeping busy lots of researchers. 

This thesis focused on the development of such a specialized method for visual global localization. 

Global localization is a hard problem and very important in mobile robotics. Global 

localization is a key technique for resolving the following situations: 

• Initial position after power on 

• Kidnapped robot problem 

• Recover from failure 

• Loop closing 

• Homing 

Analyzing the current state-of-the-art in visual global localization revealed the deficiencies 

of the current approaches and showed the necessity for further research. This resulted in the 

following research issues chosen as the main objectives of this thesis: 

Robust global localization: A lot of different effects influence the performance of global localization. 

Occlusion of the landmarks is one of the worst effects. Most of the time a 

mobile robot is forced to operate in a dynamically changing environment where people are 

moving around close to the robot and occlude large parts of the robots sight. Thus global 

localization needs to be robust to such occlusions and should be possible even in cases 

where only a few landmarks are visible. 

152

153 

Accurate pose estimation from a small number of landmark matches: The accuracy 

of pose estimation will increase with the number of detected landmark matches. Current 

methods require about 10-20 landmark matches to achieve a reasonable accuracy. The 

goal is to achieve accurate pose estimation for cases with less than 10 landmark matches 

or even for the case of a single landmark match. 

Reliable detection of landmark correspondences: The correspondence problem is an inherent 

problem in mobile robotics, not only when using vision sensors but for all sensor 

modalities. Especially mis-matches pose great difficulties for localization algorithms. Using 

vision sensors recent advances in wide-baseline image matching successfully demonstrated 

how to tackle the correspondence problem under a variety of image transforms, including 

illumination change, viewpoint change, etc. Thus, the application of wide-baseline 

methods for landmark matching in mobile robotics is very promising. 

The above listed main objectives were in the following tackled by applying wide-baseline 

methods to the field of mobile robotics. The research resulted in the following main contributions: 

Performance evaluation of local detectors: Local detectors are a key ingredient for widebaseline 

image matching. A wide variety of different methods already exists, each having 

different advantages and disadvantages and for each application the best fitting method 

should be chosen. The detector comparison of Mikolajczyk et al. [76] provides a basis for 

such an assessment, however it does not evaluate the detectors on scenes significant for 

mobile robot applications. The evaluation method is not applicable to realistic, complex 

scenes as will be encountered in mobile robot experiments. One contribution of this thesis 

therefore was the development of a method to evaluate the different local detectors on 

realistic, complex scenes. The resulting comparison showed a significant difference to the 

previous evaluation on the restricted test cases. 

Maximally Stable Corner Clusters: The analysis of the new detector evaluation result led 

to the development of a new local detector, the Maximally Stable Corner Cluster (MSCC) 

detector. MSCC regions are clusters of simple corner points in images, robustly detected 

by implying a stability criterium. A comparison with other methods revealed that the 

MSCC regions are detected at image locations left out by the other methods, thus they 

are complementary to other methods. This complementarity is the key property of the new 

detector, it allows an effective combination with other current state-of-the-art detectors. 

3D Piece-wise planar world map: The proposed piece-wise planar world map incorporates 

a higher degree of structural information in the world representation than other methods, 

e.g. [56, 96]. Landmarks are defined by a small plane patch (6DOF), a SIFT-descriptor and 

the original appearance from the image. New methods for wide-baseline region matching 

and piece-wise planar scene reconstruction were developed to build the piece-wise planar 

world map. 

Global localization from a single landmark: The piece-wise planar world representation, 

including plane structures, allowed the development of a new localization algorithm which 

enables pose estimation from a single landmark match. Accurate pose estimation is already 

possible from an image region of the size of only 400 pixel area. This allows global 

localization to deal with a high level of occlusions, necessary for crowded environments.

9.1. Future work 154 

Map building and localization experiments demonstrated the capabilities of the proposed 

approach. Map building was successfully shown for two indoor environments, the ”Office” 

and the ”Hallway” environment. The ”Hallway” environment represents a large, challenging 

environment of the size of 30m × 8m. The localization experiments showed by comparison 

with a ground truth created by laser mapping competitive accuracies to current state-of-the-art 

methods, but using only a single landmark match for pose estimation. 

In summary global localization as proposed in this thesis will result in accurate pose estimates, 

even despite of heavy occlusions and few landmark matches. Finally Table 9.1 shows 

how the key aspects of the new method compare to the current state-of-the-art. 

Authors World map Sensor 

system 

Map features 

Landmark 

matching 

Map 

building 

Global 

localization 

(#landmarks ∗ ) 

Pose representation 

Se, 

Lowe, 

Little [96] 

sparse 

metric 

stereo 3D points + SIFT feature 

matching 

SLAM 

tri-angulation, 

map-alignment 

(>= 10) 

2D (3DOF) 

Karlsson 

et al. [56] 

sparse 

metric 

monocular 

3D points + SIFT 

+ appearance 

feature 

matching 

SLAM 

3D-2D 

(>= 4) 

2D (3DOF) 

Davison 

et al. [21] 

sparse 

metric 

active 

stereo 

3D points correlation SLAM tri-angulation 

(>= 3) 

3D (6DOF) 

Bosse 

et al. [10] 

sparse 

metric 


3D points 

+ 3D lines 

+ vanishing points 

nearest 

neighbor 

batch 

map matching 

(approx.30) 

3D (6DOF) 

Goedeme 

et al. [39] 

topological 

monocular, 


2D lines 

+ color descriptor 

+ intensity descriptor 

feature 

matching 

batch 

line matching and 

voting 

topological 

location 

Kosaka 

et al. [59] 

sparse 

metric, 

CAD-model 

monocular 3D lines nearest 

neighbor 

manual - 2D (3DOF) 

Hayet 

et al. [45] 

sparse 

metric 

monocular 

quadrangular 3D planes 

+ PCA descriptor 

feature 

matching 

batch 

3D-2D 

(1) 

3D (6DOF) 

Fraundorfer 

piece-wise 

planar metric 

monocular 

unconstraint 3D planes 

+ SIFT 

+ appearance 

feature 

matching, 

registration 

batch 

3D-2D 

(1) 

3D (6DOF) 

Table 9.1: Main characteristics of the current state-of-the-art approaches compared to the proposed 

approach. ( ∗ necessary landmark matches for robust pose estimation) 

9.1 Future work 

The methods developed in this thesis provide a strong basis to carry on future work in several 

interesting directions. Let me describe four of them in more detail. The first interesting direction 

is to couple visual odometry with global localization and the second much more challenging is 

to integrate the presented method for global localization into a complete probabilistic SLAM 

framework. As a third interesting direction we discuss the integration of point and line features 

into the piece-wise planar map. Finally we will discuss the use of geometric constraints for 

landmark matching provided by the piece-wise planar map. 

Coupling with visual odometry: A mobile robot relying solely on the presented global localization 

will come into trouble if the robot faces an environment which lacks of features 

which can be used as landmarks. This case easily occurs e.g. when the robot comes 

close to a plain wall. In such a case no distinct landmarks can be extracted and thus no 

landmark matches are available for pose estimation. Such situations can be overcome by 

the use of visual odometry as described in [84]. Visual odometry computes the motion


of the robot from two subsequent frames. Point correspondences for epipolar geometry 

estimation between two subsequent frames can easily be detected by tracking e.g. KLT 

tracking [107]. The current robot position will then be computed from the last known 

position obtained from global localization and the frame-to-frame motion sequence from 

visual odometry. The motion sequence from visual odometry is computed by adding up 

all the small frame-to-frame motions. This will inevitably accumulate small errors of the 

frame-to-frame motion estimation and will result in an error proportional to the length 

of the motion sequence. However visual odometry is only necessary to navigate the robot 

back to a position where a landmark for global localization can be spotted again. And for 

such a short time the visual odometry will be accurate enough. Furthermore, global localization 

and visual odometry could simply run in parallel and the robot position could be 

computed by fusing both measurements. Global localization only needs to be carried out 

from time to time to correct the pose estimate from visual odometry. For fusing odometry 

and global localization poses the method developed by Smith et al. [100] can be used. In 

their approach the pose is represented using exponential maps. This representation eases 

the probabilistic propagation and fusion of different measurements. The fusion of global 

localization and visual odometry in such a way will result in a very robust localization 

method. 

Integration into SLAM framework: Integration of the global localization into a probabilistic 

SLAM framework is a straightforward but challenging task. The first challenge is to 

implement map and landmark updates into the so far off-line map building process. This 

requires an uncertainty representation for the map structures. Planes in 3D are described 

by a 3D point and a 3D normal-vector, in sum 6 parameters. The uncertainty of the 3D 

point can be described by a 3 × 3 covariance matrix. The uncertainty of the 3D normalvector 

can be described similar, a 3 × 3 covariance matrix can represented the angular 

uncertainties of the vector. The needed representation is identical to the case of uncertainty 

propagation for a camera position as a camera is defined by the image plane and 

the principal point. The uncertainty propagation developed for cameras can therefore be 

applied directly to the plane structures of the map. In [100] the exponential map is used to 

propagate the uncertainty of camera positions and to update the positions with additional 

measurements. An initial uncertainty of the reconstructed planes can be derived from the 

distribution of the 3D points defining the 3D plane. Beside the uncertainty representation 

of the planes the uncertainty propagation of the pose estimation algorithm has to be derived. 

However, having solved these problems the piece-wise planar world representation 

and global localization can be integrated into a probabilistic SLAM framework as proposed 

in [78]. 

Integrating points, lines and planes into a common world map: Integration of points, 

lines and planes into a common world representation would prove very beneficial. Point 

and line features will add extra value to map areas where no planar landmarks have been 

detected. Global localization can therefore make use of the points and lines for pose 

estimate. However, the integration of points and lines can be much more than simply 

storing them in the map database. One requirement of the integration would be the 

consistency of the different feature types. When adding a line feature to the world map 

it can be checked if the line originates from the intersection of two planes. This allows a 

refinement of the line position. On the contrary line features can be used to get an exact 

delineation of the map planes. For point features which are located on a map plane this


information can be used to refine the point so that it is positioned exactly on the plane. 

Global localization can then use either points, lines or planes or much more interesting, 

combinations of the feature types for pose estimation. Map planes also introduce a visibility 

criterium which can be used to detect landmark mis-matches. Tentative landmark matches 

which are behind a scene plane and thus are not visible from the current robot position can 

be discarded as mis-matches. One of the biggest benefits of such an integration however 

is, that it allows localization from another feature type, when e.g. no planes are visible. 

Geometric constraints for landmark matching: As already stated the detection of landmark 

correspondences is a key problem and very hard to solve. In the presented approach 

corresponding landmarks are identified based on the appearance. Although this approach 

is very reliable still mis-matches may occur, especially when multiple landmarks with 

identical appearance exist. However, geometric constraints allow to predict the location 

of neighboring landmarks for a tentative landmark match. For planar landmarks also the 

appearance of a landmark from a different viewpoint can be computed by a projective 

transformation. Thus a tentative landmark match can be verified by checking the location 

and appearance of the neighboring landmarks. This will significantly improve the 

reliability of landmark matching.

Appendix A 

Projective transformation of ellipses 

A.1 Projective ellipse transfer 

The method for detector evaluation described in Chapter 4 needs to transfer ellipses from one 

image to an image viewed from a different viewpoint. In this section we will thus discuss how 

an ellipse transforms if it is viewed from a different viewpoint. We consider the planar case, i.e. 

the ellipse lies on a plane in the scene. In such a case we can analytically calculate the shape of 

the ellipse (determine the ellipse parameters) for the image from a different viewpoint. We will 

see that it comes down to apply a projective transformation in form of a matrix multiplication 

to our ellipse representation. 

The mapping between two different views of a plane is described by a perspectivity [11]. In 

computer vision a perspectivity is usually denoted as a homography, sometimes collineation or 

projective transformation. All the terms are synonymous. A homography can be represented as 

a 3 × 3 matrix. Consider two image planes I and I ′ . Let the mapping between both planes be 

the homography H. One can now calculate the position of a point x ′ in I ′ by 

x ′ = Hx 

(A.1) 

where x is the point position in I. x and x ′ are homogenous 3-vectors of the form x = [x y 1] T , 

composed of the x, y-coordinates in the image coordinate system. Based on this transformation 

rule we can deduce the transformation rule for an ellipse. 

An ellipse is a conic as well as a parabola and a hyperbola. Conics arise as conic sections when 

a conic is intersected by a plane. A conic can be represented by the following inhomogeneous 

equation: 

ax 2 + bxy + cy 2 + dx + ey + f = 0. 

(A.2) 

Putting this into homogeneous form, i.e by replacing x → x 1 /x 3 , y → x 2 /x 3 it writes as follows: 

ax 2 1 + bx 1 x 2 + cx 2 2 + dx 1 x 3 + ex 2 x 3 + fx 2 3 = 0. 

The conic equation can also be written in matrix form 

x T Cx = 0. 

We call C the conic coefficient matrix and it is given by 

⎡ 

a b/2 

⎤ 

d/2 

C = ⎣ b/2 c e/2 ⎦ . 

d/2 e/2 f 

(A.3) 

(A.4) 

(A.5) 

157

A.1. Projective ellipse transfer 158 

If we now apply the transformation x ′ = Hx to the conic C this will result into the conic 

C ′ = H −T CH −1 . This transformation rule can be shown easily: 

x T Cx = x ′T [H −1 ] T CH −1 x ′ (A.6) 

= x ′T H −T CH −1 x ′ (A.7) 

Writing C ′ = H −T CH −1 reduces the relation to x ′T C ′ x ′ which is the transformation rule 

for a conic. 

The most important fact for us is here that a conic transformed by a projective transformation 

H still results in a conic. That means that (except for degenerate transforms) an ellipse 

transferred into another image still stays an ellipse. 

This property is illustrated in Figure A.1. The figure shows the effects of transforming the 

original ellipse in Figure A.1(a) by various perspective transformations with increasing perspectivity. 

The original image also contains two tangents to the ellipse, which are also transformed 

by the same perspectivity. They are helping to make the effects of the projective transformation 

better visible. In addition the ellipse is overlayed with single points moreless equally distributed. 

The points are transform with the same projective transformation as the ellipse and the lines. 

In the transformed image the points still lie on the ellipse but the initial uniform distribution 

has changed. The points moved along the ellipse perimeter in direction of the for-shortening. 

This is observable also for the intersection with the tangents. 

An important observation is that the center point of the original and the center point transformed 

ellipses are not connected by the transforming perspectivity. In other words, applying a 

point transform (homography) to the original ellipse center does not yield the ellipse center for 

the transformed ellipse. 

To calculate a projectively transformed conic with the previous method the conic must be 

represented in his matrix form. In the case of ellipses however two other representations are 

very common, the parameter form and the second moment matrix. In the parameter form one 

usually specifies the ellipse by a 5-vector E = [x, y, a, b, α]. In this representation x, y are the 

point coordinates of the ellipse center. The values a, b are the length of the major and minor 

semi-axes. And t is the ellipses rotation. One can convert this representation into the matrix 

form by first setting up a matrix for the canonic ellipse form and then applying a translation 

for the center point and a rotation for the angle. The canonic representation C C can be set up 

as follows: 

⎡ 

a −2 ⎤ 

0 0 

C C = ⎣ 0 b −2 0 ⎦ . 

(A.8) 

0 0 −1 

Applying rotation and translation leads to the matrix form C, 

C = T T R T C C RT 

where R is a 3 × 3 2D rotation matrix 

⎡ 

cos α − sin α 

⎤ 

0 

R = ⎣ sin α cos α 0 ⎦ 

0 0 1 

and T is a 3 × 3 2D translation matrix 

⎡ 

T = ⎣ 

1 0 −x 

0 1 −y 

0 0 1 

⎤ 

⎦ . 

(A.9) 

(A.10) 

(A.11)


(a) 

(b) 

(c) 

(d) 

(e) 

Figure A.1: (a) Original ellipse. (b-e) Transformed ellipses 

The conic representation in matrix form is independent of scaling. That means multiplying 

the conic matrix C by some non-zero scalar s represents still the same conic. 

To calculate the ellipse parameters E = [x, y, a, b, α] from the matrix representation one can 

go the inverse way of the construction. For this let us write down the conic construction in more


detail. 

C = T T R T C C RT (A.12) 

[ ] [ ] [ ] [ ] [ ] 

I 0 r 

T 

0 cc 0 r 0 I t 

= 

t T 1 0 T 1 0 T −1 0 T 1 0 T (A.13) 

1 

[ r 

= 

T c c r r T ] 

c c rt 

t T r T c c r t T r T (A.14) 

c c rt − 1 

⎡ 

⎤ 

c 11 c 12 c 13 

= ⎣ c 21 c 22 c 23 

⎦ . 

(A.15) 

c 31 c 32 c 33 

I is a 2 × 2 identity 

[ 

matrix. t = 

] 

[−x − y] T is a 2-vector representing 

[ 

the translation 

] 

to the 

cos α − sin α 

a 

−2 

0 

ellipse center. r = 

is a 2 × 2 rotation matrix. c 

sin α cos α 

c = 

0 b −2 is the upper 

2 × 2 part of the canonic conic matrix. 

From Eq. (A.12) it is evident that the translation vector t can be calculated from the conic 

matrix C with simple matrix arithmetic. 

( ) 

c13 

= r T c 

c c rt (A.16) 

23 

( ) 

t = r T c c r −1 c13 

(A.17) 

c 23 

[ ] −1 ( ) 

c11 c 

= 

12 c13 

(A.18) 

c 22 c 23 

c 21 

It is a very nice property that the translation can be extracted from an arbitrarily scaled 

conic matrix. Consider a scaling s, Eq. (A.18) rewrites as follows, 

[ ] −1 ( ) 

sc11 sc 

t = 

12 sc13 

sc 21 sc 22 sc 23 

= 1 [ ] −1 ( ) 

c11 c 12 c13 

s 

s c 22 c 23 

c 21 

(A.19) 

(A.20) 

and one can see that the scaling s is cancelled out. The ellipse angle α can be extracted 

from the conic matrix using the following equation 1 : 

α = 1 2 arctan 2c 21 

c 22 − c 11 

(A.21) 

Eq. (A.21) can be verified by taking a closer look at the coefficients of the conic matrix. 

c 11 = 1 a 2 cos2 α + 1 b 2 sin2 α 

(A.22) 

c 22 = 1 a 2 sin2 α + 1 b 2 cos2 α 

(A.23) 

1 This formula returns the correct angles for an interval from 0 to π/2. By returning α modulo π the angle is 

correct for an interval from 0 to π. That is sufficient because a conic in matrix form is defined uniquely only for 

an interval from 0 to π. Constructing a conic matrix for an angle α + π gives the same matrix as for α.


c 21 = − 1 a 2 cos α sin α + 1 cos α sin α (A.24) 

b2 = − 1 1 

a 2 2 sin 2α + 1 1 

b 2 sin 2α (A.25) 

2 

= 1 ( 1 

2 sin 2α b 2 − 1 ) 

a 2 (A.26) 

( 1 

c 22 − c 11 = cos 2 α 

b 2 − 1 ) 

a 2 

( 1 

+ sin 2 α 

a 2 − 1 ) 

b 2 

(A.27) 

= 

= 

= 

( 1 

b 2 − 1 a 2 ) (cos 2 α − sin 2 α ) (A.28) 

( 1 

b 2 − 1 a 2 ) ( 1 

2 + 1 2 cos 2α − 1 2 + 1 2 cos 2α ) 

( 1 

b 2 − 1 a 2 ) 

cos 2α 

(A.29) 

(A.30) 

c 21 

c 22 − c 11 

= 

1 

2 sin 2α ( 1 

− 1 ) 

b 2 a 2 

cos 2α ( 1 

− 1 ) = 1 tan 2α 

2 (A.31) 

b 2 a 2 

( ) 

arctan 2c21 

c 22 −c 11 

α = 

(A.32) 

2 

With α we can calculate the rotation matrix R which is needed to extract the last parameters 

a and b. But before we come to this it is necessary to remove the arbitrary scale from the conic 

matrix. Unlike translation and rotation the calculation of the axis is sensitive to arbitrary 

scaling. From Eq. (A.14) we can see, that the upper 2 × 2 matrix of the conic matrix is equal to 

r T c c r. The matrix c c contains the desired values of the axis a and b and they can be recovered 

by removing the applied rotations. However, an arbitrary scaling directly multiplies into the 

axis length and therefore we have to first calculate a scaling factor and remove the scaling (if it 

is not equal to 1). It is possible to recover the scaling s from coefficient c 33 of the conic matrix. 

The equation for c 33 with an unknown scaling factor s is 

c 33 = ( st T r T c c rt − 1 ) (A.33) 

( 

= t T 1 [ ] ) 

c11 c 12 

t − 1 s 

(A.34) 

s c 21 c 22 

( [ ] ) 

= t T c11 c 12 

t − s 

(A.35) 

c 21 c 22 

which leads to the following equation for s 

[ ] 

s = t T c11 c 12 

t − c 

c 21 c 33 . 

22 

Now we have all ingredients to recover the matrix c c . 

c c = 1 [ ] 

s r c11 c 12 

r T 

c 21 c 22 

(A.36) 

(A.37)

A.2. Affine approximation of ellipse transfer 162 

Finally the axis lengths a and b are 

a = 

b = 

1 

√ 

cc11 

1 

√ . 

cc22 

(A.38) 

(A.39) 

A.2 Affine approximation of ellipse transfer 

In this section we discuss the ellipse transfer method used in the evaluation method of Mikolajczyk 

and Schmid [74]. It approximates the projective transformation with an affine transformation. 

The method works by transforming the ellipse shape with an affine transformation and 

centering the ellipse around a new ellipse center. The new ellipse center is gained by transferring 

the ellipse center to the other image by homography transform. To obtain the new ellipse shape 

the second moment matrix of the ellipse is transformed with an affine transformation which is 

an approximate estimate of the true projective transformation. 

Such an approximation was chosen in [74] because the authors were interested in establishing 

a corresponding center point. However, one must be aware that there can be quite large approximation 

errors. Figure A.2 shows a comparison of the projective and affine ellipse transfer. The 

ellipse resulting from the projective transfer is drawn in black, the results from the affine transfer 

in green. In Figure A.2(b-e) one can see the differences between both methods transforming the 

original ellipse in Figure A.2(a). The ellipse centers for the green ones are at the position of the 

original ellipse center transform by the homography.

A.2. Affine approximation of ellipse transfer 163 

(a) 

(b) 

(c) 

(d) 

(e) 

Figure A.2: Comparison of projective (in black) and affine (in green) ellipse transfer. (a) Original 

ellipse. (b-e) Transformed ellipses

Appendix B 

The trifocal tensor and point 

transfer 

B.1 The trifocal tensor 

The trifocal tensor encapsulates the geometry between three images. It is the analogue to the 

fundamental matrix for the three view case. The trifocal tensor, its computation and properties 

are described in detail in [41, 42, 101, 109]. The trifocal tensor consists of three 3×3 matrices. It 

has thus 27 elements. However, the tensor has only 18 DOF. It is determined up to an arbitrary 

scale factor. The trifocal tensor defines various relationships between points and lines in three 

views. This incidence relations are trilinear equations and thus often denoted as trilinearities. 

The incidence relations are listed in Table B.1. T jk 

i 

is the trifocal tensor in tensor notation. Point 

correspondences between three views are given as x ↔ x ′ ↔ x ′′ . Similarly line correspondences 

are given as l ↔ l ′ ↔ l ′′ . The trilinearities are the basic equations for the computation of the 

trifocal tensor. The trifocal tensor can be computed from point or line correspondences between 

three views. With the use of the trilinearities an equation system can be generated of the form 

At = 0, where t are the 27 entries of the trifocal tensor. To solve for the 27 entries of T jk 

i 

up to 

scale 26 equations are necessary. With more than 26 equations a least squares solution can be 

computed. Using point correspondences (point-point-point incidence relation) at least 7 point 

correspondences 1 are necessary as each point-point-point incidence gives 4 linear independent 

equations. Each of the trilinearities can be used to generate the equation system. The different 

methods for the computation of the trifocal tensor are listed and described in detail in [44]. 

B.2 Point transfer 

Knowing the trifocal tensor for a set of three images it is possible to transfer point locations 

known in two of the images into the third one. This is often denoted as the point transfer 

property. The trilinearities again provide the basis for the point transfer property. The transfer 

property also holds for line correspondences. In the following the algorithm to transfer a point 

given in the first two views into the third view is outlined in detail (as described in [44]): 

1. Extract the fundamental matrix F 21 from the trifocal tensor and correct x ↔ x ′ to the 

exact correspondence ̂x ↔ ̂x ′ (by optimal triangulation and re-projection into the images). 

1 A six-point algorithm has been proposed in [108], which produces up to three possible solutions. 

164

165 

Trilinearities 

Line-line-line correspondence 

Point-line-line correspondence 

Point-line-point correspondence 

(l r ɛ ris )l ′ j l′′ T jk 

k i 

= 0 s 

x i l ′ j l′′ k T jk 

i 

= 0 

x i l ′ j (x′′k ɛ kqs )T jq 

i 

= 0 s 

Point-point-line correspondence x i (x ′j ɛ jpr )l ′′ T pk 

k i 

= 0 r 

Point-point-point correspondence 

x i (x ′j ɛ jpr )(x ′′k ɛ kqs )T pq 

i 

= 0 rs 

Table B.1: Summary of the incidence relations (trilinearities) imposed by the trifocal tensor. 

2. Next compute the line l’ through ̂x ′ which should be perpendicular to the epipolar line of 

̂x which is defined by l ′ e = F 21̂x. Then l ′ = (l 2 , −l 1 , −̂x 1 l 2 + ̂x 2 l 1 ) T with l ′ e = (l 1 , l 2 , l 3 ) T 

and ̂x ′ = (̂x 1 , ̂x 2 , 1) T . 

3. The transferred point is x ′′k = ̂x i l ′ j T jk 

i 

where l ′ j T jk 

i 

is the homography mapping H k i =H 13(l ′ ). 

The point transfer for the other views works similar and the corresponding equations are given 

in Table B.2. 

view 2,3→1 view 1,3→2 view 1,2→3 

l ′ e = F T 21̂x′′ l ′′ 

e = F 31̂x 

l ′ e = F 21̂x 

l ′ = (l 2 , −l 1 , −̂x ′′ 

1 l 2 + ̂x ′′ 

2 l 1) T l ′′ = (l 2 , −l 1 , −̂x 1 l 2 + ̂x 2 l 1 ) T l ′ = (l 2 , −l 1 , −̂x 1 l 2 + ̂x 2 l 1 ) T 

H 13 (l ′ ) = H k i 

=l ′ j T jk 

i 

H 12 (l ′′ ) = H j i =l′′ k T jk 

i 

H 13 (l ′ ) = Hi k =l j ′ T jk 

i 

x = H 13 (l ′ ) −1 x ′′ x ′ = H 12 (l ′′ )x x ′′ = H 13 (l ′ )x 

Table B.2: Relations to transfer a point into each view using the trifocal tensor.

Bibliography 

[1] S. Atiya and G. Hager. Real-time vision-based robot localization. IEEE Transactions on 

Robotics and Automation, 9:785–800, 1993. 

[2] N. Ayache and O. Faugeras. Maintaining representations of the environment of a mobile 

robot. IEEE Transactions on Robotics and Automation, 5(6):804–819, 1989. 

[3] C. Baillard and A. Zisserman. A plane-sweep strategy for the 3d reconstruction of buildings 

from multiple images. In International Archives of Photogrammetry and Remote Sensing, 

volume 32, pages 56–62, 2000. 

[4] C. Baillard, C. Schmid, A. Zisserman, and A. W. Fitzgibbon. Automatic line matching 

and 3d reconstruction of buildings from multiple views. In Proc. ISPRS Conference on 

Automatic Extraction of GIS Objects from Digital Imagery, Munich, pages 69–80, 1999. 

[5] J. Bauer, K. Karner, and K. Schindler. Plane parameter estimation by edge set matching. 

In Proc. 26th Workshop of the Austrian Association for Pattern Recognition, Graz, 

Austria, pages 29–36, 2002. 

[6] A. Baumberg. Reliable feature matching across widely separated views. In Proc. IEEE 

Conference on Computer Vision and Pattern Recognition, Hilton Head, South Carolina, 

pages 774–781, 2000. 

[7] P. R. Beaudet. Rotationally invariant image operators. International Joint Conference on 

Pattern Recognition, pages 579–583, 1978. 

[8] P. Besl and N. McKay. A method for registration of 3-d shapes. IEEE Transactions on 

Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992. 

[9] J. Bigün and G. H. Granlund. Optimal orientation detection of linear symmetry. In Proc. 

1st International Conference on Computer Vision, London, UK, pages 433–438, 1987. 

[10] M. Bosse, P. Newman, J. Leonard, and S. Teller. An atlas framework for scalable mapping. 

In IEEE International Conference on Robotics and Automation, pages 1234–1240, 2003. 

[11] D. A. Brannan, M. F. Esplen, and J. J. Gray. Geometry. Cambridge University Press, 

1999. 

[12] R. A. Brooks. Intelligence without representation. Artificial Intelligence, 47(1-3):139–159, 

1991. 

[13] M. Brown and D. Lowe. Invariant features from interest point groups. In Proc. 13th 

British Machine Vision Conference, Cardiff, UK, pages 253–262, 2002. 

166

167 

[14] J. Buhmann, W. Burgard, A. Cremers, D. Fox, T. Hofmann, F. Schneider, J. Strikos, and 

S. Thrun. The mobile robot Rhino. AI Magazine, 16(1), 1995. 

[15] J. Canny. Finding edges and lines in images. In MIT AI-TR, 1983. 

[16] G. Carneiro and A. Jepson. Phase-based local features. In Proc. 7th European Conference 

on Computer Vision, Copenhagen, Denmark, pages I: 282–296, 2002. 

[17] H. Christensen, N. Kirkeby, S. Kristensen, and L. Knudsen. Model-driven vision for in-door 

navigation. Robotics and Autonomous Systems, 12:199–207, 1994. 

[18] D. Comaniciu and P. Meer. Mean shift analysis and applications. In Proc. 7th International 

Conference on Computer Vision, Kerkyra, Greece, pages 1197–1203, 1999. 

[19] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge 

MA, 1990. 

[20] I. Cox. Blanche: An experiment in guidance and navigation of an autonomous robot 

vehicle. IEEE Transactions on Robotics and Automation, 7(2):193–204, 1991. 

[21] A. Davison and D. Murray. Simultaneous localization and map-building using active vision. 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):865–880, 2002. 

[22] A. J. Davison. Mobile Robot Navigation Using Active Vision. PhD thesis, University of 

Oxford, 1999. 

[23] G. de Souza and A. Kak. Vision for mobile robot navigation: A survey. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 24(2):237–267, 2002. 

[24] D. DeMenthon and L. Davis. Model-based object pose in 25 lines of code. International 

Journal of Computer Vision, 15(1-2):123–141, 1995. 

[25] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley, 2001. 

[26] P. J. Elbischger. Circular data - About the representation and calculation of oriented and 

directed 2D data. Technical Report ICG-TR-1, Institue for Computer Graphic and Vision, 

Graz University of Technology, 2003. 

[27] S. P. Engelson and D. V. McDermott. Error correction in mobile robot map learning. 

In Proc. IEEE International Conference on Robotics and Automation, Washington D.C., 

US, pages 2555–2560, 1992. 

[28] M. A. Fischler and R. C. Bolles. RANSAC random sampling concensus: A paradigm for 

model fitting with applications to image analysis and automated cartography. Communications 

of ACM, 26:381–395, 1981. 

[29] J. Folkesson, P. Jensfelt, and H. Christensen. Vision slam in the measurement subspace. 

In Proc. IEEE International Conference on Robotics and Automation, Barcelona, Spain, 

pages 30–35, 2005. 

[30] W. Förstner and E. Gülch. A fast operator for detection and precise location of distinct 

points, corners and centres of circular features. In ISPRS Intercommission Workshop, 

Interlaken, June 1987.

168 

[31] F. Fraundorfer and H. Bischof. Detecting distinguished regions by saliency. In Proc. 13th 

Scandinavian Conference on Image Analysis, Gotenborg, Sweden, pages 208–215, 2003. 

[32] F. Fraundorfer and H. Bischof. Evaluation of local detectors on non-planar scenes. In Proc. 

28th Workshop of the Austrian Association for Pattern Recognition, Hagenberg, Austria, 

pages 125–132, 2004. 

[33] F. Fraundorfer and H. Bischof. A novel performance evaluation method of local detectors 

on non-planar scenes. In Workshop Proceedings Empirical Evaluation Methods in Computer 

Vision, IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 

California, 2005. 

[34] F. Fraundorfer and H. Bischof. Global localization from a single feature correspondence. 

In Proc. 30th Workshop of the Austrian Association for Pattern Recognition, Obergurgl, 

Austria, pages 151–160, 2006. 

[35] F. Fraundorfer, S. Ober, and H. Bischof. Natural, salient image patches for robot localization. 

In Proc. International Conference an Pattern Recognition, Cambridge, UK, pages 

881–884, 2004. 

[36] F. Fraundorfer, M. Winter, and H. Bischof. Mscc: Maximally stable corner clusters. In 

Proc. 14th Scandinavian Conference on Image Analysis, Joensuu, Finland, pages 45–54, 

2005. 

[37] F. Fraundorfer, M. Winter, and H. Bischof. Maximally stable corner clusters: A novel distinguished 

region detector and descriptor. In Proc. 1st Austrian Cognitive Vision Workshop, 

Zell an der Pram, Austria, pages 59–66, 2005. 

[38] F. Fraundorfer, K. Schindler, and H. Bischof. Piecewise planar scene reconstruction from 

sparse correspondences. Image and Vision Computing, 24(4):395–406, 2006. 

[39] T. Goedeme, M. Nuttin, T. Tuytelaars, and L. Van Gool. Markerless computer vision 

based localization using automatically generated topological maps. In European Navigation 

Conference GNSS, Rotterdam, 2004. 

[40] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision 

Conference, 1988. 

[41] R. Hartley. Projective reconstruction from line correspondences. In Proc. IEEE Conference 

on Computer Vision and Pattern Recognition, Seattle, Washington, pages 903–907, 1994. 

[42] R. Hartley. Lines and points in three views and the trifocal tensor. International Journal 

of Computer Vision, 22(2):125–140, 1997. 

[43] R. Hartley. Theory and practice of projective rectification. International Journal of Computer 

Vision, 35(2):115–127, 1999. 

[44] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge, 

2000. 

[45] J. Hayet, F. Lerasle, and M. Devy. Planar landmarks to localize a mobile robot. In 8th 

International Symposium on Intelligent Robotic Systems, Reading, UK, pages 163–169, 

2000.

169 

[46] H. Hirschmüller. Accurate and efficient stereo processing by semi-global matching and mutual 

information. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 

San Diego, California, pages II: 807–814, 2005. 

[47] B. Horn. Closed form solutions of absolute orientation using unit quaternions. Journal of 

the Optical Society of America, 4(4):629–642, 1987. 

[48] P. Hough. Method and means for recognizing complex patterns. 1962. 

[49] D. Huttenlocher, G. Klanderman, and W. Rucklidge. Comparing images using the hausdorff 

distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9): 

850–863, 1993. 

[50] ICG. Giplib - general image processing library, 2005. URL 

http://www.icg.tu-graz.ac.at/research/ComputerVision/giplib. 

[51] M. Jogan, A. Leonardis, H. Wildenauer, and H. Bischof. Mobile robot localization under 

varying illumination. In Proc. International Conference an Pattern Recognition, Quebec 

City, Canada, pages II: 741–744, 2002. 

[52] T. Kadir and M. Brady. Saliency, scale and image description. International Journal of 

Computer Vision, 45(2):83–105, 2001. 

[53] T. Kadir, A. Zisserman, and M. Brady. An affine invariant salient region detector. In Proc. 

7th European Conference on Computer Vision, Prague, Czech Republic, pages I: 228–241, 

2004. 

[54] R. Kalman. A new approach to linear filtering and prediction problems. Transactions of 

the ASME: Journal of Basic Engineering, pages 35–45, 1960. 

[55] D. R. Karger, P. N. Klein, and R. E. Tarjan. A randomized linear-time algorithm to find 

minimum spanning trees. Journal of the Association for Computing Machinery, 42(2): 

321–328, 1995. 

[56] N. Karlsson, E. Di Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, and M. E. Munich. 

The vslam algorithm for robust localization and mapping. In Proc. IEEE International 

Conference on Robotics and Automation, Barcelona, Spain, pages 24–29, 2005. 

[57] L. Kitchen and A. Rosenfeld. Gray level corner detection. Pattern Recognition Letters, 1: 

95–102, 1982. 

[58] V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction via graph cuts. In Proc. 

7th European Conference on Computer Vision, Copenhagen, Denmark, pages III: 82–96, 

2002. 

[59] A. Kosaka and A. Kak. Fast vision-guided mobile robot navigation using model-based reasoning 

and prediction of uncertainties. Computer Vision, Graphics and Image Processing, 

56(3):271–329, 1992. 

[60] J. Kosecka and X. Yang. Location recognition and global localization based on scale 

invariant features. In Wokshop on Statistical Learning in Computer Vision, Proc. 7th 

European Conference on Computer Vision, Prague, Czech Republic, 2004.

170 

[61] U. Köthe. Edge and junction detection with a improved structure tensor. Proc. 25th 

DAGM Pattern Recognition Symposium, Magdeburg, Germany, pages 25–32, 2003. 

[62] Z. Lan and R. Mohr. Direct linear sub-pixel correlation by incorporation of neighbor 

pixels information and robust estimation of window transformation. Machine Vision and 

Applications, 10(5-6):256–268, 1998. 

[63] T. Lindeberg. Scale-space theory: A basic tool for analysing structures at different scales. 

Journal of Applied Statistics, 21(2):224–270, 1994. 

[64] T. Lindeberg. Feature detection with automatic scale selection. International Journal of 

Computer Vision, 30(2):79–116, 1998. 

[65] S. Livatino. Acquisition and Recognition of Natural Landmarks for Vision-Based Autonomous 

Robot Navigation. PhD thesis, Aalborg University, 2003. 

[66] D. Lowe. Object recognition from local scale-invariant features. In Proc. 7th International 

Conference on Computer Vision, Kerkyra, Greece, pages 1150–1157, 1999. 

[67] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal 

of Computer Vision, 60(2):91–110, 2004. 

[68] C. Lu, G. Hager, and E. Mjolsness. Fast and globally convergent pose estimation from 

video images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6): 

610–622, 2000. 

[69] Q.-T. Luong and T. Vieville. Canonical representations for the geometries of multiple 

projective views. Computer Vision and Image Understanding, 64(2):193–229, 1996. 

[70] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally 

stable extremal regions. In Proc. 13th British Machine Vision Conference, Cardiff, UK, 

pages 384–393, 2002. 

[71] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In Proc. 

IEEE Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, 

pages II: 257–263, 2003. 

[72] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In Proc. 

of the 8th International Conference on Computer Vision, Vancouver, Canada, pages 525– 

531, 2001. 

[73] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proc. 7th 

European Conference on Computer Vision, Copenhagen, Denmark, pages I: 128–142, 2002. 

[74] K. Mikolajczyk and C. Schmid. Comparison of affine-invariant local detectors and descriptors. 

In Proc. 12th European Signal Processing Conference, Vienna, Austria, 2004. 

[75] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. International 

Journal of Computer Vision, 60(1):63–86, 2004. 

[76] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, 

T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal 

of Computer Vision, 65(1-2):43–72, 2005.

171 

[77] N. Molton, A. Davison, and I. Reid. Locally planar patch features for real-time structure 

from motion. In Proc. 14th British Machine Vision Conference, London, UK, 2004. 

[78] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: A factored solution 

to the simultaneous localization and mapping problem. In Proc. of the AAAI National 

Conference on Artificial Intelligence, Edmonton, Canada, 2002. 

[79] H. Moravec. Towards automatic visual obstacle avoidance. In Proc. of the 5th International 

Joint Conference on Artificial Intelligence, page 584, 1977. 

[80] H. Moravec. Obstacle avoidance and navigation in the real world by a seeing robot rover. In 

tech. report CMU-RI-TR-80-03, Robotics Institute, Carnegie Mellon University. September 

1980. Available as Stanford AIM-340, CS-80-813 and republished as a Carnegie Mellon 

University Robotics Institue Technical Report to increase availability. 

[81] H. Moravec and A. Elfes. High resolution maps from wide angle sonar. In Proc. IEEE 

International Conference on Intelligent Robots and Systems, pages 116–121, 1985. 

[82] J. Neira, M. I. Ribeiro, and J. D. Tardos. Mobile robot localisation and map building 

using monocular vision. In International Symposium On Intelligent Robotics Systems, 

Stockholm, Sweden, 1997. 

[83] D. Nister. An efficient solution to the five-point relative pose problem. In Proc. IEEE 

Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, pages II: 

195–202, 2003. 

[84] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. In Proc. IEEE Conference on 

Computer Vision and Pattern Recognition, Washington, DC, pages I: 652–659, 2004. 

[85] S. Obdrzalek and J. Matas. Object recognition using local affine frames on distinguished 

regions. In Proc. 13th British Machine Vision Conference, Cardiff, UK, pages 113–122, 

2002. 

[86] C. Olson, L. Matthies, M. Schoppers, and M. Maimone. Robust stereo ego-motion for 

long distance navigation. In Proc. IEEE Conference on Computer Vision and Pattern 

Recognition, Hilton Head, South Carolina, pages II: 453–458, 2000. 

[87] R. Perko. Computer Vision For Large Format Digital Aerial Cameras. PhD thesis, Graz 

University of Technology, 2004. 

[88] M. Pollefeys, R. Koch, and L. Van Gool. Self calibration and metric reconstruction in 

spite of varying and unknown internal camera parameters. In Proc. 6th International 

Conference on Computer Vision, Bombay, India, pages 90–96, 1998. 

[89] P. Pritchett and A. Zisserman. Wide baseline stereo matching. In Proc. 6th International 

Conference on Computer Vision, Bombay, India, pages 754–760, 1998. 

[90] V. Ramachandran. In Spatial vision in humans and robotics, L. Harris, editor, Cambridge 

University Press, 1991. 

[91] K. Rohr. Localization properties of direct corner detectors. Journal of Mathematical 

Imaging and Vision, 4:139–150, 1994.

172 

[92] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ’how 

do i organize my holiday snaps?’. In Proc. 7th European Conference on Computer Vision, 

Copenhagen, Denmark, pages I: 414–431, 2002. 

[93] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo 

correspondence algorithms. International Journal of Computer Vision, 47(1-3):7–42, 2002. 

[94] K. Schindler. Generalized use of homographies for piecewise planar reconstruction. In 

Proc. 13th Scandinavian Conference on Image Analysis, Gotenborg, Sweden, pages 470– 

476, 2003. 

[95] C. Schmid, R. Mohr, and C. Bauckhage. Comparing and evaluating interest points. In 

Proc. 6th International Conference on Computer Vision, Bombay, India, pages 230–235, 

1998. 

[96] S. Se, D. G. Lowe, and J. J. Little. Vision-based global localization and mapping for 

mobile robots. IEEE Transactions on Robotics, 21(3):364–375, 2005. 

[97] R. Sedgewick. Algorithms. Addison-Wesley, 2nd edition, 1988. 

[98] R. Sim and G. Dudek. Mobile robot localization from learned landmarks. In Proc. of 

the IEEE/RSJ Conference on Intelligent Robots and Systems, pages 1060–1065, Victoria, 

Canada, 1998. 

[99] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in 

videos. In Proc. 9th IEEE International Conference on Computer Vision, Nice, France, 

pages 1470–1477, 2003. 

[100] P. Smith, T. Drummond, and K. Roussopoulos. Computing map trajectories by representing, 

propagating and combining pdfs over groups. In Proc. 9th IEEE International 

Conference on Computer Vision, Nice, France, pages 1275–1282, 2003. 

[101] M. Spetsakis and Y. Aloimonos. A multi-frame approach to visual motion perception. 

International Journal of Computer Vision, 6(3):245–255, 1991. 

[102] J. Sun, Y. Li, S. Kang, and H. Shum. Symmetric stereo matching for occlusion handling. 

In Proc. IEEE Conference on Computer Vision and Pattern Recognition, San Diego, California, 

pages II: 399–406, 2005. 

[103] S. Thrun. Learning metric-topological maps for indoor mobile robot navigation. Artificial 

Intelligence, 99(1):21–71, 1998. 

[104] S. Thrun, M. Bennewitz, W. Burgard, A. Cremers, F. Dellaert, D. Fox, D. Hähnel, 

C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. MINERVA: A second generation mobile 

tour-guide robot. In Proc. IEEE International Conference on Robotics and Automation, 

Detroit, US, pages 1999–2005, 1999. 

[105] S. Thrun, D. Hähnel, D. Ferguson, M. Montemerlo, R. Triebel, W. Burgard, C. Baker, 

Z. Omohundro, S. Thayer, and W. Whittaker. A system for volumetric robotic mapping of 

abandoned mines. In Proc. IEEE International Conference on Robotics and Automation, 

Taipei, Taiwan, pages 4270–4275, 2003.

173 

[106] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A 

factorization method. International Journal of Computer Vision, 9(2):137–154, 1992. 

[107] C. Tomasi and T. Kanade. Detection and tracking of point features. Technical Report 

CMU-CS-91-132, Carnegie Mellon University, 1991. 

[108] P. Torr and A. Zisserman. Robust parameterization and computation of the trifocal tensor. 

In Proc. 7th British Machine Vision Conference, Edinburgh, UK, 1996. 

[109] B. Triggs. Matching constraints and the joint image. In Proc. 5th International Conference 

on Computer Vision, Boston, Massachusetts, pages 338–343, 1995. 

[110] T. Tsubouchi and S. Yuta. Map assisted vision system of mobile robots for reckoning in a 

building environment. In Proc. IEEE International Conference on Robotics and Automation, 

Raleigh, US, pages 1978–1984, 1987. 

[111] T. Tuytelaars and L. Van Gool. Content-based image retrieval based on local affinely 

invariant regions. In Visual Information and Information Systems, pages 493–500, 1999. 

[112] T. Tuytelaars and L. Van Gool. Matching widely separated views based on affine invariant 

regions. International Journal of Computer Vision, 1(59):61–85, 2004. 

[113] T. Tuytelaars and L. Van Gool. Wide baseline stereo matching based on local, affinely 

invariant regions. In Proc. 11th British Machine Vision Conference, Bristol, UK, pages 

412–422, 2000. 

[114] M. Vincze, M. Ayromlou, C. Beltran, A. Gasteratos, S. Hoffgaard, O. Madsen, W. Ponweiser, 

and M. Zillich. A system to navigate a robot into a ship structure. Machine Vision 

and Applications, 14(1):15–25, 2003. 

[115] E. W. Weisstein. Hessian. Eric Weisstein’s World of Mathematics. 

http://mathworld.wolfram.com/Hessian.html, 1999-2003. 

[116] T. Werner and A. Zisserman. New techniques for automated architecture reconstruction 

from photographs. In Proc. 7th European Conference on Computer Vision, Copenhagen, 

Denmark, pages 541–555, 2002. 

[117] A. Witkin. Scale-space filtering. In International Joint Conference on Artificial Intelligence, 

pages 1019–1022, 1983. 

[118] D. C. Yuen and B. A. MacDonald. Considerations for the mobile robot implementation 

of panoramic stereo vision system with a single optical centre. In Proc. Image and Vision 

Computing New Zealand, Auckland, pages 335–340, 2002. 

[119] A. Zisserman, T. Werner, and F. Schaffalitzky. Towards automated reconstruction of architectural 

scenes from multiple images. In Proc. 25th Workshop of the Austrian Association 

for Pattern Recognition, Berchtesgaden, Germany, pages 9–23, 2001.

PHD Thesis - Institute for Computer Graphics and Vision - Graz ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?